Handling Categorical Features With Python

As a data scientist, you may very frequently encounter categorical variable in your dataset like location, car model, gender, etc. You cannot directly use them in our machine learning algorithm as these algorithms only understand numbers. There are various techniques to convert these categorical features to numerical features but that is not the focus of this post, this post is about how to implement these techniques in python. I will talk a little bit about these techniques but won’t go into too much depth, I will emphasise more on various ways how you can implement this technique in python.

#What are Categorical variables?

Categorical variables are the qualitative variables which have non-numeric values like gender it can either be male/female value, even if they are numerical values feature description says it’s categorical that means these numerical values are not mathematically related. Sometimes the answer to the questions is like yes/no, ugly/nice/ok/pretty, good/bad, etc. We can say that answer to the question is from the set of predefined possibilities. Qualitative/categorical variables take values from these set of possibilities. These variables often prove to be of great importance and can boost the accuracy of the model to a considerable extent. There are two types of categorical variable :

#Ordinal/Ordered categorical variable

Order of these type of variables matter, for example, movie review values can be good, average and bad in this case average is between good and bad, rank result of a race, in such cases order of variable have some information and if the order is not followed can produce misleading results.

#Nominal categorical variable

For these type of variables order doesn’t matter, like the type of seat values can be economic/business, gender (male/female), etc. Order of the variable makes no difference in their interpretation.
It depends on the data we have whether to interpret a categorical variable as nominal or ordinal, misunderstand the variable can lead to a false result. So it really important to think carefully before diving into implementation.

#Simple approach to encode categorical features

The two approaches which we are going to use to convert the categorical variable to its numerical equivalent form are as follows:

#Label Encoding

A simple approach to convert categorical variable to numerical variable will to assign a unique number to each possible outcome of the variable and replace the variables values with its corresponding number. But this technique can only be used for the ordinal categorical variable, once you know the order of the values of the variable, as the order of the values matter the numbers assigned to values of categorical values should also be sorted in ascending or descending order, doesn’t matter which order you choose. So, for example, a movie review variable may have five possible in-order values (excellent, awesome, good, bad, burnt it) so assigned values for the outcome will be from 5 to 1, 5 been excellent and burn it been 1. In this case review variable, a value of one data point is 4 (awesome) and other data point is 2 bad and if we take average the outcome will be 3(good) which make sense. This might not be the case for the nominal variable which is why we cannot use this method for a nominal variable.

#One Hot Encoding/One of K scheme

The other approach is called the one hot encoding, where a categorical variable is converted into a binary vector, each possible value of the categorical variable becomes the variable itself with default value of zero and the variable which was the value of the categorical variable will have the value 1. This concept is explained with the example shown below.

Table before applying one hot encoding transformation

namegender
Roshanmale
Annafemale
Hussainmale
Ashwinifemale

Table after applying one hot encoding transformation

namemalefemale
Roshan10
Anna01
Hussain10
Ashwini01

#Implementing Label Encoding

We saw what label encoding is above, not always you are going to get categorical variable in string form, it is possible you might even encounter random numerical values(this is typically the case in competitions) but still it’s a categorical feature. To deal with such situation there is a utility class LabelEncoder in preprocessing module in the sklearn package it can handle categorical variable in both numerical and string form. Fire-up a ipython console and try the code below

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
>>> from sklearn.preprocessing import LabelEncoder
# encoding numerical values
>>> num_encoder = LabelEncoder()
>>> num_encoder.fit([3, 3, 4, 9])
LabelEncoder()
>>> num_encoder.classes_
array([3, 4, 9])
>>> num_encoder.transform([3, 3, 4, 9])
array([0, 0, 1, 2])
>>> le.inverse_transform([0, 0, 1, 2])
array([3, 3, 4, 9])
# encoding string values
>>> city_encoder = LabelEncoder()
>>> city_encoder.fit(['mumbai', 'delhi', 'mumbai', 'pune'])
LabelEncoder()
>>> list(city_encoder.classes_)
['delhi', 'mumbai', 'pune']
>>> city_encoder.transform(['mumbai', 'mumbai', 'pune'])
array([1, 1, 2])
>>> list(city_encoder.inverse_transform([1, 2, 1]))
['mumbai', 'pune', 'mumbai']

#Implementing One Hot Encoding/One of K-Scheme

You may encounter data in various form data type number/string to deal with these situations there are of utility classes in the sklearn package to convert them in one hot encoding schema. As we discussed earlier categorical variable could be in numerical or string data type, following are two methods to convert a categorical variable to one hot encoding schema:

#SKLearn way

OneHotEncoder utility class provided by the sklearn package can convert numerical values to one hot encoding but we can also deal with string values if use LabelEncoder along with the OneHotEncoder class. The labelencoder class will map the string values of the categorical variable to a number and these number can be converted to one hot encoding by OneHotEncoder. Implementation is shown by the code shown below

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
In [1]: from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [2]: import pandas as pd

In [3]: raw_data = {'name': ['Roshan','Anna','Hussain','Ashwini'],
'gender': ['male', 'female', 'male', 'female']}

In [4]: gender_encoder = LabelEncoder()

In [5]: df_people['en_gender'] = gender_encoder.fit_transform(df_people['gender'])

In [6]: df_people
Out[6]:
name gender en_gender
0 Roshan male 1
1 Anna female 0
2 Hussain male 1
3 Ashwini female 0

In [7]:one_hot_array = one_hot_gender.fit_transform(df_people.en_gender.values.reshape(len(df_people),1))
Out[7]:
array([[ 0., 1.],
[ 1., 0.],
[ 0., 1.],
[ 1., 0.]])

In [8]: df_people[gender_encoder.classes_] = pd.DataFrame(one_hot_array)

In [9]: df_people
Out[9]:
name gender en_gender female male
0 Roshan male 1 0.0 1.0
1 Anna female 0 1.0 0.0
2 Hussain male 1 0.0 1.0
3 Ashwini female 0 1.0 0.0

#Pandas way

If your data is already loaded in pandas then pandas.get_dummies is one very handy method to convert your categorical variable to one hot encoding schema, this method is much convenient then the previous sklearn approach. This method can convert multiple columns in one method call by passing data frame and the columns we want to transform. Below is the example to for what I have just explained.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
In [1]: import pandas as pd

In [2]: raw_data = {'name': ['Roshan','Anna','Hussain','Ashwini'],
...: 'gender': ['male', 'female', 'male', 'female'],
...: 'location': ['delhi', 'delhi', 'mumbai', 'pune']}
...:

In [3]: df_people = pd.DataFrame(raw_data, columns = ['name', 'gender','location'])

In [4]: df_oh_p = pd.get_dummies(df_people, columns=['gender', 'location'])

In [5]: df_oh_p
Out[5]:
name gender_female gender_male location_delhi location_mumbai \
0 Roshan 0 1 1 0
1 Anna 1 0 1 0
2 Hussain 0 1 0 1
3 Ashwini 1 0 0 0

location_pune
0 0
1 0
2 0
3 1

In [6]: df_people_new = pd.concat([df_people, df_oh_p], axis=1)

In [7]: df_people_new
Out[7]:
name gender location name gender_female gender_male \
0 Roshan male delhi Roshan 0 1
1 Anna female delhi Anna 1 0
2 Hussain male mumbai Hussain 0 1
3 Ashwini female pune Ashwini 1 0

location_delhi location_mumbai location_pune
0 1 0 0
1 1 0 0
2 0 1 0
3 0 0 1

We discussed two libraries sklearn and pandas which helps us to deal with the categorical variable, which one should you prefer? If you prefer sklearn way then you get advantage of chaining transformers and estimators in PipeLine and FeatureUnion to create data pipelines which can makes your whole analysis more manageable, while if you go pandas way you get the simplicity but you will have to implement custom transformer and then chain it in _PipeLine and FeatureUnion.

#Conclusion

We saw the different type of categorical variable and how to encode them so that we can use them in machine learning algorithm along with other feature. You could write your own code to convert the categorical variables in numerical variable but you could leverage existing helpful utility classes/methods provided by some popular ML libraries which can come handy and can save some time in the cleaning of dirty data.

#Useful links

  1. Feature extraction sklearn docs
  2. What is one hot encoding and when is it used in data science? on Quora
  3. OneHotEncoder sklearn API docs
  4. LabelEncoder sklearn API docs
  5. pandas.get_dummies API docs
  6. sklearn Pipeline and FeatureUnion
Share