Converting categorical features to numerical – one-hot encoding and ordinal encoding

In the previous chapter, Predicting Online Ads Click-through with Tree-Based Algorithms, we mentioned how one-hot encoding transforms categorical features to numerical features in order to be used in the tree algorithms in scikit-learn and TensorFlow. This will not limit our choice to tree-based algorithms if we can adopt one-hot encoding to any other algorithms that only take in numerical features.

The simplest solution we can think of in terms of transforming a categorical feature with k possible values is to map it to a numerical feature with values from 1 to k. For example, [Tech, Fashion, Fashion, Sports, Tech, Tech, Sports] becomes [1, 2, 2, 3, 1, 1, 3]. However, this will impose an ordinal characteristic, such as Sports being greater than Tech, and a distance property, such as Sports being closer to Fashion than to Tech.

Instead, one-hot encoding converts the categorical feature to k binary features. Each binary feature indicates the presence or absence of a corresponding possible value. Hence, the preceding example becomes the following:

Previously, we have used OneHotEncoder from scikit-learn to convert a matrix of string into a binary matrix, but here, let's take a look at another module, DictVectorizer, which also provides an efficient conversion. It transforms dictionary objects (categorical feature: value) into one-hot encoded vectors.

For example, take a look at the following codes:

>>> from sklearn.feature_extraction import DictVectorizer
>>> X_dict = [{'interest': 'tech', 'occupation': 'professional'},
... {'interest': 'fashion', 'occupation': 'student'},
... {'interest': 'fashion','occupation':'professional'},
... {'interest': 'sports', 'occupation': 'student'},
... {'interest': 'tech', 'occupation': 'student'},
... {'interest': 'tech', 'occupation': 'retired'},
... {'interest': 'sports','occupation': 'professional'}]
>>> dict_one_hot_encoder = DictVectorizer(sparse=False)
>>> X_encoded = dict_one_hot_encoder.fit_transform(X_dict)
>>> print(X_encoded)
[[ 0. 0. 1. 1. 0. 0.]
[ 1. 0. 0. 0. 0. 1.]
[ 1. 0. 0. 1. 0. 0.]
[ 0. 1. 0. 0. 0. 1.]
[ 0. 0. 1. 0. 0. 1.]
[ 0. 0. 1. 0. 1. 0.]
[ 0. 1. 0. 1. 0. 0.]]

We can also see the mapping by executing the following:

>>> print(dict_one_hot_encoder.vocabulary_)
{'interest=fashion': 0, 'interest=sports': 1,
'occupation=professional': 3, 'interest=tech': 2,
'occupation=retired': 4, 'occupation=student': 5}

When it comes to new data, we can transform it by:

>>> new_dict = [{'interest': 'sports', 'occupation': 'retired'}]
>>> new_encoded = dict_one_hot_encoder.transform(new_dict)
>>> print(new_encoded)
[[ 0. 1. 0. 0. 1. 0.]]

We can inversely transform the encoded features back to the original features by:

>>> print(dict_one_hot_encoder.inverse_transform(new_encoded))
[{'interest=sports': 1.0, 'occupation=retired': 1.0}]

One important thing to note is that if a new (not seen in training data) category is encountered in new data, it should be ignored. DictVectorizer handles this implicitly (while OneHotEncoder needs to specify parameter ignore):

>>> new_dict = [{'interest': 'unknown_interest',
'occupation': 'retired'},
... {'interest': 'tech', 'occupation':
'unseen_occupation'}]
>>> new_encoded = dict_one_hot_encoder.transform(new_dict)
>>> print(new_encoded)
[[ 0. 0. 0. 0. 1. 0.]
[ 0. 0. 1. 0. 0. 0.]]

Sometimes, we do prefer transforming a categorical feature with k possible values into a numerical feature with values ranging from 1 to k. We conduct ordinal encoding in order to employ ordinal or ranking knowledge in our learning; for example, large, medium, and small become 3, 2, and 1 respectively, good and bad become 1 and 0, while one-hot encoding fails to preserve such useful information. We can realize ordinal encoding easily through the use of pandas, for example:

>>> import pandas as pd
>>> df = pd.DataFrame({'score': ['low',
... 'high',
... 'medium',
... 'medium',
... 'low']})
>>> print(df)
score
0 low
1 high
2 medium
3 medium
4 low
>>> mapping = {'low':1, 'medium':2, 'high':3}
>>> df['score'] = df['score'].replace(mapping)
>>> print(df)
score
0 1
1 3
2 2
3 2
4 1

We convert the string feature into ordinal values based on the mapping we define.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.212.212