Representing categorical variables

One of the most common data types we might encounter while building a machine learning system is categorical features (also known as discrete features), such as the color of a fruit or the name of a company. The challenge with categorical features is that they don't change in a continuous way, which makes it hard to represent them with numbers.

For example, a banana is either green or yellow, but not both. A product belongs either in the clothing department or in the books department, but rarely in both, and so on.

How would you go about representing such features?

For example, let's assume we are trying to encode a dataset consisting of a list of forefathers of machine learning and artificial intelligence:

In [1]: data = [
...         {'name': 'Alan Turing', 'born': 1912, 'died': 1954},
...         {'name': 'Herbert A. Simon', 'born': 1916, 'died': 2001},
...         {'name': 'Jacek Karpinski', 'born': 1927, 'died': 2010},
...         {'name': 'J.C.R. Licklider', 'born': 1915, 'died': 1990},
...         {'name': 'Marvin Minsky', 'born': 1927, 'died': 2016}
...     ]

While the born and died features are already in numeric format, the name feature is a bit trickier to encode. We might be intrigued to encode them in the following way:

In [2]: {'Alan Turing': 1,
...      'Herbert A. Simon': 2,
...      'Jacek Karpinsky': 3,
...      'J.C.R. Licklider': 4,
...      'Marvin Minsky': 5};

Although this seems like a good idea, it does not make much sense from a machine learning perspective. Why not? Well, by assigning ordinal values to these categories, most machine learning algorithms go on to think that Alan Turing < Herbert A. Simon < Jacek Karpsinky, since 1 < 2 < 3. This is clearly not what we meant to say.

Instead, what we really meant to say was something along the lines of: the first data point belongs to the Alan Turing category, and it does not belong to the Herbert A. Simon and Jacek Karpsinky categories and so on. In other words, we were looking for a binary encoding. In machine learning lingo, this is known as a one-hot encoding and is provided by most machine learning packages straight out of the box (except for OpenCV, of course).

In scikit-learn, one-hot encoding is provided by the DictVectorizer class, which can be found in the feature_extraction module. The way it works is by simply feeding a dictionary containing the data to the fit_transform function, and the function automatically determines which features to encode:

In [3]: from sklearn.feature_extraction import DictVectorizer
...     vec = DictVectorizer(sparse=False, dtype=int)
...     vec.fit_transform(data)
Out[3]: array([[1912, 1954, 1, 0, 0, 0, 0],
               [1916, 2001, 0, 1, 0, 0, 0],
               [1927, 2010, 0, 0, 0, 1, 0],
               [1915, 1990, 0, 0, 1, 0, 0],
               [1927, 2016, 0, 0, 0, 0, 1]], dtype=int32)

What happened here? The two-year entries are still intact, but the rest of the rows have been replaced by ones and zeros. We can call get_feature_names to find out the listed order of the features:

In [4]: vec.get_feature_names()
Out[4]: ['born',
         'died',
         'name=Alan Turing',
         'name=Herbert A. Simon',
         'name=J.C.R. Licklider',
         'name=Jacek Karpinski',
         'name=Marvin Minsky']

The first row of our data matrix, which stands for Alan Turing, is now encoded as born=1912, died=1954, Alan Turing=1, Herbert A. Simon=0, J.C.R Licklider=0, Jacek Karpinsik=0, and Marvin Minsky=0.

There is one problem with this approach though. If our feature category has a lot of possible values, such as every possible first and last name, then one-hot encoding will lead to a very large data matrix. However, if we investigate the data matrix row by row, it becomes clear that every row has exactly one 1 and all of the other entries are 0. In other words, the matrix is sparse. Scikit-learn provides a compact representation of sparse matrices, which we can trigger by specifying sparse=True in the first call to DictVectorizer:

In [5]: vec = DictVectorizer(sparse=True, dtype=int)
...     vec.fit_transform(data)
Out[5]: <5x7 sparse matrix of type '<class 'numpy.int64'>'
                with 15 stored elements in Compressed Sparse Row format>

Certain machine learning algorithms, such as decision trees, are capable of handling categorical features natively. In these cases, it is not necessary to use one-hot encoding.

We will come back to this technique when we talk about neural networks in Chapter 9, Using Deep Learning to Classify Handwritten Digits.

Table of Contents for Representing categorical variables

Create new playlist

Sign In

Sign Up

Table of Contents for
Representing categorical variables