Representing text features

Similar to categorical features, scikit-learn offers an easy way to encode another common feature type—text features. When working with text features, it is often convenient to encode individual words or phrases as numerical values.

Let's consider a dataset that contains a small corpus of text phrases:

In [1]: sample = [
...        'feature engineering',
...        'feature selection',
...        'feature extraction'
...     ]

One of the simplest methods of encoding such data is by word count; for each phrase, we simply count the occurrences of each word within it. In scikit-learn, this is easily done using CountVectorizer, which functions akin to DictVectorizer:

In [2]: from sklearn.feature_extraction.text import CountVectorizer
...     vec = CountVectorizer()
...     X = vec.fit_transform(sample)
...     X
Out[2]: <3x4 sparse matrix of type '<class 'numpy.int64'>'
                with 6 stored elements in Compressed Sparse Row format>

By default, this will store our feature matrix, X, as a sparse matrix. If we want to manually inspect it, we need to convert it into a regular array:

In [3]: X.toarray()
Out[3]: array([[1, 0, 1, 0],
               [0, 0, 1, 1],
               [0, 1, 1, 0]], dtype=int64)

To understand what these numbers mean, we have to look at the feature names:

In [4]: vec.get_feature_names()
Out[4]: ['engineering', 'extraction', 'feature', 'selection']

Now, it becomes clear what the integers in X mean. If we look at the phrase that is represented in the top row of X, we see that it contains one occurrence of the word, engineering, and one occurrence of the word, feature. On the other hand, it does not contain the words extraction or selection. Does this make sense? A quick glance at our original data sample reveals that the phrase was indeed feature engineering.

Looking only at the X array (no cheating!), can you guess what the last phrase in sample was?

One possible shortcoming of this approach is that we might put too much weight on words that appear very frequently. One approach to fixing this is known as Term Frequency-Inverse Document Frequency (TF-IDF). What TF-IDF does might be easier to understand than its name, which is basically to weigh the word counts by a measure of how often they appear in the entire dataset.

The syntax for TF-IDF is pretty much similar to the previous command:

In [5]: from sklearn.feature_extraction.text import TfidfVectorizer
...     vec = TfidfVectorizer()
...     X = vec.fit_transform(sample)
...     X.toarray()
Out[5]: array([[ 0.861037 , 0. , 0.50854232, 0. ],
               [ 0. , 0. , 0.50854232, 0.861037 ],
               [ 0. , 0.861037 , 0.50854232, 0. ]])

We note that the numbers are now smaller than before, with the third column taking the biggest hit. This makes sense, as the third column corresponds to the most frequent word across all three phrases, feature:

In [6]: vec.get_feature_names()
Out[6]: ['engineering', 'extraction', 'feature', 'selection']

If you're interested in the math behind TF-IDF, you can start with this paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.121.1424&rep=rep1&type=pdf. For more information about its specific implementation in scikit-learn, have a look at the API documentation at http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting .

Representing text features will become important in Chapter 7, Implementing a Spam Filter with Bayesian Learning.

Table of Contents for Representing text features

Create new playlist

Sign In

Sign Up

Table of Contents for
Representing text features