Using text in machine learning pipelines

Of course, the ultimate goal of our vectorizers is to use them to make text data ingestible for our machine learning pipelines. Because CountVectorizer and TfidfVectorizer act like any other transformer we have been working with in this book, we will have to utilize a scikit-learn pipeline to ensure accuracy and honesty in our machine learning pipeline. In our example, we are going to be working with a large number of columns (in the hundreds of thousands), so I will use a classifier that is known to be more efficient in this case, a Naive Bayes model:

from sklearn.naive_bayes import MultinomialNB # for faster predictions with large number of features...

Before we start building our pipelines, let's get our null accuracy of the response column, which is either zero (negative) or one (positive):

# get the null accuracy
y.value_counts(normalize=True)

1 0.564632 0 0.435368 Name: Sentiment, dtype: float64

Making the accuracy beat 56.5%. Now, let's create a pipeline with two steps:

  • CountVectorizer to featurize the tweets
  • MultiNomialNB Naive Bayes model to classify between positive and negative sentiment

First let's start with setting up our pipeline parameters as follows, and then instantiate our grid search as follows:

# set our pipeline parameters
pipe_params = {'vect__ngram_range':[(1, 1), (1, 2)], 'vect__max_features':[1000, 10000], 'vect__stop_words':[None, 'english']}

# instantiate our pipeline
pipe = Pipeline([('vect', CountVectorizer()), ('classify', MultinomialNB())])

# instantiate our gridsearch object
grid = GridSearchCV(pipe, pipe_params)
# fit the gridsearch object
grid.fit(X, y)

# get our results
print grid.best_score_, grid.best_params_


0.755753132845 {'vect__ngram_range': (1, 2), 'vect__stop_words': None, 'vect__max_features': 10000}

And we got 75.6%, which is great! Now, let's kick things into high-gear and incorporate the TfidfVectorizer. Instead of rebuilding the pipeline using tf-idf instead of CountVectorizer, let's try using something a bit different. The scikit-learn has a FeatureUnion module that facilitates horizontal stacking of features (side-by-side). This allows us to use multiple types of text featurizers in the same pipeline.

For example, we can build a featurizer that runs both a TfidfVectorizer and a CountVectorizer on our tweets and concatenates them horizontally (keeping the same number of rows but increasing the number of columns):

from sklearn.pipeline import FeatureUnion
# build a separate featurizer object
featurizer = FeatureUnion([('tfidf_vect', TfidfVectorizer()), ('count_vect', CountVectorizer())])

Once we build the featurizer, we can use it to see how it affects the shape of our data:

_ = featurizer.fit_transform(X)
print _.shape # same number of rows , but twice as many columns as either CV or TFIDF

(99989, 211698)

We can see that unioning the two featurizers results in a dataset with the same number of rows, but doubles the number of either the CountVectorizer or the TfidfVectorizer. This is because the resulting dataset is literally both datasets side-by-side. This way, our machine learning models may learn from both sets of data simultaneously. Let's change the params of our featurizer object slightly and see what difference it makes:

featurizer.set_params(tfidf_vect__max_features=100, count_vect__ngram_range=(1, 2),
count_vect__max_features=300)
# the TfidfVectorizer will only keep 100 words while the CountVectorizer will keep 300 of 1 and 2 word phrases
_ = featurizer.fit_transform(X)
print _.shape # same number of rows , but twice as many columns as either CV or TFIDF
(99989, 400)

Let's build a much more comprehensive pipeline that incorporates the feature union of both of our vectorizers:

pipe_params = {'featurizer__count_vect__ngram_range':[(1, 1), (1, 2)], 'featurizer__count_vect__max_features':[1000, 10000], 'featurizer__count_vect__stop_words':[None, 'english'],
'featurizer__tfidf_vect__ngram_range':[(1, 1), (1, 2)], 'featurizer__tfidf_vect__max_features':[1000, 10000], 'featurizer__tfidf_vect__stop_words':[None, 'english']}
pipe = Pipeline([('featurizer', featurizer), ('classify', MultinomialNB())])
grid = GridSearchCV(pipe, pipe_params)
grid.fit(X, y)
print grid.best_score_, grid.best_params_
0.758433427677 {'featurizer__tfidf_vect__max_features': 10000, 'featurizer__tfidf_vect__stop_words': 'english', 'featurizer__count_vect__stop_words': None, 'featurizer__count_vect__ngram_range': (1, 2), 'featurizer__count_vect__max_features': 10000, 'featurizer__tfidf_vect__ngram_range': (1, 1)}

Nice, even better than just CountVectorizer alone! It is also interesting to note that the best ngram_range for the CountVectorizer was (1, 2), while it was (1, 1) for the TfidfVectorizer, implying that word occurrences alone were not as important as two-word phrase occurrences.

By this point, it should be obvious that we could have made our pipeline much more complicated by:
    • Grid searching across dozens of parameters for each vectorizer
    • Adding in more steps to our pipeline such as polynomial feature construction
But this would have been very cumbersome for this text and would take hours to run on most commercial laptops. Feel free to expand on this pipeline and beat our score!

Phew, that was a lot. Text can be difficult to work with. Between sarcasm, misspellings, and vocabulary size, data scientists and machine learning engineers have their hands full. This introduction to working with text will allow you, the reader, to experiment with your own large text datasets and obtain your own results!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.175.151