Using TF-IDF to improve the result

It was called the Term Frequency-Inverse Document Frequency (TF-IDF), and we encountered it in Chapter 4, Representing Data and Engineering Features. If you recall, what TF-IDF does is basically weigh the word count by a measure of how often the words appear in the entire dataset. A useful side effect of this method is the IDF part—the inverse frequency of words. This makes sure that frequent words, such as and, the, and but, carry only a small weight in the classification.

We apply TF-IDF to the feature matrix by calling fit_transform on our existing feature matrix, X:

In [24]: tfidf = feature_extraction.text.TfidfTransformer()
In [25]: X_new = tfidf.fit_transform(X)

Don't forget to split the data; also, you can tweak the random_state parameter, which will split the data (train-test) differently as you change the random number. It is important to note that the overall accuracy might change if the train-test split changes:

In [26]: X_train, X_test, y_train, y_test = ms.train_test_split(X_new, y,
... test_size=0.2, random_state=42)

Then, when we train and score the classifier again, we suddenly find a remarkable score of 99% accuracy:

In [27]: model_naive = naive_bayes.MultinomialNB()
..., y_train)
... model_naive.score(X_test, y_test)
Out[27]: 0.99087941628264209

To convince us of the classifier's awesomeness, we can inspect the confusion matrix. This is a matrix that shows, for every class, how many data samples were misclassified as belonging to a different class. The diagonal elements in the matrix tell us how many samples of the class i were correctly classified as belonging to the class i. The off-diagonal elements represent misclassifications:

In [28]: metrics.confusion_matrix(y_test, model_naive.predict(X_test))
Out[28]: array([[3746, 84],
[ 11, 6575]])

This tells us we got 3,746 class 0 classifications correct, and 6,575 class 1 classifications correct. We confused 84 samples of class 0 as belonging to class 1 and 11 samples of class 1 as belonging to class 0. If you ask me, that's about as good as it gets.

