Using TF-IDF to improve the result

It was called the Term Frequency-Inverse Document Frequency (TF-IDF), and we encountered it in Chapter 4, Representing Data and Engineering Features. If you recall, what TF-IDF does is basically weigh the word count by a measure of how often the words appear in the entire dataset. A useful side effect of this method is the IDF part—the inverse frequency of words. This makes sure that frequent words, such as and, the, and but, carry only a small weight in the classification.

We apply TF-IDF to the feature matrix by calling fit_transform on our existing feature matrix, X:

In [24]: tfidf = feature_extraction.text.TfidfTransformer()
In [25]: X_new = tfidf.fit_transform(X)

Don't forget to split the data; also, you can tweak the random_state parameter, which will split the data (train-test) differently as you change the random number. It is important to note that the overall accuracy might change if the train-test split changes:

In [26]: X_train, X_test, y_train, y_test = ms.train_test_split(X_new, y,
... test_size=0.2, random_state=42)

Then, when we train and score the classifier again, we suddenly find a remarkable score of 99% accuracy:

In [27]: model_naive = naive_bayes.MultinomialNB()
... model_naive.fit(X_train, y_train)
... model_naive.score(X_test, y_test)
Out[27]: 0.99087941628264209

To convince us of the classifier's awesomeness, we can inspect the confusion matrix. This is a matrix that shows, for every class, how many data samples were misclassified as belonging to a different class. The diagonal elements in the matrix tell us how many samples of the class i were correctly classified as belonging to the class i. The off-diagonal elements represent misclassifications:

In [28]: metrics.confusion_matrix(y_test, model_naive.predict(X_test))
Out[28]: array([[3746, 84],
[ 11, 6575]])

This tells us we got 3,746 class 0 classifications correct, and 6,575 class 1 classifications correct. We confused 84 samples of class 0 as belonging to class 1 and 11 samples of class 1 as belonging to class 0. If you ask me, that's about as good as it gets.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.163.229