Training on the full dataset

However, if you want to classify the full dataset, we need a more sophisticated approach. We turn to scikit-learn's Naive Bayes classifier, as it understands how to handle sparse matrices. In fact, if you didn't pay attention and treated X_train like every NumPy array before, you might not even notice that anything is different:

In [17]: from sklearn import naive_bayes
... model_naive = naive_bayes.MultinomialNB()
... model_naive.fit(X_train, y_train)
Out[17]: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Here, we used MultinomialNB from the naive_bayes module, which is the version of Naive Bayes classifier that is best suited to handling categorical data, such as word counts.

The classifier is trained almost instantly and returns the scores for both the training and test sets:

In [18]: model_naive.score(X_train, y_train)
Out[18]: 0.95086413826212191
In [19]: model_naive.score(X_test, y_test)
Out[19]: 0.94422043010752688

And there we have it: 94.4% accuracy on the test set! Pretty good for not doing much other than using the default values, isn't it?

However, what if we were super critical of our own work and wanted to improve the result even further? There are a couple of things we could do.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.163.229