Training on the full dataset

However, if you want to classify the full dataset, we need a more sophisticated approach. We turn to scikit-learn's Naive Bayes classifier, as it understands how to handle sparse matrices. In fact, if you didn't pay attention and treated X_train like every NumPy array before, you might not even notice that anything is different:

In [17]: from sklearn import naive_bayes
...      model_naive = naive_bayes.MultinomialNB()
...      model_naive.fit(X_train, y_train)
Out[17]: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Here, we used MultinomialNB from the naive_bayes module, which is the version of Naive Bayes classifier that is best suited to handling categorical data, such as word counts.

The classifier is trained almost instantly and returns the scores for both the training and test sets:

In [18]: model_naive.score(X_train, y_train)
Out[18]: 0.95086413826212191
In [19]: model_naive.score(X_test, y_test)
Out[19]: 0.94422043010752688

And there we have it: 94.4% accuracy on the test set! Pretty good for not doing much other than using the default values, isn't it?

However, what if we were super critical of our own work and wanted to improve the result even further? There are a couple of things we could do.

Table of Contents for Training on the full dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
Training on the full dataset