We split the data into the default 75:25 train-test sets, ensuring that test set classes closely mirror the train set:
y = pd.factorize(docs.topic)[0] # create integer class values
X = docs.body
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)
We proceed to learn the vocabulary from the training set and transform both datasets using CountVectorizer with default settings to obtain almost 26,000 features:
vectorizer = CountVectorizer()
X_train_dtm = vectorizer.fit_transform(X_train)
X_test_dtm = vectorizer.transform(X_test)
X_train_dtm.shape, X_test_dtm.shape
((1668, 25919), (557, 25919))
Training and prediction follow the standard sklearn fit/predict interface:
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)
We evaluate multiclass predictions using accuracy and find that the default classifier achieved almost 98%:
accuracy_score(y_test, y_pred_class)
0.97666068222621