Training and evaluating multinomial Naive Bayes classifier

We split the data into the default 75:25 train-test sets, ensuring that test set classes closely mirror the train set:

y = pd.factorize(docs.topic)[0] # create integer class values
X = docs.body
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)

We proceed to learn the vocabulary from the training set and transform both datasets using CountVectorizer with default settings to obtain almost 26,000 features:

vectorizer = CountVectorizer()
X_train_dtm = vectorizer.fit_transform(X_train)
X_test_dtm = vectorizer.transform(X_test)
X_train_dtm.shape, X_test_dtm.shape
((1668, 25919), (557, 25919))

Training and prediction follow the standard sklearn fit/predict interface:

nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

We evaluate multiclass predictions using accuracy and find that the default classifier achieved almost 98%:

accuracy_score(y_test, y_pred_class)
0.97666068222621
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.218.247.159