Recap and inverse document frequency

In the previous chapter, we detected spam emails by applying naive Bayes classifier on the extracted feature space. The feature space was represented by term frequency (tf), where a collection of text documents was converted to a matrix of term counts. It reflected how terms are distributed in each individual document, however, without all documents across the entire corpus. For example, some words generally occur more often in the language, while some rarely occur, but convey important messages.

Because of this, it is encouraged to adopt a more comprehensive approach to extract text features, the term frequency-inverse document frequency (tf-idf): it assigns each term frequency a weighting factor that is inversely proportional to the document frequency, the fraction of documents containing this term. In practice, the idf factor of a term t in documents D is calculated as follows:

Where n_D is the total number of documents, is the number of documents containing t, and the 1 is added to avoid division by zero.

With the idf factor incorporated, it diminishes the weight of common terms (such as "get", "make") occurring frequently, and emphasizes terms that rarely occur but are meaningful.

We can test the effectiveness of tf-idf on our existing spam email detection model, by simply replacing the tf feature extractor, CountVectorizer, with the tf-idf feature extractor, TfidfVectorizer, from scikit-learn. We will reuse most of the previous codes and only tune the naive Bayes smoothing term:

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> smoothing_factor_option = [1.0, 2.0, 3.0, 4.0, 5.0]
>>> from collections import defaultdict
>>> auc_record = defaultdict(float)
>>> for train_indices, test_indices in k_fold.split(cleaned_emails, labels):
...     X_train, X_test = cleaned_emails_np[train_indices],
                                   cleaned_emails_np[test_indices]
...     Y_train, Y_test = labels_np[train_indices],
                                   labels_np[test_indices]
...     tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True,
              max_df=0.5, stop_words='english', max_features=8000)
...     term_docs_train = tfidf_vectorizer.fit_transform(X_train)
...     term_docs_test = tfidf_vectorizer.transform(X_test)
...     for smoothing_factor in smoothing_factor_option:
...         clf = MultinomialNB(alpha=smoothing_factor,
                                              fit_prior=True)
...         clf.fit(term_docs_train, Y_train)
...         prediction_prob = clf.predict_proba(term_docs_test)
...         pos_prob = prediction_prob[:, 1]
...         auc = roc_auc_score(Y_test, pos_prob)
...         auc_record[smoothing_factor] += auc
>>> print('max features  smoothing  fit prior  auc')
>>> for smoothing, smoothing_record in auc_record.iteritems():
...         print('       8000      {0}      true    
                               {1:.4f}'.format(smoothing, smoothing_record/k))
max features  smoothing  fit prior  auc
       8000      1.0      True    0.9920
       8000      2.0      True    0.9930
       8000      3.0      True    0.9936
       8000      4.0      True    0.9940
       8000      5.0      True    0.9943

The best averaged 10-fold AUC 0.9943 is achieved, which outperforms 0.9856 obtained based on tf features.

Table of Contents for Recap and inverse document frequency

Create new playlist

Sign In

Sign Up

Table of Contents for
Recap and inverse document frequency