Classifying newsgroup topics with SVMs

Finally, it is time to build our state-of-the-art SVM-based newsgroup topic classifier using everything we just learned.

First we load and clean the dataset with the entire 20 groups as follows:

>>> categories = None
>>> data_train = fetch_20newsgroups(subset='train',
                          categories=categories, random_state=42)
>>> data_test = fetch_20newsgroups(subset='test',
                          categories=categories, random_state=42)
>>> cleaned_train = clean_text(data_train.data)
>>> label_train = data_train.target
>>> cleaned_test = clean_text(data_test.data)
>>> label_test = data_test.target
>>> term_docs_train = tfidf_vectorizer.fit_transform(cleaned_train)
>>> term_docs_test = tfidf_vectorizer.transform(cleaned_test)

As we have seen that the linear kernel is good at classifying text data, we will continue using linear as the value of the kernel hyperparameter so we only need to tune the penalty C, through cross-validation:

>>> svc_libsvm = SVC(kernel='linear')

The way we have conducted cross-validation so far is to explicitly split data into folds and repetitively write a for loop to consecutively examine each hyperparameter. To make this less redundant, we introduce a more elegant approach utilizing the GridSearchCV module from scikit-learn. GridSearchCV handles the entire process implicitly, including data splitting, fold generation, cross training and validation, and finally, an exhaustive search over the best set of parameters. What is left for us is just to specify the hyperparameter(s) to tune and the values to explore for each individual hyperparameter:

>>> parameters = {'C': (0.1, 1, 10, 100)}
>>> from sklearn.model_selection import GridSearchCV
>>> grid_search = GridSearchCV(svc_libsvm, parameters, n_jobs=-1, cv=5)

The GridSearchCV model we just initialized will conduct five-fold cross-validation (cv=5) and will run in parallel on all available cores (n_jobs=-1). We then perform hyperparameter tuning by simply applying the fit method, and record the running time:

>>> import timeit
>>> start_time = timeit.default_timer()
>>> grid_search.fit(term_docs_train, label_train)
>>> print("--- %0.3fs seconds ---" % (timeit.default_timer() - start_time))
--- 525.728s seconds ---

We can obtain the optimal set of parameters (the optimal C in this case) using the following code:

>>> grid_search.best_params_
{'C': 10}

And the best five-fold averaged performance under the optimal set of parameters by using the following code:

>>> grid_search.best_score_
0.8888987095633728

We then retrieve the SVM model with the optimal hyperparameter and apply it to the testing set:

>>> svc_libsvm_best = grid_search.best_estimator_
>>> accuracy = svc_libsvm_best.score(term_docs_test, label_test)
>>> print('The accuracy of 20-class classification is:   
                                 {0:.1f}%'.format(accuracy*100))
The accuracy of 20-class classification is: 78.7%

It should be noted that we tune the model based on the original training set, which is divided into folds for cross training and validation, and that we adopt the optimal model to the original testing set. We examine the classification performance in this manner in order to measure how well generalized the model is to make correct predictions on a completely new dataset. An accuracy of 78.7% is achieved with our first SVC model.

There is another SVM classifier, LinearSVC, from scikit-learn. How will we perform this? LinearSVC is similar to SVC with linear kernels, but it is implemented based on the liblinear library, which is better optimized than libsvm with linear kernel. We then repeat the same preceding process with LinearSVC as follows:

>>> from sklearn.svm import LinearSVC
>>> svc_linear = LinearSVC()
>>> grid_search = GridSearchCV(svc_linear, parameters,
                                               n_jobs=-1, cv=5))
>>> start_time = timeit.default_timer()
>>> grid_search.fit(term_docs_train, label_train)
>>> print("--- %0.3fs seconds ---" %
                           (timeit.default_timer() - start_time))
--- 19.915s seconds ---
>>> grid_search.best_params_
{'C': 1}
>>> grid_search.best_score_
0.894643804136468
>>> svc_linear_best = grid_search.best_estimator_
>>> accuracy = svc_linear_best.score(term_docs_test, label_test)
>>> print('The accuracy of 20-class classification is:   
                                 {0:.1f}%'.format(accuracy*100))
The accuracy on testing set is: 79.9%

The LinearSVC model outperforms SVC, and its training is more than 26 times faster. This is because the liblinear library with high scalability is designed for large datasets, while the libsvm library with more than quadratic computation complexity is not able to scale well with more than training instances.

We can also tweak the feature extractor, the TfidfVectorizer model, to further improve the performance. Feature extraction and classification as two consecutive steps should be cross-validated collectively. We utilize the pipeline API from scikit-learn to facilitate this.

The tfidf feature extractor and linear SVM classifier are first assembled in the pipeline:

>>> from sklearn.pipeline import Pipeline
>>> pipeline = Pipeline([
...     ('tfidf', TfidfVectorizer(stop_words='english')),
...     ('svc', LinearSVC()),
... ])

The hyperparameters to tune are defined as follows, with a pipeline step name joined with a parameter name by a __ as the key, and a tuple of corresponding options as the value:

>>> parameters_pipeline = {
...     'tfidf__max_df': (0.25, 0.5, 1.0),
...     'tfidf__max_features': (10000, None),
...     'tfidf__sublinear_tf': (True, False),
...     'tfidf__smooth_idf': (True, False),
...     'svc__C': (0.3, 1, 3),
... }

Besides the penalty C, for the SVM classifier, we tune the tfidf feature extractor in terms of the following:

max_df: The maximum document frequency of a term to be allowed, in order to avoid common terms generally occurring in documents
max_features: The number of top features to consider
sublinear_tf: Whether scaling term frequency with the logarithm function or not
smooth_idf: Adding an initial 1 to the document frequency or not, similar to smoothing factor for the term frequency

The grid search model searches for the optimal set of parameters throughout the entire pipeline:

>>> grid_search = GridSearchCV(pipeline, parameters_pipeline,
                                             n_jobs=-1, cv=5)
>>> start_time = timeit.default_timer()
>>> grid_search.fit(cleaned_train, label_train)
>>> print("--- %0.3fs seconds ---" %
                       (timeit.default_timer() - start_time))
--- 333.761s seconds ---
>>> grid_search.best_params_
{'svc__C': 1, 'tfidf__max_df': 0.5, 'tfidf__max_features': None,
'tfidf__smooth_idf': False, 'tfidf__sublinear_tf': True}
>>> grid_search.best_score_
0.9018914619056037
>>> pipeline_best = grid_search.best_estimator_

Finally, the optimal model is applied to the testing set as follows:

>>> accuracy = pipeline_best.score(cleaned_test, label_test)
>>> print('The accuracy of 20-class classification is:
                              {0:.1f}%'.format(accuracy*100))
The accuracy of 20-class classification is: 81.0%

The set of hyperparameters, {max_df: 0.5, smooth_idf: False, max_features: 40000, sublinear_tf: True, C: 1}, facilitates the best classification accuracy, 81.0%, on the entire 20 groups of text data.

Table of Contents for Classifying newsgroup topics with SVMs

Create new playlist

Sign In

Sign Up

Table of Contents for
Classifying newsgroup topics with SVMs