The implementations of SVM

We have totally covered the fundamentals of the SVM classifier. Now let's apply it right away on news topic classification. We start with a binary case classifying two topics,

comp.graphics and sci.space:

First, load the training and testing subset of the computer graphics and science space news data respectively:

>>> categories = ['comp.graphics', 'sci.space']
>>> data_train = fetch_20newsgroups(subset='train',
categories=categories, random_state=42)
>>> data_test = fetch_20newsgroups(subset='test',
categories=categories, random_state=42)

Again, don't forget to specify a random state for reproducing experiments.

Clean the text data and retrieve label information:

>>> cleaned_train = clean_text(data_train.data)
>>> label_train = data_train.target
>>> cleaned_test = clean_text(data_test.data)
>>> label_test = data_test.target
>>> len(label_train), len(label_test)
(1177, 783)

As a good practice, check whether the classes are imbalanced:

>>> from collections import Counter
>>> Counter(label_train)
Counter({1: 593, 0: 584})
>>> Counter(label_test)
Counter({1: 394, 0: 389})

Next, extract tf-idf features using the TfidfVectorizer extractor that we just acquired:

>>> tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True, 
max_df=0.5, stop_words='english', max_features=8000)
>>> term_docs_train =
tfidf_vectorizer.fit_transform(cleaned_train)
>>> term_docs_test = tfidf_vectorizer.transform(cleaned_test)

Now we can apply our SVM algorithm with features ready. Initialize an SVC model with the kernel parameter set to linear (we will explain this shortly) and penalty C set to the default value 1:

>>> from sklearn.svm import SVC
>>> svm = SVC(kernel='linear', C=1.0, random_state=42)

Then fit our model on the training set:

>>> svm.fit(term_docs_train, label_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto',
kernel='linear',max_iter=-1, probability=False, random_state=42,
shrinking=True, tol=0.001, verbose=False)

And then predict on the testing set with the trained model and obtain the prediction accuracy directly:

>>> accuracy = svm.score(term_docs_test, label_test)
>>> print('The accuracy on testing set is:
{0:.1f}%'.format(accuracy*100))
The accuracy on testing set is: 96.4%

Our first SVM model just works so well with 96.4% accuracy achieved. How about more than two topics? How does SVM handle multiclass classification?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.90.182