Handling multiclass classification

One last thing worth noting is how logistic regression algorithms deal with multiclass classification. Although we interact with scikit-learn classifiers in multiclass cases the same way as in binary cases, it is encouraging to understand how logistic regression works in multiclass classification.

Logistic regression for more than two classes is also called multinomial logistic regression, or better known as softmax regression recently. Recall in binary case, the model is represented by one weight vector w, the probability of the target being "1" or the positive class is written as . In a K-class case, the model is represented by K weight vectors w1,w2,...,wK, and the probability of the target being class k is written as follows:

Note that the term normalizes probabilities (k from 1 to K) so that they sum to 1. The cost function in binary case is expressed as . Similarly, the cost function now becomes as follows:

Where function is 1 only if is true, otherwise it is 0.

With the cost function defined, we obtain the step for the j weight vector in the same way we derived the step in binary case:

In a similar manner, all K weight vectors get updated in each iteration. After sufficient iterations, the learned weight vectors are then used to classify a new sample as follows:

To have a better understanding of this, we will experiment with the news topic dataset that we worked on in Chapter 4, News Topic Classification with Support Vector Machine:

(note that we will herein reuse functions already defined in Chapter 4, News Topic Classification with Support Vector Machine)

>>> data_train = fetch_20newsgroups(subset='train', 
categories=None, random_state=42)
>>> data_test = fetch_20newsgroups(subset='test', categories=None,
random_state=42)
>>> cleaned_train = clean_text(data_train.data)
>>> label_train = data_train.target
>>> cleaned_test = clean_text(data_test.data)
>>> label_test = data_test.target
>>> tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True,
max_df=0.5, stop_words='english', max_features=40000)
>>> term_docs_train =
tfidf_vectorizer.fit_transform(cleaned_train)
>>> term_docs_test = tfidf_vectorizer.transform(cleaned_test)

We will now combine grid search to find the optimal multiclass logistic regression model:

>>> from sklearn.model_selection import GridSearchCV
>>> parameters = {'penalty': ['l2', None],
... 'alpha': [1e-07, 1e-06, 1e-05, 1e-04],
... 'eta0': [0.01, 0.1, 1, 10]}
>>> sgd_lr = SGDClassifier(loss='log', learning_rate='constant',
eta0=0.01, fit_intercept=True, n_iter=10)
>>> grid_search = GridSearchCV(sgd_lr, parameters,
n_jobs=-1, cv=3)
>>> grid_search.fit(term_docs_train, label_train)
>>> print(grid_search.best_params_)
{'penalty': 'l2', 'alpha': 1e-07, 'eta0': 10}

To predict using the optimal model, use the following code:

>>> sgd_lr_best = grid_search.best_estimator_
>>> accuracy = sgd_lr_best.score(term_docs_test, label_test)
>>> print('The accuracy on testing set is:
{0:.1f}%'.format(accuracy*100))
The accuracy on testing set is: 79.7%
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.130.199