Training a logistic regression model via stochastic gradient descent

In gradient descent-based logistic regression models, all training samples are used to update the weights for each single iteration. Hence, if the number of training samples is large, the whole training process becomes very time-consuming and computation expensive, as we just witnessed in our last example.

Fortunately, a small tweak will make logistic regression suitable for large-size data. For each weight update, only one training sample is consumed, instead of the complete training set. The model moves a step based on the error calculated by a single training sample. Once all samples are used, one iteration finishes. This advanced version of gradient descent is called stochastic gradient descent(SGD). Expressed in a formula, for each iteration, we do the following:

for i in 1 to m:

SGD generally converges in several iterations (usually less than 10), much faster than gradient descent where a large number of iterations is usually needed.

To implement SGD-based logistic regression, we just need to slightly modify the update_weights_gd function:

>>> def update_weights_sgd(X_train, y_train, weights, 
learning_rate):
... """ One weight update iteration: moving weights by one
step based on each individual sample
... Args:
... X_train, y_train (numpy.ndarray, training data set)
... weights (numpy.ndarray)
... learning_rate (float)
... Returns:
... numpy.ndarray, updated weights
... """
... for X_each, y_each in zip(X_train, y_train):
... prediction = compute_prediction(X_each, weights)
... weights_delta = X_each.T * (y_each - prediction)
... weights += learning_rate * weights_delta
... return weights

And in the train_logistic_regression function, just change the following line:

weights = update_weights_gd(X_train, y_train, weights, learning_rate)   

Into the following:

weights = update_weights_sgd(X_train, y_train, weights, learning_rate)   

Now let's see how powerful such a small change is. First we work with 10 thousand training samples, where we choose 5 as the number of iterations, 0.01 as the learning rate, and print out current costs every other iteration:

>>> start_time = timeit.default_timer()
>>> weights = train_logistic_regression(X_train_10k, y_train_10k,
max_iter=5, learning_rate=0.01, fit_intercept=True)
0.414965479133
0.406007112829
0.401049374518
>>> print("--- %0.3fs seconds ---" %
(timeit.default_timer() - start_time))
--- 1.007s seconds ---

The training process finishes in just a second! And it also performs better than previous models on the testing set:

>>> predictions = predict(X_test_10k, weights)
>>> print('The ROC AUC on testing set is:
{0:.3f}'.format(roc_auc_score(y_test, predictions)))
The ROC AUC on testing set is: 0.720

How about a larger training set of 100 thousand samples? Let's do that with the following:

>>> start_time = timeit.default_timer()
>>> weights = train_logistic_regression(X_train_100k,
y_train_100k, max_iter=5, learning_rate=0.01,
fit_intercept=True)
0.412786485963
0.407850459722
0.405457331149
>>> print("--- %0.3fs seconds ---" %
(timeit.default_timer() - start_time))
--- 24.566s seconds ---

And examine the classification performance on the next 10 thousand samples:

>>> X_dict_test, y_test_next10k = 
read_ad_click_data(10000, 100000)
>>> X_test_next10k = dict_one_hot_encoder.transform(X_dict_test)
>>> predictions = predict(X_test_next10k, weights)
>>> prin( 'The ROC AUC on testing set is:
{0:.3f}'.format(roc_auc_score(y_test_next10k, predictions)))
The ROC AUC on testing set is: 0.736

Obviously, SGD-based models are amazingly more efficient than gradient descent-based models.

As usual, after successfully implementing the SGD-based logistic regression algorithm from scratch, we implement it using scikit-learn's SGDClassifier package:

>>> from sklearn.linear_model import SGDClassifier
>>> sgd_lr = SGDClassifier(loss='log', penalty=None,
fit_intercept=True, n_iter=5,
learning_rate='constant', eta0=0.01)

Where 'log' for the loss parameter indicates the cost function is log loss, penalty is the regularization term to reduce overfitting, which we will discuss further in the next section, n_iter is the number of iterations, and the remaining two parameters mean that the learning rate is 0.01 and unchanged during the course of training. It is noted that the default learning_rate is 'optimal', where the learning rate slightly decreases as more and more updates are taken. This can be beneficial for finding the optimal solution on large datasets.

Now train the model and test it:

>>> sgd_lr.fit(X_train_100k, y_train_100k)
>>> predictions = sgd_lr.predict_proba(X_test_next10k)[:, 1]
>>> print('The ROC AUC on testing set is:
{0:.3f}'.format(roc_auc_score(y_test_next10k, predictions)))
The ROC AUC on testing set is: 0.735

Quick and easy!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.164.141