Training a logistic regression model using stochastic gradient descent

In gradient descent based logistic regression models, all training samples are used to update the weights in each single iteration. Hence, if the number of training samples is large, the whole training process will become very time-consuming and computationally expensive, as we just witnessed in our last example.

Fortunately, a small tweak will make logistic regression suitable for large-size data. For each weight update, only one training sample is consumed, instead of the complete training set. The model moves a step based on the error calculated by a single training sample. Once all samples are used, one iteration finishes. This advanced version of gradient descent is called stochastic gradient descent (SGD). Expressed in a formula, for each iteration, we do the following:

SGD generally converges much faster than gradient descent where a large number of iterations is usually needed.

To implement SGD-based logistic regression, we just need to slightly modify the update_weights_gd function:

>>> def update_weights_sgd(X_train, y_train, weights, 
                                           learning_rate):
...     """ One weight update iteration: moving weights by one 
            step based on each individual sample
...     Args:
...     X_train, y_train (numpy.ndarray, training data set)
...     weights (numpy.ndarray)
...     learning_rate (float)
...     Returns:
...     numpy.ndarray, updated weights
...     """
...     for X_each, y_each in zip(X_train, y_train):
...         prediction = compute_prediction(X_each, weights)
...         weights_delta = X_each.T * (y_each - prediction)
...         weights += learning_rate * weights_delta
...     return weights

In the train_logistic_regression function, SGD is applied:

>>> def train_logistic_regression_sgd(X_train, y_train, max_iter, 
                              learning_rate, fit_intercept=False):
...     """ Train a logistic regression model via SGD
...     Args:
...     X_train, y_train (numpy.ndarray, training data set)
...     max_iter (int, number of iterations)
...     learning_rate (float)
...     fit_intercept (bool, with an intercept w0 or not)
...     Returns:
...     numpy.ndarray, learned weights
...     """
...     if fit_intercept:
...         intercept = np.ones((X_train.shape[0], 1))
...         X_train = np.hstack((intercept, X_train))
...     weights = np.zeros(X_train.shape[1])
...     for iteration in range(max_iter):
...         weights = update_weights_sgd(X_train, y_train, weights, 
                                                     learning_rate)
...         # Check the cost for every 2 (for example) iterations
...         if iteration % 2 == 0:
...             print(compute_cost(X_train, y_train, weights))
...     return weights

Now, let's see how powerful SGD is. We work with 100,000 training samples and choose 10 as the number of iterations, 0.01 as the learning rate, and print out current costs every other iteration:

>>> start_time = timeit.default_timer()
>>> weights = train_logistic_regression_sgd(X_train_enc.toarray(), 
        Y_train, max_iter=10, learning_rate=0.01, fit_intercept=True)
0.4127864859625796
0.4078504597223988
0.40545733114863264
0.403811787845451
0.4025431351250833
>>> print("--- %0.3fs seconds ---" % 
                          (timeit.default_timer() - start_time))
--- 40.690s seconds ---
>>> pred = predict(X_test_enc.toarray(), weights)
>>> print('Training samples: {0}, AUC on testing set: 
               {1:.3f}'.format(n_train, roc_auc_score(Y_test, pred)))
Training samples: 100000, AUC on testing set: 0.732

The training process finishes in just 40 seconds! And it also performs better than the previous one using gradient descent.

As usual, after successfully implementing the SGD-based logistic regression algorithm from scratch, we realize it using the SGDClassifier module of scikit-learn:

>>> from sklearn.linear_model import SGDClassifier
>>> sgd_lr = SGDClassifier(loss='log', penalty=None, 
             fit_intercept=True, n_iter=10, 
             learning_rate='constant', eta0=0.01)

Here, 'log' for the loss parameter indicates that the cost function is log loss, penalty is the regularization term to reduce overfitting that we will discuss further in the next section, n_iter is the number of iterations, and the remaining two parameters mean the learning rate is 0.01 and unchanged during the course of training. It should be noted that the default learning_rate is 'optimal', where the learning rate slightly decreases as more and more updates are taken. This can be beneficial for finding the optimal solution on large datasets.

Now, train the model and test it:

>>> sgd_lr.fit(X_train_enc.toarray(), Y_train)
>>> pred = sgd_lr.predict_proba(X_test_enc.toarray())[:, 1]
>>> print('Training samples: {0}, AUC on testing set: 
              {1:.3f}'.format(n_train, roc_auc_score(Y_test, pred)))
Training samples: 100000, AUC on testing set: 0.734

Quick and easy!

Table of Contents for Training a logistic regression model using stochastic gradient descent

Create new playlist

Sign In

Sign Up

Table of Contents for
Training a logistic regression model using stochastic gradient descent