Training a logistic regression model with regularization

As we briefly mentioned in the previous section, the penalty parameter in the logistic regression SGDClassifier is related to model regularization. There are two basic forms of regularization, L1 (also called Lasso) and L2 (also called ridge). In either way, the regularization is an additional term on top on the original cost function:

Here, α is the constant that multiplies the regularization term, and q is either 1 or 2 representing L1 or L2 regularization where the following applies:

Training a logistic regression model is a process of reducing the cost as a function of weights w. If it gets to a point where some weights, such as w_i, w_j, and w_k are considerably large, the whole cost will be determined by these large weights. In this case, the learned model may just memorize the training set and fail to generalize to unseen data. The regularization term herein is introduced in order to penalize large weights, as the weights now become part of the cost to minimize. Regularization as a result eliminates overfitting. Finally, parameter α provides a trade-off between log loss and generalization. If α is too small, it is not able to compromise large weights and the model may suffer from high variance or overfitting; on the other hand, if α is too large, the model becomes over generalized and performs poorly in terms of fitting the dataset, which is the syndrome of underfitting. α is an important parameter to tune in order to obtain the best logistic regression model with regularization.

As for choosing between the L1 and L2 form, the rule of thumb is whether feature selection is expected. In machine learning classification, feature selection is the process of picking a subset of significant features for use in better model construction. In practice, not every feature in a dataset carries information useful for discriminating samples; some features are either redundant or irrelevant, and hence can be discarded with little loss. In a logistic regression classifier, feature selection can only be achieved with L1 regularization. To understand this, we consider two weight vectors, w1=(1, 0) and w2=(0.5, 0.5), and, supposing they produce the same amount of log loss, the L1 and L2 regularization terms of each weight vector are as follows:

The L1 term of both vectors is equivalent, while the L2 term of w₂ is less than that of w₁. This indicates that L2 regularization penalizes more on weights composed of significantly large and small weights than L1 regularization does. In other words, L2 regularization favors relative small values for all weights, and avoids significantly large and small values for any weight, while L1 regularization allows some weights with significantly small value, and some with significantly large value. Only with L1 regularization can some weights be compressed to close to or exactly 0, which enables feature selection.

In scikit-learn, the regularization type can be specified by the penalty parameter with options as none (without regularization), "l1", "l2", and "elasticnet" (a mixture of L1 and L2), and the multiplier α by the alpha parameter.

We herein examine L1 regularization for feature selection.

Initialize an SGD logistic regression model with L1 regularization, and train the model based on 10,000 samples:

>>> sgd_lr_l1 = SGDClassifier(loss='log', penalty='l1', alpha=0.0001, 
                             fit_intercept=True, n_iter=10, 
                             learning_rate='constant', eta0=0.01)
>>> sgd_lr_l1.fit(X_train_enc.toarray(), Y_train)

With the trained model, we obtain the absolute values of its coefficients:

>>> coef_abs = np.abs(sgd_lr_l1.coef_)
>>> print(coef_abs)
[[0. 0.09963329 0. ... 0. 0. 0.07431834]]

The bottom 10 coefficients and their values are printed as follows:

>>> print(np.sort(coef_abs)[0][:10])
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
>>> bottom_10 = np.argsort(coef_abs)[0][:10]

We can see what these 10 features are using the following codes:

>>> feature_names = enc.get_feature_names()
>>> print('10 least important features are:
', 
                                   feature_names[bottom_10])
10 least important features are:
 ['x0_1001' 'x8_851897aa' 'x8_85119990' 'x8_84ebbcd4' 'x8_84eb6b0e'
 'x8_84dda655' 'x8_84c2f017' 'x8_84ace234' 'x8_84a9d4ba' 'x8_84915a27']

They are 1001 from the 0 column (that is the C1 column) in X_train, "851897aa" from the 8 column (that is the device_model column), and so on and so forth.

Similarly, the top 10 coefficients and their values can be obtained as follows:

>>> print(np.sort(coef_abs)[0][-10:])
[0.67912376 0.70885933 0.79975917 0.8828797 0.98146351 0.98275124
 1.08313767 1.13261091 1.18445527 1.40983505]
>>> top_10 = np.argsort(coef_abs)[0][-10:]
>>> print('10 most important features are:
', feature_names[top_10])
10 most important features are:
 ['x7_cef3e649' 'x3_7687a86e' 'x18_61' 'x18_15' 'x5_9c13b419' 
'x5_5e3f096f' 'x2_763a42b5' 'x2_d9750ee7' 'x3_27e3c518' 
'x5_1779deee']

They are "cef3e649" from the 7 column (that is app_category) in X_train, "7687a86e" from the third column (that is site_domain), and so on and so forth.

Table of Contents for Training a logistic regression model with regularization

Create new playlist

Sign In

Sign Up

Table of Contents for
Training a logistic regression model with regularization