L2 penalty

The L2 penalty, also known as ridge regression, is similar in many ways to the L1 penalty, but instead of adding a penalty based on the sum of the absolute weights, the penalty is based on the squared weights. This has the effect of providing a varied penalty, with larger (positive or negative) weights resulting in a greater penalty. In the context of neural networks, this is sometimes referred to as weight decay. If you examine the gradient of the regularized objective function, there is a penalty such that, at every update, there is a multiplicative penalty to the weights. As for the L1 penalty, although they could be included, biases or offsets are usually excluded from this.

From the perspective of a linear regression problem, the L2 penalty is a modification to the objective function minimized, from (Y – XB)T(Y – XB) to (Y – XB)T(Y – XB) + 0.5λBTB . As with the L1 penalty, the L2 penalty can allow otherwise undetermined problems to be solved, particularly when the covariance matrix of the predictors is singular. The reason for this is that the effect of the L2 penalty is essentially to increase the variance of each variable. In OLS, the normal equations for B in matrix form are inv(XTX)XTy but solving the regularized OLS objective function shown earlier, obtain, inv(XTX + λI)XTy, where I is the identity matrix.

Since XTX is the variance-covariance matrix for the design matrix, adding λI will have the effect of increasing the diagonal, but leaving the off diagonals unchanged. That is, the variances are increased but covariances unchanged, resulting in shrinking the correlations (standardized covariances) towards zero. A sufficiently strong penalty will result in otherwise singular covariance matrices being uniquely estimable, and can also help stabilize estimates when there are strongly correlated predictors.

L2 penalty in action

To see how the L2 penalty works, we can use the same simulated linear regression problem we used for the L1 penalty. To fit a ridge regression model, we use the glmnet() function from the glmnet package. As mentioned previously, this function can actually fit the L1 or the L2 penalties, and which occurs is determined by the argument, alpha. When alpha = 1, it fits the lasso, and when alpha = 0, it fits ridge regression. This time, we choose alpha = 0. Again, we evaluate a range of lambda options and tune this hyperparameter automatically using cross validation, accomplished by using the cv.glmnet() function:

m.ridge.cv <- cv.glmnet(X[1:100, ], y[1:100], alpha = 0)

We plot the ridge regression object to see the error for a variety of lambda values:

plot(m.ridge.cv)
L2 penalty in action

Figure 3.2

Although the shape is different from the lasso in that the error appears to asymptote for higher lambda values, it is still clear that, when the penalty gets too high, the cross-validated model error increases. As with the lasso, the ridge regression model seems to do well with very low lambda values, perhaps indicating the L2 penalty does not much help improve out-of-sample performance/generalizability.

Finally, we can compare the OLS coefficients with those from the lasso and the ridge regression model:

cbind(
  OLS = coef(m.ols),
  Lasso = coef(m.lasso.cv)[,1],
  Ridge = coef(m.ridge.cv)[,1])


               OLS Lasso Ridge
(Intercept)  2.958  2.99 3.002
X[1:100, ]1 -0.082  1.41 0.958
X[1:100, ]2  2.239  0.71 0.964
X[1:100, ]3  0.602  0.51 0.924
X[1:100, ]4  1.235  1.17 0.949
X[1:100, ]5 -0.041  0.00 0.011

Although ridge regression does not shrink the coefficient for the fifth predictor to exactly zero, it is smaller than in the OLS, and the remaining parameters are all slightly shrunken, but quite close to their true values of 3, 1, 1, 1, 1, and 0.

Weight decay (L2 penalty in neural networks)

Without knowing it, we have actually already seen regularization in action in Chapter 2, Training a Prediction Model. The neural network we trained using the caret and nnet package used a weight decay of 0.10. We can investigate the use of the weight decay by varying it, and tuning it using cross-validation. First we load the data as before. Then we create a local cluster to run the cross validation in parallel. Note that, as before, rather than load the libraries directly, we need to source() the checkpoint.R file so that each of the workers in our cluster is using the same R package version:

## same data as from previous chapter
digits.train <- read.csv("train.csv")

## convert to factor
digits.train$label <- factor(digits.train$label, levels = 0:9)

i <- 1:5000
digits.X <- digits.train[i, -1]
digits.y <- digits.train[i, 1]

## try various weight decays and number of iterations
## register backend so that different decays can be
## estimated in parallel
cl <- makeCluster(4)
clusterEvalQ(cl, {
  source("checkpoint.R")
})
registerDoSNOW(cl)

Next we train a neural network on the digit classification, and vary the weight decay penalty at 0 (no penalty) and 0.10. We also loop through two sets of the number of iterations allowed: 100 or 150. Note that this code is computationally intensive and depending on hardware may take some time to run:

set.seed(1234)
digits.decay.m1 <- lapply(c(100, 150), function(its) {
  train(digits.X, digits.y,
           method = "nnet",
           tuneGrid = expand.grid(
             .size = c(10),
             .decay = c(0, .1)),
           trControl = trainControl(method = "cv", number = 5, repeats = 1),
           MaxNWts = 10000,
           maxit = its)
})

Examining the results, we see that, when we limit to only 100 iterations, the non-regularized model (Accuracy = 0.63) outperforms the regularized model (Accuracy = 0.60) based on cross-validated results (although neither is doing well absolutely, particularly on this data):

digits.decay.m1[[1]]
Neural Network 

5000 samples
 784 predictor
  10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 4000, 3999, 4000, 4001, 4000 
Resampling results across tuning parameters:

  decay  Accuracy  Kappa  Accuracy SD  Kappa SD
  0.0    0.63      0.59   0.052        0.058   
  0.1    0.60      0.56   0.061        0.068   

Tuning parameter 'size' was held constant at a value of 10
Accuracy was used to select the optimal model using  the
 largest value.
The final values used for the model were size = 10 and decay = 0.

Next we can examine the model with 150 iterations and see whether the regularized or non-regularized model performs better:

digits.decay.m1[[2]]
Neural Network 

5000 samples
 784 predictor
  10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 4002, 4000, 4000, 3999, 3999 
Resampling results across tuning parameters:

  decay  Accuracy  Kappa  Accuracy SD  Kappa SD
  0.0    0.65      0.61   0.049        0.055   
  0.1    0.66      0.62   0.071        0.078   

Tuning parameter 'size' was held constant at a value of 10
Accuracy was used to select the optimal model using  the
 largest value.
The final values used for the model were size = 10 and decay = 0.1.

Overall, the model with more iterations outperforms the model with fewer iterations, regardless of the regularization. However, comparing both models with 150 iterations, the regularized model is superior (Accuracy = 0.66) to the non-regularized model (Accuracy = 0.65), although here the difference is relatively small.

These results highlight the point that regularization is often most helpful with more complex models that have greater flexibility to fit (and overfit) the data, and that (in models that are appropriate or overly simplistic for the data) regularization may actually decrease performance. In the next section, we will discuss ensemble and model averaging techniques, the last forms of regularization we will highlight in this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.46.141