Chapter 3. Preventing Overfitting

In the previous chapter, we learned how to train a basic neural network. We also saw the diminishing returns from further training iterations or a larger neural network in terms of its predictive ability on holdout or validation data not used to train the model. This highlights how, although a more complex model will almost always fit the data it was trained on better, it may not actually predict new data better. This chapter shows different approaches that can be used to prevent models from overfitting the data to improve generalizability, called regularization on unsupervised data. More specifically, whereas models are typically trained by optimizing parameters in a way that reduces the training error, regularization is concerned with reducing testing or validation errors so that the model performs well with new data as well as training data.

The first part of the chapter provides a conceptual overview of a variety of regularization strategies. The chapter closes with an example use case using regularization to improve out-of-sample performance. It covers the following topics:

  • L1 penalty
  • L2 penalty
  • Ensembles and model averaging
  • Use case – improving out-of-sample model performance using dropout

L1 penalty

The basic concept of the L1 penalty, also known as the Least Absolute Shrinkage and Selection Operator (lasso)—(Hastie, T., Tibshirani, R., and Friedman, J. (2009)), is that a penalty is used to shrink weights towards zero. The penalty term uses the sum of the absolute weights, so the degree of penalty is no smaller or larger for small or large weights, with the result that small weights may get shrunken to zero, a convenient effect as, in addition to preventing overfitting, it can be a sort of variable selection. The strength of the penalty is controlled by a hyperparameter, λ, which multiplies the sum of the absolute weights, and can be set a priori or, as with other hyperparameters, optimized using cross validation or some similar approach.

Mathematically, it is easier to start with an Ordinary Least Squares (OLS) regression model. In regression, a set of coefficients or model weights are estimated using the least squared error criteria, where the weight/coefficient vector, B, is estimated such that it minimizes: (Y – XB)T(Y – XB) where Y is the outcome or dependent variable, X is a k + 1 column design matrix with k columns for the predictors and one constant column for the intercept (also called an offset sometimes). The difference between the observed outcome and the predicted values (the product of the design matrix post multiplied by the weight vector) is a vector of the errors or residuals. In this framework, one way to think about the L1 penalty is that it is a constrained estimator, where the weight vector, B, is estimated subject to the constraint that the sum of the absolute weights is less than or equal to some (user-chosen) threshold, λ.

Typically, the intercept or offset term is excluded from this constraint (for example, by pre-centering all data and dropping the intercept or by selectively applying the constraint). Another way of viewing the L1 penalty is to see it as a modification to the function minimized, from (Y – XB)T(Y – XB) to (Y – XB)T(Y – XB) + λ||B||1, where ||B||1 represents the sum of the absolute weights. If λ = 0, then the L1 penalty reduces to the regular OLS estimator. The user may choose λ, or more commonly it is treated as a hyperparameter and optimized by evaluating a range of possible λ values (for example, through cross validation). Although outside the scope of this book, the L1 penalty may also be viewed through a Bayesian perspective, the final posterior estimates are a function of the estimates from the data and the prior, and the shrinkage that occurs from the penalty term is accomplished by setting a prior with varying degrees of certainty. Technically, the parameters could be shrunk towards any arbitrary value, but they are almost always shrunk towards zero.

Even if the theory behind why and how the L1 penalty works is not so clear, there are a number of practical implications that are straightforward. First, it may be obvious that the effect of the penalty depends on the size of the weights, and the size of the weights depends on the scale of the data. Therefore, data is typically standardized to have unit variance first (or at least to make the variance of each variable equal). The L1 penalty has a tendency to shrink small weights to zero (for explanations as to why this happens, see Hastie, T., Tibshirani, R., and Friedman, J. (2009)). If you only consider variables for which the L1 penalty leaves non zero weights, it can essentially function as feature selection, a primary motivation of another name commonly used for the L1 penalty, the Least Absolute Shrinkage and Selection Operator, or lasso. Even outside the usage of strict feature selection, the tendency for the L1 penalty to shrink small coefficients to zero can be convenient for simplifying the interpretation of the model results.

When considering the L1 penalty as constrained optimization, it is easy to see how it effectively limits the complexity of the model. Even if many predictors are included, the sum of the absolute weights cannot exceed the defined threshold. One result of this is that, using the L1 penalty, it is actually possible to include more predictors than cases or observations, so long as there is a sufficiently strong penalty term; the apparently (by number of weights) over-parameterized model becomes uniquely estimable through the constraints.

With these basics on the L1 penalty, we will now briefly consider how the L1 penalty can apply to neural networks, the main use case we are concerned with in this book. Let X represent our inputs, Y our outcome or dependent variable, and B our parameters, and F, the objective function which will be optimized to obtain B. Specifically: F(B; X, Y). In neural networks, parameters may be biases or offsets (essentially intercepts from regression) and the weights. The L1 penalty modifies the objective function to be: F(B; X, Y) + λ||w||1, where w represents only the weights (that is, typically offsets are ignored). Considering the gradient, we can show that the additional penalty term is λ * sign(w). This highlights the fact that the penalty is constant regardless of the magnitude of the weight. This will be an important point of distinction compared with the L2 penalty, which we will discuss next. Further, it is part of the way in which the L1 penalty tends to result in a sparse solution (that is, more zero weights) as small and larger weights result in equal penalties, so that at each update of the gradient the weights are moved towards zero.

We have discussed λ as a constant, controlling the degree of penalty or regularization. However, it is possible to set different values. Although not commonly done in a single layer neural network (it is atypical to seek to differentially regularize specific weights), it becomes more useful with deep neural networks, where varying degrees of regularization can be applied to different layers. One reason for considering such differential regularization is that it is sometimes desirable to allow a greater number of parameters (say by including more neurons in a particular layer) but then counteract this somewhat through stronger regularization. Despite this, as these hyperparameters are typically optimized through cross validation or other empirical techniques, it can be quite computationally demanding to allow them to vary for every layer of a deep neural network, as the number of possible values grows exponentially; so most commonly a single value is used across the entire model. After exploring the L1 penalty practically in R, we move on to consider another common form of regularization, the L2 penalty.

L1 penalty in action

To see how the L1 penalty works, we can use a simulated linear regression problem. First, we will add the R package glmnet to the checkpoint.R file to load the relevant library and use a reproducible version, as before:

library(glmnet)

Next we can simulate the data, using a purposefully pathologically correlated set of predictors:

set.seed(1234)

X <- mvrnorm(n = 200, mu = c(0, 0, 0, 0, 0),
  Sigma = matrix(c(
    1, .9999, .99, .99, .10,
    .9999, 1, .99, .99, .10,
    .99, .99, 1, .99, .10,
    .99, .99, .99, 1, .10,
    .10, .10, .10, .10, 1
  ), ncol = 5))

y <- rnorm(200, 3 + X %*% matrix(c(1, 1, 1, 1, 0)), .5)

Next, we can fit an OLS regression model to the first 100 cases, and then use the lasso. To use the lasso, we use the glmnet() function from the glmnet package. This function can actually fit the L1 or the L2 (discussed in the next section) penalties, and which occurs is determined by the argument, alpha. When alpha = 1, it is the L1 penalty (that is, the lasso), and when alpha = 0 it is the L2 penalty (that is, ridge regression). Further, because we do not know the value of lambda we should pick, we can evaluate a range of options and tune this hyperparameter automatically using cross validation, accomplished by using the cv.glmnet() function:

m.ols <- lm(y[1:100] ~ X[1:100, ])

m.lasso.cv <- cv.glmnet(X[1:100, ], y[1:100], alpha = 1)

We can plot the lasso object to see the mean squared error for a variety of lambda values:

plot(m.lasso.cv) 
L1 penalty in action

Figure 3.1

One thing that we can see from the graph is that, when the penalty gets too high, the cross-validated model error increases. Indeed, the lasso seems to do well with very low lambda values, perhaps indicating the lasso does not help improve out-of-sample performance/generalizability much. For the sake of this example, we will continue but in actual use this might give us pause to consider whether the lasso was really helping.

Finally, we can compare the OLS coefficients with those from the lasso:

cbind(
  OLS = coef(m.ols),
  Lasso = coef(m.lasso.cv)[,1])

               OLS Lasso
(Intercept)  2.958  2.99
X[1:100, ]1 -0.082  1.41
X[1:100, ]2  2.239  0.71
X[1:100, ]3  0.602  0.51
X[1:100, ]4  1.235  1.17
X[1:100, ]5 -0.041  0.00

Notice that the OLS coefficients are noisier and also that, in the lasso, predictor 5 is penalized to 0. Recall from the simulated data that the true coefficients are 3, 1, 1, 1, 1, and 0. The OLS estimates have much too low a value for the first predictor and much too high a value for the second, whereas the lasso has more accurate values for each.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.216.254