How ridge regression works

The ridge regression shrinks the regression coefficients by adding a penalty to the objective function that equals the sum of the squared coefficients, which in turn corresponds to the L² norm of the coefficient vector:

Hence, the ridge coefficients are defined as:

The intercept has been excluded from the penalty to make the procedure independent of the origin chosen for the output variable—otherwise, adding a constant to all output values would change all slope parameters as opposed to a parallel shift.

It is important to standardize the inputs by subtracting from each input the corresponding mean and dividing the result by the input's standard deviation because the ridge solution is sensitive to the scale of the inputs. There is also a closed solution for the ridge estimator that resembles the OLS case:

The solution adds the scaled identity matrix λI to X^TX before inversion, which guarantees that the problem is non-singular, even if X^TX does not have full rank. This was one of the motivations for using this estimator when it was originally introduced.

The ridge penalty results in proportional shrinkage of all parameters. In the case of orthonormal inputs, the ridge estimates are just a scaled version of the least squares estimates, that is:

Using the singular value decomposition (SVD) of the input matrix X, we can gain insight into how the shrinkage affects inputs in the more common case where they are not orthonormal. The SVD of a centered matrix represents the principal components of a matrix (refer to Chapter 11, Gradient Boosting Machines, on unsupervised learning) that capture uncorrelated directions in the column space of the data in descending order of variance.

Ridge regression shrinks coefficients on input variables that are associated with directions in the data that have less variance more than input variables that correlate with directions that exhibit more variance. Hence, the implicit assumption of ridge regression is that the directions in the data that vary the most will be most influential or most reliable when predicting the output.

Table of Contents for How ridge regression works

Create new playlist

Sign In

Sign Up

Table of Contents for
How ridge regression works