17

Regularization

In Chapter 16, we considered various ways to measure model performance. Section 16.3 described k-fold cross-validation, a technique that tries to measure model performance by looking at how it predicts on test data. This chapter explores regularization, one technique to improve performance on test data. Specifically, this method aims to prevent overfitting.

17.1 Why Regularize?

Let’s begin with a base case of linear regression. We will be using the ACS data.

import pandas as pd
acs = pd.read_csv('data/acs_ny.csv')
print(acs.columns)
Index(['Acres', 'FamilyIncome', 'FamilyType', 'NumBedrooms',
       'NumChildren', 'NumPeople', 'NumRooms', 'NumUnits',
       'NumVehicles', 'NumWorkers', 'OwnRent', 'YearBuilt',
       'HouseCosts', 'ElectricBill', 'FoodStamp', 'HeatingFuel',
       'Insurance', 'Language'],
      dtype='object')

Now, let’s create our design matrices using patsy.

from patsy import dmatrices
   
# sequential strings get concatenated together in Python
response, predictors = dmatrices(
  "FamilyIncome ~ NumBedrooms + NumChildren + NumPeople + "
  "NumRooms + NumUnits + NumVehicles + NumWorkers + OwnRent + "
  "YearBuilt + ElectricBill + FoodStamp + HeatingFuel + "
  "Insurance + Language",
  data=acs,
)

With our predictor and response matrices created, we can use sklearn to split our data into training and testing sets.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
  predictors, response, random_state=0
)

Now, let’s fit our linear model. Here we are normalizing our data so we can compare our coefficients when we use our regularization techniques.

from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

lr = make_pipeline(
  StandardScaler(with_mean=False), LinearRegression()
)

lr = lr.fit(X_train, y_train)
print(lr)
Pipeline(steps=[('standardscaler', StandardScaler(with_mean=False)),
                ('linearregression', LinearRegression())])
model_coefs = pd.DataFrame(
  data=list(
    zip(
      predictors.design_info.column_names,
      lr.named_steps["linearregression"].coef_[0],
    )
  ),
  columns=["variable", "coef_lr"],
)

print(model_coefs)
                       variable      coef_lr
0                     Intercept 2.697159e-13
1   NumUnits[T.Single attached] 9.661755e+03
2   NumUnits[T.Single detached] 8.345408e+03
3           OwnRent[T.Outright] 2.382740e+03
4             OwnRent[T.Rented] 2.260806e+03
..                          ...          ...
34                     NumRooms 1.340575e+04
35                  NumVehicles 7.228920e+03
36                   NumWorkers 1.877535e+04
37                 ElectricBill 1.000008e+04
38                    Insurance 3.072892e+04

[39 rows x 2 columns]

Now, we can look at our model scores.

# score on the _training_ data
print(lr.score(X_train, y_train))
0.2726140465638568
# score on the _testing_ data
print(lr.score(X_test, y_test))
0.26976979568488013

In this particular case, our model demonstrates poor performance. In another potential scenario, we might have a high training score and a low test score—a sign of overfitting. Regularization solves this overfitting issue, by putting constraints on the coefficients and variables. This causes the coefficients of our data to be smaller. In the case of LASSO (least absolute shrinkage and selection operator) regression, some coefficients can actually be dropped (i.e., become 0), whereas in ridge regression, coefficients will approach 0, but are never dropped.

17.2 LASSO Regression

The first type of regularization technique is called LASSO, which stands for least absolute shrinkage and selection operator. It is also known as regression with L1 regularization.

We will fit the same model as we did in our linear regression.

from sklearn.linear_model import Lasso

lasso = make_pipeline(
  StandardScaler(with_mean=False),
  Lasso(max_iter=10000, random_state=42),
)

lasso = lasso.fit(X_test, y_test)
print(lasso)
Pipeline(steps=[('standardscaler', StandardScaler(with_mean=False)),
                ('lasso', Lasso(max_iter=10000, random_state=42))])

Now, let’s get a dataframe of coefficients, and combine them with our linear regression results.

coefs_lasso = pd.DataFrame(
  data=list(
    zip(
      predictors.design_info.column_names,
      lasso.named_steps["lasso"].coef_.tolist(),
    )
  ),
  columns=["variable", "coef_lasso"],
)
model_coefs = pd.merge(model_coefs, coefs_lasso, on='variable')
print(model_coefs)
                       variable       coef_lr    coef_lasso
0                     Intercept  2.697159e-13      0.000000
1   NumUnits[T.Single attached]  9.661755e+03   7765.482025
2   NumUnits[T.Single detached]  8.345408e+03   7512.067593
3           OwnRent[T.Outright]  2.382740e+03   2431.710977
4             OwnRent[T.Rented]  2.260806e+03    604.186925
..                          ...           ...           ...
34                     NumRooms  1.340575e+04  10940.150208
35                  NumVehicles  7.228920e+03   7724.681161
36                   NumWorkers  1.877535e+04  16911.035390
37                 ElectricBill  1.000008e+04   9516.123582
38                    Insurance  3.072892e+04  32155.544169

[39 rows x 3 columns]

Notice that the coefficients are now smaller than their original linear regression values. Additionally, some of the coefficients are now 0.

Finally, let’s look at our training and test data scores.

print(lasso.score(X_train, y_train))
0.2669751487716776
print(lasso.score(X_test, y_test))
0.2752627973740016

There isn’t much difference here, but you can see that the test results are now better than the training results. That is, there is an improvement in prediction when using new, unseen data.

17.3 Ridge Regression

Now let’s look at another regularization technique, ridge regression. It is also known as regression with L2 regularization.

Most of the code will be very similar to that seen with the previous methods. We will fit the model on our training data, and combine the results with our ongoing dataframe of results.

from sklearn.linear_model import Ridge

ridge = make_pipeline(
    StandardScaler(with_mean=False), Ridge(random_state=42)
)

ridge = ridge.fit(X_train, y_train)
print(ridge)
Pipeline(steps=[('standardscaler', StandardScaler(with_mean=False)),
                ('ridge', Ridge(random_state=42))])
coefs_ridge = pd.DataFrame(
  data=list(
    zip(
      predictors.design_info.column_names,
      ridge.named_steps["ridge"].coef_.tolist()[0],
    )
  ),
  columns=["variable", "coef_ridge"],
)

model_coefs = pd.merge(model_coefs, coefs_ridge, on="variable")
print(model_coefs)
                       variable       coef_lr    coef_lasso 
0                     Intercept  2.697159e-13      0.000000
1   NumUnits[T.Single attached]  9.661755e+03   7765.482025
2   NumUnits[T.Single detached]  8.345408e+03   7512.067593
3           OwnRent[T.Outright]  2.382740e+03   2431.710977
4             OwnRent[T.Rented]  2.260806e+03    604.186925
..                          ...           ...           ...
34                     NumRooms  1.340575e+04  10940.150208
35                  NumVehicles  7.228920e+03   7724.681161
36                   NumWorkers  1.877535e+04  16911.035390
37                 ElectricBill  1.000008e+04   9516.123582
38                    Insurance  3.072892e+04  32155.544169
      coef_ridge
0       0.000000
1    9659.413514
2    8342.247690
3    2381.429615
4    2259.526329
..           ...
34  13405.409584
35   7228.542922
36  18773.079462
37  10000.853603
38  30727.230542

[39 rows x 4 columns]

17.4 Elastic Net

The elastic net is a regularization technique that combines the ridge and LASSO regression techniques.

from sklearn.linear_model import ElasticNet

en = ElasticNet(random_state=42).fit(X_train, y_train)

coefs_en = pd.DataFrame(
    list(zip(predictors.design_info.column_names, en.coef_)),
    columns=["variable", "coef_en"],
)

model_coefs = pd.merge(model_coefs, coefs_en, on="variable")
print(model_coefs)
                       variable       coef_lr    coef_lasso 
0                     Intercept  2.697159e-13      0.000000
1   NumUnits[T.Single attached]  9.661755e+03   7765.482025
2   NumUnits[T.Single detached]  8.345408e+03   7512.067593
3           OwnRent[T.Outright]  2.382740e+03   2431.710977
4             OwnRent[T.Rented]  2.260806e+03    604.186925
..                         ...            ...           ...
34                     NumRooms  1.340575e+04  10940.150208
35                  NumVehicles  7.228920e+03   7724.681161
36                   NumWorkers  1.877535e+04  16911.035390
37                 ElectricBill  1.000008e+04   9516.123582
38                    Insurance  3.072892e+04  32155.544169
      coef_ridge       coef_en
0       0.000000      0.000000
1    9659.413514   1342.291706
2    8342.247690    168.728479
3    2381.429615    445.533238
4    2259.526329   -600.673747
..           ...           ...
34  13405.409584   5685.101939
35   7228.542922   6059.776166
36  18773.079462  12247.547800
37  10000.853603     97.566664
38  30727.230542     32.484207

[39 rows x 5 columns]

The ElasticNet object has two parameters, alpha and l1_ratio, that allow you to control the behavior of the model. The l1_ratio parameter specifically controls how much of the L2 or L1 penalty is used. If l1_ratio = 0, then the model will behave as described by ridge regression. If l1_ratio = 1, then the model will behave as described by LASSO regression. Any value in between will give some combination of the ridge and LASSO regression results.

Since LASSO regression can zero out coefficients, let’s just see how the coefficients compare with just the variables where LASSO has turned into a 0.

print(model_coefs.loc[model_coefs["coef_lasso"] == 0])
                variable       coef_lr coef_lasso   coef_ridge 
0              Intercept  2.697159e-13        0.0     0.000000
25  HeatingFuel[T.Solar]  1.442204e+02        0.0   142.354045
     coef_en
0   0.000000
25  0.994142

17.5 Cross-Validation

Cross-validation (first described in Section 16.3) is a commonly used technique when fitting models. It was mentioned at the beginning of this chapter, as a segue to regularization, but it is also a way to pick optimal parameters for regularization. Since the user must tune certain parameters (also known as hyper-parameters), cross-validation can be used to try out various combinations of these hyper-parameters to pick the “best” model. The ElasticNet object has a similar function called ElasticNetCV that can iteratively fit the elastic net with various hyper-parameter values.1

1. ElasticNetCV documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNetCV.html

from sklearn.linear_model import ElasticNetCV

en_cv = ElasticNetCV(cv=5, random_state=42).fit(
    X_train, y_train.ravel() # ravel is to remove the 1d warning
)

coefs_en_cv = pd.DataFrame(
    list(zip(predictors.design_info.column_names, en_cv.coef_)),
    columns=["variable", "coef_en_cv"],
)

model_coefs = pd.merge(model_coefs, coefs_en_cv, on="variable")
print(model_coefs)
                       variable       coef_lr    coef_lasso 
0                     Intercept  2.697159e-13      0.000000
1   NumUnits[T.Single attached]  9.661755e+03   7765.482025
2   NumUnits[T.Single detached]  8.345408e+03   7512.067593
3           OwnRent[T.Outright]  2.382740e+03   2431.710977
4             OwnRent[T.Rented]  2.260806e+03    604.186925
..                          ...           ...           ...
34                     NumRooms  1.340575e+04  10940.150208
35                  NumVehicles  7.228920e+03   7724.681161
36                   NumWorkers  1.877535e+04  16911.035390
37                 ElectricBill  1.000008e+04   9516.123582
38                    Insurance  3.072892e+04  32155.544169
      coef_ridge       coef_en  coef_en_cv
0       0.000000      0.000000    0.000000
1    9659.413514   1342.291706   -0.000000
2    8342.247690    168.728479    0.000000
3    2381.429615    445.533238    0.000000
4    2259.526329   -600.673747   -0.000000
..           ...           ...         ...
34  13405.409584   5685.101939    0.028443
35   7228.542922   6059.776166    0.000000
36  18773.079462  12247.547800    0.000000
37  10000.853603     97.566664   26.166320
38  30727.230542     32.484207   38.561748

[39 rows x 6 columns]

Let’s compare which coefficients were turned into 0.

print(model_coefs.loc[model_coefs["coef_en_cv"] == 0])
                       variable       coef_lr   coef_lasso 
0                     Intercept  2.697159e-13     0.000000
1   NumUnits[T.Single attached]  9.661755e+03  7765.482025
2   NumUnits[T.Single detached]  8.345408e+03  7512.067593
3           OwnRent[T.Outright]  2.382740e+03  2431.710977
4             OwnRent[T.Rented]  2.260806e+03   604.186925
..                          ...           ...          ...
31                  NumBedrooms  3.755708e+03  4447.892458
32                  NumChildren  9.524915e+03  6905.672216
33                    NumPeople -1.153672e+04 -8777.265840
35                  NumVehicles  7.228920e+03  7724.681161
36                   NumWorkers  1.877535e+04 16911.035390
      coef_ridge       coef_en   coef_en_cv
0       0.000000      0.000000          0.0
1    9659.413514   1342.291706         -0.0
2    8342.247690    168.728479          0.0
3    2381.429615    445.533238          0.0
4    2259.526329   -600.673747         -0.0
..           ...           ...          ...
31   3755.521256   2073.910045          0.0
32   9521.180875   2498.719581          0.0
33 -11533.098634  -2562.412933          0.0
35   7228.542922   6059.776166          0.0
36  18773.079462  12247.547800          0.0
          
[36 rows x 6 columns]

Conclusion

Regularization is a technique used to prevent overfitting of data. It achieves this goal by applying some penalty for each feature added to the model. The end result either drops variables from the model or decreases the coefficients of the model. Both techniques try to fit the training data less accurately but hope to provide better predictions with data that has not been seen before. These techniques can be combined (as seen in the elastic net), and can also be iterated over and improved with cross-validation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.245.236