In Chapter 16, we considered various ways to measure model performance. Section 16.3 described k-fold cross-validation, a technique that tries to measure model performance by looking at how it predicts on test data. This chapter explores regularization, one technique to improve performance on test data. Specifically, this method aims to prevent overfitting.
Let’s begin with a base case of linear regression. We will be using the ACS data.
import pandas as pd
acs = pd.read_csv('data/acs_ny.csv')
print(acs.columns)
Index(['Acres', 'FamilyIncome', 'FamilyType', 'NumBedrooms',
'NumChildren', 'NumPeople', 'NumRooms', 'NumUnits',
'NumVehicles', 'NumWorkers', 'OwnRent', 'YearBuilt',
'HouseCosts', 'ElectricBill', 'FoodStamp', 'HeatingFuel',
'Insurance', 'Language'],
dtype='object')
Now, let’s create our design matrices using patsy
.
from patsy import dmatrices
# sequential strings get concatenated together in Python
response, predictors = dmatrices(
"FamilyIncome ~ NumBedrooms + NumChildren + NumPeople + "
"NumRooms + NumUnits + NumVehicles + NumWorkers + OwnRent + "
"YearBuilt + ElectricBill + FoodStamp + HeatingFuel + "
"Insurance + Language",
data=acs,
)
With our predictor and response matrices created, we can use sklearn
to split our data into training and testing sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
predictors, response, random_state=0
)
Now, let’s fit our linear model. Here we are normalizing our data so we can compare our coefficients when we use our regularization techniques.
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
lr = make_pipeline(
StandardScaler(with_mean=False), LinearRegression()
)
lr = lr.fit(X_train, y_train)
print(lr)
Pipeline(steps=[('standardscaler', StandardScaler(with_mean=False)),
('linearregression', LinearRegression())])
model_coefs = pd.DataFrame(
data=list(
zip(
predictors.design_info.column_names,
lr.named_steps["linearregression"].coef_[0],
)
),
columns=["variable", "coef_lr"],
)
print(model_coefs)
variable coef_lr
0 Intercept 2.697159e-13
1 NumUnits[T.Single attached] 9.661755e+03
2 NumUnits[T.Single detached] 8.345408e+03
3 OwnRent[T.Outright] 2.382740e+03
4 OwnRent[T.Rented] 2.260806e+03
.. ... ...
34 NumRooms 1.340575e+04
35 NumVehicles 7.228920e+03
36 NumWorkers 1.877535e+04
37 ElectricBill 1.000008e+04
38 Insurance 3.072892e+04
[39 rows x 2 columns]
Now, we can look at our model scores.
# score on the _training_ data
print(lr.score(X_train, y_train))
0.2726140465638568
# score on the _testing_ data
print(lr.score(X_test, y_test))
0.26976979568488013
In this particular case, our model demonstrates poor performance. In another potential scenario, we might have a high training score and a low test score—a sign of overfitting. Regularization solves this overfitting issue, by putting constraints on the coefficients and variables. This causes the coefficients of our data to be smaller. In the case of LASSO (least absolute shrinkage and selection operator) regression, some coefficients can actually be dropped (i.e., become 0
), whereas in ridge regression, coefficients will approach 0, but are never dropped.
The first type of regularization technique is called LASSO, which stands for least absolute shrinkage and selection operator. It is also known as regression with L1 regularization.
We will fit the same model as we did in our linear regression.
from sklearn.linear_model import Lasso
lasso = make_pipeline(
StandardScaler(with_mean=False),
Lasso(max_iter=10000, random_state=42),
)
lasso = lasso.fit(X_test, y_test)
print(lasso)
Pipeline(steps=[('standardscaler', StandardScaler(with_mean=False)),
('lasso', Lasso(max_iter=10000, random_state=42))])
Now, let’s get a dataframe of coefficients, and combine them with our linear regression results.
coefs_lasso = pd.DataFrame(
data=list(
zip(
predictors.design_info.column_names,
lasso.named_steps["lasso"].coef_.tolist(),
)
),
columns=["variable", "coef_lasso"],
)
model_coefs = pd.merge(model_coefs, coefs_lasso, on='variable')
print(model_coefs)
variable coef_lr coef_lasso
0 Intercept 2.697159e-13 0.000000
1 NumUnits[T.Single attached] 9.661755e+03 7765.482025
2 NumUnits[T.Single detached] 8.345408e+03 7512.067593
3 OwnRent[T.Outright] 2.382740e+03 2431.710977
4 OwnRent[T.Rented] 2.260806e+03 604.186925
.. ... ... ...
34 NumRooms 1.340575e+04 10940.150208
35 NumVehicles 7.228920e+03 7724.681161
36 NumWorkers 1.877535e+04 16911.035390
37 ElectricBill 1.000008e+04 9516.123582
38 Insurance 3.072892e+04 32155.544169
[39 rows x 3 columns]
Notice that the coefficients are now smaller than their original linear regression values. Additionally, some of the coefficients are now 0
.
Finally, let’s look at our training and test data scores.
print(lasso.score(X_train, y_train))
0.2669751487716776
print(lasso.score(X_test, y_test))
0.2752627973740016
There isn’t much difference here, but you can see that the test results are now better than the training results. That is, there is an improvement in prediction when using new, unseen data.
Now let’s look at another regularization technique, ridge regression. It is also known as regression with L2 regularization.
Most of the code will be very similar to that seen with the previous methods. We will fit the model on our training data, and combine the results with our ongoing dataframe of results.
from sklearn.linear_model import Ridge
ridge = make_pipeline(
StandardScaler(with_mean=False), Ridge(random_state=42)
)
ridge = ridge.fit(X_train, y_train)
print(ridge)
Pipeline(steps=[('standardscaler', StandardScaler(with_mean=False)),
('ridge', Ridge(random_state=42))])
coefs_ridge = pd.DataFrame(
data=list(
zip(
predictors.design_info.column_names,
ridge.named_steps["ridge"].coef_.tolist()[0],
)
),
columns=["variable", "coef_ridge"],
)
model_coefs = pd.merge(model_coefs, coefs_ridge, on="variable")
print(model_coefs)
variable coef_lr coef_lasso
0 Intercept 2.697159e-13 0.000000
1 NumUnits[T.Single attached] 9.661755e+03 7765.482025
2 NumUnits[T.Single detached] 8.345408e+03 7512.067593
3 OwnRent[T.Outright] 2.382740e+03 2431.710977
4 OwnRent[T.Rented] 2.260806e+03 604.186925
.. ... ... ...
34 NumRooms 1.340575e+04 10940.150208
35 NumVehicles 7.228920e+03 7724.681161
36 NumWorkers 1.877535e+04 16911.035390
37 ElectricBill 1.000008e+04 9516.123582
38 Insurance 3.072892e+04 32155.544169
coef_ridge
0 0.000000
1 9659.413514
2 8342.247690
3 2381.429615
4 2259.526329
.. ...
34 13405.409584
35 7228.542922
36 18773.079462
37 10000.853603
38 30727.230542
[39 rows x 4 columns]
The elastic net is a regularization technique that combines the ridge and LASSO regression techniques.
from sklearn.linear_model import ElasticNet
en = ElasticNet(random_state=42).fit(X_train, y_train)
coefs_en = pd.DataFrame(
list(zip(predictors.design_info.column_names, en.coef_)),
columns=["variable", "coef_en"],
)
model_coefs = pd.merge(model_coefs, coefs_en, on="variable")
print(model_coefs)
variable coef_lr coef_lasso
0 Intercept 2.697159e-13 0.000000
1 NumUnits[T.Single attached] 9.661755e+03 7765.482025
2 NumUnits[T.Single detached] 8.345408e+03 7512.067593
3 OwnRent[T.Outright] 2.382740e+03 2431.710977
4 OwnRent[T.Rented] 2.260806e+03 604.186925
.. ... ... ...
34 NumRooms 1.340575e+04 10940.150208
35 NumVehicles 7.228920e+03 7724.681161
36 NumWorkers 1.877535e+04 16911.035390
37 ElectricBill 1.000008e+04 9516.123582
38 Insurance 3.072892e+04 32155.544169
coef_ridge coef_en
0 0.000000 0.000000
1 9659.413514 1342.291706
2 8342.247690 168.728479
3 2381.429615 445.533238
4 2259.526329 -600.673747
.. ... ...
34 13405.409584 5685.101939
35 7228.542922 6059.776166
36 18773.079462 12247.547800
37 10000.853603 97.566664
38 30727.230542 32.484207
[39 rows x 5 columns]
The ElasticNet
object has two parameters, alpha
and l1_ratio
, that allow you to control the behavior of the model. The l1_ratio
parameter specifically controls how much of the L2 or L1 penalty is used. If l1_ratio = 0
, then the model will behave as described by ridge regression. If l1_ratio = 1
, then the model will behave as described by LASSO regression. Any value in between will give some combination of the ridge and LASSO regression results.
Since LASSO regression can zero out coefficients, let’s just see how the coefficients compare with just the variables where LASSO has turned into a 0
.
print(model_coefs.loc[model_coefs["coef_lasso"] == 0])
variable coef_lr coef_lasso coef_ridge
0 Intercept 2.697159e-13 0.0 0.000000
25 HeatingFuel[T.Solar] 1.442204e+02 0.0 142.354045
coef_en
0 0.000000
25 0.994142
Cross-validation (first described in Section 16.3) is a commonly used technique when fitting models. It was mentioned at the beginning of this chapter, as a segue to regularization, but it is also a way to pick optimal parameters for regularization. Since the user must tune certain parameters (also known as hyper-parameters), cross-validation can be used to try out various combinations of these hyper-parameters to pick the “best” model. The ElasticNet
object has a similar function called ElasticNetCV
that can iteratively fit the elastic net with various hyper-parameter values.1
1. ElasticNetCV
documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNetCV.html
from sklearn.linear_model import ElasticNetCV
en_cv = ElasticNetCV(cv=5, random_state=42).fit(
X_train, y_train.ravel() # ravel is to remove the 1d warning
)
coefs_en_cv = pd.DataFrame(
list(zip(predictors.design_info.column_names, en_cv.coef_)),
columns=["variable", "coef_en_cv"],
)
model_coefs = pd.merge(model_coefs, coefs_en_cv, on="variable")
print(model_coefs)
variable coef_lr coef_lasso
0 Intercept 2.697159e-13 0.000000
1 NumUnits[T.Single attached] 9.661755e+03 7765.482025
2 NumUnits[T.Single detached] 8.345408e+03 7512.067593
3 OwnRent[T.Outright] 2.382740e+03 2431.710977
4 OwnRent[T.Rented] 2.260806e+03 604.186925
.. ... ... ...
34 NumRooms 1.340575e+04 10940.150208
35 NumVehicles 7.228920e+03 7724.681161
36 NumWorkers 1.877535e+04 16911.035390
37 ElectricBill 1.000008e+04 9516.123582
38 Insurance 3.072892e+04 32155.544169
coef_ridge coef_en coef_en_cv
0 0.000000 0.000000 0.000000
1 9659.413514 1342.291706 -0.000000
2 8342.247690 168.728479 0.000000
3 2381.429615 445.533238 0.000000
4 2259.526329 -600.673747 -0.000000
.. ... ... ...
34 13405.409584 5685.101939 0.028443
35 7228.542922 6059.776166 0.000000
36 18773.079462 12247.547800 0.000000
37 10000.853603 97.566664 26.166320
38 30727.230542 32.484207 38.561748
[39 rows x 6 columns]
Let’s compare which coefficients were turned into 0
.
print(model_coefs.loc[model_coefs["coef_en_cv"] == 0])
variable coef_lr coef_lasso
0 Intercept 2.697159e-13 0.000000
1 NumUnits[T.Single attached] 9.661755e+03 7765.482025
2 NumUnits[T.Single detached] 8.345408e+03 7512.067593
3 OwnRent[T.Outright] 2.382740e+03 2431.710977
4 OwnRent[T.Rented] 2.260806e+03 604.186925
.. ... ... ...
31 NumBedrooms 3.755708e+03 4447.892458
32 NumChildren 9.524915e+03 6905.672216
33 NumPeople -1.153672e+04 -8777.265840
35 NumVehicles 7.228920e+03 7724.681161
36 NumWorkers 1.877535e+04 16911.035390
coef_ridge coef_en coef_en_cv
0 0.000000 0.000000 0.0
1 9659.413514 1342.291706 -0.0
2 8342.247690 168.728479 0.0
3 2381.429615 445.533238 0.0
4 2259.526329 -600.673747 -0.0
.. ... ... ...
31 3755.521256 2073.910045 0.0
32 9521.180875 2498.719581 0.0
33 -11533.098634 -2562.412933 0.0
35 7228.542922 6059.776166 0.0
36 18773.079462 12247.547800 0.0
[36 rows x 6 columns]
Regularization is a technique used to prevent overfitting of data. It achieves this goal by applying some penalty for each feature added to the model. The end result either drops variables from the model or decreases the coefficients of the model. Both techniques try to fit the training data less accurately but hope to provide better predictions with data that has not been seen before. These techniques can be combined (as seen in the elastic net), and can also be iterated over and improved with cross-validation.
3.133.146.47