Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Penalized regression

The problem of maximum likelihood estimation comes into picture when we include large number of predictors or highly correlated predictors or both in a regression model and its failure to provide higher accuracy in regression problems gives rise to the introduction of penalized regression in data mining. The properties of maximum likelihood cannot be satisfying the regression procedure because of high variability and improper interpretation. To address the issue, most relevant subset selection came into picture. However, the subset selection procedure has some demerits. To again solve the problem, a new method can be introduced, which is popularly known as penalized maximum likelihood estimation method. There are two different variants of penalized regression that we are going to discuss here:

Ridge regression: The ridge regression is known as L2 quadratic penalty and the equation can be represented as follows:

In comparison to the maximum likelihood estimates, the L1 and L2 method of estimation shrinks the beta coefficients towards zero. In order to avoid the problem of multicollinearity and higher number of predictors or both typically in a scenario when you have less number of observations, the shrinkage methods reduces the overfitted beta coefficient values.

Taking the Cars93_1.csv dataset, we can test out the model and the results from it. The lambda parameter basically strikes the balance between penalty and fit of log likelihood function. The selection of lambda value is very critical for the model; if it is a small value then the model may overfit the data and high variance would become evident, and if the lambda chosen is very large value then it may lead to biased result.

     > #installing the library

     > library(glmnet)

     > #removing the missing values from the dataset

     > Cars93_2<-na.omit(Cars93_1)

     > #independent variables matrix

     > x<-as.matrix(Cars93_2[,-1])

     > #dependent variale matrix

     > y<-as.matrix(Cars93_2[,1])

     > #fitting the regression model

     > mod<-glmnet(x,y,family = "gaussian",alpha = 0,lambda = 0.001)

     > #summary of the model

     > summary(mod)

     Length Class Mode

     a0 1 -none- numeric

     beta 13 dgCMatrix S4

     df 1 -none- numeric

     dim 2 -none- numeric

     lambda 1 -none- numeric

     dev.ratio 1 -none- numeric

     nulldev 1 -none- numeric

     npasses 1 -none- numeric

     jerr 1 -none- numeric

     offset 1 -none- logical

     call 6 -none- call

     nobs 1 -none- numeric

     #Making predictions

     pred<-predict(mod,x,type = "link")

     #estimating the error for the model.

     mean((y-pred)^2)

     > #Making predictions

     > pred<-predict(mod,x,type = "link")

     > #estimating the error for the model.

     > mean((y-pred)^2)

     [1] 6.663406

Given the preceding background, the regression model was able to show up with 6.66% error, which implies the model is giving 93.34% accuracy. This is not the final iteration. The model needs to be tested couple of times and various subsamples as out of the bag need to be used to conclude the final accuracy of the model.

Least Absolute Shrinkage Operator (LASSO): The LASSO regression is known as L1 absolute value penalty (Tibshirani, 1997). The equation for LASSO regression can be expressed as follows:

Taking the Cars93_1.csv dataset, we can test out the model and the results from it. The lambda parameter basically strikes the balance between penalty and fit of log likelihood function. The selection of the lambda value is very critical for the model; if it is a small value then the model may over fit the data and high variance would become evident, and if the lambda chosen is very large value then it may lead to biased result. Let's try to implement the model and verify the results:

      > #installing the library

      > library(lars)

      > #removing the missing values from the dataset

      > Cars93_2<-na.omit(Cars93_1)

      > #independent variables matrix

      > x<-as.matrix(Cars93_2[,-1])

      > #dependent variale matrix

      > y<-as.matrix(Cars93_2[,1])

      > #fitting the LASSO regression model

      > model<-lars(x,y,type = "lasso")

      > #summary of the model

      > summary(model)

      LARS/LASSO

      Call: lars(x = x, y = y, type = "lasso")

      Df Rss Cp

      0 1 2215.17 195.6822

      1 2 1138.71 63.7148

      2 3 786.48 21.8784

      3 4 724.26 16.1356

      4 5 699.39 15.0400

      5 6 692.49 16.1821

      6 7 675.16 16.0246

      7 8 634.59 12.9762

      8 9 623.74 13.6260

      9 8 617.24 10.8164

      10 9 592.76 9.7695

      11 10 587.43 11.1064

      12 11 551.46 8.6302

      13 12 548.22 10.2275

      14 13 547.85 12.1805

      15 14 546.40 14.0000

The type option within the Least Angle Regression (LARS) provides opportunities to apply various variants of the lars model with "lasso", "lar", "forward.stagewise", or "stepwise". Using these we can create various models and compare the results.

From the preceding output of the model we can see the RSS and CP along with degrees of freedom, at each iterations so we need to find out a best model step where the RSS for the model is minimum:

   > #select best step with a minin error

   > best_model<-model$df[which.min(model$RSS)]

   > best_model

   14

   > #Making predictions

   > pred<-predict(model,x,s=best_model,type = "fit")$fit

   > #estimating the error for the model.

   > mean((y-pred)^2)

   [1] 6.685669

The best model is available at step 14, where the RSS is minimum and hence using that best step we can make predictions. Using the predict function at that step, we can generate predictions and the error for the model chosen is 6.68%. The visualization for the fitted model can be displayed as shown next. The vertical axis shows the standardized coefficients and the horizontal axis shows the model coefficient progression for the dependent variable. The x axis shows the ratio of absolute value of beta coefficients upon the max of the absolute value of the beta coefficients. The numerator is the estimated beta coefficient current and the denominator is the beta coefficient for the OLS model. When the shrinkage parameter in LASSO takes a value of zero, the model will correspond to a OLS regression method. As the penalized parameter increases the sum of absolute values of the beta coefficients pulled to zero:

In the preceding graph, the vertical bars indicate when a variable has been pulled to zero. Vertical bar corresponding to 15 shows there are 15 predictor variables need to be reduced to zero, which corresponds to higher penalized parameter lambda value 1.0. When lambda the penalized parameter is very small, the LASSO would approach the OLS regression and as lambda increases you will see fewer variables in your model. Hence, after lambda value of 1.0, there are no variables left in the model.

L1 and L2 are known as regularized regression methods. L1 cannot zero out regression coefficients; either you will get all the coefficients or none of them. However, L2 does parameter shrinkage and variable selection automatically.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Penalized regression

Create new playlist

Sign In

Sign Up

Penalized regression

Table of Contents for
Penalized regression