Penalized regression

The problem of maximum likelihood estimation comes into picture when we include large number of predictors or highly correlated predictors or both in a regression model and its failure to provide higher accuracy in regression problems gives rise to the introduction of penalized regression in data mining. The properties of maximum likelihood cannot be satisfying the regression procedure because of high variability and improper interpretation. To address the issue, most relevant subset selection came into picture. However, the subset selection procedure has some demerits. To again solve the problem, a new method can be introduced, which is popularly known as penalized maximum likelihood estimation method. There are two different variants of penalized regression that we are going to discuss here:

  • Ridge regression: The ridge regression is known as L2 quadratic penalty and the equation can be represented as follows:

    Penalized regression

    In comparison to the maximum likelihood estimates, the L1 and L2 method of estimation shrinks the beta coefficients towards zero. In order to avoid the problem of multicollinearity and higher number of predictors or both typically in a scenario when you have less number of observations, the shrinkage methods reduces the overfitted beta coefficient values.

    Taking the Cars93_1.csv dataset, we can test out the model and the results from it. The lambda parameter basically strikes the balance between penalty and fit of log likelihood function. The selection of lambda value is very critical for the model; if it is a small value then the model may overfit the data and high variance would become evident, and if the lambda chosen is very large value then it may lead to biased result.

         > #installing the library
    
         > library(glmnet)
    
         > #removing the missing values from the dataset
    
         > Cars93_2<-na.omit(Cars93_1)
    
         > #independent variables matrix
    
         > x<-as.matrix(Cars93_2[,-1])
    
         > #dependent variale matrix
    
         > y<-as.matrix(Cars93_2[,1])
    
         > #fitting the regression model
    
         > mod<-glmnet(x,y,family = "gaussian",alpha = 0,lambda = 0.001)
    
         > #summary of the model
    
         > summary(mod)
    
         Length Class Mode
    
         a0 1 -none- numeric
    
         beta 13 dgCMatrix S4
    
         df 1 -none- numeric
    
         dim 2 -none- numeric
    
         lambda 1 -none- numeric
    
         dev.ratio 1 -none- numeric
    
         nulldev 1 -none- numeric
    
         npasses 1 -none- numeric
    
         jerr 1 -none- numeric
    
         offset 1 -none- logical
    
         call 6 -none- call
    
         nobs 1 -none- numeric
    
         #Making predictions
    
         pred<-predict(mod,x,type = "link")
    
         #estimating the error for the model.
    
         mean((y-pred)^2)
    
         > #Making predictions
    
         > pred<-predict(mod,x,type = "link")
    
         > #estimating the error for the model.
    
         > mean((y-pred)^2)
    
         [1] 6.663406
    

    Given the preceding background, the regression model was able to show up with 6.66% error, which implies the model is giving 93.34% accuracy. This is not the final iteration. The model needs to be tested couple of times and various subsamples as out of the bag need to be used to conclude the final accuracy of the model.

  • Least Absolute Shrinkage Operator (LASSO): The LASSO regression is known as L1 absolute value penalty (Tibshirani, 1997). The equation for LASSO regression can be expressed as follows:

    Penalized regression

    Taking the Cars93_1.csv dataset, we can test out the model and the results from it. The lambda parameter basically strikes the balance between penalty and fit of log likelihood function. The selection of the lambda value is very critical for the model; if it is a small value then the model may over fit the data and high variance would become evident, and if the lambda chosen is very large value then it may lead to biased result. Let's try to implement the model and verify the results:

          > #installing the library
    
          > library(lars)
    
          > #removing the missing values from the dataset
    
          > Cars93_2<-na.omit(Cars93_1)
    
          > #independent variables matrix
    
          > x<-as.matrix(Cars93_2[,-1])
    
          > #dependent variale matrix
    
          > y<-as.matrix(Cars93_2[,1])
    
          > #fitting the LASSO regression model
    
          > model<-lars(x,y,type = "lasso")
    
          > #summary of the model
    
          > summary(model)
    
          LARS/LASSO
    
          Call: lars(x = x, y = y, type = "lasso")
    
          Df Rss Cp
    
          0 1 2215.17 195.6822
    
          1 2 1138.71 63.7148
    
          2 3 786.48 21.8784
    
          3 4 724.26 16.1356
    
          4 5 699.39 15.0400
    
          5 6 692.49 16.1821
    
          6 7 675.16 16.0246
    
          7 8 634.59 12.9762
    
          8 9 623.74 13.6260
    
          9 8 617.24 10.8164
    
          10 9 592.76 9.7695
    
          11 10 587.43 11.1064
    
          12 11 551.46 8.6302
    
          13 12 548.22 10.2275
    
          14 13 547.85 12.1805
    
          15 14 546.40 14.0000
    

    The type option within the Least Angle Regression (LARS) provides opportunities to apply various variants of the lars model with "lasso", "lar", "forward.stagewise", or "stepwise". Using these we can create various models and compare the results.

    From the preceding output of the model we can see the RSS and CP along with degrees of freedom, at each iterations so we need to find out a best model step where the RSS for the model is minimum:

       > #select best step with a minin error
    
       > best_model<-model$df[which.min(model$RSS)]
    
       > best_model
    
       14
    
       > #Making predictions
    
       > pred<-predict(model,x,s=best_model,type = "fit")$fit
    
       > #estimating the error for the model.
    
       > mean((y-pred)^2)
    
       [1] 6.685669
    

The best model is available at step 14, where the RSS is minimum and hence using that best step we can make predictions. Using the predict function at that step, we can generate predictions and the error for the model chosen is 6.68%. The visualization for the fitted model can be displayed as shown next. The vertical axis shows the standardized coefficients and the horizontal axis shows the model coefficient progression for the dependent variable. The x axis shows the ratio of absolute value of beta coefficients upon the max of the absolute value of the beta coefficients. The numerator is the estimated beta coefficient current and the denominator is the beta coefficient for the OLS model. When the shrinkage parameter in LASSO takes a value of zero, the model will correspond to a OLS regression method. As the penalized parameter increases the sum of absolute values of the beta coefficients pulled to zero:

Penalized regression

In the preceding graph, the vertical bars indicate when a variable has been pulled to zero. Vertical bar corresponding to 15 shows there are 15 predictor variables need to be reduced to zero, which corresponds to higher penalized parameter lambda value 1.0. When lambda the penalized parameter is very small, the LASSO would approach the OLS regression and as lambda increases you will see fewer variables in your model. Hence, after lambda value of 1.0, there are no variables left in the model.

L1 and L2 are known as regularized regression methods. L1 cannot zero out regression coefficients; either you will get all the coefficients or none of them. However, L2 does parameter shrinkage and variable selection automatically.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.12.3