How to do it...

In this example, we will follow the same logic, but we will use alpha=1, forcing glmnet to do Lasso. This will penalize the coefficients now using the L1 norm, which means that some of the coefficients (the irrelevant ones) will be pushed towards zero exactly. Therefore, some data scientists use LASSO as a variable selection tool:

  1. We use the same code as before, but now with alpha=1
library(MASS) 
library(tidyr)
library(ggplot2)
library(glmnet)
get_results <- function(lambda){
coeffs_total = data.frame(V1=numeric(), V2=numeric(), V3=numeric(), V4=numeric(), V5=numeric())
for (q in 1:100){
V1 = runif(1000)*100
V2 = runif(1000)*10 + V1
V3 = runif(1000)*100
V4 = runif(1000)*10 + V3
V5 = runif(1000)*100
Residuals = runif(1000)*100
Y = V1 + V2 + V3 + V4 + Residuals
coefs_lm <- lm(Y ~ V1 + V2 + V3 + V4 + V5)$coefficients
coefs_rd <- glmnet(cbind(V1 ,V2,V3,V4 ,V5),Y,lambda=lambda,alpha=1)$beta
frame1 <- data.frame(V1= coefs_lm[2], V2= coefs_lm[3],V3= coefs_lm[4], V4=
coefs_lm[5],V5= coefs_lm[6],method="lm")
frame2 <- data.frame(V1= coefs_rd[1], V2= coefs_rd[2], V3= coefs_rd[3], V4= coefs_rd[4], V5= coefs_rd[5],method="ridge")
coeffs_total <- rbind(coeffs_total,frame1,frame2)
}
transposed_data = gather(coeffs_total,"variable","value",1:5)
ggplot(transposed_data, aes(x=variable, y=value, fill=method)) + geom_boxplot()
print(transposed_data %>% group_by(variable,method) %>%
summarise(median=median(value)))
}
  1. We now run the code with lambda=8. As you can see, the coefficients are slightly smaller than those with Ridge. But the most important part here is that the irrelevant coefficient is now equal to zero. This is slightly better than in Ridge, because it is literally telling us to discard that variable from the model:
get_results(8) 

The following screenshot shows the boxplots for the (lambda=8) coefficients:

The following screenshot shows medians for each coefficient (notice that values used for the previous plot):

  1. If we had used 0.1 instead of 8 for lambda, we would have got very similar results. This behaves differently from Ridge, where we got very different results for different lambdas: 
get_results(0.1) 

The following screenshot shows the boxplots for the (lambda=0.1) coefficients:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.19.231