How to do it...

The handy train() function in caret will allow us to find the best hyperparameters for these models. In this recipe, we will use the longley dataset, which is included in R's standard datasets library. This dataset contains the number of people employed (which is our target variable), and several features (such as the number of unemployed, the year, the GNP, the GNP deflator, and the population). There are two challenges here; it only contains annual data from 1947 to 1962, and some of the variables are highly colinear:

Let's first load the dataset, and build a simple lm() model that will serve as a reference. As can be seen below, we have three variables with a low significance. This model is obviously not very promising as it is, because it contains nonsignificant variables inside it (that is, the predictions will have a huge variance). We set the seed to 100 for reproducibility purposes:

set.seed(100)
library(caret)
summary(lm(data=longley,Employed~.))

We now set up a train control that will be used among the different models:

rctrl1 <- trainControl(method     = "cv",number=5)

We retrain the lm() model and we focus our attention on the RMSE. Here we get an RMSE of 0.39, which will be our baseline:

ols_      <- train(Employed~.,longley, method     = "lm",  trControl  = rctrl1,  tuneLength = 4,metric="RMSE",
preProc    = c("center", "scale"))

The following screenshot shows model results:

We do the same exercise for the lasso model, and we get a RMSE of 0.369 for a fraction parameter that equals 0.54:

lasso_ <- train(Employed~.,longley, method = "lasso", trControl  = rctrl1, tuneLength = 10,metric="RMSE",  preProc = c("center", "scale"))

The following screenshot shows Lasso results:

Doing the same for ridge, we get an RMSE of 0.362. So far, this is the best model:

ridge_ <- train(Employed~.,longley, method = "ridge",  trControl = rctrl1, tuneLength = 10,metric="RMSE", preProc = c("center", "scale"))

The following screenshot shows Ridge results:

Finally, for elasticnet, the RMSE is 0.34. We conclude that this is the best model based on the ones we have tested:

elasticnet_ <- train(Employed~.,longley, method     = "glmnet",   trControl  = rctrl1, tuneLength = 10,metric="RMSE", preProc    = c("center", "scale"))

Take a look at the following screenshot:

We can still use the varImp function to get the relative importance of each variable. For example, let's see what the importance is for the elasticnet model (which is our best model here):

varImp((elasticnet_))

The following screenshot shows the variable importance vector:

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...