How to do it...

In this recipe, we will focus on removing irrelevant features from our models, extracting the importance for each feature, and how to use recursive elimination to select features from a preliminary model (this can be used later in a secondary model containing the right features).

We will work with the Boston dataset. The idea of this dataset is to predict the median property price, based on environmental variables, crime indexes, and so on. We will use a random forest model. Here, we will follow a manual approach for feature selection, where we will train a model, get each feature importance, and build a final model:

First, we load the Boston dataset, we define the control and the grid for the model, and then we train the model using the usual cross-validation:

library(MASS)
library(caret)
control <- trainControl(method="repeatedcv", number=4, repeats=1)
tunegrid  <- expand.grid(.mtry=c(2,3,4,5,6,7,8))
data <- Boston
result <- train(medv~., data=data, method="rf", metric="RMSE", tuneGrid=tunegrid,  
trControl=control,importance=TRUE)$finalModel
result

The following screenshot shows model results:

After the model has been trained, we will get the predictions via the varImp function. This can also be obtained using the importance() function:

gbmImp <- varImp(result)
importance(result)

The following screenshot shows variable importance table:

These can also be printed using the varImpPlot() function:

varImpPlot(result)
Variable importance plot

Take a look at the following screenshot:

The model can be rebuilt using just two features (the ones that rank higher in both plots). It's not that surprising that a model with just two features performs reasonably closely to our full model with many extra variables. Nevertheless, what we really care about is which model gives the lowest RMSE, and obviously it would be rare for a model with two features to perform better than one with 13 of them. What could be a promising approach is to start with these two and keep adding more variables:

tunegrid <- expand.grid(.mtry=c(1))
result = train(medv~., data=data[,c("medv","rm","lstat")], method="rf", metric="RMSE",  
tuneGrid=tunegrid, trControl=control)$finalModel
result

The following screenshot shows model results for the two best variables:

The caret package already includes a function for this, which starts with a large model containing lots of features, and recursively eliminates them. As we can see the best model is actually the one with the full dataset! The print() statement returns the number of variables that were tested along with some metrics, and the chosen model. The predictors() function only returns the chosen variables. We are instructing the rfe control to try subsets of 1, 2, 3, 4, and 5 features, plus a full model of 13 features:

control <- rfeControl(functions=rfFuncs, method="cv", number=10)
results <- rfe(as.matrix(data[-14]),as.matrix(data[14]), sizes=c(1:5), rfeControl=control)
print(results)
predictors(results)

The following screenshot shows the result:

plot(results, type=c("g", "o"))
RMSE (y-axis) and number of variables (x-axis)

The following screenshot shows the plotting results:

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...