The importance of variables

During model-building exercises, datasets may have tens of variables. Not all of them may add value to the predictive model. It is not uncommon to reduce the dataset to include a subset of the variables and allow the machine learning programmer to devote more time toward fine-tuning the chosen variables and the model-building process. There is also a technical justification for reducing the number of variables in the dataset. Performing machine learning modeling on very large, that is, high dimensional datasets can be very compute-intensive, that is, it may require a significant amount of time, CPU, and RAM to perform the numerical operations. This not only makes the application of certain algorithms impractical, it also has the effect of causing unwarranted delays. Hence, the methodical selection of variables helps both in terms of analysis time as well as computational requirements for algorithmic analysis.

Variable selection is also known as feature selection/attribute selection. Algorithms such as random forests and lasso regression implement variable selection as part of their algorithmic operations. But, variable selection can be done as a separate exercise.

The R package, caret, provides a very simple-to-use and intuitive interface for variable selection. As we haven't yet discussed the modeling process, we will learn how to find the important variables, and in the next chapter delve deeper into the subject.

We'll use a common, well-known algorithm called RandomForest that is used for building decision trees. The algorithm will be described in more detail in the next chapter, but the purpose of using it here is merely to show how variable selection can be performed. The example is illustrative of what the general process is.

We'll re-use the dataset we have been working with, that is, the PimaIndiansDiabetes data from the mlbench package. We haven't discussed the model training process yet, but it has been used here in order to derive the values for variable importance. The outcome variable in this case is diabetes and the other variables are used as the independent variables. In other words, can we predict if the person has diabetes using the data available:

diab<- PimaIndiansDiabetes 
 
# We will use the createDataPartition function from caret to split 
# The data. The function produces a set of indices using which we 
# will create the corresponding training and test sets 
 
training_index<- createDataPartition(diab$diabetes, p = 0.80, list = FALSE, times = 1) 
 
# Creating the training set 
diab_train<- diab[training_index,] 
 
# Create the test set 
diab_test<- diab[-training_index,] 
 
# Create the trainControl parameters for the model 
diab_control<- trainControl("repeatedcv", number = 3, repeats = 2, classProbs = TRUE, summaryFunction = twoClassSummary) 
 
# Build the model 
rf_model<- train(diabetes ~ ., data = diab_train, method = "rf", preProc = c("center", "scale"), tuneLength = 5, trControl = diab_control, metric = "ROC") 
 
# Find the Variable Importance 
varImp(rf_model) 
rf variable importance 
 
         Overall 
glucose  100.000 
mass      52.669 
age       39.230 
pedigree  24.885 
pressure  12.619 
pregnant   6.919 
insulin    2.294 
triceps    0.000 
 
# This indicates that glucose levels, body mass index and age are the top 3 predictors of diabetes. 
 
# caret also includes several useful plot functions. We can visualize the variable importance using the command: 
 
plot(varImp(rf_model))

The output of the preceding code is as shown below. It indicates that glucose, mass and age were the variables that contributed the most towards creating the model (to predict diabetes)

Table of Contents for The importance of variables

Create new playlist

Sign In

Sign Up

Table of Contents for
The importance of variables