Building attrition prediction model with stacking

Let's build an attrition prediction model with stacking:

# loading the required libraries and registering the cpu cores for multiprocessing 
library(doMC)
library(caret)
library(caretEnsemble)
registerDoMC(cores=4)
# setting the working directory and loading the dataset
setwd("~/Desktop/chapter 15")
mydata <- read.csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
# removing the non-discriminatory features from the dataset as identified in EDA step
mydata$EmployeeNumber=mydata$Over18=mydata$EmployeeCount=mydata$StandardHours = NULL
# setting up control paramaters for cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=10, savePredictions=TRUE, classProbs=TRUE)
# declaring the ML algorithms to use in stacking
algorithmList <- c('C5.0', 'nb', 'glm', 'knn', 'svmRadial')
# setting the seed to ensure reproducibility of the results
set.seed(10000)
# creating the stacking model
models <- caretList(Attrition~., data=mydata, trControl=control, methodList=algorithmList)
# obtaining the stacking model results and printing them
results <- resamples(models)
summary(results)

This will result in the following output:

summary.resamples(object = results) 

Models: C5.0, nb, glm, knn, svmRadial
Number of resamples: 100

Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
C5.0 0.8082192 0.8493151 0.8639456 0.8625833 0.8775510 0.9054054 0
nb 0.8367347 0.8367347 0.8378378 0.8387821 0.8424658 0.8435374 0
glm 0.8299320 0.8639456 0.8775510 0.8790444 0.8911565 0.9387755 0
knn 0.8027211 0.8299320 0.8367347 0.8370763 0.8438017 0.8630137 0
svmRadial 0.8287671 0.8648649 0.8775510 0.8790467 0.8911565 0.9319728 0

Kappa Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
C5.0 0.03992485 0.29828006 0.37227344 0.3678459 0.4495049 0.6112590 0
nb 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0
glm 0.26690604 0.39925723 0.47859218 0.4673756 0.5218094 0.7455280 0
knn -0.05965697 0.02599388 0.06782465 0.0756081 0.1320451 0.2431312 0
svmRadial 0.24565 0.38667527 0.44195662 0.4497538 0.5192393 0.7423764 0

# Identifying the correlation between results
modelCor(results)

This will result in the following output:

We can see from the correlation table results that none of the individual ML algorithm predictions are highly correlated. Very highly correlated results mean that the algorithms have produced very similar predictions. Combining the very similar predictions may not really yield significant benefit compared with what one would avail from accepting the individual predictions. In this specific case, we can observe that none of the algorithm predictions are highly correlated so we can straightforwardly move to the next step of stacking the predictions:

# Setting up the cross validation control parameters for stacking the predictions from individual ML algorithms 
stackControl <- trainControl(method="repeatedcv", number=10, repeats=10, savePredictions=TRUE, classProbs=TRUE)
# stacking the predictions of individual ML algorithms using generalized linear model
stack.glm <- caretStack(models, method="glm", trControl=stackControl)
# printing the stacked final results
print(stack.glm)

This will result in the following output:

A glm ensemble of 2 base models: C5.0, nb, glm, knn, svmRadial 
Ensemble results:
Generalized Linear Model
14700 samples
5 predictors
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 13230, 13230, 13230, 13230, 13230, 13230, ...
Resampling results:
Accuracy Kappa
0.8844966 0.4869556

With GLM-based stacking, we have 88% accuracy. Let's now examine the effect of using random forest modeling instead of GLM to stack the individual predictions from each of the five ML algorithms on the observations:

# stacking the predictions of individual ML algorithms using random forest 
stack.rf <- caretStack(models, method="rf", trControl=stackControl)
# printing the summary of rf based stacking
print(stack.rf)

This will result in the following output:

A rf ensemble of 2 base models: C5.0, nb, glm, knn, svmRadial 
Ensemble results:
Random Forest
14700 samples
5 predictors
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 13230, 13230, 13230, 13230, 13230, 13230, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.9122041 0.6268108
3 0.9133605 0.6334885
5 0.9132925 0.6342740
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 3.

We see that without much effort, we were able to achieve an accuracy of 91% by stacking the predictions. Now, let's explore the working principle of stacking.

At last, we have discovered the various ensembling techniques that can provide us with better performing models. However, before ending the chapter, there are a couple of things we need to take a note of.

There is not just one way to implement ML models in R. For example, bagging can be implemented using functions available in the ipred library and not by using caret as we did in this chapter. We should be aware that hyperparameter tuning forms an important part of model building to avail the best performing model. The number of hyperparameters and the acceptable values for those hyperparameters vary depending on the library that we intend to use. This is the reason why we paid less attention to hyperparameter tuning in the models we built in this chapter. Nevertheless, it is very important to read up the library documentation to understand the hyperparameters that can be tuned with a library function. In most cases, incorporating hyperparameter tuning in models significantly improves the model's performance.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.35.185