The GBM implementation

Let's implement the attrition prediction model with GBMs:

# loading the essential libraries and registering the cores for multiprocessing 
library(doMC) 
library(mlbench) 
library(gbm) 
library(caret) 
registerDoMC(cores=4) 
# setting the working directory and reading the dataset 
setwd("~/Desktop/chapter 15") 
mydata <- read.csv("WA_Fn-UseC_-HR-Employee-Attrition.csv") 
# removing the non-discriminatory features as identified by EDA step 
mydata$EmployeeNumber=mydata$Over18=mydata$EmployeeCount=mydata$StandardHours = NULL 
# converting the target attrition feild to numeric as gbm model expects all numeric feilds in the dataset 
mydata$Attrition = as.numeric(mydata$Attrition) 
# forcing the attrition column values to be 0 and 1 instead of 1 and 2 
mydata = transform(mydata, Attrition=Attrition-1) 
# running the gbm model with 10 fold cross validation to identify the number of trees to build - hyper parameter tuning 
gbm.model = gbm(Attrition~., data=mydata, shrinkage=0.01, distribution = 'bernoulli', cv.folds=10, n.trees=3000, verbose=F) 
# identifying and printing the value of hyper parameter identified through the tuning above 
best.iter = gbm.perf(gbm.model, method="cv") 
print(best.iter) 
# setting the seed for reproducibility 
set.seed(123) 
# creating a copy of the dataset 
mydata1=mydata 
# converting target to a factor 
mydata1$Attrition=as.factor(mydata1$Attrition) 
# setting up cross validation controls 
fitControl = trainControl(method="repeatedcv", number=10,repeats=10) 
# runing the gbm model in tandem with caret  
caretmodel = train(Attrition~., data=mydata1, method="gbm", distribution="bernoulli",  trControl=fitControl, verbose=F, tuneGrid=data.frame(.n.trees=best.iter, .shrinkage=0.01, .interaction.depth=1, .n.minobsinnode=1)) 
# printing the model summary 
print(caretmodel)

This will result in the following output:

2623 
Stochastic Gradient Boosting  

1470 samples 
  30 predictors 
   2 classes: '0', '1'  

No pre-processing 
Resampling: Cross-Validated (10 fold, repeated 10 times)  
Summary of sample sizes: 1323, 1323, 1323, 1322, 1323, 1323, ...  
Resampling results: 
  Accuracy   Kappa     
  0.8771472  0.4094991 
Tuning parameter 'n.trees' was held constant at a value of 2623 
Tuning parameter 'shrinkage' was held constant at a value of 0.01 
Tuning parameter 'n.minobsinnode' was held constant at a value of 1

You will see that with the GBM model, we have achieved accuracy above 87%, which is better accuracy compared to the 84% achieved with KNN.

Table of Contents for The GBM implementation

Create new playlist

Sign In

Sign Up

Table of Contents for
The GBM implementation