The GBM implementation

Let's implement the attrition prediction model with GBMs:

# loading the essential libraries and registering the cores for multiprocessing 
library(doMC)
library(mlbench)
library(gbm)
library(caret)
registerDoMC(cores=4)
# setting the working directory and reading the dataset
setwd("~/Desktop/chapter 15")
mydata <- read.csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
# removing the non-discriminatory features as identified by EDA step
mydata$EmployeeNumber=mydata$Over18=mydata$EmployeeCount=mydata$StandardHours = NULL
# converting the target attrition feild to numeric as gbm model expects all numeric feilds in the dataset
mydata$Attrition = as.numeric(mydata$Attrition)
# forcing the attrition column values to be 0 and 1 instead of 1 and 2
mydata = transform(mydata, Attrition=Attrition-1)
# running the gbm model with 10 fold cross validation to identify the number of trees to build - hyper parameter tuning
gbm.model = gbm(Attrition~., data=mydata, shrinkage=0.01, distribution = 'bernoulli', cv.folds=10, n.trees=3000, verbose=F)
# identifying and printing the value of hyper parameter identified through the tuning above
best.iter = gbm.perf(gbm.model, method="cv")
print(best.iter)
# setting the seed for reproducibility
set.seed(123)
# creating a copy of the dataset
mydata1=mydata
# converting target to a factor
mydata1$Attrition=as.factor(mydata1$Attrition)
# setting up cross validation controls
fitControl = trainControl(method="repeatedcv", number=10,repeats=10)
# runing the gbm model in tandem with caret
caretmodel = train(Attrition~., data=mydata1, method="gbm", distribution="bernoulli", trControl=fitControl, verbose=F, tuneGrid=data.frame(.n.trees=best.iter, .shrinkage=0.01, .interaction.depth=1, .n.minobsinnode=1))
# printing the model summary
print(caretmodel)

This will result in the following output:

2623 
Stochastic Gradient Boosting

1470 samples
30 predictors
2 classes: '0', '1'

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 1323, 1323, 1323, 1322, 1323, 1323, ...
Resampling results:
Accuracy Kappa
0.8771472 0.4094991
Tuning parameter 'n.trees' was held constant at a value of 2623
Tuning parameter 'shrinkage' was held constant at a value of 0.01
Tuning parameter 'n.minobsinnode' was held constant at a value of 1

You will see that with the GBM model, we have achieved accuracy above 87%, which is better accuracy compared to the 84% achieved with KNN.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.68.18