K-nearest neighbors model for benchmarking the performance

In this section, we will implement the k-nearest neighbors (KNN) algorithm to build a model on our IBM attrition dataset. Of course, we are already aware from EDA that we have a class imbalance problem in the dataset at hand. However, we will not be treating the dataset for class imbalance for now as this is an entire area on its own and several techniques are available in this area and therefore out of scope for the ML ensembling topic covered in this chapter. We will, for now, consider the dataset as is and build ML models. Also, for class imbalance datasets, Kappa or precision and recall or the area under the curve of the receiver operating characteristic (AUROC) are the appropriate metrics to use. However, for simplicity, we will use accuracy as a performance metric. We will adapt 10-fold cross validation repeated 10 times to avail the model performance measurement. Let's now build our attrition prediction model with the KNN algorithm as follows:

# Load the necessary libraries 
# doMC is a library that enables R to use multiple cores available on the sysem thereby supporting multiprocessing.
library(doMC)
# registerDoMC command instructs R to use the specified number of cores to execute the code. In this case, we ask R to use 4 cores available on the system
registerDoMC(cores=4)
# caret library has the ml algorithms and other routines such as cross validation etc.
library(caret)
# Setting the working directory where the dataset is located
setwd("~/Desktop/chapter 15")
# Reading the csv file into R variable called mydata
mydata <- read.csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
#Removing the non-discriminatory features (as identified during EDA) from the dataset
mydata$EmployeeNumber=mydata$Over18=mydata$EmployeeCount=mydata$StandardHours = NULL
# setting the seed prior to model building ensures reproducibility of the results obtained
set.seed(10000)
# setting the train control parameters specifying gold standard 10 fold cross validation repeated 10 times
fitControl = trainControl(method="repeatedcv", number=10,repeats=10)
###creating a model on the data. Observe that we specified Attrition as the target and that model should learn from rest of the variables. We specified mydata as the dataset to learn. We pass the train control parameters and specify that knn algorithm need to be used to build the model. K can be of any length - we specified 20 as parameter which means the train command will search through 20 different random k values and finally retains the model that produces the best performance measurements. The final model is stored as caretmodel
caretmodel = train(Attrition~., data=mydata, trControl=fitControl, method = "knn", tuneLength = 20)
# We output the model object to the console
caretmodel

This will result in the following output:

k-Nearest Neighbors  
1470 samples
30 predictors
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 1323, 1323, 1324, 1323, 1324, 1322, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.8216447 0.0902934591
7 0.8349033 0.0929511324
9 0.8374198 0.0752842114
11 0.8410920 0.0687849122
13 0.8406861 0.0459679081
15 0.8406875 0.0337742424
17 0.8400748 0.0315670261
19 0.8402770 0.0245499585
21 0.8398721 0.0143638854
23 0.8393945 0.0084393721
25 0.8391891 0.0063246624
27 0.8389174 0.0013913143
29 0.8388503 0.0007113939
31 0.8387818 0.0000000000
33 0.8387818 0.0000000000
35 0.8387818 0.0000000000
37 0.8387818 0.0000000000
39 0.8387818 0.0000000000
41 0.8387818 0.0000000000
43 0.8387818 0.0000000000
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 11.

We can see from the model output that the best performing model is when k = 11 and we obtained an accuracy of 84% with this k value. In the rest of the chapter, while experimenting with several ensembling techniques, we will check if this 84% accuracy obtained from KNN will get beaten at all. 

In a realistic project-building situation, just identifying the best hyperparameters is not enough. A model needs to be trained on a full dataset with the best hyperparameters and the model needs to be saved for future use. We will review these steps in the rest of this section.

In this case, the caretmodel object already has the trained model with k = 11, therefore we do not attempt to retrain the model with the best hyperparameter. To check the final model, you can query the model object with the code:

caretmodel$finalModel 

This will result in the following output:

11-nearest neighbor model 
Training set outcome distribution:
No Yes
1233 237

The next step is to save your best models to a file so that we can load them up later and make predictions on unseen data. A model can be saved to a local directory using the saveRDS R command:

 # save the model to disk 
saveRDS(caretmodel, "production_model.rds")

In this case, the caretmodel is saved as production_model.rds in the working directory. The model is now serialized as a file that can be loaded anytime and it can be used to score unseen data. Loading and scoring can be achieved through the following R code:

# Set the working directory to the directory where the saved .rds file is located  
setwd("~/Desktop/chapter 15")
#Load the model
loaded_model <- readRDS("production_model.rds")
#Using the loaded model to make predictions on unseen data 
final_predictions <- predict(loaded_model, unseen_data)
Please note that unseen_data needs to be read prior to scoring through the predict command. 

The part of the code where the final model is trained on the entire dataset, saving the model, reloading it from the file whenever required and scoring the unseen data collectively, is termed as building an ML productionalization pipeline. This pipeline remains the same for all ML models irrespective of the fact that the model is built using one single algorithm or using an ensembling technique. Therefore, in the later sections when we implement the various ensembling techniques, we will not cover the productionalization pipeline but just stop at obtaining the performance measurement through 10-fold cross validation repeated 10 times.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.254.61