K-nearest neighbors model for benchmarking the performance

In this section, we will implement the k-nearest neighbors (KNN) algorithm to build a model on our IBM attrition dataset. Of course, we are already aware from EDA that we have a class imbalance problem in the dataset at hand. However, we will not be treating the dataset for class imbalance for now as this is an entire area on its own and several techniques are available in this area and therefore out of scope for the ML ensembling topic covered in this chapter. We will, for now, consider the dataset as is and build ML models. Also, for class imbalance datasets, Kappa or precision and recall or the area under the curve of the receiver operating characteristic (AUROC) are the appropriate metrics to use. However, for simplicity, we will use accuracy as a performance metric. We will adapt 10-fold cross validation repeated 10 times to avail the model performance measurement. Let's now build our attrition prediction model with the KNN algorithm as follows:

# Load the necessary libraries 
# doMC is a library that enables R to use multiple cores available on the sysem thereby supporting multiprocessing.  
library(doMC) 
# registerDoMC command instructs R to use the specified number of cores to execute the code. In this case, we ask R to use 4 cores available on the system 
registerDoMC(cores=4) 
# caret library has the ml algorithms and other routines such as cross validation etc.  
library(caret) 
# Setting the working directory where the dataset is located 
setwd("~/Desktop/chapter 15") 
# Reading the csv file into R variable called mydata 
mydata <- read.csv("WA_Fn-UseC_-HR-Employee-Attrition.csv") 
#Removing the non-discriminatory features (as identified during EDA) from the dataset  
mydata$EmployeeNumber=mydata$Over18=mydata$EmployeeCount=mydata$StandardHours = NULL 
# setting the seed prior to model building ensures reproducibility of the results obtained 
set.seed(10000) 
# setting the train control parameters specifying gold standard 10 fold cross validation  repeated 10 times 
fitControl = trainControl(method="repeatedcv", number=10,repeats=10) 
###creating a model on the data. Observe that we specified Attrition as the target and that model should learn from rest of the variables. We specified mydata as the dataset to learn. We pass the train control parameters and specify that knn algorithm need to be used to build the model. K can be of any length - we specified 20 as parameter which means the train command will search through 20 different random k values and finally retains the model that produces the best performance measurements. The final model is stored as caretmodel 
caretmodel = train(Attrition~., data=mydata, trControl=fitControl, method = "knn", tuneLength = 20) 
# We output the model object to the console  
caretmodel

This will result in the following output:

k-Nearest Neighbors  
1470 samples 
  30 predictors 
   2 classes: 'No', 'Yes'  
No pre-processing 
Resampling: Cross-Validated (10 fold, repeated 10 times)  
Summary of sample sizes: 1323, 1323, 1324, 1323, 1324, 1322, ...  
Resampling results across tuning parameters: 
  k   Accuracy   Kappa        
   5  0.8216447  0.0902934591 
   7  0.8349033  0.0929511324 
   9  0.8374198  0.0752842114 
  11  0.8410920  0.0687849122 
  13  0.8406861  0.0459679081 
  15  0.8406875  0.0337742424 
  17  0.8400748  0.0315670261 
  19  0.8402770  0.0245499585 
  21  0.8398721  0.0143638854 
  23  0.8393945  0.0084393721 
  25  0.8391891  0.0063246624 
  27  0.8389174  0.0013913143 
  29  0.8388503  0.0007113939 
  31  0.8387818  0.0000000000 
  33  0.8387818  0.0000000000 
  35  0.8387818  0.0000000000 
  37  0.8387818  0.0000000000 
  39  0.8387818  0.0000000000 
  41  0.8387818  0.0000000000 
  43  0.8387818  0.0000000000 
Accuracy was used to select the optimal model using the largest value. 
The final value used for the model was k = 11.

We can see from the model output that the best performing model is when k = 11 and we obtained an accuracy of 84% with this k value. In the rest of the chapter, while experimenting with several ensembling techniques, we will check if this 84% accuracy obtained from KNN will get beaten at all.

In a realistic project-building situation, just identifying the best hyperparameters is not enough. A model needs to be trained on a full dataset with the best hyperparameters and the model needs to be saved for future use. We will review these steps in the rest of this section.

In this case, the caretmodel object already has the trained model with k = 11, therefore we do not attempt to retrain the model with the best hyperparameter. To check the final model, you can query the model object with the code:

caretmodel$finalModel

This will result in the following output:

11-nearest neighbor model 
Training set outcome distribution: 
  No  Yes  
1233  237

The next step is to save your best models to a file so that we can load them up later and make predictions on unseen data. A model can be saved to a local directory using the saveRDS R command:

 # save the model to disk 
saveRDS(caretmodel, "production_model.rds")

In this case, the caretmodel is saved as production_model.rds in the working directory. The model is now serialized as a file that can be loaded anytime and it can be used to score unseen data. Loading and scoring can be achieved through the following R code:

# Set the working directory to the directory where the saved .rds file is located  
setwd("~/Desktop/chapter 15") 
#Load the model  
loaded_model <- readRDS("production_model.rds")

#Using the loaded model to make predictions on unseen data 
final_predictions <- predict(loaded_model, unseen_data)

Please note that unseen_data needs to be read prior to scoring through the predict command.

The part of the code where the final model is trained on the entire dataset, saving the model, reloading it from the file whenever required and scoring the unseen data collectively, is termed as building an ML productionalization pipeline. This pipeline remains the same for all ML models irrespective of the fact that the model is built using one single algorithm or using an ensembling technique. Therefore, in the later sections when we implement the various ensembling techniques, we will not cover the productionalization pipeline but just stop at obtaining the performance measurement through 10-fold cross validation repeated 10 times.

Table of Contents for K-nearest neighbors model for benchmarking the performance

Create new playlist

Sign In

Sign Up

Table of Contents for
K-nearest neighbors model for benchmarking the performance