Leveraging multicore processing in the model

The exercise in the previous section is repeated here using the PimaIndianDiabetes2 dataset instead. This dataset contains several missing values. As a result, we will first impute the missing values and then run the machine learning example.

The exercise has been repeated with some additional nuances, such as using multicore/parallel processing in order to make the cross-validations run faster.

To leverage multicore processing, install the package doMC using the following code:

Install.packages("doMC")  # Install package for multicore processing 
Install.packages("nnet") # Install package for neural networks in R 

Now we will run the program as shown in the code here:

# Load the library doMC 
library(doMC) 
 
# Register all cores 
registerDoMC(cores = 8) 
 
# Set seed to create a reproducible example 
set.seed(100) 
 
# Load the PimaIndiansDiabetes2 dataset 
data("PimaIndiansDiabetes2",package = 'mlbench') 
diab<- PimaIndiansDiabetes2 
 
# This dataset, unlike PimaIndiansDiabetes has 652 missing values! 
> sum(is.na(diab)) 
[1] 652 
 
# We will use knnImputation to fill in the missing values 
diab<- knnImputation(diab) 
 
# Create the train-test set split 
training_index<- createDataPartition(diab$diabetes, p = .8, list = FALSE, times = 1) 
 
# Create the training and test dataset 
diab_train<- diab[training_index,] 
diab_test<- diab[-training_index,] 
 
# We will use 10-Fold Cross Validations 
diab_control<- trainControl("repeatedcv", number = 10, repeats = 3, search = "random", classProbs = TRUE) 
 
# Create the model using methodnnet (a Neural Network package in R) 
# Note that we have changed the metric here to "Accuracy" instead of # ROC 
nn_model<- train(diabetes ~ ., data = diab_train, method = "nnet",   preProc = c("center", "scale"), trControl = diab_control, tuneLength = 10, metric = "Accuracy") 

predictions<- predict(nn_model, diab_test[,-ncol(diab_test)]) 
cf<- confusionMatrix(predictions, diab_test$diabetes) 
cf 
 
# >cf 
# Confusion Matrix and Statistics 
#  
#        Reference 
# Prediction negpos 
#        neg  89  19 
#        pos  11  34 
#  
# Accuracy : 0.8039           
# 95% CI : (0.7321, 0.8636) 
# No Information Rate : 0.6536           
# P-Value [Acc> NIR] : 3.3e-05          
#  

Even with 650+ missing values, our model was able to achieve an accuracy of 80%+.

It can certainly be improved, but as a baseline, it shows the kind of performance that can be expected of machine learning models.

In a case of a dichotomous outcome variable, a random guess would have had a 50% chance of being accurate. An accuracy of 80% is significantly higher than the accuracy we could have achieved using just guess-work:

 plot(nn_model) 

The resulting plot is as follows:

fourfoldplot(cf$table) 

The result is depicted in the following plot:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.37.196