The exercise in the previous section is repeated here using the PimaIndianDiabetes2 dataset instead. This dataset contains several missing values. As a result, we will first impute the missing values and then run the machine learning example.
The exercise has been repeated with some additional nuances, such as using multicore/parallel processing in order to make the cross-validations run faster.
To leverage multicore processing, install the package doMC using the following code:
Install.packages("doMC") # Install package for multicore processing Install.packages("nnet") # Install package for neural networks in R
Now we will run the program as shown in the code here:
# Load the library doMC library(doMC) # Register all cores registerDoMC(cores = 8) # Set seed to create a reproducible example set.seed(100) # Load the PimaIndiansDiabetes2 dataset data("PimaIndiansDiabetes2",package = 'mlbench') diab<- PimaIndiansDiabetes2 # This dataset, unlike PimaIndiansDiabetes has 652 missing values! > sum(is.na(diab)) [1] 652 # We will use knnImputation to fill in the missing values diab<- knnImputation(diab) # Create the train-test set split training_index<- createDataPartition(diab$diabetes, p = .8, list = FALSE, times = 1) # Create the training and test dataset diab_train<- diab[training_index,] diab_test<- diab[-training_index,] # We will use 10-Fold Cross Validations diab_control<- trainControl("repeatedcv", number = 10, repeats = 3, search = "random", classProbs = TRUE) # Create the model using methodnnet (a Neural Network package in R) # Note that we have changed the metric here to "Accuracy" instead of # ROC nn_model<- train(diabetes ~ ., data = diab_train, method = "nnet", preProc = c("center", "scale"), trControl = diab_control, tuneLength = 10, metric = "Accuracy") predictions<- predict(nn_model, diab_test[,-ncol(diab_test)]) cf<- confusionMatrix(predictions, diab_test$diabetes) cf # >cf # Confusion Matrix and Statistics # # Reference # Prediction negpos # neg 89 19 # pos 11 34 # # Accuracy : 0.8039 # 95% CI : (0.7321, 0.8636) # No Information Rate : 0.6536 # P-Value [Acc> NIR] : 3.3e-05 #
Even with 650+ missing values, our model was able to achieve an accuracy of 80%+.
It can certainly be improved, but as a baseline, it shows the kind of performance that can be expected of machine learning models.
In a case of a dichotomous outcome variable, a random guess would have had a 50% chance of being accurate. An accuracy of 80% is significantly higher than the accuracy we could have achieved using just guess-work:
plot(nn_model)
The resulting plot is as follows:
fourfoldplot(cf$table)
The result is depicted in the following plot: