In Chapter 12, Multilevel Analyses, we examined how to fit and predict nested data using multilevel analyses. In the previous chapter, we discussed text mining in R. In this chapter, we will discover how to perform cross-validation and bootstrapping using the caret
package and how to export models using Predictive Model Markup Language (PMML).
In this section, we will discuss how to examine the reliability of a model with cross-validation. We start by discussing what cross-validation is.
You might remember that, in several chapters, we used half of the data to train the model and half of it to test it. The aim of this process was to ensure that the high reliability of a classification, for instance, was not due to the fitting of noise in the data rather than true relationships. We have seen, for instance, in the previous chapter, that the reliability of a classification on the training set is usually higher than in the test set (unseen data).
The process of using half of the data for training and half for testing is actually a special case of cross-validation, that is, two-fold cross-validation. We can perform cross-validation using more folds. Two very common approaches are ten-fold cross-validation and leave-one-out cross-validation. In ten-fold cross-validation, the data is randomly split into 10 groups that contain the same number of cases (or approximately). For the following explanation, we will call these groups 1 to 10. The analysis is performed 10 times. The first time the analysis is performed, groups 1 to 9 are used to train the algorithm and group 10 is used to test the model. The second time, groups 1 to 8 and 10 are used for training and group 9 for testing, and so on (see the following figure). So each group is included at some point for training, with 8 other groups, and each group is used for testing individually. Doing things this way allows for more accurate estimates of the reliability of the algorithm on the data as the analysis is performed several times with different sets, which permits the obtaining of distribution of measures of reliability. Another advantage is that 9/10th of the data is used in the training set at each iteration, and we can use all cases at both testing and training stages.
Leave-one-out cross-validation is quite similar, except that the data is not split into groups. Using this approach, the analysis is performed as many times as there are cases, and each of the cases is used once for testing and the rest of the time for training.
There are several ways of performing cross-validation in R. Using the caret
package is one of the most effective ways, as the package provides a unified framework to use with different algorithms. This is exactly what we are going to do in the rest of this section, keep it simple.
First, we start by installing (if not already done) and loading the package:
install.packages("caret"); library (caret)
We can determine the number of folds with the trainControl()
function.
Here for ten-fold cross-validation the number of folds will be as follows:
CtrlCV = trainControl(method = "cv", number = 10)
Here, for leave-one-out cross-validation the number of folds will be as follows:
CtrlLOO = trainControl(method = "LOOCV")
Now that we have set this, we can perform the analyses! We will do so using ten-fold cross-validation (replace CtrlCV
by CtrlLOO
for leave-one-out cross validation). To simplify, we will use the iris
dataset for some examples. If any of the following required packages are not installed on your system, please use the install.packages()
function to install them:
install.packages("klaR") modelNB = train(Species ~ ., data = iris, trControl = CtrlCV, method = "nb")
Packages klaR
and MASS
are required and loaded automatically
This requires loading RWeka
library(RWeka) modelC45 = train(Species ~ ., data = iris, trControl = CtrlCV, method = "J48")
modelC50 = train(Species ~ ., data = iris, trControl =CtrlCV, method = "C5.0")
C50
and plyr
packages are loaded automatically:modelCART = train(Species ~ ., data = iris, trControl =CtrlCV, method = "rpart")
rpart
package is loaded automatically:modelRF = train(Species ~ ., data = iris, trControl = CtrlCV, method = "rf")
randomForest
package is loaded automaticallyWe have covered the other algorithms in this book, and many more are available. To get the list, simply type:
names(getModelInfo())
We can inspect the accuracy of the models as follows (we take the example of the Naïve Bayes classification):
modelNB
The output reproduced here displays information about the sample, classes, and predictors; a summary of the method used; and more importantly the average accuracy and kappa values, and their standard deviations for different tuning parameters, which differ depending of the algorithm used (see the end of the output for a description). We can see that the average accuracy and kappa values are excellent, with small standard deviations:
Naive Bayes 150 samples 4 predictor 3 classes: 'setosa', 'versicolor', 'virginica' No pre-processing Resampling: Cross-Validated (10 fold) Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... Resampling results across tuning parameters: usekernel Accuracy Kappa Accuracy SD Kappa SD FALSE 0.9533333 0.93 0.05488484 0.08232726 TRUE 0.9533333 0.93 0.05488484 0.08232726 Tuning parameter fL was held constant at a value of 0. Accuracy was used to select the optimal model using the largest value. The final values used for the model were fL = 0 and usekernel = FALSE.
The aim of bootstrapping is also to obtain a more precise image of the reliability of the model on the data. This is done in a different fashion. Instead of partitioning the data for training and testing, a random sample of n cases is selected N times from the original set with replacement (meaning that the same case can occur several times at each iteration), where N is the number of iterations and n is the number of cases. The analysis is performed on each of the samples independently, which gives mean and standard deviation for the estimates.
Bootstrapping is done in a way similar to cross-validation, simply by specifying it using the trainControl
function:
CtrlBoot = trainControl(method="boot", number=1000)
Let's take our examples again:
modelNBboot = train(Species ~ ., data = iris, trControl = CtrlBoot, method = "nb")
modelC45boot = train(Species ~ ., data = iris, trControl = CtrlBoot, method = "J48")
modelC50boot = train(Species ~ ., data = iris, trControl = CtrlBoot, method = "C5.0")
modelCARTboot = train(Species ~ ., data = iris, trControl = CtrlBoot, method = "rpart")
modelRFboot = train(Species ~ ., data = iris, trControl = CtrlBoot, method = "rf")
The output is quite similar to what we have seen previously. We will, therefore, not comment on it.
Predictive models are built with the intent of predicting unseen data. This can be done very easily. In what follows, we first partition the data in two sets using stratified sampling, one of 75 percent, which we will use to train and test using cross-validation, and another of 25 percent, which contains unseen data (data that we have not yet used):
forCV = createDataPartition(iris$Species, p=0.75, list=FALSE) CVset = iris[forCV,] NEWset = iris[-forCV,]
We now create the cross-validated model (with Naïve Bayes):
model = train(Species ~ ., data = CVset, trControl = CtrlCV, method = "nb")
We can now predict our unseen data using this model:
Predictions = predict(model, NEWset)
18.220.245.233