Chapter 14. Cross-validation and Bootstrapping Using Caret and Exporting Predictive Models Using PMML

In Chapter 12, Multilevel Analyses, we examined how to fit and predict nested data using multilevel analyses. In the previous chapter, we discussed text mining in R. In this chapter, we will discover how to perform cross-validation and bootstrapping using the caret package and how to export models using Predictive Model Markup Language (PMML).

Cross-validation and bootstrapping of predictive models using the caret package

In this section, we will discuss how to examine the reliability of a model with cross-validation. We start by discussing what cross-validation is.

Cross-validation

You might remember that, in several chapters, we used half of the data to train the model and half of it to test it. The aim of this process was to ensure that the high reliability of a classification, for instance, was not due to the fitting of noise in the data rather than true relationships. We have seen, for instance, in the previous chapter, that the reliability of a classification on the training set is usually higher than in the test set (unseen data).

The process of using half of the data for training and half for testing is actually a special case of cross-validation, that is, two-fold cross-validation. We can perform cross-validation using more folds. Two very common approaches are ten-fold cross-validation and leave-one-out cross-validation. In ten-fold cross-validation, the data is randomly split into 10 groups that contain the same number of cases (or approximately). For the following explanation, we will call these groups 1 to 10. The analysis is performed 10 times. The first time the analysis is performed, groups 1 to 9 are used to train the algorithm and group 10 is used to test the model. The second time, groups 1 to 8 and 10 are used for training and group 9 for testing, and so on (see the following figure). So each group is included at some point for training, with 8 other groups, and each group is used for testing individually. Doing things this way allows for more accurate estimates of the reliability of the algorithm on the data as the analysis is performed several times with different sets, which permits the obtaining of distribution of measures of reliability. Another advantage is that 9/10th of the data is used in the training set at each iteration, and we can use all cases at both testing and training stages.

Cross-validation

Representation of training and testing sets in 10-fold cross validation

Leave-one-out cross-validation is quite similar, except that the data is not split into groups. Using this approach, the analysis is performed as many times as there are cases, and each of the cases is used once for testing and the rest of the time for training.

Performing cross-validation in R with caret

There are several ways of performing cross-validation in R. Using the caret package is one of the most effective ways, as the package provides a unified framework to use with different algorithms. This is exactly what we are going to do in the rest of this section, keep it simple.

First, we start by installing (if not already done) and loading the package:

install.packages("caret"); library (caret)

We can determine the number of folds with the trainControl() function.

Here for ten-fold cross-validation the number of folds will be as follows:

CtrlCV = trainControl(method = "cv", number = 10)

Here, for leave-one-out cross-validation the number of folds will be as follows:

CtrlLOO = trainControl(method = "LOOCV")

Now that we have set this, we can perform the analyses! We will do so using ten-fold cross-validation (replace CtrlCV by CtrlLOO for leave-one-out cross validation). To simplify, we will use the iris dataset for some examples. If any of the following required packages are not installed on your system, please use the install.packages() function to install them:

  • Naive Bayes:
    install.packages("klaR")
    modelNB = train(Species ~ ., data = iris, trControl = CtrlCV,
    method = "nb")

    Packages klaR and MASS are required and loaded automatically

  • C4.5:

    This requires loading RWeka

    library(RWeka)
    modelC45 = train(Species ~ ., data = iris,
       trControl = CtrlCV, method = "J48")
  • C5.0:
    modelC50 = train(Species ~ ., data = iris,
       trControl =CtrlCV, method = "C5.0")
  • The C50 and plyr packages are loaded automatically:
    modelCART = train(Species ~ ., data = iris,
       trControl =CtrlCV, method = "rpart")
  • The rpart package is loaded automatically:
    modelRF = train(Species ~ ., data = iris, trControl = CtrlCV,
       method = "rf")
  • The randomForest package is loaded automatically

We have covered the other algorithms in this book, and many more are available. To get the list, simply type:

names(getModelInfo())

We can inspect the accuracy of the models as follows (we take the example of the Naïve Bayes classification):

modelNB

The output reproduced here displays information about the sample, classes, and predictors; a summary of the method used; and more importantly the average accuracy and kappa values, and their standard deviations for different tuning parameters, which differ depending of the algorithm used (see the end of the output for a description). We can see that the average accuracy and kappa values are excellent, with small standard deviations:

Naive Bayes

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica'

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa  Accuracy SD  Kappa SD  
  FALSE      0.9533333  0.93   0.05488484   0.08232726
   TRUE      0.9533333  0.93   0.05488484   0.08232726
Tuning parameter 
fL was held constant at a value of 0.
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were fL = 0 and usekernel = FALSE.

Bootstrapping

The aim of bootstrapping is also to obtain a more precise image of the reliability of the model on the data. This is done in a different fashion. Instead of partitioning the data for training and testing, a random sample of n cases is selected N times from the original set with replacement (meaning that the same case can occur several times at each iteration), where N is the number of iterations and n is the number of cases. The analysis is performed on each of the samples independently, which gives mean and standard deviation for the estimates.

Performing bootstrapping in R with caret

Bootstrapping is done in a way similar to cross-validation, simply by specifying it using the trainControl function:

CtrlBoot = trainControl(method="boot", number=1000)

Let's take our examples again:

  • Naive Bayes:
    modelNBboot = train(Species ~ ., data = iris, 
       trControl = CtrlBoot, method = "nb")
  • C4.5:
    modelC45boot = train(Species ~ ., data = iris, 
       trControl = CtrlBoot, method = "J48")
  • C5.0:
    modelC50boot = train(Species ~ ., data = iris,
       trControl = CtrlBoot, method = "C5.0")
  • CART:
    modelCARTboot = train(Species ~ ., data = iris, 
       trControl = CtrlBoot, method = "rpart")
  • Random forests:
    modelRFboot = train(Species ~ ., data = iris, 
       trControl = CtrlBoot, method = "rf")

The output is quite similar to what we have seen previously. We will, therefore, not comment on it.

Predicting new data

Predictive models are built with the intent of predicting unseen data. This can be done very easily. In what follows, we first partition the data in two sets using stratified sampling, one of 75 percent, which we will use to train and test using cross-validation, and another of 25 percent, which contains unseen data (data that we have not yet used):

forCV = createDataPartition(iris$Species, p=0.75, list=FALSE)
CVset = iris[forCV,]
NEWset = iris[-forCV,]

We now create the cross-validated model (with Naïve Bayes):

model = train(Species ~ ., data = CVset, trControl = CtrlCV,
   method = "nb")

We can now predict our unseen data using this model:

Predictions = predict(model, NEWset)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.245.233