The cross-validation parameter

Cross-validation takes the train-test split concept to the next stage. The aim of the machine learning exercise is, in essence, to find what set of model parameters will provide the best performance. A model parameter indicates the arguments that the function (the model) takes. For example, for a decision tree model, parameters may include the number of levels deep the model should be built, number of splits, and so on. If, say, there are n different parameters, each having k different values, the total number of parameters would be k^n. We generally select a fixed set of combinations for each of the parameters and could easily end with 100-1000+ combinations. We will test the performance of the model (for example, accuracy in predicting the outcome correctly) for each of the parameters.

With a simple train-test split, say, if there were 500 combinations of parameters we had selected, we just need to run them against the training dataset and determine which one shows the optimal performance.

With cross-validation, we further split the training set into smaller subsets, for example, three- or five-fold is commonly used. If there are three folds, that is, we split the training set into three subsets, we keep aside one fold, say, Fold 2, and create a model using a set of parameters using Folds 1 and 3. We then test its accuracy against Fold 2. This step is repeated several times, with each iteration representing a unique set of folds on which the training-test process is being executed and accuracy measures are collected. Eventually, we would arrive at an optimal combination by selecting the parameters that showed the best overall performance.

The standard approach can be summarized as follows:

  1. Create an 80-20 train-test split
  2. Execute your model(s) using different combinations of model parameters
  3. Select the model parameters that show the best overall performance and create the final model
  4. Apply the final model on the test set to see the results

The cross-validation approach mandates that we should further split the training dataset into smaller subsets. These subsets are generally known as folds and collectively they are known as the k-folds, where k represents the number of splits:

  1. Create an 80-20 train-test split
  2. Split the training set into k-folds, say, three folds
  3. Set aside Fold 1 and build a model using Fold 2 and Fold 3
  4. Test your model performance on Fold 1 (for example, the percentage of accurate results)
  5. Set aside Fold 2 and build a model using Fold 1 and Fold 3
  6. Test your model performance on Fold 2
  7. Set aside Fold 3 and build a model using Fold 1 and Fold 2
  8. Test your model performance on Fold 3
  9. Take the average performance of the model across all three folds
  10. Repeat Step 1 for each set of model parameters
  11. Select the model parameters that show the best overall performance and create the final model
  1. Apply the final model on the test set to see the results
This image illustrates the difference between using an approach without cross-validation and one with cross-validation. The cross-validation method is arguably more robust and involves a rigorous evaluation of the model. That said, it is often useful to attempt creating a model initially without cross-validation to get a sense of the kind of performance that may be expected. For example, if a model built with say 2-3 training-test splits shows a performance of say, 30% accuracy, it is unlikely that any other approach, including cross-validation would somehow make that 90%. In other words the standard approach helps to get a sense of the kind of performance that may be expected. As cross-validations can be quite compute-intensive and time consuming getting an initial feedback on performance helps in a preliminary analysis of the overall process.

The caret package in R provides a very user-friendly approach to building models using cross-validation. Recall that data pre-processing must be passed or made an integral part of the cross-validation process. So, say, we had to center and scale the dataset and perform a five-fold cross-validation, all we would have to do is define the type of sampling we'd like to use in caret's trainControl function.

Caret's webpage on trainControl provides a detailed overview of the functions along with worked-out examples at https://topepo.github.io/caret/model-training-and-tuning.html#basic-parameter-tuning.

We have used this approach in our earlier exercise where we built a model using RandomForest on the PimaIndiansDiabetes dataset. It is shown again here to indicate where the technique was used:

# Create the trainControl parameters for the model 
# The parameters indicate that a 3-Fold CV would be created 
# and that the process would be repeated 2 times (repeats) 
# The class probabilities in each run will be stored 
# And we'll use the twoClassSummary* function to measure the model 
# Performance 
diab_control<- trainControl("repeatedcv", number = 3, repeats = 2, classProbs = TRUE, summaryFunction = twoClassSummary) 
 
# Build the model 
# We used the train function of caret to build the model 
# As part of the training process, we specified a tunelength** of 5 
# This parameter lets caret select a set of default model parameters 
# trControl = diab_control indicates that the model will be built 
# Using the cross-validation method specified in diab_control 
# Finally preProc = c("center", "scale") indicate that the data 
# Would be centered and scaled at each pass of the model iteration 
 
rf_model<- train(diabetes ~ ., data = diab_train, method = "rf", preProc = c("center", "scale"), tuneLength = 5, trControl = diab_control, metric = "ROC") 

You can get a more detailed explanation of summaryFunction from https://cran.r-project.org/web/packages/caret/vignettes/caret.pdf.

The summaryFunction argument is used to pass in a function that takes the observed and predicted values and estimates some measure of performance. Two such functions are already included in the package: defaultSummary and twoClassSummary. The latter will compute measures specific to two-class problems, such as the area under the ROC curve, the sensitivity and specificity. Since the ROC curve is based on the predicted class probabilities (which are not computed automatically), another option is required. The classProbs = TRUE option is used to include these calculations.

Here is an explanation of tuneLength from the help file for the train function of caret.

tuneLength is an integer denoting the amount of granularity in the tuning parameter grid. By default, this argument is the number of levels for each tuning parameter that should be generated by train. If trainControl has the option search = random, this is the maximum number of tuning parameter combinations that will be generated by the random search.

Note that if this argument is given it must be named.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.32.1