Cross-validation

We've seen that many times in the real world, we come across a situation where we don't have an available test data set that we can use in order to measure the performance of our model on unseen data. The most typical reason is that we have very few data overall and want to use all of it to train our model. Another situation is that we want to keep a sample of the data as a validation set to tune some model meta parameters such as cost and gamma for SVMs with radial kernels, and as a result, we've already reduced our starting data and don't want to reduce it further.

Whatever the reason for the lack of a test data set, we already know that we should never use our training data as a measure of model performance and generalization because of the problem of overfitting. This is especially relevant for powerful and expressive models such as the nonlinear models of neural networks and SVMs with radial kernels that are often capable of approximating the training data very closely but may end up failing to generalize well on unseen data. In this section, we introduce the notion of cross-validation, which is perhaps best explained by a diagram:

Cross-validation

There is actually more than one variant of cross-validation, but in the previous diagram, we show the most common one, which is known as k-fold cross-validation. Under this scheme, we split our training data into k equally sized and nonoverlapping partitions. We then train a model using k-1 of these partitions and use the remaining partition as a test set. This is done k times in total, once for every partition that is left out and used as a test set. Finally, we compute a global estimate of the accuracy of our method on unseen data by aggregating all estimates that we obtained across the k different test sets. For example, for classification problems where we compute the classification accuracy, we can obtain a global test set classification accuracy by taking the average classification accuracy across all the different test sets.

As an example, suppose that we want to train a binary classification model with a training data set of only 500 observations, and we want to do 10-fold cross validation. We produce 10 partitions of 50 observations each. Then, we train 10 models, each of which is trained with 9 of these partitions for a training set size of 450 observations and tested with the remaining partition of 50 observations. For each of these models, we will measure the classification accuracy on the partition set aside for testing, and take the average of these 10 measurements in order to obtain a global estimate of the model's classification accuracy on unseen data. Individually, the test set accuracy of one of these models may be inaccurate due to the randomness of the sampling of the training data and the relatively small size of the testing data. The averaging process at the end is instrumental in smoothing out these irregularities that might occur for individual models.

Choosing a good value for k is an important design decision. Using a small number for k results in models that are trained on a small proportion of the input data, which introduces a significant bias in the model. For example, if we were to use a value of 2 for k, we would end up training a model on only half of the training data. This is yet another example of the bias-variance trade-off that constantly crops up in statistical modeling. If, instead, we choose to have a large value of k, each time we train a model we will be using most of the training data and thus our bias will be kept low. However, because we are essentially using almost the same training data each time and testing it on a small-sized test set, we will observe a large variance in our prediction of the model's accuracy on unseen data. In the limit, we can actually set the value of k to be the total number of observations in our training data. When this happens, each time we train a model, our test set will consist of only one observation that is left out. For this reason, this process is referred to as leave-one-out cross validation.

Note that we train as many models as the value of k. At the same time, the size of the training data also increases with the value of k so when k is high, we also end up having a large overall training time. Sometimes, this may also be a factor in choosing an appropriate value of k, especially when the training time of an individual model is high. A good rule of thumb is to choose a value of 10 for k.

Returning to the analysis we performed in the previous section, remember that we manually tried out different gamma and cost parameters on the same test set because we didn't want to set aside a validation set to do this. The e1071 package provides us with a tune() function that can carry out k-fold cross-validation to determine appropriate values for model meta parameters, such as the SVM cost. To do this, it receives an input called ranges, which is a list of vectors that contain values for all the parameters that we want to vary across the different runs of cross validation. In this case, we have two parameters, cost and gamma, so we provide it with two vectors:

> set.seed(2002)
> bdf_radial_tune <- tune(svm, V42 ~ ., data = bdf_train, 
  kernel = "radial", ranges = list(cost = c(0.01, 0.1, 1, 10, 100),     
  gamma = c(0.01, 0.05, 0.1, 0.5, 1)))
> bdf_radial_tune

Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost gamma
    1  0.05

- best performance: 0.1194818

As we can see, our k-fold cross-validation predicts that the best parameter combination to use is a cost of 1 and a gamma of 0.05, which in this case, turns out to be consistent with what we found earlier using performance on a held out test set. This, of course, will not always be the case in general and in fact, our cross-validation indicates that the expected performance is actually closer to 88 percent rather than the 89 percent that we found earlier using these parameters. We deduced this from the output by noting that the best performance that is listed in the tuning results is actually the lowest average error obtained across the 10 data folds using the training data.

Note

A very readable book on support vector machines is An Introduction to Support Vector Machines and Other Kernel-based Learning Methods by Nello Christiani and John Shawe-Taylor. Another good reference that presents an insightful link between SVMs and a related type of neural network known as a Radial Basis Function Network is Neural Networks and Learning Machines by Simon Haykin, which we also referenced in Chapter 4, Neural Networks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.34.226