Cross-validation

Cross-validation is a simple and, in most cases, effective solution to evaluate a model without leaving aside data. This process is summarized in the following figure. We take our data and we partition it into portions. We try to keep the portions more or less equal (in size and sometimes also in other features, such as, for example, an equal number of classes). Then, we use on them to train the model and the remaining to test it. Then, we repeat this procedure systematically by leaving a different portion out of the training set and using that portion as the test set until we have done rounds. The results, , are then averaged along the runs. This is known as K-fold cross-validation. When is equal to the number of data points, we get what is known as leave-one-out cross-validation (LOOCV). Sometimes, when doing LOOCV, the number of rounds can be less than the total number of data points if we have a prohibitive number of them:

Figure 5.7

CV is the bread and butter of machine learner practitioners. There are a few details we are obviating here, but this is OK for the current discussion. For more details, you could read the book Python Machine Learning, by Sebastian Raschka, or Python Data Science Handbook by Jake Vanderplas.

Cross-validation is a very simple and useful idea, but for some models or for large amounts of data, the computational cost of cross-validation may be beyond our possibilities. Many people have tried to come up with simpler-to-compute quantities that approximate the results obtained with cross-validation and that work in scenarios where cross-validation is not that straightforward to perform. This is the subject of the next section.

Table of Contents for Cross-validation

Create new playlist

Sign In

Sign Up

Table of Contents for
Cross-validation