Combining grid search with nested cross-validation

Although grid search with cross-validation makes for a much more robust model selection procedure, you might have noticed that we performed the split into training and validation sets still only once. As a result, our results might still depend too much on the exact training-validation split of the data.

Instead of splitting the data into training and validation sets once, we can go a step further and use multiple splits for cross-validation. This will result in what is known as nested cross-validation, and the process is illustrated in the following diagram:

In nested cross-validation, there is an outer loop over the grid search box that repeatedly splits the data into training and validation sets. For each of these splits, a grid search is run, which will report back a set of best parameter values. Then, for each outer split, we get a test score using the best settings.

Running a grid search over many parameters and on large datasets can be computationally intensive. A particular parameter setting on a particular cross-validation split can be done completely independently from the other parameter settings and models. Hence, parallelization over multiple CPU cores or a cluster is very important for grid search and cross-validation.

Now that we know how to find the best parameters of a model, let's take a closer look at the different evaluation metrics that we can use to score a model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.131.10