Learning curves

The learning curve (see the right-hand panel of the preceding chart for our house price regression example) helps determine whether a model's cross-validation performance would benefit from additional data and whether prediction errors are more driven by bias or by variance.

If training and cross-validation performance converges, then more data is unlikely to improve the performance. At this point, it is important to evaluate whether the model performance meets expectations, determined by a human benchmark. If this is not the case, then you should modify the model's hyperparameter settings to better capture the relationship between the features and the outcome, or choose a different algorithm with a higher capacity to capture complexity.

In addition, the variation of train and test errors shown by the shaded confidence intervals provide clues about the bias and variance sources of the prediction error. Variability around the cross-validation error is evidence of variance, whereas variability for the training set suggests bias, depending on the size of the training error.

In our example, the cross-validation performance has continued to drop, but the incremental improvements have shrunk and the errors have plateaued, so there are unlikely to be many benefits from a larger training set. On the other hand, the data is showing substantial variance given the range of validation errors compared to that shown for the training errors.

Table of Contents for Learning curves

Create new playlist

Sign In

Sign Up

Table of Contents for
Learning curves