Data splitting

One of the key problems that need to be addressed while working on any machine learning model is: how accurate can this model be once it is implemented in production on a future dataset?

It is not possible to answer this question straight away. However, it is really important to obtain the buy-in from commercial teams that ultimately get benefited from the model build. Dividing the dataset into training and testing datasets comes in handy in such a scenario.

The training dataset is the data that is used to build the model. The testing dataset is the dataset that is not seen by the model; that is, the data points are not used in building the model. Essentially, one can think of the testing dataset as the dataset that is likely to come in future. Hence, the accuracy that we see on the testing dataset is likely to be the accuracy of the model on the future dataset.

Typically, in regression, we deal with the problem of generalization/overfitting. The overfitting problem arises when the model is so complex that it perfectly fits all the data points—thus resulting in a minimal possible error rate. A typical example of an overfitted dataset looks like this:

From the graph dataset, one can observe that the line (colored in black) does not fit all the data points perfectly, while the curve (colored in blue) fits the points perfectly and hence has minimal error on the data points on which it is trained.

However, the line has a better chance of being more generalizable when compared to the curve on a new dataset. Thus, in practice, regression/classification is a trade-off between generalizability and complexity of the model.

The lower the generalizability of the model, the higher the error rate on unseen data points.

This phenomenon can be observed in the following graph. As the complexity of the model increases, the error rate of unseen data points keeps reducing till a point, after which it starts increasing again:

The curve colored in blue is the error rate on the training dataset, and the curve colored in red is the testing dataset error rate.

The validation dataset is used to obtain the optimal hyperparameters of the model. For example, in techniques such as random forest or GBM, the number of trees needed to build or the depth of a tree is a hyper parameter. As we keep changing the hyperparameter, the accuracy on unseen datasets changes.

However, we cannot go on varying the hyperparameter until the test dataset accuracy is the highest, as we would have seen the practically future dataset (testing dataset) in such a scenario.

The validation dataset comes in handy in such scenarios, where we keep varying the hyperparameters on the training dataset until we see that the accuracy on the validation dataset is the highest. That would thus form the optimal hyperparameter combination for the model.

Table of Contents for Data splitting

Create new playlist

Sign In

Sign Up

Table of Contents for
Data splitting