Time series cross-validation with sklearn

The time series nature of the data implies that cross-validation produces a situation where data from the future will be used to predict data from the past. This is unrealistic at best and data snooping at worst, to the extent that future data reflects past events.

To address time dependency, the sklearn.model_selection.TimeSeriesSplit object implements a walk-forward test with an expanding training set, where subsequent training sets are supersets of past training sets, as shown in the following code:

tscv = TimeSeriesSplit(n_splits=5)
for train, validate in tscv.split(data):
    print(train, validate)

[0 1 2 3 4] [5]
[0 1 2 3 4 5] [6]
[0 1 2 3 4 5 6] [7]
[0 1 2 3 4 5 6 7] [8]
[0 1 2 3 4 5 6 7 8] [9]

You can use the max_train_size parameter to implement walk-forward cross-validation, where the size of the training set remains constant over time, similar to how zipline tests a trading algorithm. Scikit-learn facilitates the design of custom cross-validation methods using subclassing, which we will implement in the following chapters.

Table of Contents for Time series cross-validation with sklearn

Create new playlist

Sign In

Sign Up

Table of Contents for
Time series cross-validation with sklearn