Data splitting

Let's now split the data for the training and the test model. Training and testing the model forms the basis for further usage of the model for prediction in predictive analytics. Given a dataset of 192 rows of data, we split it into a convenient ratio (say 70:30), and allocate 134 rows for training and 58 rows for testing.

In general, in the algorithms based on artificial neural networks, the splitting is done by selecting rows randomly to reduce the bias. With the time series data, the sequence of values is important, so this procedure is not practicable. A simple method that we can use is to divide the ordered dataset into train and test. As we anticipated, the following code calculates the division point index and separates the data in the training datasets, with 70% of the observations for us to use to train our model; this leaves the remaining 30% to test the model:

train_len = int(len(dataset) * 0.70)
test_len  = len(dataset) - train_len
train = dataset[0:train_len,:]
test  = dataset[train_len:len(dataset),:]

The first two lines of code set the length of the two groups of data. The next two lines split the dataset into two parts: from row 1 to row train_len -1 for the train set, and from the train_len row to the last row for the test set. To confirm the correct split of data, we can print the length of the two datasets:

print(len(train), len(test))

This gives the following results:

134 58

As we anticipated, the operation divided the dataset into 134 (train set) and 58 rows (test set).

Table of Contents for Data splitting

Create new playlist

Sign In

Sign Up

Table of Contents for
Data splitting