There's more...

It is also common to split data into three sets: training, validation, and test. The validation set is used for frequent evaluation and tuning of the model's hyperparameters. Suppose we want to train a decision tree classifier and find the optimal value of the max_depth hyperparameter, which decides the maximum depth of the tree. To do so, we can train the model multiple times using the training set, and each time with a different value of the hyperparameter. Then, we can evaluate the performance of all these models, using the validation set. We pick the best model of those, and then, finally, evaluate its performance on the test set.

In the following code block, we illustrate a possible way of creating a train-validation-test split, using the same train_test_split function:

# define the size of the validation and test sets
VALID_SIZE = 0.1
TEST_SIZE = 0.2

# create the initial split - training and temp
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=(VALID_SIZE + TEST_SIZE), stratify=y, random_state=42)

# calculate the new test size
NEW_TEST_SIZE = np.around(TEST_SIZE / (VALID_SIZE + TEST_SIZE), 2)

# create the valid and test sets
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=NEW_TEST_SIZE, stratify=y_temp, random_state=42)

We basically ran train_test_split; however, we had to adjust the sizes of the test_size input in such a way that the initially defined proportions (70-10-20) were preserved.

Sometimes, we do not have enough data to split it into three sets, either because we do not have that many observations in general or because the data can be highly imbalanced, and we would remove valuable training samples from the training set. That is why practitioners often use a method called cross-validation, which we describe in the Tuning hyperparameters using grid search and cross-validation recipe.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.140.206