Learning curves help us understand how the size of our training dataset influences the machine learning model. This is very useful when you have to deal with computational constraints. Let's go ahead and plot the learning curves by varying the size of our training dataset.
# Learning curves from sklearn.learning_curve import learning_curve classifier = RandomForestClassifier(random_state=7) parameter_grid = np.array([200, 500, 800, 1100]) train_sizes, train_scores, validation_scores = learning_curve(classifier, X, y, train_sizes=parameter_grid, cv=5) print " ##### LEARNING CURVES #####" print " Training scores: ", train_scores print " Validation scores: ", validation_scores
We want to evaluate the performance metrics using training datasets of size 200, 500, 800, and 1100. We use five-fold cross-validation, as specified by the cv
parameter in the learning_curve
method.
# Plot the curve plt.figure() plt.plot(parameter_grid, 100*np.average(train_scores, axis=1), color='black') plt.title('Learning curve') plt.xlabel('Number of training samples') plt.ylabel('Accuracy') plt.show()
Although smaller training sets seem to give better accuracy, they are prone to overfitting. If we choose a bigger training dataset, it consumes more resources. Therefore, we need to make a trade-off here to pick the right size of the training dataset.
3.145.27.58