Using scikit-learn for k-fold cross-validation

In scikit-learn, cross-validation can be performed in three steps:

Load the dataset. Since we already did this earlier, we don't have to do it again.
Instantiate the classifier:

      In [8]: from sklearn.neighbors import KNeighborsClassifier
      ...     model = KNeighborsClassifier(n_neighbors=1)

Perform cross-validation with the cross_val_score function. This function takes as input a model, the full dataset (X), the target labels (y), and an integer value for the number of folds (cv). It is not necessary to split the data by hand—the function will do that automatically depending on the number of folds. After the cross-validation is completed, the function returns the test scores:

      In [9]: from sklearn.model_selection import cross_val_score
      ...     scores = cross_val_score(model, X, y, cv=5)
      ...     scores
      Out[9]: array([ 0.96666667, 0.96666667, 0.93333333, 0.93333333,
              1. ])

To get a sense of how the model did on average, we can look at the mean and standard deviation of the five scores:

In [10]: scores.mean(), scores.std()
Out[10]: (0.95999999999999996, 0.024944382578492935)

With five folds, we have a much better idea about how robust the classifier is on average. We see that kNN with k=1 achieves on average 96% accuracy, and this value fluctuates from run to run with a standard deviation of roughly 2.5%.

Table of Contents for Using scikit-learn for k-fold cross-validation

Create new playlist

Sign In

Sign Up

Table of Contents for
Using scikit-learn for k-fold cross-validation