Implementing a bagging classifier

We can, for instance, build an ensemble from a collection of 10 k-NN classifiers as follows:

In [1]: from sklearn.ensemble import BaggingClassifier
...     from sklearn.neighbors import KNeighborsClassifier
...     bag_knn = BaggingClassifier(KNeighborsClassifier(),
...                                 n_estimators=10)

The BaggingClassifier class provides a number of options to customize the ensemble:

n_estimators: As shown in the preceding code, this specifies the number of base estimators in the ensemble.
max_samples: This denotes the number (or fraction) of samples to draw from the dataset to train each base estimator. We can set bootstrap=True to sample with replacement (effectively implementing bagging), or we can set bootstrap=False to implement pasting.
max_features: This denotes the number (or fraction) of features to draw from the feature matrix to train each base estimator. We can set max_samples=1.0 and max_features<1.0 to implement the random subspace method. Alternatively, we can set both max_samples<1.0 and max_features<1.0 to implement the random patches method.

This gives us the ultimate freedom to implement any kind of averaging ensemble. An ensemble can then be fit to data like any other estimator using the ensemble's fit method.

For example, if we wanted to implement bagging with 10 k-NN classifiers with k = 5, where every k-NN classifier is trained on 50% of the samples in the dataset, we would modify the preceding command as follows:

In [2]: bag_knn = BaggingClassifier(KNeighborsClassifier(n_neighbors=5),
...                                 n_estimators=10, max_samples=0.5,
...                                 bootstrap=True, random_state=3)

In order to observe a performance boost, we have to apply the ensemble to a particular dataset, such as the breast cancer dataset from Chapter 5, Using Decision Trees to Make a Medical Diagnosis:

In [3]: from sklearn.datasets import load_breast_cancer
...     dataset = load_breast_cancer()
...     X = dataset.data
...     y = dataset.target

As usual, we should follow the best practice of splitting the data into training and test sets:

In [4]: from sklearn.model_selection import train_test_split
...     X_train, X_test, y_train, y_test = train_test_split(
...         X, y, random_state=3
...     )

Then, we can train the ensemble using the fit method and evaluate its generalization performance using the score method:

In [5]: bag_knn.fit(X_train, y_train)
...     bag_knn.score(X_test, y_test)
Out[5]: 0.93706293706293708

The performance boost will become evident once we also train a single k-NN classifier on the data:

In [6]: knn = KNeighborsClassifier(n_neighbors=5)
...     knn.fit(X_train, y_train)
...     knn.score(X_test, y_test)
Out[6]: 0.91608391608391604

Without changing the underlying algorithm, we were able to improve our test score from 91.6% to 93.7% by simply letting 10 k-NN classifiers do the job instead of a single one.

You're welcome to experiment with other bagging ensembles. For example, how would you adjust the preceding code snippets to implement the random patches method?

You can find the answer in the Jupyter Notebook, 10.00-Combining-Different-Algorithms-Into-an-Ensemble.ipynb, on GitHub.

Table of Contents for Implementing a bagging classifier

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementing a bagging classifier