Implementing a boosting classifier

For example, we can build a boosting classifier from a collection of 10 decision trees as follows:

In [11]: from sklearn.ensemble import GradientBoostingClassifier
... boost_class = GradientBoostingClassifier(n_estimators=10,
... random_state=3)

These classifiers support both binary and multiclass classification.

Similar to the BaggingClassifier class, the GradientBoostingClassifier class provides a number of options to customize the ensemble:

  • n_estimators: This denotes the number of base estimators in the ensemble. A large number of estimators typically results in better performance.
  • loss: This denotes the loss function (or cost function) to be optimized. Setting loss='deviance' implements logistic regression for classification with probabilistic outputs. Setting loss='exponential' actually results in AdaBoost, which we will talk about in a little bit.
  • learning_rate: This denotes the fraction by which to shrink the contribution of each tree. There is a trade-off between learning_rate and n_estimators.
  • max_depth: This denotes the maximum depth of the individual trees in the ensemble.
  • criterion: This denotes the function to measure the quality of a node split.
  • min_samples_split: This denotes the number of samples required to split an internal node.
  • max_leaf_nodes: This denotes the maximum number of leaf nodes allowed in each individual tree.

We can apply the boosted classifier to the preceding breast cancer dataset to get an idea of how this ensemble compares to a bagged classifier. But first, we need to reload the dataset:

In [12]: dataset = load_breast_cancer()
... X = dataset.data
... y = dataset.target
In [13]: X_train, X_test, y_train, y_test = train_test_split(
... X, y, random_state=3
... )

Then, we find that the boosted classifier achieves 94.4% accuracy on the test set—a little under 1% better than the preceding bagged classifier:

In [14]: boost_class.fit(X_train, y_train)
... boost_class.score(X_test, y_test)
Out[14]: 0.94405594405594406

We would expect an even better score if we increased the number of base estimators from 10 to 100. In addition, we might want to play around with the learning rate and the depths of the trees.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.77.63