Example of AdaBoost with Scikit-Learn

Let's continue using the Wine dataset in order to analyze the performance of AdaBoost with different parameters. Scikit-Learn, like almost all algorithms, implements both a classifier AdaBoostClassfier (based on the algorithm SAMME and SAMME.R) and a regressor AdaBoostRegressor (based on the algorithm R2). In this case, we are going to use the classifier, but I invite the reader to test the regressor using a custom dataset or one of the built-in toy datasets. In both classes, the most important parameters are n_estimators and learning_rate (default value set to 1.0). The default underlying weak learner is always a decision tree, but it's possible to employ other models creating a base instance and passing it through the parameter base_estimator. As explained in the chapter, real-valued AdaBoost algorithms require an output based on a probability vector. In Scikit-Learn, some classifiers/regressors (such as SVM) don't compute the probabilities unless it is explicitly required (setting the parameter probability=True); therefore, if an exception is raised, I invite you to check the documentation in order to learn how to force the algorithm to compute them.

The examples we are going to discuss have only a didactic purpose because they focus on a single parameter. In a real-world scenario, it's always better to perform a grid search (which is more expensive), so as to analyze a set of combinations. Let's start analyzing the cross-validation score as a function of the number of estimators (the vectors X and Y are the ones defined in the previous example):

import numpy as np

from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_score 

scores_ne = []

for ne in range(10, 201, 10):
    adc = AdaBoostClassifier(n_estimators=ne, learning_rate=0.8, random_state=1000)
    scores_ne.append(np.mean(cross_val_score(adc, X, Y, cv=10)))

We have considered a range starting from 10 trees and ending with 200 trees with steps of 10 trees. The learning rate is kept constant and equal to 0.8. The resulting plot is shown in the following graph:

10-fold cross-validation accuracy as a function of the number of estimators

The maximum is reached with 50 estimators. Larger values cause performance worsening due to the over-specialization and a consequent variance increase. As explained also in other chapters, the capacity of a model must be tuned according to the Occam's Razor principle, not only because the resulting model can be faster to train, but also because a capacity excess is normally saturated, overfitting the training set and reducing the scope for generalization. Cross-validation can immediately show this effect, which, instead, can remain hidden when a standard training/test set split is done (above all when the samples are not shuffled).

Let's now check the performance with different learning rates (keeping the number of trees fixed):

import numpy as np

scores_eta_adc = []

for eta in np.linspace(0.01, 1.0, 100):
    adc = AdaBoostClassifier(n_estimators=50, learning_rate=eta, random_state=1000)
    scores_eta_adc.append(np.mean(cross_val_score(adc, X, Y, cv=10)))

The final plot is shown in the following graph:

10-fold Cross-validation accuracy as a function of the learning rate (number of estimators = 50)

Again, different learning rates yield different accuracies. The choice of η = 0.8 seems to be the most effective, as higher and lower values lead to performance worsening. As explained, the learning rate has a direct impact on the re-weighting process. Very small values require a larger number of estimators because subsequent distributions are very similar. On the other side, large values can lead to a premature over-specialization. Even if the default value is 1.0, I always suggest checking the accuracy also with smaller values. There's no golden rule for picking the right learning rate in every case, but it's important to remember that lower values allow the algorithm to smoothly adapt to fit the training set in a gentler way, while higher values reduce the robustness to outliers, because the samples that have been misclassified are immediately boosted and the probability of sampling them increases very rapidly. The result of this behavior is a constant focus on those samples that may be affected by noise, almost forgetting the structure of the remaining sample space.

The last experiment we want to make is analyzing the performance after a dimensionality reduction performed with Principal Component Analysis (PCA) and Factor Analysis (FA) (with 50 estimators and η = 0.8):

import numpy as np

from sklearn.decomposition import PCA, FactorAnalysis

scores_pca = []

for i in range(13, 1, -1):
    if i < 12:
        pca = PCA(n_components=i, random_state=1000)
        X_pca = pca.fit_transform(X)
    else:
        X_pca = X
        
    adc = AdaBoostClassifier(n_estimators=50, learning_rate=0.8, random_state=1000)
    scores_pca.append(np.mean(cross_val_score(adc, X_pca, Y, cv=10)))  

scores_fa = []

for i in range(13, 1, -1):
    if i < 12:
        fa = FactorAnalysis(n_components=i, random_state=1000)
        X_fa = fa.fit_transform(X)
    else:
        X_fa = X
        
    adc = AdaBoostClassifier(n_estimators=50, learning_rate=0.8, random_state=1000)
    scores_fa.append(np.mean(cross_val_score(adc, X_fa, Y, cv=10)))

The resulting plot is shown in the following graph:

10-fold cross-validation accuracy as a function of the number of components (PCA and factor analysis)

This exercise confirms some important features analyzed in Chapter 5, EM Algorithm and Applications. First of all, performances are not dramatically affected even by a 50% dimensionality reduction. This consideration is further confirmed by the feature importance analysis performed in the previous example. Decision trees can perform quite a good classification considering only 6/7 features because the remaining ones offer a marginal contribution to the characterization of a sample. Moreover, FA is almost always superior to PCA. With 7 components, the accuracy achieved using the FA algorithm is higher than 0.95 (very close to the value achieved with no reduction), while a PCA reaches this value with 12 components. The reader should remember that PCA is a particular case of FA, with the assumption of homoscedastic noise. The diagram confirms that this condition is not acceptable with the Wine dataset. Assuming different noise variances allows remodeling the reduced dataset in a more accurate way, minimizing the cross-effect of the missing features. Even if PCA is normally the first choice, with large datasets, I suggest you always compare the performance with a Factor Analysis and choose the technique that guarantees the best result (given also that FA is more expensive in terms of computational complexity).

Table of Contents for Example of AdaBoost with Scikit-Learn

Create new playlist

Sign In

Sign Up

Table of Contents for
Example of AdaBoost with Scikit-Learn