AdaBoost

In the previous section, we have seen that sampling with a replacement leads to datasets where the samples are randomly reweighted. However, if M is very large, most of the samples will appear only once and, moreover, all the choices are totally random. AdaBoost is an algorithm proposed by Schapire and Freund that tries to maximize the efficiency of each weak learner by employing adaptive boosting (the name derives from this). In particular, the ensemble is grown sequentially and the data distribution is recomputed at each step so as to increase the weight of those samples that were misclassified and reduce the weight of the ones that were correctly classified. In this way, every new learner is forced to focus on those regions that were more problematic for the previous estimators. The reader can immediately understand that, contrary to random forests and other bagging methods, boosting doesn't rely on randomness to reduce the variance and improve the accuracy. Rather, it works in a deterministic way and each new data distribution is chosen with a precise goal. In this paragraph, we are going to consider a variant called Discrete AdaBoost (formally AdaBoost.M1), which needs a classifier whose output is thresholded (for example, -1 and 1). However, real-valued versions (whose output behaves like a probability) have been developed (a classical example is shown in Additive Logistic Regression: a Statistical View of Boosting, Friedman J., Hastie T., Tibshirani R., Annals of Statistics, 28/1998). As the main concepts are always the same, the reader interested in the theoretical details of other variants can immediately find them in the referenced papers.

For simplicity, the training dataset of AdaBoost.M1 is defined as follow:

This choice is not a limitation because, in multi-class problems, a one-versus-the-rest strategy can be easily employed, even if algorithms like AdaBoost.SAMME guarantee a much better performance. In order to manipulate the data distribution, we need to define a weight set:

The weight set allows defining an implicit data distribution D(t)(x), which initially is equivalent to the original one but that can be easily reshaped by changing the values wi. Once the family and the number of estimators, Nc, have been chosen, it's possible to start the global training process. The algorithm can be applied to any kind of learner that is able to produce thresholded estimations (while the real-valued variants can work with probabilities, for example, obtained through the Platt scaling method).

The first instance d1(x) is trained with the original dataset, which means with the data distribution D(1)(x). The next instances, instead, are trained with the reweighted distributions D(2)(x), D(3)(x), ..., D(Nc)(x). In order to compute them, after each training process, the normalized weighted error sum is computed:

This value is bounded between 0 (no misclassifications) and 1 (all samples have been misclassified) and it's employed to compute the estimator weight α(t):

To understand how this function works, it's useful to consider its plot (shown in the following diagram):

Estimator weight plot as a function of the normalized weighted error sum

This diagram unveils an implicit assumption: the worst classifier is not the one that misclassifies all samples (ε(t) = 1), but a totally random binary guess (corresponding to ε(t) = 0.5). In this case, α(t) is null and, therefore, the outcome if the estimator is completely discarded. When ε(t) < 0.5, a boosting is applied (between about 0.05 and 0.5, the trend is almost linear), but it becomes greater than 1 only when ε(t) < about 0.25 (larger values drive to a penalty because the weight is smaller than 1). This value is a threshold to qualify an estimator as trusted or very strong and α(t) → +∞ in the particular case of a perfect estimator (no errors).

In practice, an upper bound should be imposed in order to avoid overflows or divisions by zero. Instead, when ε(t) > 0.5, the estimator is unacceptably weak, because it's worse than a random guess and the resulting boosting would be negative. To avoid this problem, real implementations must invert the output of such estimators, transforming them de facto into learners with ε(t) < 0.5 (this is not an issue, as the transformation is applied to all output values in the same way). It's important to consider that this algorithm shouldn't be directly applied to multi-class scenarios because, as pointed out in Multi-class AdaBoost, Zhu J., Rosset S., Zou H., Hastie T., 01/2006, the threshold 0.5 corresponds to a random guess accuracy only for binary choices. When the number of classes is larger than two, a random estimator outputs a class with a probability 1/Ny (where Ny is the number of classes) and, therefore, AdaBoost.M1 will boost the classifiers in a wrong way, yielding poor final accuracies (the real threshold should be 1 - 1/Ny, which is larger than 0.5 when Ny > 2). The AdaBoost.SAMME algorithm (implemented by Scikit-Learn) has been proposed to solve this problem and exploit the power of boosting also in multi-class scenarios.

The global decision function is defined as follows:

In this way, as the estimators are added sequentially, the importance of each of them will decrease while the accuracy of di(x) increases. However, it's also possible to observe a plateau if the complexity of X is very high. In this case, many learners will have a high weight, because the final prediction must take into account a sub-combination of learners in order to achieve an acceptable accuracy. As this algorithm specializes the learners at each step, a good practice is to start with a small number of estimators (for example, 10 or 20) and increase the number until no improvement is achieved. Sometimes, a minimum number of good learners (like SVM or decision trees) is sufficient to reach the highest possible accuracy (limited to this kind of algorithm), but in some other cases, the number of estimators can be some thousands. Grid search and cross-validation are again the only good strategies to make the right choice.

After each training step it is necessary to update the weights in order to produce a boosted distribution. This is achieved using an exponential function (based on bipolar outputs {-1, 1}):

Given a sample x_i, if it has been misclassified, its weight will be increased considering the overall estimator weight. This approach allows a further adaptive behavior because a classifier with a high α(t) is already very accurate and it's necessary a higher attention level to focus only on the (few) misclassified samples. On the contrary, if α(t) is small, the estimator must improve its overall performance and the over-weighting process must be applied to a large subset (therefore, the distribution won't peak around a few samples, but will penalize only the small subset that has been correctly classified, leaving the estimator free to explore the remaining space with the same probability). Even if not present in the original proposal, it's also possible to include a learning rate η that multiplies the exponent:

A value η = 1 has no effect, while smaller values have been proven to increase the accuracy by avoiding a premature specialization. Of course, when η << 1, the number of estimators must be increased in order to compensate the minor reweighting and this can drive to a training performance loss. As for the other hyperparameters, the right value for η must be discovered using a cross-validation technique (alternatively, if it's the only value that must be fine-tuned, it's possible to start with one and proceed by decreasing its value until the maximum accuracy has been reached).

The complete AdaBoost.M1 algorithm is as follows:

Set the family and the number of estimators Nc
Set the initial weights W(1) equal to 1/M
Set the learning rate η (for example, η = 1)
Set the initial distribution D(1) equal to the dataset X

For i=1 to Nc:
1. Train the i^th estimator di(x) with the data distribution D(i)
2. Compute the normalized weighted error sum ε(i):
  1. If ε(i) > 0.5, invert all estimator outputs
3. Compute the estimator weight α(i)
4. Update the weights using the exponential formula (with or without the learning rate)
5. Normalize the weights
Create the global estimator applying the sign(•) function to the weighted sum α(i)di(x) (for i=1 to Nc)

Table of Contents for AdaBoost

Create new playlist

Sign In

Sign Up

Table of Contents for
AdaBoost