Gradient boosting

At this point, we can introduce a more general method of creating boosted ensembles. Let's choose a generic algorithm family, represented as follows:

Each model is parametrized using the vector θi and there are no restrictions on the kind of method that is employed. In this case, we are going to consider decision trees (which is one of the most diffused algorithms when this boosting strategy is employed—in this case, the algorithm is known as gradient tree boosting), but the theory is generic and can be easily applied to more complex models, such as neural networks. In a decision tree, the parameter vector θi is made up of selection tuples, so the reader can think of this method as a pseudo-random forest where, instead of randomness, we look for extra optimality exploiting the previous experience. In fact, as with AdaBoost, a gradient boosting ensemble is built sequentially, using a technique that is formally defined as Forward Stage-wise Additive Modeling. The resulting estimator is represented as a weighted sum:

Therefore the variables to manage are the single estimator weights αi and the parameter vectors θi. However, we don't have to work with the whole set, but with a single tuple (αi, θi), without modifying the values already chosen during the previous iterations. The general procedure can be summarized with a loop:

The estimator sum is initialized to a null value
For i=1 to Nc:
1. The best tuple(αi, θi) is chosen and the estimator f(x; θi) is trained
2. di(x) = di-1(x) + αif(x; θi)
The final estimator d(x) is output

How is it possible to find out the best tuple? We have already presented a strategy for improving the performance of every learner through boosting the dataset. In this case, instead, the algorithm is based on a cost function that we need to minimize:

In particular, the generic optimal tuple is obtained as follows:

As the process is sequential, each estimator is optimized to improve the previous one's accuracy. However, contrary to AdaBoost, we are not constrained to impose a specific loss function (it's possible to prove that AdaBoost.M1 is equivalent to this algorithm with an exponential loss but the proof is beyond the scope of this book). As we are going to discuss, other cost functions can yield better performances in several different scenarios, because they avoid the premature convergence towards sub-optimal minima.

The problem could be considered as solved by employing the previous formula to optimize each new learner; however, the argmin(•) function needs a complete exploration of the cost function space and, as C(•) depends on each specific model instance and, therefore, on θi, it's necessary to perform several retraining processes in order to find the optimal solution. Moreover, the problem is generally non-convex and the number of variables can be very high. Numerical algorithms such as L-BFGS or other quasi-Newton methods need too many iterations and a prohibitive computational time. It's clear that such an approach is not affordable in the vast majority of cases and the Gradient Boosting algorithm has been proposed as an intermediate solution. The idea is to find a sub-optimal solution with a gradient descent strategy limited to a single step for each iteration.

In order to present the algorithm, it's useful to rewrite the additive model with an explicit reference to the optimal goal:

Note that the cost function is computed carrying on all the previously trained models; therefore, the correction is always incremental. If the cost function L is differentiable (a fundamental condition that is not difficult to meet), it's possible to compute the gradient with respect to the current additive model (at the i^th iteration, we need to consider the additive model obtained summing all the previous i-1 models):

At this point, a new classifier can be added by moving the current additive model into the negative direction of the gradient:

We haven't considered the parameter αi yet (nor the learning rate η, which is a constant), however the reader familiar with some basic calculus can immediately understand the effect of an update is to reduce the value of the global loss function by forcing the next model to improve its accuracy with respect to its predecessors. However, a single gradient step isn't enough to guarantee an appropriate boosting strategy. In fact, as discussed previously, we also need to weight each classifier according to its ability to reduce the loss. Once the gradient has been computed, it's possible to determine the best value for the weight αi with a direct minimization of the loss function (using a line search algorithm) computed considering the current additive model with α as an extra variable:

When using the gradient tree boosting variant, an improvement can be achieved by splitting the weight αi into m sub-weights αi(j) associated with each terminal node of the tree. The computational complexity is slightly increased, but the final accuracy can be higher than the one obtained with a single weight. The reason derives from the functional structure of a tree. As the boosting forces a specialization in specific regions, a single weight could drive to an over-estimation of a learner also when a specific sample cannot be correctly classified. Instead, using different weights, it's possible to operate a fine-grained filtering of the result, accepting or discarding an outcome according to its value and to the properties of the specific tree.

This solution cannot provide the same accuracy of a complete optimization, but it's rather fast and it's possible to compensate for this loss using more estimators and a lower learning rate. Like many other algorithms, gradient boosting must be tuned up in order to yield the maximum accuracy with a low variance. The learning rate is normally quite smaller than 1.0 and its value should be found by validating the results and considering the total number of estimators (it's better to reduce it when more learners are employed). Moreover, a regularization technique could be added in order to prevent overfitting. When working with specific classifier families (such as logistic regression or neural networks), it's very easy to include an L1 or L2 penalty, but it's not so easy with other estimators. For this reason, a common regularization technique (implemented also by Scikit-Learn) is the downsampling of the training dataset. Selecting P < N random samples allows the estimators to reduce the variance and prevent overfitting. Alternatively, it's possible to employ a random feature selection (for gradient tree boosting only) as in a random forest; picking a fraction of the total number of features increases the uncertainty and avoids over-specialization. Of course, the main drawback to these techniques is a loss of accuracy (proportional to the downsampling/feature selection ratio) that must be analyzed in order to find the most appropriate trade-off.

Before moving to the next section, it's useful to briefly discuss the main cost functions that are normally employed with this kind of algorithms. In the first chapter, we have presented some common cost functions, like mean squared error, Huber Loss (very robust in regression contexts), and cross-entropy. They are all valid examples, but there are other functions that are peculiar to classification problems. The first one is Exponential Loss, defined as follows:

As pointed out by Hastie, Tibshirani and, Friedman, this function transforms the gradient boosting into an AdaBoost.M1 algorithm. The corresponding cost function has a very precise behavior that sometimes is not the most adequate to solve particular problems. In fact, the result of an exponential loss has a very high impact when the error is large, yielding distributions that are strongly peaked around a few samples. The subsequent classifiers can be consequently driven to over-specialize their structure to cope only with a small data region, with a concrete risk of losing the ability to correctly classify other samples. In many situations, this behavior is not dangerous and the final bias-variance trade-off is absolutely reasonable; however, there are problems where a softer loss function can allow a better final accuracy and generalization ability. The most common choice for real-valued binary classification problems is Binomial Negative Log-Likelihood Loss (deviance), defined as follows (in this case we are assuming that the classifier f(•) is not thresholded, but outputs a positive-class probability):

This loss function is the same employed in Logistic Regressions and, contrary to Exponential Loss, doesn't yield peaked distributions. Two misclassified samples with different probabilities are boosted proportionally to the error (not the exponential value), so as to force the classifiers to focus on all the misclassified population with almost the same probability (of course, a higher probability assigned to samples whose error is very large is desirable, assuming that all the other misclassified samples have always a good chance to be selected). The natural extension of the Binomial Negative Log-Likelihood Loss to multi-class problems is the Multinomial Negative Log-Likelihood Loss, defined as follows (the classifier f(•) is represented as probability vector with p components):

In the previous formula, the notation Iy=j must be interpreted as an indicator function, which is equal to 1 when y=j and 0 otherwise. The behavior of this loss function is perfectly analogous to the binomial variant and, in general, it is the default choice for classification problems. The reader is invited to test the examples with both exponential loss and deviance and compare the results.

The complete gradient boosting algorithm is as follows:

Set the family and the number of estimators Nc
Select a loss function L (for example, deviance)
Initialize the base estimator d0(x) as a constant (such as 0) or using another model
Set the learning rate η (such as η = 1)
For i=1 to Nc:
1. Compute the gradient ∇d L(•) using the additive model at the step i-1
2. Train the i^th estimator di(x) with the data distribution { (xi, ∇d L(yi, di-1(xi)) }
3. Perform a line search to compute αi
4. Add the estimator to the ensemble

Table of Contents for Gradient boosting

Create new playlist

Sign In

Sign Up

Table of Contents for
Gradient boosting