Simplified split-finding algorithms

The gradient boosting implementation by sklearn finds the optimal split that enumerates all options for continuous features. This precise greedy algorithm is computationally very demanding because it must first sort the data by feature values before scoring the potentially very large number of split options and making a decision. This approach faces challenges when the data does not fit in memory or when training in a distributed setting on multiple machines.

An approximate split-finding algorithm reduces the number of split points by assigning feature values to a user-determined set of bins, which can also greatly reduce the memory requirements during training because only a single split needs to be stored for each bin. XGBoost introduced a quantile sketch algorithm that was also able to divide weighted training samples into percentile bins to achieve a uniform distribution. XGBoost also introduced the ability to handle sparse data caused by missing values, frequent zero-gradient statistics, and one-hot encoding, and can also learn an optimal default direction for a given split. As a result, the algorithm only needs to evaluate non-missing values.

In contrast, LightGBM uses gradient-based one-side sampling (GOSS) to exclude a significant proportion of samples with small gradients, and only uses the remainder to estimate the information gain and select a split value accordingly. Samples with larger gradients require more training and tend to contribute more to the information gain. LightGBM also uses exclusive feature bundling to combine features that are mutually exclusive, in that they rarely take nonzero values simultaneously, to reduce the number of features. As a result, LightGBM was the fastest implementation when released.

Table of Contents for Simplified split-finding algorithms

Create new playlist

Sign In

Sign Up

Table of Contents for
Simplified split-finding algorithms