Gradient boosting and class imbalance

Ensembles of models (several models stacked together) can be conceptualized into two main groups: bagging and boosting. Bagging stands for bootstrap aggregation, meaning that several submodels are trained by bootstrapping (resampling with replacement) over the dataset. Each dataset will obviously be different and each model will yield different results. Boosting, on the other hand relies on training subsequent models using the residuals from the previous step. In each step, we have an aggregated model and a new model that is trained over those residuals. Both are combined to build a new combined model optimally (in such a way that the overall predictions are as good as possible).

The most famous bagging technique is random forests, which we have used previously in this chapter. Several boosting techniques have enjoyed an enormous popularity over recent years, gradient boosting being the most.

A separate topic is class imbalance, which causes major changes regarding how we should assess our models' performance. The main problem is that when the data is imbalanced, the model will adjust to predict the dominating class to the detriment of the other ones. It is not strange to see cases when a model predicts all the labels to belong to the most common class. There are essentially three ways of fixing this problem: 

  • Upsampling, meaning that we resample with replacement from the least frequent classes
  • Downsampling, which is the same as the previous method, but samples less from the most common classes
  • Synthetic minority oversampling technique (SMOTE): this one generates new observations
  • Weighting the classes (usually an inversely proportional weight to each weight according to the class frequency)

The first three can be applied with any classification technique in caret, whereas the last can be applied only if the underlying method accepts it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.53.119