There's more...

In this recipe, we showed how to use advanced classifiers to achieve better results. To make things even more interesting, these models have multiple hyperparameters to tune, which can significantly increase/decrease their performance. 

For brevity, we do not discuss hyperparameter tuning of these models here. We refer to the Notebook in the GitHub repository for a short introduction to tuning these models using randomized grid search. Here, we only present the results, comparing the performance of the models with default settings versus their tuned counterparts.

Before describing the results, we briefly go over the details of the considered classifiers:

Random Forest: Random Forest is the first of the models we consider in this recipe, and it is an example of an ensemble of models. It trains a series of smaller models (decision trees) and uses them to create predictions. In the case of a regression problem, it takes the average value of all the underlying trees. For classification, it uses a majority vote. There are several aspects that make Random Forest stand out:

  • It uses bagging (bootstrap aggregation)—each tree is trained on a subset of all available observations (drawn randomly with replacement, so—unless specified otherwise—the total number of observations used for each tree is the same as the total in the training set). Even though a single tree might have high variance with respect to a particular dataset (due to bagging), the forest will have lower variance overall, without increasing the bias. Additionally, this can also reduce the effect of any outliers in the data, as they will not be used in all of the trees.
  • Additionally, each tree only considers a subset of all features to create the splits.
  • Thanks to the two mechanisms above, the trees in the forest are not correlated with each other and are built independently.

Random Forest is a good algorithm to be familiar with, as it provides a good trade-off between complexity and performance. Often—without any tuning—we can get much better performance than when using simpler algorithms, such as decision trees or linear/logistic regression. That is because Random Forest has lower bias (due to its flexibility) and reduced variance (due to aggregating predictions of multiple models).

Gradient Boosted Trees: Gradient Boosted Trees is another type of ensemble model. The idea is to train many weak learners (decision trees/stumps) and combine them, to obtain a strong learner. In contrast to Random Forest, Gradient Boosted Trees is a sequential/iterative algorithm. We start with the first weak learner, and each of the subsequent learners tries to learn from the mistakes of the previous one. They do this by being fitted to the residuals (error terms) of the previous models.

The term gradient comes from the fact that the trees are built using gradient descent, which is an optimization algorithm. Without going into too much detail, it uses the gradient (slope) of the loss function to minimize the overall loss and achieve the best performance. The loss function represents the difference between actual values and the predicted ones. In practice, to perform the gradient descent procedure in the Gradient Boosted Trees, we add such a tree to the model that reduces the value of the loss function (that is, it follows the gradient).

The reason why we create an ensemble of weak learners instead of strong learners is that in the case of the strong learners, the errors/mislabeled data points would most likely be the noise in the data, so the overall model would end up overfitting to the training data.

Extreme Gradient Boosting (XGBoost): XGBoost is an implementation of Gradient Boosted Trees that incorporates a series of improvements resulting in superior performance (both in terms of evaluation metrics and time). Since being published, the algorithm was successfully used to win many data science competitions. In this recipe, we only present a high-level overview of the distinguishable features. For a more detailed overview, please refer to the original paper or documentation. The key concepts of XGBoost are:

  • XGBoost combines a pre-sorted algorithm with a histogram-based algorithm to calculate the best splits. This tackles a significant inefficiency of Gradient Boosted Trees, that is, for creating a new branch, they consider the potential loss for all possible splits (especially important when considering hundreds, or thousands, of features).
  • The algorithm uses the Newton-Raphson method for boosting (instead of gradient descent)—it provides a direct route to the minimum/minima of the loss function.
  • XGBoost has an extra randomization parameter to reduce the correlation between the trees.
  • XGBoost combines Lasso (L1) and Ridge (L2) regularization to prevent overfitting.
  • It offers a different (more efficient) approach to tree pruning.
  • XGBoost has a feature called monotonic constraints (that other models, such as LightGBM, lack)—the algorithm sacrifices some accuracy, and increases the training time to improve model interpretability.
  • XGBoost does not take categorical features as input—we must use some kind of encoding.
  • The algorithm can handle missing values in the data.

LightGBM: LightGBM, released by Microsoft, is another competition-winning implementation of Gradient Boosted Trees. Due to some improvements, LightGBM results in a similar performance to XGBoost, but with faster training time. Some key features include the following:

  • The difference in speed is caused by the approach to growing trees. In general, algorithms (such as XGBoost) use a level-wise (horizontal) approach. LightBGM, on the other hand, grows trees leaf-wise (vertically). The leaf-wise algorithm chooses the leaf with the maximum reduction in the loss function. This function was later added to XGBoost as well (grow_policy = 'lossguide').
  • LightGBM uses a technique called Gradient-based One-Side Sampling (GOSS), to filter out the data instances for finding the best split value. Intuitively, observations with small gradients are already well trained, while those with large gradients have more room for improvement. GOSS retains instances with large gradients and, additionally, samples randomly from observations with small gradients.
  • LightGBM uses Exclusive Feature Bundling (EFB) to take advantage of sparse datasets and bundles together features that are mutually exclusive (they never have values of zero at the same time). This leads to a reduction in the complexity (dimensionality) of the feature space.
  • The model can easily overfit for small datasets.

Having described the models, we present the results of the default and tuned variants of the classifiers:

For the models calibrated using the randomized search (including rs in the name), we used 100 random sets of hyperparameters. As the considered problem deals with imbalanced data (the minority class is ~20%), we look at recall for performance evaluation. It seems that the basic decision tree achieved the best recall score on the test set, at the cost of much lower precision than the more advanced models. That is why the F1 Score (a harmonic mean of precision and recall) is the lowest for the decision tree. The results by no means indicate that the more advanced models are inferior; they might simply require more tuning or a different set of hyperparameters. For example, the advanced models had enforced the maximum depth of the tree, while the decision tree had no such limit (it reached the depth of 37!). The more advanced the model, the more effort it requires to "get it right".

It is also worth mentioning that the training of the LightGBM model was by far the fastest. However, this could also be dependent on the random draw of the hyperparameters, so, this alone should not be used as a deciding argument about the speed of the algorithms.

There are many different classifiers available to experiment with. Some of the possibilities include:

  • Logistic regression—Often a good starting point to get the baseline
  • Support vector machines (SVMs)
  • Naive Bayes classifier
  • ExtraTrees classifier (also known as Extremely Randomized Trees (ERT))
  • AdaBoost—The first boosting algorithm
  • CatBoost—A recently released algorithm, developed with special attention to dealing with categorical features
  • Artificial neural networks
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.136.26