Algorithms such as Random Forest (RF) or Gradient Boosted Machine (GBM) (also referred to as Gradient Boosted Trees) are two examples of ensemble tree-based models which are currently available in MLlib; you can think of an ensemble as an uber-model which represents a collection of base models. The best way to think about what an ensemble is doing behind the scenes is to consider a simple analogy:
"Suppose that you are the head coach of a famous soccer club and you have heard rumors of an incredible athlete from Brazil and it may be advantageous to sign this young athlete to your club before the other teams do; but your schedule is incredibly busy and instead, you send 10 of your assistant coaches to assess the player. Each one of your assistant coaches grades the player based on his/her coaching philosophy - maybe one coach wants to measure how fast the player can run 40 yards while another coach thinks height and arm-reach are important. Regardless of how each coach defines "athlete potential" you, as head coach, just want to know if you should sign the player to a contract now or wait. And so it goes that your coaches fly down to Brazil and each coach makes an assessment; upon arrival you go up to each of your coaches and ask "should we draft this player now or wait?" and, based on a simple rule like majority vote, you can make your decision. This an example of what an ensemble is doing behind the scenes with respect to a classification task."
You can think of each coach as a decision tree and, therefore, you will have an ensemble of 10 trees (for 10 coaches). How each coach assesses the player is highly specific and this holds true for our trees as well; for each of the 10 trees created features are selected randomly at each node (hence the random, in RF. Forest because there are many trees!). The reason for introducing this randomness and other base models is to prevent over-fitting the data. While RF and GBM are both tree-based ensembles, the manner in which they go about training is slightly different and deserves mention.
GBMs must be trained one tree at a time in order to minimize a loss function (for example, log-loss, squared error, and so on) and usually take longer to train than an RF which can generate multiple trees in parallel.
However, when training a GBM, it is recommended to make shallow trees which in turn lends itself to faster training.
- RFs generally do not overfit the data compared to a GBM; that is, we can add more trees to our forest and be less prone to over-fitting than if we added more trees to our GBM.
- Hyper-parameter tuning for an RF is much simpler than GBM. In his paper, Influence of Hyperparameters on Random Forest Accuracy, Bernard et al. show via experimentation that the number of K random features to select at each node is a key influencer with respect to model accuracy (https://hal.archives-ouvertes.fr/hal-00436358/document) Conversely, a GBM has much more hyper-parameters that must be considered, such as loss function, learning rate, number of iterations, and so on.
As with most which is better questions in data science, choosing between an RF and a GBM is open-ended and very task and dataset dependent.