Random forest key concepts

Random forest is a decision tree-based ML modelling process. Decision trees predict target variables by growing a hierarchy of successive splits across input features, based on the information gain of each split. If this sounds confusing, just think of it as a series of decision rules that result in a prediction at the end. The following diagram may help explain the idea:

Decision tree ID3 algorithm simple diagram. Source: Wikimedia commons

Random forests take this concept a step further by introducing randomness and repetition. From the original training set, a random variation of the training set is created by taking a random split and bootstrapping a dataset to train a tree. Bootstrapping uses random draws from a dataset to create a (usually) larger dataset as a way to simulate different datasets from the same population of data records.

A popular form of random forest also adds some randomness to growing the tree. A random feature is selected from a subset of features to split on at each branch. This whole process is repeated over and over as a forest of decision trees is created. There are typically over 1,000 trees in one model. The model then averages the decisions of all the trees to arrive at the predicted value:

Random forest visual. All tree variations vote to arrive at the consolidated answer y.

This is an overly simplified explanation as there is much more to it. There are also several tuning parameters depending on the problem being investigated. This is where the art in data science comes into play. This process reduces variation by a wisdom of crowds effect. The cost is a little increase in bias.

Table of Contents for Random forest key concepts

Create new playlist

Sign In

Sign Up

Table of Contents for
Random forest key concepts