Randomization with random forests

As we've seen in bagging, we create a number of bags on which each model is trained. Each of the bags consists of subsets of the actual dataset, however the number of features or variables remain the same in each of the bags. In other words, what we performed in bagging is subsetting the dataset rows.

In random forests, while we create bags from the dataset through subsetting the rows, we also subset the features (columns) that need to be included in each of the bags.

Assume that you have 1,000 observations with 20 features in your dataset. We can create 20 bags where each one of the bags has 100 observations (this is possible because of bootstrapping with replacement) and five features. Now 20 models are trained where each model gets to see only the bag it is assigned with. The final prediction is arrived at by voting or averaging based on the fact of whether the problem is a regression problem or a classification problem.

Another key difference between bagging and random forests is the ML algorithm that is used to build the model. In bagging, any ML algorithm may be used to create a model however random forest models are built specifically using CART.

Random forest modeling is yet another very popular machine learning algorithm. It is one of the algorithms that has proved itself multiple times as the best performing of algorithms, despite applying it on noisy datasets. For a person that has understood bootstrapping, understanding random forests is a cakewalk.

Table of Contents for Randomization with random forests

Create new playlist

Sign In

Sign Up

Table of Contents for
Randomization with random forests