Random forest - feature bagging of decision tree

The ensemble technique, bagging (which stands for bootstrap aggregating), which we briefly mentioned in the first chapter, can effectively overcome overfitting. To recap, different sets of training samples are randomly drawn with replacement from the original training data; each set is used to train an individual classification model. Results of these separate models are then combined together via majority vote to make the final decision.

Tree bagging, as previously described, reduces the high variance that a decision tree model suffers from and hence in general performs better than a single tree. However, in some cases where one or more features are strong indicators, individual trees are constructed largely based on these features and as a result become highly correlated. Aggregating multiple correlated trees will not make much difference. To force each tree to be uncorrelated, random forest only considers a random subset of the features when searching for the best splitting point at each node. Individual trees are now trained based on different sequential sets of features, which guarantees more diversity and better performance. Random forest is a variant tree bagging model with additional feature-based bagging.

To deploy random forest to our click-through prediction project, we will use the package from scikit-learn. Similar to the way we previously implemented decision tree, we only tweak the max_depth parameter:

>>> from sklearn.ensemble import RandomForestClassifier
>>> random_forest = RandomForestClassifier(n_estimators=100, 
                criterion='gini', min_samples_split=30, n_jobs=-1)
>>> grid_search = GridSearchCV(random_forest, parameters, 
                               n_jobs=-1, cv=3, scoring='roc_auc')
>>> grid_search.fit(X_train, y_train)
>>> print(grid_search.best_params_)
{'max_depth': None}

Use the model with the optimal parameter None for max_depth (nodes are expanded until other stopping criteria are met) to predict unseen cases:

>>> random_forest_best = grid_search.best_estimator_
>>> pos_prob = random_forest_best.predict_proba(X_test)[:, 1]
>>> print('The ROC AUC on testing set is: 
    {0:.3f}'.format(roc_auc_score(y_test, pos_prob)))
The ROC AUC on testing set is: 0.724

It turns out that the random forest model gives a lift in the performance.

Although for demonstration, we only played with the max_depth parameter, there are another three important parameters that we can tune to improve the performance of a random forest model:

max_features: The number of features to consider at each best splitting point search. Typically, for an m-dimensional dataset, (rounded) is a recommended value for max_features. This can be specified as max_features="sqrt" in scikit-learn. Other options include "log2", 20% of the original features to 50%.
n_estimators: The number of trees considered for majority voting. Generally speaking, the more the number trees, the better is the performance, but it takes more computation time. It is usually set as 100, 200, 500, and so on.
min_samples_split: The minimal number of samples required for further split at a node. Too small a value tends to cause overfitting, while a large one is likely to introduce under fitting. 10, 30, and 50 might be good options to start with.

Table of Contents for Random forest - feature bagging of decision tree

Create new playlist

Sign In

Sign Up

Table of Contents for
Random forest - feature bagging of decision tree