However, Will is not satisfied with Jacob’s answers. He wants more suggestions and
recommendations. Therefore, he contacts more friends and asks for their advice. Will’s other
friends also use Jacob’s approach and ask Will a list of questions to provide the best advice.
After going through the suggestions of all his friends, Will picks the most repeated location and
finalizes it for his vacation. Now, this complete approach was what we do in random forests.
In a random forest, it is not too complex to calculate the relative importance of a predic-
tion’s feature. Python’s ML library scikit-learn has designed a productive tool for this purpose
which is used for calculating the importance of features. This analysis considers tree nodes that
employ a feature and how much of their impurity is decreased in the forest. When after training,
a score is generated for all the features; it performs scaling on the results, thereby ensuring that
all the importance sum is equal to zero.
Feature importance is valuable because it can assist you in making a decision for dropping
any feature. A feature is dropped when it is not providing any contribution to the prediction.
Bear in mind that in ML, more features may lead to an over-fitting issue.
To speed up the pace of the model or boosting the prediction process, the random forest
uses hyper parameters. Let’s go over some of the hyper parameters from scikit-learn.
• The “n_jobs” hyperparameter informs the engine about the processor limit for usage.
Avalue of “−1” points to no limit. On the other hand, a value of 1 means that no more
than one processor is allowed for use.
• The “random_state”hyperparameter is able to transform the output of the model into a
replicable output. The model is bound to generate identical results if it is fed with iden-
tical training data, hyperparameters, and a fixed random_state value.
• In the end, we have “oob_score” that is basically a cross-validation method. This
sampling works by, using almost one-third of data for the evaluation of performance.
• The “n_estimators” hyperparameter refers to the total figure of trees which the algo-
rithm is going to create before assessing prediction averages and maximum voting.
Usually, a bigger tree figure slows the performance but brings stability in predictions.
• The “max_features” hyperparameter takes knowledge of the max digit value of features
which the algorithm can allow to run in a single tree.
• Lastly, there is the “min_sample_leaf” hyperparameter. It assesses the least leaf limit,
which is needed for splitting any internal node.
To create a random forest, you would have to go through the following steps.
1. In the given total “m” features, pick any “k” features randomly.
2. For the “k” features, perform a calculation for the node “d” by the use of split point.
3. By using the best split method, create daughter nodes from the node.
4. Keep performing the above steps untill the requirement for the number of nodes “l” is met.
5. Generate forest with the repetition of the above steps to produce “n” tree samples.
For executing predictions, consider the following pseudocode.
1. Use the “test features” and utilize the set of rules for every decision tree (which was gener-
ated randomly) to predict the result and save the result. This result is referred to as a target.
2. Perform a calculation of votes for all the predicted targets.
3. The predicted targets having a higher number of votes have to be considered as the final
prediction of the algorithm.
260 Internet of Things
Internet_of_Things_CH10_pp249-270.indd 260 9/3/2019 10:15:57 AM