Improving classification accuracy using random forests

Random forests (also sometimes called random decision forests) are ensembles of decision trees. Random forests are one of the most successful machine learning models for classification and regression. They combine many decision trees in order to reduce the risk of overfitting. Like decision trees, random forests handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture nonlinearities and feature interactions. There are numerous advantageous RFs. They can overcome the overfitting problem across their training dataset by combining many decision trees.

A forest in the RF or RDF usually consists of hundreds of thousands of trees. These trees are actually trained on different parts of the same training set. More technically, an individual tree that has grown very deep tends to learn from highly unpredictable patterns. This kind of nature of the trees creates overfitting problems on the training sets. Moreover, low biases make the classifier a low performer even if your dataset quality is good in terms of features presented. On the other hand, an RF helps to average multiple decision trees together with the goal of reducing the variance to ensure consistency by computing proximities between pairs of cases.

However, this increases a small bias or some loss of the interpretability of the results. But, eventually, the performance of the final model is increased dramatically. While using the RF as a classifier, here goes the parameter setting:

  • If the number of trees is 1, then no bootstrapping is used at all; however, if the number of trees is > 1, then bootstrapping is accomplished. The supported values are auto, all, sqrt, log2, and onethird.
  • The supported numerical values are (0.0-1.0] and [1-n]. However, if featureSubsetStrategy is chosen as auto, the algorithm chooses the best feature subset strategy automatically.
  • If numTrees == 1, the featureSubsetStrategy is set to be all. However, if numTrees > 1 (that is, forest), featureSubsetStrategy is set to be sqrt for classification.
  • Moreover, if a real value n is set in the range of (0, 1.0], n*number_of_features will be used. However, if an integer value say n is in the range (1, the number of features), only n features are used alternatively.
  • The categoricalFeaturesInfo parameter , which is a map, is used for storing arbitrary categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1,...,k-1}.
  • The impurity criterion is used only for the information gain calculation. The supported values are gini and variance for classification and regression, respectively.
  • The maxDepth is the maximum depth of the tree (for example, depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes, and so on).
  • The maxBins signifies the maximum number of bins used for splitting the features, where the suggested value is 100 to get better results.
  • Finally, the random seed is used for bootstrapping and choosing feature subsets to avoid the random nature of the results.

As already mentioned, since RF is fast and scalable enough for the large-scale dataset, Spark is a suitable technology to implement the RF to take the massive scalability. However, if the proximities are calculated, storage requirements also grow exponentially.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.10.201