C4.5

C4.5 works in ways similar to ID3, but uses the gain ratio as a partitioning criterion, which in part resolves the issue mentioned previously. Another advantage is that it accepts partition on numeric attributes, which it splits into categories. The value of the split is selected in order to decrease the entropy for the considered attribute. Other differences from ID3 are that C4.5 allows for post-pruning, which is basically the bottom up simplification of the tree to avoid overfitting to the training data.

The gain ratio

Using the gain ratio as partitioning criterion overcoming a shortcomes of ID3, which is to prefer attributes with many modalities as nodes because they have a higher information gain. The gain ratio divides the information gain by a value called split information. This value is computed as minus the sum of: the ratio of the number of cases in each modality of the attribute divided by the number of cases to partition upon, multiplied by the base 2 logarithm of the number of cases in each modality of the attribute (iteratively) divided by the number of cases to partition upon.

The formula for split information is provided here:

The gain ratio

Post-pruning

A problem frequently observed using predictive algorithms is overfitting. When overfitting occurs, the classifier is very good at classifying observations in the training dataset, but classifies unseen observations (for example, those from a testing dataset) poorly. Pruning reduces this problem by decreasing the size of the tree bottom up (replacing a node by a leaf on the basis of the most frequently observed class), thereby making the classifier less sensitive to noise in the training data.

There are several ways to perform pruning. C4.5 implements pessimistic pruning. In pessimistic pruning, a measure of incorrectly classified observations is used to determine which nodes are to be replaced by leaves on the basis of the training data only. In order to do this, a first measure is computed as: the multiplication of the number of leaves multiplied by 0.5, plus the sum of the errors. If the number of correctly classified observations in the most accurate leaf of the node plus 0.5 is within one standard deviation around the previously computed value, the node is replaced by the classification of the leaf.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.35.122