Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

C4.5

C4.5 works in ways similar to ID3, but uses the gain ratio as a partitioning criterion, which in part resolves the issue mentioned previously. Another advantage is that it accepts partition on numeric attributes, which it splits into categories. The value of the split is selected in order to decrease the entropy for the considered attribute. Other differences from ID3 are that C4.5 allows for post-pruning, which is basically the bottom up simplification of the tree to avoid overfitting to the training data.

The gain ratio

Using the gain ratio as partitioning criterion overcoming a shortcomes of ID3, which is to prefer attributes with many modalities as nodes because they have a higher information gain. The gain ratio divides the information gain by a value called split information. This value is computed as minus the sum of: the ratio of the number of cases in each modality of the attribute divided by the number of cases to partition upon, multiplied by the base 2 logarithm of the number of cases in each modality of the attribute (iteratively) divided by the number of cases to partition upon.

The formula for split information is provided here:

Post-pruning

A problem frequently observed using predictive algorithms is overfitting. When overfitting occurs, the classifier is very good at classifying observations in the training dataset, but classifies unseen observations (for example, those from a testing dataset) poorly. Pruning reduces this problem by decreasing the size of the tree bottom up (replacing a node by a leaf on the basis of the most frequently observed class), thereby making the classifier less sensitive to noise in the training data.

There are several ways to perform pruning. C4.5 implements pessimistic pruning. In pessimistic pruning, a measure of incorrectly classified observations is used to determine which nodes are to be replaced by leaves on the basis of the training data only. In order to do this, a first measure is computed as: the multiplication of the number of leaves multiplied by 0.5, plus the sum of the errors. If the number of correctly classified observations in the most accurate leaf of the node plus 0.5 is within one standard deviation around the previously computed value, the node is replaced by the classification of the leaf.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for C4.5

Create new playlist

Sign In

Sign Up

C4.5

The gain ratio

Post-pruning

Table of Contents for
C4.5