ID3

ID3 is one of the simplest algorithms to produce decision trees with categorical classes and attributes. We chose to explain it because of its simplicity (but will not examine its use here). We will then build upon this understanding when discussing the other algorithms.

ID3 relies on a measure called information gain to build the trees. The goal is to maximize the predictive power of the tree by reducing the uncertainty in the data.

Entropy

Entropy is a measure of uncertainty in a source of information. We discuss it before we talk about information gain, as information gain relies on the computation of entropy.

Entropy is easily understood using an example. Let's consider three opaque boxes containing 100 M&Ms each. In box 1, there are 99 red M&Ms and 1 yellow. In box 2, there are as many red and yellow M&Ms. In box 3, there are 25 red M&Ms and 75 yellow. Knowing this, we want to guess the color of the next M&M we pick from each of the boxes.

As you have guessed, it is easier to guess the color of the picked M&M in the first box, because there is only one yellow M&M in the box. In this case, there is almost no uncertainty. Contrarily, there is maximal uncertainty in the case of the second box, because there is an equal probability that the M&M we pick from the box is red or yellow. Finally, in the third box there is a relative uncertainty, because picking a yellow M&M is three times as likely as picking a red one.

Entropy is a measure of what we could intuitively grasp from previous knowledge of the content of the boxes. In the following formula for the computation of entropy (where c is the number of classes), p refers to the probability of each of the outcomes:

Entropy

Let's examine the entropy for the M&Ms problem.

Here, entropy is measured as minus the probability of red multiplied by the base two logarithm of the probability of red, minus the probability of yellow multiplied by the base two logarithms of the probability of yellow:

- p_red * log2(p_red) -  p_yellow * log2(p_yellow)

In box 1, the entropy is:

- (99/100) * log2(99/100) - (1/100) * log2(1/100) #= 0.08079314

In box 2, the entropy is:

- (50/100) * log2(50/100) - (50/100) * log2(50/100) # = 1

In box 3, the entropy is:

- (25/100) * log2(25/100) - (75/100) * log2(75/100) #= 0.8112781

Computing the entropy of attributes containing more categories can be done in a similar way: the sum of all - p_category * log2(p_category), where p_category is the probability of each of the different categories iteratively considered. For instance, let's imagine a yellow M&M is replaced by a blue M&M in the third box. The entropy in that box is now:

- (25/100) * log2(25/100) - (74/100) * log2(74/100) - (1/100) *
    log2(1/100)  #= 0.8735663

Now that entropy is understood, let's discuss information gain.

Information gain

In the context of decision trees, information gain measures the difference in uncertainty that is obtained by a partition of the data. It consists of the entropy before the partition is performed from the entropy after the partition is done. To illustrate this, let's take the example of the Titanic data we have discussed. We need to preprocess the data first:

1  #preparing the dataset 
2  Titanic.Weights = as.data.frame(Titanic)
3  Titanic.df = Titanic.Weights[rep(1:nrow(Titanic.Weights),
4     Titanic.Weights$Freq),]
5  Titanic.df = Titanic.df[,1:4]

The following code shows how many people did and didn't survive on the Titanic:

table(Titanic.df$Survived)

The output is provided here:

No    Yes 
1490  711

The entropy among all cases is computed as:

EntropAll = -(1490/2201) * log2(1490/2201) -(711/2201) *   
   log2(711/2201) #= 0.9076514

The following code displays how many of the male and female individuals did and didn't survive:

table(Titanic.df$Sex,Titanic.df$Survived)

Here is the output:

         No  Yes
Male   1364  367
Female  126  344

The entropy among males is:

EntropM = -(1364/1731) * log2(1364/1731) -(367/1731) * 
   log2(367/1731) # = 0.745319

The entropy among females is:

EntropF = -(126/470) * log2(126/470) -(344/470) * 
   log2(344/470)  # = 0.8387034

The entropy after the partition is performed is computed as the sum of the entropies for each category weighted by the proportion of the observations they contain:

EntropSex = ( (1731/ 2201) * EntropM ) + ((470/ 2201) * 
   EntropF ) # = 0.7652602

The information gain is:

InformationGain = EntropAll - EntropSex  # = 0.1423912

While we provided an example only for the sex, ID3 computes the information gain for all attributes that have not been selected at higher nodes. It iteratively creates the next node selecting the attribute that provides the highest information gain. It then continues until there are no more unused attributes, or if the node consists only of attributes of the same class. In both cases, a leaf node is created. Generally speaking, ID3 favors splits on attributes with a high number of modalities, and therefore, can produce inefficient trees.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.234.28