Chapter 2

Decision Trees

2.1 Introduction

Decision tree (DT) is a statistical model that is used in classification. This machine-learning approach is used to classify data into classes and to represent the results in a flowchart, such as a tree structure [1]. This model classifies data in a dataset by flowing through a query structure from the root until it reaches the leaf, which represents one class. The root represents the attribute that plays a main role in classification, and the leaf represents the class. The DT model follows the steps outlined below in classifying data:

  1. It puts all training examples to a root.

  2. It divides training examples based on selected attributes.

  3. It selects attributes by using some statistical measures.

  4. Recursive partitioning continues until no training example remains, or until no attribute remains, or the remaining training examples belong to the same class.

In other words, we induce DT greedily in top-down fashion. This top-down induction of decision trees (TDIDTs) [2] is the most common strategy used to learn DTs from data.

DT learning is supervised, because it constructs DT from class-labeled training tuples. In this chapter, we will discuss two DT algorithms as follows:

  1. ID3 (Iterative Dichotomiser 3)

  2. C4.5 (Successor of ID3)

The statistical measure used to select attribute (that best splits the dataset in terms of given classes) in ID3 is information gain, whereas in C4.5, the criterion is gain ratio. Both measures have a close relationship with another concept called entropy. Indeed, information gain is based on the concept of entropy.

In Section 2.2, we will discuss about entropy by providing practical examples.

2.2 Entropy

The concept of entropy that is used in a machine learning algorithm came from the field of information theory. It is a measure of uncertainty in a random variable. Shannon entropy quantifies this uncertainty in terms of expected value of the information present in the message.

H(X)=iP(xi)I(xi)=iP(xi)logbP(xi)

2.2.1 Example

If we toss a fair coin, the probability of outcome for head and tail is equal, that is, ½. Hence, in such a maximum uncertain environment, one full bit of information is delivered by each fair toss of coin. In the situation when the coin is unfair or biased, there is less amount of uncertainty or entropy and hence information can be preserved in less number of bits. Tables 2.1 and 2.2 describe outcomes of four tossing experiments each with a fair and unfair coin.

The entropy of dataset D1 can be calculated as follows:

Entropy(D1) = −P(H)log2P(H) + −P(T)log2P(T)
Entropy(D1) = −2/4log2(2/4) + −2/4log2(2/4)
Entropy(D1) = −1/2log2(1/2) + −1/2log2(1/2)
Entropy(D1) = −1/2(−1) + −1/2(−1)
Entropy(D1) = 1
 
 
Entropy(D2) = −P(H)log2P(H)
Entropy(D2) = −4/4log2(4/4)
Entropy(D2) = 0 that means uncertainty is zero for
unfair coin.

Table 2.1 Dataset D1 Describing Outcomes of Four Experiments with Fair Coin

Toss No.

Outcome

1

H

2

T

3

H

4

T

Table 2.2 Dataset D2 Describing Outcomes of Four Experiments with Unfair Coin

Toss No.

Outcome

1

H

2

H

3

H

4

H

2.2.2 Understanding the Concept of Number of Bits

Now consider the example where three possible outcomes exist:

Opinion No. for a Political Issue

Outcome

1

Positive

2

Positive

3

Negative

4

Neutral

Entropy(D3) = = −P(“Positive”)log2P(“Positive”)
+ −P(“Negative”)log2P(“Negative”) + −P(“Neutral”)
log2P(“Neutral”)
Entropy(D3) = −2/4log2(2/4) + −1/4log2(1/4)
+ −1/4log2(1/4)
Entropy(D3) = −1/2(−1)+0.25(−2)+ 0.25(−2)
Entropy(D3) = 0.5 + 0.5 + 0.5
Entropy(D3) = 1.5

If there were four possible classes, the entropy would be 2 bits with four possible values that are 00, 01, 10, and 11. One can understand the meaning of 1 or 2 bits but what is the meaning of 1.5 bits? We will answer this question in the next paragraph.

Let us apply a simple Huffman coding on the above dataset. We will get the final Huffman code as follows:

Symbol

Code

Positive

0 [1 bit code]

Negative

10 [2 bits code]

Neutral

11 [2 bits code]

images

The concept of bits can be used in assessing the potential of compression of a message. In the machine learning world, the same concept is used in another meaning. To build a DT for classification, the target is to find the attribute that can reduce the entropy or information content of the whole system by splitting a single dataset into multiple datasets. If we are able to split our dataset in such a manner that resulting datasets are purer (less/no uncertainty or entropy) than machine learning, the job of classification will proceed further. In order to build a DT for classification, the attribute selection is a vital step.

2.3 Attribute Selection Measure

In C4.5, splitinfo is a measure that is used for attribute selection. However, to appreciate the worth of this measure, we will analyze the measure information gain that was used in ID3 to select attributes to build a DT.

2.3.1 Information Gain of ID3

We will begin with an example to explain the concept of information gain.

Opinion Number for a Political Issue

Gender

Age

Outcome

1

Male

Below 60

Positive

2

Male

Above 60

Positive

3

Female

Above 60

Negative

4

Female

Below 60

Neutral

Analyzing the above dataset, one will intuitively select “Gender” as the best choice for split. One can derive a simple rule, such as “if Gender is Male then opinion is positive, else it may be negative or neutral.” But the question is why the attribute “Gender” was selected? Can we tell the machine to perform a similar operation of selection of “Gender” for such a scenario to split the data. The measure information gain performs this job. The attribute that will be selected for splitting purpose will be the one for which the measure of information gain will be the highest.

The formula of information gain for attribute A is given below:

Gain(A) = Info(D)−InfoA(D)

where:

InfoA(D)=j=1v|Dj||D|×Info(Dj)

Gender

Age

Outcome

Male

Below 60

Positive

Male

Above 60

Positive

Female

Above 60

Negative

Female

Below 60

Neutral

For the above dataset, there are two candidates on the basis of which a split can be made.

The entropy or info (D) of the above dataset is 1.5 (see entropy calculation for dataset D3).

The attribute “Gender” has two possible values that are “Male” and “Female.” If “Gender,” is selected as an attribute to split the above dataset it will result in two small child datasets. Similarly, the attribute “Age” will split the dataset into two small child datasets for its two values present in the dataset.

Dataset (Dgender = male):

Age

Outcome

Below 60

Positive

Above 60

Positive

Info(Dgender = male) = 2/2log2(2/2) = 0

Dataset (Dgender = Female):

Age

Outcome

Above 60

Negative

Below 60

Neutral

Info(Dgender = Female) = 1/2log2(1/2) + 1/2log2(1/2) = 1
InfoGender(D) = 2/4(0) + 2/4(1) = 0.5
Gain(Gender) = 1.5 – 0.5 = 1

Dataset (DAge = Below60):

Gender

Outcome

Male

Positive

Female

Neutral

Entropy(DAge = Below60) = −1/2log2(1/2) + −1/2log2(1/2)
Entropy(DAge = Below60) = −1/2(−1) + −1/2(−1) =1

Dataset (DAge = Above60):

Gender

Outcome

Male

Positive

Female

Negative

Entropy(DAge = Above60) = −1/2log2(1/2) + −1/2log2(1/2)
Entropy(DAge = Above60) = −1/2(−1) + −1/2(−1) =1
InfoAge(D) = 2/4(1) + 2/4(1) =1
Gain(Age) = 1.5 − 1 = 0.5

Because Gain (Gender) > Gain (Age), we will select the attribute “Gender.” There is no competitor for the attribute “Age” in the next iteration; therefore, it will be selected for further partitioning.

2.3.2 The Problem with Information Gain

The problem with information gain as a measure to select the attribute for partition is that in the quest of pure partitions, it can select the attributes that are meaningless from the machine learning point of view. We will demonstrate the problem with the same example but with a new attribute. The new attribute is “Opinion no.” for a particular political issue:

Opinion No.

Gender

Age

Outcome

1

Male

Below 60

Positive

2

Male

Above 60

Positive

3

Female

Above 60

Negative

4

Female

Below 60

Neutral

Now we have three candidate attributes that compete to be selected for splitting purpose. Since we have already calculated information gain for “Gender” and “Age” in Section 2.3.1, we will only calculate information gain for the attribute “Opinion no.” This will divide the dataset into four pure child datasets.

Dataset (D“Opinion no” = 1):

Gender

Age

Outcome

Male

Below 60

Positive

Dataset (D“Opinion no” = 2):

Gender

Age

Outcome

Male

Above 60

Positive

Dataset (D“Opinion no” = 3):

Gender

Age

Outcome

Female

Above 60

Negative

Dataset (D“Opinion no”=4):

Gender

Age

Outcome

Female

Below 60

Neutral

Entropy (D”Opinion no”=1) = 0
Entropy (D”Opinion no”=2) = 0
Entropy (D”Opinion no”=3) = 0
Entropy (D”Opinion no”=4) = 0
Info”Opinion no”(D) = 0
Gain(“Opinion no”) = 1.5 - 0 = 1.5

Thus, Gain (“Opinion no”) > Gain(“Gender”) > Gain(“Age”); therefore, the attribute “Opinion no.” will be selected as a measure to split the dataset. Since the attribute “Opinion no.” has a large number of distinct values, it gives numerous, purest children datasets that are useless for machine learning purpose. This drawback is due to the inherent deficiency in the measure information gain that gives preference to the attribute that can divide the parent dataset to datasets with the least amount of entropy. This bias problem leads to a proposal of another measure called gain ratio to solve this problem.

2.4 Implementation in MATLAB®

The implementation in MATLAB will help the reader understand the concept in more practical details. Taking the example of a small dataset discussed above, we will explain the problem in the attribute selection measure of “information gain.”

The file containing the small dataset is a simple comma separated value (csv) file that will be used as an input source to bring data into the MATLAB environment. The file name is “infogainconcept.csv.” The following code describes how the data from the csv file is brought into the MATLAB environment.

%test.m

fid = fopen(‘C:infogainconcept.csv’)’;
out = textscan(fid,‘%s%s%s%s’,‘delimiter’,‘,’);
fclose(fid);
num_featureswithclass = size(out,2);
tot_rec = size(out{size(out,2)},1)-1;
for i = 1:tot_rec
     yy{i} = out{num_featureswithclass}{i+1};
end
for i = 1: num_featureswithclass
    xx{i} = out{i};
end
[a, b] = infogain(xx, yy);

The function “infogain” (corresponding file, infogain.m) takes the input of xx (consisting of information about all input features) and yy (information about class feature). In order to calculate the information gain of each feature, the entropy of the dataset should be calculated. The following piece of code from the function “infogain” calculates the entropy of the dataset.

In order to calculate the infogain of each feature, the following code snippet will do the job.

The above code snippet calculates the entropy of each feature and then calculates the information gain by each feature. For three features, the following values of information gain were calculated:

  1. Num: 1.5000

  2. Gender: 1

  3. Age: 0.500

Since the information gain of the attribute “Num” is highest, it will be used as a splitting point. In the above scenario, the attribute selection on the basis of the “information gain” criterion is useless and therefore it should be improved. “Gain Ratio,” as an improved attribute selection criterion will be discussed next.

2.4.1 Gain Ratio of C4.5

In order to solve the bias problem of the measure of information gain, another measure called gain ratio was proposed. The idea behind the new measure is very simple and intuitive. Just penalize those attributes that make many splits. Hence, attributes like “Opinion no.” will be penalized for too much partitioning. The extent of partitioning is calculated by SplitInfo. This normalizes the information gain, resulting in the calculation of gain ratio.

SplitInfoA(D)=j=1v|Dj||D|×log2(|Dj||D|)

Gain ratio(A)=Gain(A)SplitInfo(A)

We will again use our small dataset to illustrate the concept:

Opinion No.

Gender

Age

Outcome

1

Male

Below 60

Positive

2

Male

Above 60

Positive

3

Female

Above 60

Negative

4

Female

Below 60

Neutral

SplitInfo”Opinion no”(D) = −1/4*log(1/4) +
−1/4*log(1/4)+ −1/4*log(1/4)+ −1/4*log(1/4)
SplitInfo”Opinion no”(D) = 0.5 + 0.5 + 0.5 + 0.5 = 2
SplitInfo Age(D) = −2/4*log(2/4) + −2/4*log(2/4)
=0.5 + 0.5 = 1
SplitInfo Gender(D) = −2/4*log(2/4) + −2/4*log(2/4)
=0.5 + 0.5 = 1
GainRatio(“Opinion No”) = gain(“Opinion No”)/
splitinfo(“Opinion No”)
GainRatio(“Opinion No”) =1.5/2 = 0.75
Similarly GainRatio(“Gender”) = gain(Gender)/
Splitinfo(Gender) = 1/1 = 1
GainRatio(Age) = gain(Age)/Splitinfo(Age) = 0.5/1 =
0.5
Hence GainRatio(Gender) > GainRatio(“Opinion no”) >
GainRatio(“Age”)

Therefore, “Gender” will be selected as an attribute to split the dataset in the first iteration.

In the next iteration, attributes “Age” and “Opinion no.” will compete to be selected as an attribute for splitting the datasets from the first iteration. One of the two datasets, that is, Dataset (Dgender = male), has 0 entropy; therefore, there is no need for a further split. The other dataset (Dgender = Female) has a nonzero entropy, so we need to select an attribute. The two competing attributes have same values for information gain and SplitInfo and, hence, the measure “Gain Ratio” is also the same. When such a situation arises, a decision can be taken arbitrarily.

From the above example it is clear that the value of “SplitInfo” increases as the number of partitions increases. Consider three scenarios with evenly distributed partitions.

  • ■ 10 possible partitions with even distribution

    SplitInfo = −10*(1/10*log2(1/10)) = 3.32

  • ■ 100 possible partitions with even distribution

    SplitInfo = −100*(1/100*log2(1/100)) = 6.64

  • ■ 1000 possible partitions with even distribution

    SplitInfo = −1000*(1/1000*log2(1/1000)) = 9.97

It can be seen that as the number of partitions increases, the value of the SplitInfo goes up, which in turn reduces the value of Gain Ratio for a particular attribute.

2.4.2 Implementation in MATLAB

We have to make small changes to the function of “infogain.” The new function name is “gaininfo.” The file “gaininfo.m” serves the purpose for calculation of “gaininfo” of each feature and then returning the one with the maximum number of value of “gaininfo.” The following code snippet describes the “gaininfo” calculation.

% To calculate split info of each attribute
       r = q; [row_r, col_r] = size(r);
       split_count = zeros(row_r,1)
       splitinfo = zeros(row_r,1)
       for i = 1:row_r
               for jj = 1:col_r
               if r(i, jj) ~= 0
                   split_count(i) = split_Count(i)+
                   r(i, jj);
               end
               end
               for jj = 1:col_r
               if r(i, jj) ~= 0
                   r(i, jj) = r(i, jj)/
                   split_count(i);
                   splitinfo(i) = splitinfo(i) +
               r(i, jj)* log2(r(i, jj));
               end
               end
splitinfo(i) = −1.* splitinfo(i);
end

After calculating the “SplitInfo” of each attribute, the “Gain Ratio” is calculated using the following code.

Gain_ratio = info_gains./splitinfo
[gain, max_gain_feature] = max(Gain_ratio);

Now the criterion of the selection of an attribute for making a DT is improved.

The two functions “infogain” and “gainratio” can be used to develop a DT easily.

References

1. Jiawei, H. and Kamber, M. Data Mining: Concepts and Techniques, 2nd Ed. Burlington, MA: Elsevier, 2006, p. 772.

2. Quinlan, J.R. Induction of decision trees, Machine Learning, vol.1, 81–106, 1986.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.104.124