This chapter covers the following items:
–Algorithm for decision tree
–Algorithm for classifying with decision tree algorithms
–Examples and applications
At this point, let us now dwell on what a decision tree is. A decision tree is a tree structure that is similar to a flowchart tree structure. In a typical decision tree structure, each internal node (also known as nonleaf node) shows a test on an attribute [1–3]. On the other hand, each branch shows an outcome of the test. In decision trees, each leaf node (also known as terminal node) contains a class label. The root node is the top node in a tree. Figure 6.1 provides a typical illustration of a decision tree. It shows the concept of time to export (days). In other words, it predicts if an economy has USA, New Zealand, Italy or Sweden.
Each internal (nonleaf) node represents a test on an attribute. Each leaf node represents a class that is likely to have the time to export (days), internal nodes are represented by rectangles and leaf nodes by ovals. Some decision tree algorithms yield only binary trees in which each internal node branches to exactly two other nodes, while some others can produce on binary trees [4–6].
Some questions regarding decision trees are elicited and elaborated:
Section 6.2 provides a basic algorithm for learning decision trees. Attribute selection measures are utilized for the selection of the attribute while building the tree. This selection is concerned with getting the best partitions for the datasets so that they can fall into distinct classes. Section 6.1 presents some of the popular measures seen in attribute selection. When decision trees are constructed, various branches are likely to reflect noise or outliers in the training data. At this stage, tree pruning method is used for trying to identify and remove these kinds of branches. This is particularly significant for the improvement of classification accuracy on the unseen data. Tree pruning is elaborated in Section 6.2, addressing scalability issues concerning large datasets.
Decision tree induction is referred to as learning of decision trees from the training datasets that are class labeled. In this section, it is important to mention two cornerstone algorithms that generated a flood of works carried out on decision tree induction. A decision tree algorithm called ID3 (iterative dichotomiser (ID)) was developed by a researcher engaged in machine learning. J. Ross Quinlan started the work in the late 1970s and early 1980s and E. B. Hunt, J. Marin and P. T. Stone later expanded on the previous work, with Quinlan presenting a benchmark called C4.5 (a successor of ID3) [8]. In 1984, a group of statisticians with L. Breiman, J. Friedman, R. Olshen and C. Stone published Classification and Regression Trees (CART) book [9]. In their book, they described the binary decision tree generation. ID3 and CART were invented independent of each other and surprisingly at around the same time periods. They track a similar approach about learning decision trees from training datasets.
Most algorithms with regard to the decision tree induction track an approach that is stop-down, which means decision trees are built in a top-down recursive divide and conquer way starting with a training set of datasets and their associated class labels. Then one partitions the training set into smaller subsets while building the tree; Figure 6.2 provides a summarized form of a decision tree algorithm. Although the algorithm may seem very long at first glance, there is no need to worry about it since it is really straightforward. The strategy regarding the algorithm is given as follows:
–The algorithm has three parameters that are D, attribute list and attribute selection. D is referred to as data partition, the complete set of training datasets and associated class labels thereof. A list of attributes that describe the datasets is called the parameter attribute list. Finally, attribute selection identifies a heuristic procedure for the selection of the attribute that differentiates the “best” given datasets based on the class. This procedure uses an attribute selection measure like the Gini index or the information gain. The Gini index imposes the resulting tree to be binary. Other measures such as the information gain do not allow multiway splits to illustrate better; we can say that two or more branches are to be grown from a node.
–The starting of the tree is seen as a single node and N depicts the training datasets in D (Step 1). The partition of class-labeled training datasets at node N gives us the set of datasets. These sets track a path from the tree root to the node N while it is being processed by the tree [10].
Following is the classification of the training dataset being trained with decision tree algorithm (see Figure 6.2).
Steps (1–3) The datasets in D may all be of the same class. In such a case, node N becomes a leaf that is labeled with that class.
Steps (4–5) Note that there are conditions that are terminating. At the end of the algorithm, all the terminating conditions are explicated.
Step (6) If they are not explained, then the algorithm calls Attribute_selection_method so that the splitting criterion can be determined. It is the splitting criterion that shows which attribute one shall test at node N by the determination of the “best” way for separating the datasets in D into individual classes. There is another function of the slitting criterion that shows which branches one shall grow from node N with regard to the results of the test chosen. More specifically, the splitting criterion presents the splitting attribute, which may also demonstrate either a split point or a splitting subset. At this point, it is important to obtain the resulting partitions at each branch with highest “purity” possible. If all the datasets in a partition belong to the same class, it is possible to call a partition a pure one.
Steps (7–9) The splitting criterion labels the node N. This serves as a test at the node. From node N a branch is grown for each result of the splitting criterion.
Steps (10–15) The datasets in D are divided accordingly. Figure 6.3 shows three of the possible scenarios. Let us say, A is the splitting attribute. On the basis of the training data, A has distinct values {a1, a2,:::, av}. The three possible cases are discussed.
In Figure 6.3(a)–(c), examples for the splitting criterion labels regarding decision tree are explained in detail.
Let us take case in which A is discrete valued. In this case, test outcomes at node N match directly with the known values of A. For each value that is known, a branch is created: aj of A and labeled with that value (Figure 6.3(a)). Partition Dj is the subset of class-labeled datasets in D that has the value aj of A. All datasets in a particular partition have the same value for A, and so A does not have to be taken into account in any of the future dataset partitioning. For this reason, it is taken out from attribute list.
After having covered decision tree and decision tree induction, we can move on to attribute selection measure. This measure is a heuristic for the selection of the splitting criterion which splits the given data partition “best” separates D, class-labeled training datasets into individual classes. If it is necessary to divide D into smaller partitions in line with the splitting criterion outcomes. Each partition would be pure and that would be an ideal situation since all the datasets in the partition belongs to the same class. Splitting rule is another term used for attribute selection measures since these measures resolve how the datasets at a given node are going to be partitioned. The attribute selection measure presents a ranking for each attribute that describe the given training datasets. The attribute that has the best score for the measure is selected as the splitting attribute for the given datasets. If the splitting attribute has continuous value or if there is a restriction to binary trees, then either a split point or a splitting subset has to be identified as a component of the splitting criterion. The tree node that is formed for partition D is marked with the splitting criterion. Branches are grown for the each outcome of the criterion. Additionally, the datasets are partitioned in view of that. In this section, you can find the descriptions of three popular attribute selection measures: information gain, gain ratio and Gini index [11].
D, the data partition, is the training set of class-labeled datasets. Assume that the class label attribute has m distinct values that define m distinct classes Ci (for i = 1, ..., m). D is the set of datasets of class Ci in D. |D| denotes the number of datasets in D and |Ci, D| in Ci,D.
As the attribute selection measure, ID3 uses information gain. This measure is reliant on the leading work carried out by Claude Shannon [12] on information theory, which examined the value or “information content” of messages. Node N represents or holds the datasets of partition D. The attribute with the highest information gain is selected as the splitting attribute for node N. This attribute reduces the information required to classify the datasets in the resultant partitions and reflects the least randomness or “impurity” in the partitions. This approach reduces the expected number of tests that are required to classify a given dataset to the minimum and ensures that a simple tree is found (although not always the simplest one).
The expected information that is needed for the classification of a dataset in D is provided by the following formula:
where pi is the nonzero probability concerning an arbitrary dataset in D that belongs to class Ci. It is estimated by Info(D) is the average amount of information required to identify the class label of a dataset in D. The available information at this point is based merely on the proportions of datasets pertaining to each class. Info(D)is also recognized as the entropy of D.
Ideally, the partitioning process is desired to produce an exact classification of the datasets. In other words, each partition is aspired to be pure. Yet, it is probable that the partitions will not be pure as a partition might contain a collection of datasets from different classes instead of being from a single class.
The information still needed following the partitioning to reach an exact classification is measured as an amount by the following formula:
The term refers to the weight of the j th partition. InfoA(D) is the expected information that is needed for the classification of a dataset from D, which is based on the separation by A. If the expected information required is smaller, the purity of the partitions shall be greater.
Information gain is the difference between the original information requirement (based on the proportion of classes) and the new requirement (as can be obtained following the partitioning process on A). In short, Gain(A) shows how much one could gain by branching on A; the attribute A that has the highest information gain. Gain(A) is selected as the splitting attribute at node N. This is comparable to saying that partition on the attribute A that would do the “best classification” is needed so that the amount of information still needed to finish categorizing the datasets is smallest minimum InfoA(D).
The following formula clarifies the situation:
If you want to calculate an attribute’s information gain with continuous value, you have an attribute A not a discrete-valued one. (For instance, suppose that we have the raw values for this attribute rather than having the discretized version of Age from Table 2.19.) For this case, you will have to determine the “best” split-point for A, where the split-point is a threshold on A.
You first sort the values of A in an increasing order. The midpoint between each pair of adjacent values is regarded as a possible split-point for given v values of A. Then v−1 possible splits are taken into consideration and evaluated [13]. For instance, the midpoint between the values ai and ai + 1 of A is found through the following formula:
The values of A may be arranged beforehand; next determination of the best split for A needs only one pass through the values. For each possible split-point for A, you evaluate InfoA(D). Here, the number of partitions is two: v = 2 (or j = 1, 2) as can be seen in eq. (6.2). The point that has the minimum expected information obligation for A is chosen as the split point for A. D1 is the set of datasets in D that satisfies A≤ split_point. D2 is the set of datasets in D, which satisfies A>split_point.
The flowchart of ID3 algorithm is as the one provided in Figure 6.4.
Based on ID3 algorithm flowchart (Figure 6.4), you can find the steps of ID3 algorithm. The steps applied in ID3 algorithm are as follows:
The following Example 6.1 shows the application of the ID3 algorithm to get a clearer picture through relevant steps.
Example 6.1 In Table 6.1 the values for the individuals belonging to WAIS-R dataset are elicited as gender, age, verbal IQ (QIV) and class. The data are provided in Chapter 2 for WAIS-R dataset (Tables 2.19 and 6.1). These three variables (gender, age and QIV) have been handled. Class (patient and healthy individuals). At this moment, let us see how we can form the ID3 tree form using these data.
We can apply the steps for ID3 algorithm in the relevant step-by-step order so as to build the tree form as shown in Figure 6.5.
Table 6.1: Selected sample from the WAIS-R dataset.
Steps (1–4) The first step is to calculate the root node to form the tree structure. The total number of record is 15 as can be seen in Table 6.1 (first column), which can go as follows: (Individual (ID): 1, 2, 3, …, 15). We can use the entropy formula in order to calculate the root node based on eq. (6.1).
As can be seen clearly in Table 6.1, class column 11 individuals belong to patient classification and 4 to healthy. The general status entropy can be calculated based on eq. (6.1) as follows:
Table 6.1 lists the calculation of the entropy value for the gender variable (female–male; our first variable).
Step (5) In order to calculate the entropy value of the gender variable.
The weighted total value of the gender variable can be calculated as in Figure 6.5.
Step (6) The gain calculation of the gender variable is done according to eq. (6.3).
The calculation is as follows:
Now let us do the gain calculation for another variable of ours, age variable following the same steps mentioned. In order to form the branches of the tree formed based on ID3 algorithm, the age variable should be split into groups: let us assume that the age variable is grouped as follows and labeled accordingly.
Table 6.2: The grouping of the age variable.
Age interval | Group no. |
48–55 | 1 |
59–64 | 2 |
65–75 | 3 |
76–85 | 4 |
86–100 | 5 |
Steps (1–4) Let us now calculate the entropy values of the group numbers as labeled with the age interval values given in Table 6.2; eq. (6.2) is used for this calculation.
Step (5) The calculation of the weighted total value for the age variable is as follows:
Step (6) The gain calculation for the age variable Gain(Age) is calculated according to eq. (6.3):
At this point the gain value of our last variable, namely, the QIV variable, is to be calculated.
We shall apply the same steps likewise for the QIV variable as well. In order to form the branches of the tree to be formed, it is required to split the QIV variable into groups. If we assume that the QIV variable is grouped as follows:
Table 6.3: Grouping the QIV variable.
QIV | Group no. |
151–159 | 1 |
160–164 | 2 |
165–174 | 3 |
175–184 | 4 |
>185 | 5 |
Steps (1–4) Let us calculate the entropy value of the QIV variable based on Table 6.3.
Step (5) Let us use eq. (6.2) in order to calculate the weighted total value of the QIV variable.
Step (6) Let us apply the gain calculation for the QIV variable based on eq. (6.3):
Step (7) We have been able to calculate the gain values of all the variables in Table 6.1. These variables are age, gender and QIV. The calculation has yielded the following results for the variables:
When ranked from the smallest to the greatest, the greatest gain value has been obtained from gain (gender). Based on the greatest values, gender will be selected as the root node and it is presented as shown in Figure 6.6 Gender variable constitutes the root of the tree in line with the procedural steps of the ID3 algorithm as shown in Figure 6.6. The branches of the tree are shown as female and male.
ID3 algorithm has been applied on economy (U.N.I.S.) dataset, multiple sclerosis (MS) dataset and WAIS-R dataset in Sections 6.3.1.1, 6.3.1.2 and 6.3.1.3, respectively.
As the first set of data, in the following sections, we will use some data related to economies for U.N.I.S. countries. The attributes of the these countries’ economies are data regarding years, unemployment youth male (% of male labor force ages 15–24) (national estimate), gross value added at factor cost ($), tax revenue,…, gross domestic product (GDP) growth (annual %). Economy (U.N.I.S.) dataset consists of a total of 18 attributes. Data belonging to USA, New Zealand, Italy and Sweden economies from 1960 to 2015 are defined based on the attributes given in Table 2.7 economy (U.N.I.S.) dataset (http://data.worldbank.org) [15], to be used in the following sections. For the classification of D matrix through ID3 algorithm, the first step training procedure is to be employed. For the training procedure, 66.66% of the D matrix can be split for the training dataset (151 × 18), and 33.33% as the test dataset (77 × 18).
Following the classification of the training dataset being trained with ID3 algorithm, we can classify the test dataset. Although the procedural steps of the algorithm may seem complicated at first glance, the only thing you have to do is to concentrate on the steps and grasp them. For this, let us have a close look at the steps provided in Figure 6.7.
How can we find the class of a test data belonging to economy (U.N.I.S.) dataset provided in Table 2.7.1 according to ID3 algorithm? We can find the class label of a sample data belonging to economy (U.N.I.S.) dataset through the following steps (Figure 6.9):
Steps (1–3) Let us split the U.N.I.S. dataset as follows: 66.6% as the training data (D = 151 × 18) and 33.3% as the test data (T= 77 × 18).
Steps (4–10) Create random subset. The training of the training set (D=228 × 18) through the ID3 algorithm has yielded the following results: number of nodes: 15. Sample ID3 decision tree structure is elicited in Figure 6.10.
Steps (11–12) The splitting criterion labels the node.
The training of the economy (U.N.I.S.) dataset is carried out by applying ID3 algorithm (Steps 1–12 as specified in Figure 6.8.). Through this, the decision tree made up of 15 nodes (the learned tree (N = 15)) has been obtained. The nodes chosen via the ID3 learned tree (gross value added at factor cost($), tax revenue, unemployment youth male) and some of the rules pertaining to the nodes are presented in Figure 6.8.
Based on Figure 6.8(a), some of the rules obtained from the ID3 algorithm applied to economy (U.N.I.S.) dataset can be seen as follows.
Figure 6.8(a) and (b) shows economy (U.N.I.S.) dataset sample decision tree structure based on ID3, obtained as such; IF-THEN rules are as follows:
…
…
…
Rule 1
IF ((gross value added at factor cost($) (current US ($) < $2.17614e + 12 ) and (tax revenue (% of GDP) < %12.20) and (unemployment youth male (% labor force ages 15–24)< % 3.84))
THEN class = Italy
Rule 2
IF ((gross value added at factor cost($) (current US ($)) < $ 2.17614e + 12) and (tax revenue (% of GDP) < % 12.20) and (unemployment youth male (% labor force ages 15–24)> % 3.84)
THEN class = USA
Rule 3
IF ((gross value added at factor cost($) (current US ($) < $ 2.17614e + 12) and (tax revenue (% of GDP) ≥ % 12.20)
THEN class = Italy
Rule 4
IF ((Gross value added at factor cost($)(Current US ($)) ≥ $ 2.17614e + 12)
THEN class = USA
…
…
…
As a result, the data in the economy (U.N.I.S.) dataset with 33.33% portion allocated for the test procedure and classified as USA, New Zealand, Italy and Sweden have been classified with an accuracy rate of 95.2% based on ID3 algorithm.
As presented in Table 2.10, MS dataset has data from the following groups: 76 sample belonging to RRMS, 76 sample to SPMS, 76 sample to PPMS and 76 sample belonging to healthy subjects of control group. The attributes of the control group are data regarding brain stem (MRI 1), corpus collasum periventricular (MRI 2), upper cervical (MRI 3) lesion diameter size (milimetric (mm)) in the MR image and EDSS score. Data consist of a total of 112 attributes. It is known that using these attributes of 304 individuals, the data whether they belong to the MS subgroup or healthy group can be known. How can we make the classification as to which MS patient belongs to which subgroup of MS including healthy individuals and those diagnosed with MS (based on the lesion diameters (MRI 1, MRI 2 and MRI 3) and number of lesion size for (MRI 1, MRI 2 and MRI 3) as obtained from MRI images and EDSS scores)? D matrix has a dimension of (304 × 112). This means D matrix includes the MS dataset of 304 individuals along with their 112 attributes (see Table 2.12 for the MS dataset). For the classification of D matrix through Naive Bayesian algorithm, the first step training procedure is to be employed. For the training procedure, 66.66% of the D matrix can be split for the training dataset (203 × 112) and 33.33% as the test dataset (101 × 112).
Following the classification of the training dataset being trained with ID3 algorithm, we can classify the test dataset. Although the procedural steps of the algorithm may seem complicated at first glance, the only thing you have to do is to concentrate on the steps and grasp them. For this let us have a close look at the steps provided in Figure 6.9.
In order to be able to do classification through ID3 algorithm, the test data has been selected randomly from the MS dataset as 33.33%.
Let us now think for a while about how to classify the MS dataset provided in Table 2.10.1 in accordance with the steps of ID3 algorithm (Figure 6.9).
Steps (1–3) Let us identify 66.66% of the MS dataset as the training data (D=203 × 112) and 33.33% as the test data (T =101 × 112).
Steps (4–10) Create random subset. As a result of training the training set with ID3 algorithm. The following results have been yielded: number of nodes – 39.
Steps (11–12) The splitting criterion labels the node.
The training of the MS dataset is done by applying ID3 algorithm (Steps 1–12 as specified in Figure 6.7). Through this, the decision tree comprised of 39 nodes (the learned tree (N = 39)) has been obtained. The nodes chosen via the ID3 learned tree (MRI 1) and some of the rules pertaining to the nodes are presented in Figure 6.9.
Based on Figure 6.10(a), some of the rules obtained from the ID3 algorithm applied to MS dataset can be seen as follows.
Figure 6.10(a) and (b) shows MS dataset sample decision tree structure based on ID3, obtained as such; if–then rules are as follows:
…
…
…
Rule 1
If ((1st region lesion count) < (number of lesion is 10.5))
Then CLASS = RRMS
Rule 2
If ((1st region lesion count) ≥ (number of lesion is 10.5))
Then CLASS = PPMS
…
…
…
As a result, the data in the MS dataset with 33.33% portion allocated for the test procedure and classified as RRMS, SPMS, PPMS and healthy have been classified with an accuracy rate of 88.3% based on ID3 algorithm.
As presented in Table 2.16 the WAIS-R dataset has data of 200 belonging to patient group and 200 sample to healthy control group. The attributes of the control group are data regarding school education, gender, three-dimensional modeling, memory, ordering ability, age,…, DM. WAIS-R dataset is consists of a total of 21 attributes. It is known that using these attributes of 400 individuals, the data whether they belong to patient or healthy group are known. How can we make the classification as to which individual belongs to which patient or healthy individuals and those diagnosed with WAIS-R test (based on the school education, gender, three-dimensional modeling, memory, ordering ability, age,…, DM)? D matrix has a dimension of 400 × 21. This means D matrix includes the WAIS-R dataset of 400 individuals along with their 21 attributes (see Table 2.16) for the WAIS-R dataset. For the classification of D matrix through ID3 algorithm, the first step training procedure is to be employed. For the training procedure, 66.66% of the D matrix can be split for the training dataset (267 × 21) and 33.33% as the test dataset (133 × 21).
Following the classification of the training dataset being trained with ID3 algorithm, we can classify the test dataset. Although the procedural steps of the algorithm may seem complicated at first glance, the only thing you have to do is to concentrate on the steps and grasp them. We can have a glance at the steps shown in Figure 6.11:
How can we classify the WAIS-R dataset in Table 2.16.1 in accordance with the steps of ID3 algorithm (Figure 6.11)?
Steps (1–3) Let us identify the 66.66% of WAIS-R dataset as the training data (D= 267 × 21) and 33.33% as the test data (T= 133 × 21).
Steps (4–10) Create random subset. As a result of training of the training set (D = 267 × 21) through the ID3 algorithm. The following results have been obtained: Number of nodes – 27. The sample decision tree structure is provided in Figure 6.12.
Steps (11–12) The splitting criterion labels the node.
The training of the WAIS-R dataset is conducted by applying ID3 algorithm (Steps 1–12 as specified in Figure 6.12.). Through this, the decision tree made up of 27 nodes (the learned tree (N = 27)) has been obtained. The nodes chosen via the ID3 learned tree (three- dimensional modeling, memory, ordering ability, age) and some of the rules pertaining to the nodes are presented in Figure 6.12.
Based on Figure 6.12(a), some of the rules obtained from the ID3 algorithm applied to WAIS-R dataset can be seen as follows.
Figure 6.12(a) and (b) shows WAIS-R dataset sample decision tree structure based on ID3, obtained as such; if–then rules are as follows:
…
…
…
Rule 1
IF (three-dimensional modeling of test score result < 13.5 and ordering ability of test score result < 0.5)
THEN class = healthy
Rule 2
IF (three-dimensional modeling of test score result ≥ 13.5 and memory of test score result < 16.5 and age < 40
THEN class = patient
Rule 3
IF (three-dimensional modeling of test score result ≥ 13.5 and memory of test score result < 16.5 and ≥ 40
THEN class = healthy
Rule 4
IF (three-dimensional modeling of test score result) ≥ 13.5 and memory of test score result ≥ 16.5
THEN class = patient
…
…
…
As a result, the data in the WAIS-R dataset with 33.33% portion allocated for the test procedure and classified as patient and healthy individuals have been classified with an accuracy rate of 51% based on ID3 algorithm.
Following some information on ID3, we can have a look at C4.5. Information gain measure prefers to choose attributes with a large number of values. For instance, you can consider an attribute acting as a unique identifier such as ID. A split on ID would bring about a large number of partitions (as many number as there are values) each one contains just one dataset.
As the example shows each partition is pure and the information needed for the classification of classify dataset D based on this partitioning shall be Infoindividual ID(D) = 0. For this very reason, the information gained by partitioning on this attribute is highest and this classification could prove to be useless at the end [14].
ID3 is a precursor of C4.5. An extension to information gain is known as gain ratio. Gain ratio tries to overcome the bias toward tests with numerous outcomes. Implementing a sort of normalization to information gain (using a “split information” value defined in an analogous manner with Info(D) as given as follows):
The value stands for the potential information produced by splitting the training dataset D into v partitions. This corresponds to the v outcomes of a test on attribute A. It considers the number of datasets with that outcome for the total number of datasets in D for each outcome. Unlikely, information gain measures the information relating to classification acquired as derived from the same partitioning, but the gain ratio is defined through the calculation as follows [15]:
The attribute that has the maximum gain ratio is selected as the splitting attribute. Here, as the split information comes close to 0, the ratio gets unstable. To evade this situation, a constraint is included. And then the information gain of the test selected has to be large.
Figure 6.13: presents the flowchart of C4.5 algorithm.
Based on C4.5 algorithm flowchart, you can find the steps of C4.5 algorithm according to Figure 6.13.
Example 6.2 shows the application of the C4.5 algorithm to get a clearer picture through relevant steps (the same procedure was followed for ID3 algorithm Figure 6.14).
Example 6.2 In Table 6.1, the values for the individuals belonging to WAIS-R dataset are elicited as gender, age, QIV and class label values. The data in Section 6.2 for WAIS-R dataset (indicated in Table 2.16.1 and Table 6.1) has been handled.
We can step-by-step apply the steps for C4.5 algorithm in the relevant order so as to build the tree form as in Figure 6.1.
3.15.218.169