Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

6Decision Tree

This chapter covers the following items:

–Algorithm for decision tree

–Algorithm for classifying with decision tree algorithms

–Examples and applications

At this point, let us now dwell on what a decision tree is. A decision tree is a tree structure that is similar to a flowchart tree structure. In a typical decision tree structure, each internal node (also known as nonleaf node) shows a test on an attribute [1–3]. On the other hand, each branch shows an outcome of the test. In decision trees, each leaf node (also known as terminal node) contains a class label. The root node is the top node in a tree. Figure 6.1 provides a typical illustration of a decision tree. It shows the concept of time to export (days). In other words, it predicts if an economy has USA, New Zealand, Italy or Sweden.

Each internal (nonleaf) node represents a test on an attribute. Each leaf node represents a class that is likely to have the time to export (days), internal nodes are represented by rectangles and leaf nodes by ovals. Some decision tree algorithms yield only binary trees in which each internal node branches to exactly two other nodes, while some others can produce on binary trees [4–6].

Some questions regarding decision trees are elicited and elaborated:

How can we use decision trees for classification purposes? If the associated class label is not known for X a dataset, then the dataset attribute are tested against the decision tree. From the root to the leaf node, one follows a certain track or path. This contains the class prediction for that particular dataset. It is easy to change the decision trees into classification rules. For these reasons, decision trees are commonly used for the classification purposes.
What is the reason for the popularity of decision tree classifiers? First, there is no requirement of having field knowledge for the formation of the decision tree classifiers. There is no need to have knowledge of parameter setting, either. In addition, it is possible to deal with multidimensional data through decision trees. Another advantage is related to the simplicity and time-saving characters of decision trees with regard to learning and classification steps of decision tree induction. In addition, decision tree classifiers yield a good level of accuracy rate. Nevertheless, it should be noted that a successful use relies on the available data one is using for research [7]. Algorithms using decision tree induction enjoy numerous areas of applications including but not limited to medicine, production exact sciences (including biology and astronomy) and financial areas.

Figure 6.1: A decision tree for the concept of time to export (days), indicating whether an economy is likely to have USA, New Zealand, Italy or Sweden.

Section 6.2 provides a basic algorithm for learning decision trees. Attribute selection measures are utilized for the selection of the attribute while building the tree. This selection is concerned with getting the best partitions for the datasets so that they can fall into distinct classes. Section 6.1 presents some of the popular measures seen in attribute selection. When decision trees are constructed, various branches are likely to reflect noise or outliers in the training data. At this stage, tree pruning method is used for trying to identify and remove these kinds of branches. This is particularly significant for the improvement of classification accuracy on the unseen data. Tree pruning is elaborated in Section 6.2, addressing scalability issues concerning large datasets.

6.1Decision tree induction

Decision tree induction is referred to as learning of decision trees from the training datasets that are class labeled. In this section, it is important to mention two cornerstone algorithms that generated a flood of works carried out on decision tree induction. A decision tree algorithm called ID3 (iterative dichotomiser (ID)) was developed by a researcher engaged in machine learning. J. Ross Quinlan started the work in the late 1970s and early 1980s and E. B. Hunt, J. Marin and P. T. Stone later expanded on the previous work, with Quinlan presenting a benchmark called C4.5 (a successor of ID3) [8]. In 1984, a group of statisticians with L. Breiman, J. Friedman, R. Olshen and C. Stone published Classification and Regression Trees (CART) book [9]. In their book, they described the binary decision tree generation. ID3 and CART were invented independent of each other and surprisingly at around the same time periods. They track a similar approach about learning decision trees from training datasets.

Most algorithms with regard to the decision tree induction track an approach that is stop-down, which means decision trees are built in a top-down recursive divide and conquer way starting with a training set of datasets and their associated class labels. Then one partitions the training set into smaller subsets while building the tree; Figure 6.2 provides a summarized form of a decision tree algorithm. Although the algorithm may seem very long at first glance, there is no need to worry about it since it is really straightforward. The strategy regarding the algorithm is given as follows:

–The algorithm has three parameters that are D, attribute list and attribute selection. D is referred to as data partition, the complete set of training datasets and associated class labels thereof. A list of attributes that describe the datasets is called the parameter attribute list. Finally, attribute selection identifies a heuristic procedure for the selection of the attribute that differentiates the “best” given datasets based on the class. This procedure uses an attribute selection measure like the Gini index or the information gain. The Gini index imposes the resulting tree to be binary. Other measures such as the information gain do not allow multiway splits to illustrate better; we can say that two or more branches are to be grown from a node.

–The starting of the tree is seen as a single node and N depicts the training datasets in D (Step 1). The partition of class-labeled training datasets at node N gives us the set of datasets. These sets track a path from the tree root to the node N while it is being processed by the tree [10].

Following is the classification of the training dataset being trained with decision tree algorithm (see Figure 6.2).

Steps (1–3) The datasets in D may all be of the same class. In such a case, node N becomes a leaf that is labeled with that class.

Steps (4–5) Note that there are conditions that are terminating. At the end of the algorithm, all the terminating conditions are explicated.

Step (6) If they are not explained, then the algorithm calls Attribute_selection_method so that the splitting criterion can be determined. It is the splitting criterion that shows which attribute one shall test at node N by the determination of the “best” way for separating the datasets in D into individual classes. There is another function of the slitting criterion that shows which branches one shall grow from node N with regard to the results of the test chosen. More specifically, the splitting criterion presents the splitting attribute, which may also demonstrate either a split point or a splitting subset. At this point, it is important to obtain the resulting partitions at each branch with highest “purity” possible. If all the datasets in a partition belong to the same class, it is possible to call a partition a pure one.

Steps (7–9) The splitting criterion labels the node N. This serves as a test at the node. From node N a branch is grown for each result of the splitting criterion.

Steps (10–15) The datasets in D are divided accordingly. Figure 6.3 shows three of the possible scenarios. Let us say, A is the splitting attribute. On the basis of the training data, A has distinct values {a1, a2,:::, av}. The three possible cases are discussed.

Figure 6.2: Basic algorithm to induce a decision tree from training datasets.

In Figure 6.3(a)–(c), examples for the splitting criterion labels regarding decision tree are explained in detail.

Let us take case in which A is discrete valued. In this case, test outcomes at node N match directly with the known values of A. For each value that is known, a branch is created: aj of A and labeled with that value (Figure 6.3(a)). Partition Dj is the subset of class-labeled datasets in D that has the value aj of A. All datasets in a particular partition have the same value for A, and so A does not have to be taken into account in any of the future dataset partitioning. For this reason, it is taken out from attribute list.

Figure 6.3: The figure displays three possibilities for the partitioning of datasets based on the splitting criterion. Each with examples, let us assume A be the splitting attribute. (a) If A is discrete valued, then one branch is grown for each A known value. (b) In the condition that A is continuous valued, two branches are grown parallel to A ≤ split_point and A > split_point. (c) In the condition that A is discrete valued and a binary tree must be generated, then the test will have the form as A 2 SA. Here as you can see SA is the splitting subset for A.

The case in which A is continuous valued. In this second possible case, the test at node N has two possible outcomes that correspond to the conditions A ≤ split point and A > split point. In practice, the split point a is generally taken as the midpoint of the two known adjacent values of A. Thus, it may not essentially be a pre-existing value of A from the training data. Two branches are grown from N and labeled based on the outcomes previously gained as can be seen from Figure 6.3(b). The datasets are partitioned in a way in which D1 contains the subset of class-labeled datasets in D: A ≤ split point, but D2 holds the rest.
The case in which A is discrete valued and a binary tree must be produced: The test at node N is in the form “A ∈ SA.” Here SA is the splitting subset for A, a subset of the known values of A. If the dataset has value aj of A and if aj ∈ SA, the test at node N is regarded as satisfied. Two branches are grown from N (which you can see in Figure 6.3(c)). As a general convention in the field, the left branch out of N is marked yes in order that D1 corresponds to the subset of class-labeled datasets in D fulfilling the test. Right branch out of N is marked no in order that D2 matches up with the subset of class-labeled datasets from D that do not fulfil or satisfy the test.
The same process is iteratively used by the algorithm so as to form a decision tree for the datasets at each of the resulting partition. Dj of D (as you may see Step 14). The repetitive partitioning comes to an end when one of the following terminating conditions provided is true:

a. Steps (1–3) All the datasets in D partition represented at node N are members of the same class.

b. Steps (4–11) There are no remaining attributes for the datasets to be partitioned further and one patient majority voting involving the conversion of N into a leaf as well as labeling it with the most common class in D. As an alternative the node class distribution of the datasets can be stored.

c. Steps (12–14) In this scenario, partition Dj is empty, which means no datasets exist for a given branch. In such a case, a leaf is created with the majority class in D.

6.2Attribute selection measures

After having covered decision tree and decision tree induction, we can move on to attribute selection measure. This measure is a heuristic for the selection of the splitting criterion which splits the given data partition “best” separates D, class-labeled training datasets into individual classes. If it is necessary to divide D into smaller partitions in line with the splitting criterion outcomes. Each partition would be pure and that would be an ideal situation since all the datasets in the partition belongs to the same class. Splitting rule is another term used for attribute selection measures since these measures resolve how the datasets at a given node are going to be partitioned. The attribute selection measure presents a ranking for each attribute that describe the given training datasets. The attribute that has the best score for the measure is selected as the splitting attribute for the given datasets. If the splitting attribute has continuous value or if there is a restriction to binary trees, then either a split point or a splitting subset has to be identified as a component of the splitting criterion. The tree node that is formed for partition D is marked with the splitting criterion. Branches are grown for the each outcome of the criterion. Additionally, the datasets are partitioned in view of that. In this section, you can find the descriptions of three popular attribute selection measures: information gain, gain ratio and Gini index [11].

6.3Iterative dichotomiser 3 (ID3) algorithm

D, the data partition, is the training set of class-labeled datasets. Assume that the class label attribute has m distinct values that define m distinct classes Ci (for i = 1, ..., m). D is the set of datasets of class Ci in D. |D| denotes the number of datasets in D and |Ci, D| in Ci,D.

As the attribute selection measure, ID3 uses information gain. This measure is reliant on the leading work carried out by Claude Shannon [12] on information theory, which examined the value or “information content” of messages. Node N represents or holds the datasets of partition D. The attribute with the highest information gain is selected as the splitting attribute for node N. This attribute reduces the information required to classify the datasets in the resultant partitions and reflects the least randomness or “impurity” in the partitions. This approach reduces the expected number of tests that are required to classify a given dataset to the minimum and ensures that a simple tree is found (although not always the simplest one).

The expected information that is needed for the classification of a dataset in D is provided by the following formula:

$Info (D) = - \sum_{i = 1}^{m} p_{i} \log_{2} (p_{i}) (6.1)$ $Info (D) = - \sum_{i = 1}^{m} p_{i} \log_{2} (p_{i}) (6.1)$

where pi is the nonzero probability concerning an arbitrary dataset in D that belongs to class Ci. It is estimated by $\frac{| C_{i, D} |}{| D |} .$ $\frac{| C_{i, D} |}{| D |} .$ Info(D) is the average amount of information required to identify the class label of a dataset in D. The available information at this point is based merely on the proportions of datasets pertaining to each class. Info(D)is also recognized as the entropy of D.

Ideally, the partitioning process is desired to produce an exact classification of the datasets. In other words, each partition is aspired to be pure. Yet, it is probable that the partitions will not be pure as a partition might contain a collection of datasets from different classes instead of being from a single class.

The information still needed following the partitioning to reach an exact classification is measured as an amount by the following formula:

${info}_{A} (D) = \sum_{j = 1}^{v} \frac{| D_{j} |}{| D |} \times (D_{j}) (6.2)$ ${info}_{A} (D) = \sum_{j = 1}^{v} \frac{| D_{j} |}{| D |} \times (D_{j}) (6.2)$

The term $\frac{| D_{j} |}{| D |}$ $\frac{| D_{j} |}{| D |}$ refers to the weight of the j th partition. InfoA(D) is the expected information that is needed for the classification of a dataset from D, which is based on the separation by A. If the expected information required is smaller, the purity of the partitions shall be greater.

Information gain is the difference between the original information requirement (based on the proportion of classes) and the new requirement (as can be obtained following the partitioning process on A). In short, Gain(A) shows how much one could gain by branching on A; the attribute A that has the highest information gain. Gain(A) is selected as the splitting attribute at node N. This is comparable to saying that partition on the attribute A that would do the “best classification” is needed so that the amount of information still needed to finish categorizing the datasets is smallest minimum InfoA(D).

The following formula clarifies the situation:

$Gain (A) = Info (D) - {Info}_{A} (D) (6.3)$ $Gain (A) = Info (D) - {Info}_{A} (D) (6.3)$

If you want to calculate an attribute’s information gain with continuous value, you have an attribute A not a discrete-valued one. (For instance, suppose that we have the raw values for this attribute rather than having the discretized version of Age from Table 2.19.) For this case, you will have to determine the “best” split-point for A, where the split-point is a threshold on A.

You first sort the values of A in an increasing order. The midpoint between each pair of adjacent values is regarded as a possible split-point for given v values of A. Then v−1 possible splits are taken into consideration and evaluated [13]. For instance, the midpoint between the values ai and ai + 1 of A is found through the following formula:

$\frac{a_{i} + a_{i + 1}}{2} (6.4)$ $\frac{a_{i} + a_{i + 1}}{2} (6.4)$

The values of A may be arranged beforehand; next determination of the best split for A needs only one pass through the values. For each possible split-point for A, you evaluate InfoA(D). Here, the number of partitions is two: v = 2 (or j = 1, 2) as can be seen in eq. (6.2). The point that has the minimum expected information obligation for A is chosen as the split point for A. D1 is the set of datasets in D that satisfies A≤ split_point. D2 is the set of datasets in D, which satisfies A>split_point.

Figure 6.4: The flowchart for ID3 algorithm.

The flowchart of ID3 algorithm is as the one provided in Figure 6.4.

Based on ID3 algorithm flowchart (Figure 6.4), you can find the steps of ID3 algorithm. The steps applied in ID3 algorithm are as follows:

Figure 6.5: General ID3 decision tree algorithm.

The following Example 6.1 shows the application of the ID3 algorithm to get a clearer picture through relevant steps.

Example 6.1 In Table 6.1 the values for the individuals belonging to WAIS-R dataset are elicited as gender, age, verbal IQ (QIV) and class. The data are provided in Chapter 2 for WAIS-R dataset (Tables 2.19 and 6.1). These three variables (gender, age and QIV) have been handled. Class (patient and healthy individuals). At this moment, let us see how we can form the ID3 tree form using these data.

We can apply the steps for ID3 algorithm in the relevant step-by-step order so as to build the tree form as shown in Figure 6.5.

Table 6.1: Selected sample from the WAIS-R dataset.

Steps (1–4) The first step is to calculate the root node to form the tree structure. The total number of record is 15 as can be seen in Table 6.1 (first column), which can go as follows: (Individual (ID): 1, 2, 3, …, 15). We can use the entropy formula in order to calculate the root node based on eq. (6.1).

As can be seen clearly in Table 6.1, class column 11 individuals belong to patient classification and 4 to healthy. The general status entropy can be calculated based on eq. (6.1) as follows:

\frac{11}{15} \log (\frac{15}{11}) + \frac{4}{15} \log (\frac{15}{4}) = 0.2517

$\frac{11}{15} \log (\frac{15}{11}) + \frac{4}{15} \log (\frac{15}{4}) = 0.2517$

Table 6.1 lists the calculation of the entropy value for the gender variable (female–male; our first variable).

Step (5) In order to calculate the entropy value of the gender variable.

Info (D) = \frac{6}{7} \log (\frac{7}{6}) + \frac{1}{7} \log (\frac{7}{1}) = 0.1780

$Info (D) = \frac{6}{7} \log (\frac{7}{6}) + \frac{1}{7} \log (\frac{7}{1}) = 0.1780$

{Info}_{Gender = Male} (D) = \frac{5}{8} \log (\frac{8}{5}) + \frac{3}{8} \log (\frac{8}{3}) = 0.2872

${Info}_{Gender = Male} (D) = \frac{5}{8} \log (\frac{8}{5}) + \frac{3}{8} \log (\frac{8}{3}) = 0.2872$

The weighted total value of the gender variable can be calculated as in Figure 6.5.

\begin{array}{l} {Info}_{A} (D) = \sum_{j = 1}^{v} \frac{| D_{j} |}{| D |} \times (D_{j}) (based on (6.2)) \\ (\frac{7}{15}) \times 0.1780 + (\frac{8}{15}) \times 02872 = 0.2361 \end{array}

$\begin{array}{l} {Info}_{A} (D) = \sum_{j = 1}^{v} \frac{| D_{j} |}{| D |} \times (D_{j}) (based on (6.2)) \\ (\frac{7}{15}) \times 0.1780 + (\frac{8}{15}) \times 02872 = 0.2361 \end{array}$

Step (6) The gain calculation of the gender variable is done according to eq. (6.3).

The calculation is as follows:

Gain (Gender) = 0.2517 - 0.2361 = 0.0156

$Gain (Gender) = 0.2517 - 0.2361 = 0.0156$

Now let us do the gain calculation for another variable of ours, age variable following the same steps mentioned. In order to form the branches of the tree formed based on ID3 algorithm, the age variable should be split into groups: let us assume that the age variable is grouped as follows and labeled accordingly.

Table 6.2: The grouping of the age variable.

Age interval	Group no.
48–55	1
59–64	2
65–75	3
76–85	4
86–100	5

Steps (1–4) Let us now calculate the entropy values of the group numbers as labeled with the age interval values given in Table 6.2; eq. (6.2) is used for this calculation.

\begin{array}{l} {Info}_{Age = 1 .Group} = \frac{1}{2} \log (\frac{3}{1}) + \frac{2}{3} \log (\frac{3}{2}) = 0.2764 \\ {Info}_{Age = 2. Group} = \frac{1}{2} \log (\frac{5}{2}) + \frac{3}{5} \log (\frac{5}{3}) = 0.2923 \\ {Info}_{Age=3 .Group} = \frac{1}{4} \log (\frac{4}{1}) + \frac{2}{4} \log (\frac{4}{2}) + \frac{1}{4} \log (\frac{4}{1}) = 0.4515 \end{array}

$\begin{array}{l} {Info}_{Age = 1 .Group} = \frac{1}{2} \log (\frac{3}{1}) + \frac{2}{3} \log (\frac{3}{2}) = 0.2764 \\ {Info}_{Age = 2. Group} = \frac{1}{2} \log (\frac{5}{2}) + \frac{3}{5} \log (\frac{5}{3}) = 0.2923 \\ {Info}_{Age=3 .Group} = \frac{1}{4} \log (\frac{4}{1}) + \frac{2}{4} \log (\frac{4}{2}) + \frac{1}{4} \log (\frac{4}{1}) = 0.4515 \end{array}$

\begin{array}{l} {Info}_{Age = 4 .Group} = \frac{1}{2} \log (\frac{2}{1}) + \frac{1}{2} \log (\frac{2}{1}) = 0.3010 \\ {Info}_{Age = 5. Group} = \frac{1}{1} \log (\frac{1}{1}) = 1 s \end{array}

$\begin{array}{l} {Info}_{Age = 4 .Group} = \frac{1}{2} \log (\frac{2}{1}) + \frac{1}{2} \log (\frac{2}{1}) = 0.3010 \\ {Info}_{Age = 5. Group} = \frac{1}{1} \log (\frac{1}{1}) = 1 s \end{array}$

Step (5) The calculation of the weighted total value for the age variable is as follows:

(\frac{3}{15}) \times 0.2764 + (\frac{5}{15}) \times 0.2923 + (\frac{4}{15}) \times 0.4515 + (\frac{2}{15}) \times 0.3010 + (\frac{1}{15}) \times 1 = 0.3799

$(\frac{3}{15}) \times 0.2764 + (\frac{5}{15}) \times 0.2923 + (\frac{4}{15}) \times 0.4515 + (\frac{2}{15}) \times 0.3010 + (\frac{1}{15}) \times 1 = 0.3799$

Step (6) The gain calculation for the age variable Gain(Age) is calculated according to eq. (6.3):

Gain (Age) = 0.2517 - 0.3799 = - 0.1282.

$Gain (Age) = 0.2517 - 0.3799 = - 0.1282.$

At this point the gain value of our last variable, namely, the QIV variable, is to be calculated.

We shall apply the same steps likewise for the QIV variable as well. In order to form the branches of the tree to be formed, it is required to split the QIV variable into groups. If we assume that the QIV variable is grouped as follows:

Table 6.3: Grouping the QIV variable.

QIV	Group no.
151–159	1
160–164	2
165–174	3
175–184	4
>185	5

Steps (1–4) Let us calculate the entropy value of the QIV variable based on Table 6.3.

\begin{array}{l} {Info}_{QIV = 1. Group} = \frac{1}{3} \log (\frac{3}{1}) + \frac{2}{3} \log (\frac{3}{2}) = 0.2764 \\ {Info}_{QIV = 2. Group} = (\frac{2}{2}) \log (\frac{2}{2}) = 1 \\ {Info}_{QIV = 3. Group} = (\frac{1}{6}) \log (\frac{6}{1}) \log (\frac{5}{6}) + (\frac{1}{4}) \log (\frac{4}{1}) = 0.1957 \\ {Info}_{QIV = 4. Group} = (\frac{1}{2}) \log (\frac{2}{1}) + (\frac{1}{2}) \log (\frac{2}{1}) = 0.3010 \end{array}

$\begin{array}{l} {Info}_{QIV = 1. Group} = \frac{1}{3} \log (\frac{3}{1}) + \frac{2}{3} \log (\frac{3}{2}) = 0.2764 \\ {Info}_{QIV = 2. Group} = (\frac{2}{2}) \log (\frac{2}{2}) = 1 \\ {Info}_{QIV = 3. Group} = (\frac{1}{6}) \log (\frac{6}{1}) \log (\frac{5}{6}) + (\frac{1}{4}) \log (\frac{4}{1}) = 0.1957 \\ {Info}_{QIV = 4. Group} = (\frac{1}{2}) \log (\frac{2}{1}) + (\frac{1}{2}) \log (\frac{2}{1}) = 0.3010 \end{array}$

{Info}_{QIV = 5. G r o u p} = (\frac{2}{2}) \log (\frac{2}{2}) = 1

${Info}_{QIV = 5. G r o u p} = (\frac{2}{2}) \log (\frac{2}{2}) = 1$

Step (5) Let us use eq. (6.2) in order to calculate the weighted total value of the QIV variable.

(\frac{3}{15}) \times 0.2764 + (\frac{2}{15}) \times 1 + (\frac{6}{15}) \times 0.1957 + (\frac{2}{15}) \times 0.3010 + (\frac{2}{15}) \times 1 = 0.4404

$(\frac{3}{15}) \times 0.2764 + (\frac{2}{15}) \times 1 + (\frac{6}{15}) \times 0.1957 + (\frac{2}{15}) \times 0.3010 + (\frac{2}{15}) \times 1 = 0.4404$

Step (6) Let us apply the gain calculation for the QIV variable based on eq. (6.3):

G a i n (Q I V) = 0.2517 - 0.4404 = - 0.1887

$G a i n (Q I V) = 0.2517 - 0.4404 = - 0.1887$

Step (7) We have been able to calculate the gain values of all the variables in Table 6.1. These variables are age, gender and QIV. The calculation has yielded the following results for the variables:

Gain (Gender) = 0.0156; Gain (Age) = - 0.1282; Gain (QIV) = - 0.1887

$Gain (Gender) = 0.0156; Gain (Age) = - 0.1282; Gain (QIV) = - 0.1887$

When ranked from the smallest to the greatest, the greatest gain value has been obtained from gain (gender). Based on the greatest values, gender will be selected as the root node and it is presented as shown in Figure 6.6 Gender variable constitutes the root of the tree in line with the procedural steps of the ID3 algorithm as shown in Figure 6.6. The branches of the tree are shown as female and male.

Figure 6.6: The root node initial value based on Example 6.1 ID3 algorithm.

6.3.1ID3 algorithm for the analysis of various data

ID3 algorithm has been applied on economy (U.N.I.S.) dataset, multiple sclerosis (MS) dataset and WAIS-R dataset in Sections 6.3.1.1, 6.3.1.2 and 6.3.1.3, respectively.

6.3.1.1ID3 Algorithm for the analysis of economy (U.N.I.S.)

As the first set of data, in the following sections, we will use some data related to economies for U.N.I.S. countries. The attributes of the these countries’ economies are data regarding years, unemployment youth male (% of male labor force ages 15–24) (national estimate), gross value added at factor cost ($), tax revenue,…, gross domestic product (GDP) growth (annual %). Economy (U.N.I.S.) dataset consists of a total of 18 attributes. Data belonging to USA, New Zealand, Italy and Sweden economies from 1960 to 2015 are defined based on the attributes given in Table 2.7 economy (U.N.I.S.) dataset (http://data.worldbank.org) [15], to be used in the following sections. For the classification of D matrix through ID3 algorithm, the first step training procedure is to be employed. For the training procedure, 66.66% of the D matrix can be split for the training dataset (151 × 18), and 33.33% as the test dataset (77 × 18).

Following the classification of the training dataset being trained with ID3 algorithm, we can classify the test dataset. Although the procedural steps of the algorithm may seem complicated at first glance, the only thing you have to do is to concentrate on the steps and grasp them. For this, let us have a close look at the steps provided in Figure 6.7.

Figure 6.7: ID3 algorithm for the analysis of the economy (U.N.I.S.) data set.

How can we find the class of a test data belonging to economy (U.N.I.S.) dataset provided in Table 2.7.1 according to ID3 algorithm? We can find the class label of a sample data belonging to economy (U.N.I.S.) dataset through the following steps (Figure 6.9):

Steps (1–3) Let us split the U.N.I.S. dataset as follows: 66.6% as the training data (D = 151 × 18) and 33.3% as the test data (T= 77 × 18).

Steps (4–10) Create random subset. The training of the training set (D=228 × 18) through the ID3 algorithm has yielded the following results: number of nodes: 15. Sample ID3 decision tree structure is elicited in Figure 6.10.

Steps (11–12) The splitting criterion labels the node.

The training of the economy (U.N.I.S.) dataset is carried out by applying ID3 algorithm (Steps 1–12 as specified in Figure 6.8.). Through this, the decision tree made up of 15 nodes (the learned tree (N = 15)) has been obtained. The nodes chosen via the ID3 learned tree (gross value added at factor cost($), tax revenue, unemployment youth male) and some of the rules pertaining to the nodes are presented in Figure 6.8.

Based on Figure 6.8(a), some of the rules obtained from the ID3 algorithm applied to economy (U.N.I.S.) dataset can be seen as follows.

Figure 6.8(a) and (b) shows economy (U.N.I.S.) dataset sample decision tree structure based on ID3, obtained as such; IF-THEN rules are as follows:

…

Rule 1

IF ((gross value added at factor cost($) (current US ($) < $2.17614e + 12 ) and (tax revenue (% of GDP) < %12.20) and (unemployment youth male (% labor force ages 15–24)< % 3.84))

THEN class = Italy

Rule 2

IF ((gross value added at factor cost($) (current US ($)) < $ 2.17614e + 12) and (tax revenue (% of GDP) < % 12.20) and (unemployment youth male (% labor force ages 15–24)> % 3.84)

THEN class = USA

Rule 3

IF ((gross value added at factor cost($) (current US ($) < $ 2.17614e + 12) and (tax revenue (% of GDP) ≥ % 12.20)

THEN class = Italy

Rule 4

IF ((Gross value added at factor cost($)(Current US ($)) ≥ $ 2.17614e + 12)

THEN class = USA

…

Figure 6.8: Sample decision tree structure for the economy (U.N.I.S.) dataset based on ID3. (a) ID3 algorithm flowchart for the sample economy (U.N.I.S.) dataset. (b) Sample decision tree rule space for the economy (U.N.I.S) dataset.

As a result, the data in the economy (U.N.I.S.) dataset with 33.33% portion allocated for the test procedure and classified as USA, New Zealand, Italy and Sweden have been classified with an accuracy rate of 95.2% based on ID3 algorithm.

6.3.1.2ID3 algorithm for the analysis of MS

As presented in Table 2.10, MS dataset has data from the following groups: 76 sample belonging to RRMS, 76 sample to SPMS, 76 sample to PPMS and 76 sample belonging to healthy subjects of control group. The attributes of the control group are data regarding brain stem (MRI 1), corpus collasum periventricular (MRI 2), upper cervical (MRI 3) lesion diameter size (milimetric (mm)) in the MR image and EDSS score. Data consist of a total of 112 attributes. It is known that using these attributes of 304 individuals, the data whether they belong to the MS subgroup or healthy group can be known. How can we make the classification as to which MS patient belongs to which subgroup of MS including healthy individuals and those diagnosed with MS (based on the lesion diameters (MRI 1, MRI 2 and MRI 3) and number of lesion size for (MRI 1, MRI 2 and MRI 3) as obtained from MRI images and EDSS scores)? D matrix has a dimension of (304 × 112). This means D matrix includes the MS dataset of 304 individuals along with their 112 attributes (see Table 2.12 for the MS dataset). For the classification of D matrix through Naive Bayesian algorithm, the first step training procedure is to be employed. For the training procedure, 66.66% of the D matrix can be split for the training dataset (203 × 112) and 33.33% as the test dataset (101 × 112).

Following the classification of the training dataset being trained with ID3 algorithm, we can classify the test dataset. Although the procedural steps of the algorithm may seem complicated at first glance, the only thing you have to do is to concentrate on the steps and grasp them. For this let us have a close look at the steps provided in Figure 6.9.

In order to be able to do classification through ID3 algorithm, the test data has been selected randomly from the MS dataset as 33.33%.

Let us now think for a while about how to classify the MS dataset provided in Table 2.10.1 in accordance with the steps of ID3 algorithm (Figure 6.9).

Steps (1–3) Let us identify 66.66% of the MS dataset as the training data (D=203 × 112) and 33.33% as the test data (T =101 × 112).

Steps (4–10) Create random subset. As a result of training the training set with ID3 algorithm. The following results have been yielded: number of nodes – 39.

Steps (11–12) The splitting criterion labels the node.

The training of the MS dataset is done by applying ID3 algorithm (Steps 1–12 as specified in Figure 6.7). Through this, the decision tree comprised of 39 nodes (the learned tree (N = 39)) has been obtained. The nodes chosen via the ID3 learned tree (MRI 1) and some of the rules pertaining to the nodes are presented in Figure 6.9.

Figure 6.9: ID3 algorithm for the analysis of the MS data set.

Based on Figure 6.10(a), some of the rules obtained from the ID3 algorithm applied to MS dataset can be seen as follows.

Figure 6.10(a) and (b) shows MS dataset sample decision tree structure based on ID3, obtained as such; if–then rules are as follows:

…

Rule 1

If ((1st region lesion count) < (number of lesion is 10.5))

Then CLASS = RRMS

Figure 6.10: Sample decision tree structure for the MS dataset based on ID3. (a) ID3 algorithm flowchart for sample MS dataset. (b) Sample decision tree rule space for the MS dataset.

Rule 2

If ((1st region lesion count) ≥ (number of lesion is 10.5))

Then CLASS = PPMS

…

As a result, the data in the MS dataset with 33.33% portion allocated for the test procedure and classified as RRMS, SPMS, PPMS and healthy have been classified with an accuracy rate of 88.3% based on ID3 algorithm.

6.3.1.3ID3 algorithm for the analysis of mental functions

As presented in Table 2.16 the WAIS-R dataset has data of 200 belonging to patient group and 200 sample to healthy control group. The attributes of the control group are data regarding school education, gender, three-dimensional modeling, memory, ordering ability, age,…, DM. WAIS-R dataset is consists of a total of 21 attributes. It is known that using these attributes of 400 individuals, the data whether they belong to patient or healthy group are known. How can we make the classification as to which individual belongs to which patient or healthy individuals and those diagnosed with WAIS-R test (based on the school education, gender, three-dimensional modeling, memory, ordering ability, age,…, DM)? D matrix has a dimension of 400 × 21. This means D matrix includes the WAIS-R dataset of 400 individuals along with their 21 attributes (see Table 2.16) for the WAIS-R dataset. For the classification of D matrix through ID3 algorithm, the first step training procedure is to be employed. For the training procedure, 66.66% of the D matrix can be split for the training dataset (267 × 21) and 33.33% as the test dataset (133 × 21).

Following the classification of the training dataset being trained with ID3 algorithm, we can classify the test dataset. Although the procedural steps of the algorithm may seem complicated at first glance, the only thing you have to do is to concentrate on the steps and grasp them. We can have a glance at the steps shown in Figure 6.11:

Figure 6.11: ID3 algorithm for the analysis of WAIS-R data set.

How can we classify the WAIS-R dataset in Table 2.16.1 in accordance with the steps of ID3 algorithm (Figure 6.11)?

Steps (1–3) Let us identify the 66.66% of WAIS-R dataset as the training data (D= 267 × 21) and 33.33% as the test data (T= 133 × 21).

Steps (4–10) Create random subset. As a result of training of the training set (D = 267 × 21) through the ID3 algorithm. The following results have been obtained: Number of nodes – 27. The sample decision tree structure is provided in Figure 6.12.

Steps (11–12) The splitting criterion labels the node.

The training of the WAIS-R dataset is conducted by applying ID3 algorithm (Steps 1–12 as specified in Figure 6.12.). Through this, the decision tree made up of 27 nodes (the learned tree (N = 27)) has been obtained. The nodes chosen via the ID3 learned tree (three- dimensional modeling, memory, ordering ability, age) and some of the rules pertaining to the nodes are presented in Figure 6.12.

Based on Figure 6.12(a), some of the rules obtained from the ID3 algorithm applied to WAIS-R dataset can be seen as follows.

Figure 6.12(a) and (b) shows WAIS-R dataset sample decision tree structure based on ID3, obtained as such; if–then rules are as follows:

…

Rule 1

IF (three-dimensional modeling of test score result < 13.5 and ordering ability of test score result < 0.5)

THEN class = healthy

Rule 2

IF (three-dimensional modeling of test score result ≥ 13.5 and memory of test score result < 16.5 and age < 40

THEN class = patient

Rule 3

IF (three-dimensional modeling of test score result ≥ 13.5 and memory of test score result < 16.5 and ≥ 40

THEN class = healthy

Rule 4

IF (three-dimensional modeling of test score result) ≥ 13.5 and memory of test score result ≥ 16.5

THEN class = patient

…

Figure 6.12: Sample decision tree structure for the WAIS-R dataset based on ID3.(a) ID3 algorithm flowchart for sample WAIS-R dataset. (b) Sample decision tree rule space for the WAIS-R dataset.

As a result, the data in the WAIS-R dataset with 33.33% portion allocated for the test procedure and classified as patient and healthy individuals have been classified with an accuracy rate of 51% based on ID3 algorithm.

6.4C4.5 algorithm

Following some information on ID3, we can have a look at C4.5. Information gain measure prefers to choose attributes with a large number of values. For instance, you can consider an attribute acting as a unique identifier such as ID. A split on ID would bring about a large number of partitions (as many number as there are values) each one contains just one dataset.

As the example shows each partition is pure and the information needed for the classification of classify dataset D based on this partitioning shall be Infoindividual ID(D) = 0. For this very reason, the information gained by partitioning on this attribute is highest and this classification could prove to be useless at the end [14].

ID3 is a precursor of C4.5. An extension to information gain is known as gain ratio. Gain ratio tries to overcome the bias toward tests with numerous outcomes. Implementing a sort of normalization to information gain (using a “split information” value defined in an analogous manner with Info(D) as given as follows):

${SplitInfo}_{A} (D) = - \sum_{j = 1}^{v} \frac{| D_{j} |}{| D |} \times \log_{2} (\frac{| D_{j} |}{| D |}) (6.5)$ ${SplitInfo}_{A} (D) = - \sum_{j = 1}^{v} \frac{| D_{j} |}{| D |} \times \log_{2} (\frac{| D_{j} |}{| D |}) (6.5)$

The value stands for the potential information produced by splitting the training dataset D into v partitions. This corresponds to the v outcomes of a test on attribute A. It considers the number of datasets with that outcome for the total number of datasets in D for each outcome. Unlikely, information gain measures the information relating to classification acquired as derived from the same partitioning, but the gain ratio is defined through the calculation as follows [15]:

$GainRatio (A) = \frac{G a i n (A)}{{SplitInfo}_{A} (D)} (6.6)$ $GainRatio (A) = \frac{G a i n (A)}{{SplitInfo}_{A} (D)} (6.6)$

The attribute that has the maximum gain ratio is selected as the splitting attribute. Here, as the split information comes close to 0, the ratio gets unstable. To evade this situation, a constraint is included. And then the information gain of the test selected has to be large.

Figure 6.13: presents the flowchart of C4.5 algorithm.

Based on C4.5 algorithm flowchart, you can find the steps of C4.5 algorithm according to Figure 6.13.

Example 6.2 shows the application of the C4.5 algorithm to get a clearer picture through relevant steps (the same procedure was followed for ID3 algorithm Figure 6.14).

Figure 6.13: The flowchart for C4.5 algorithm.

Example 6.2 In Table 6.1, the values for the individuals belonging to WAIS-R dataset are elicited as gender, age, QIV and class label values. The data in Section 6.2 for WAIS-R dataset (indicated in Table 2.16.1 and Table 6.1) has been handled.

We can step-by-step apply the steps for C4.5 algorithm in the relevant order so as to build the tree form as in Figure 6.1.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6 Decision Tree

Create new playlist

Sign In

Sign Up

6Decision Tree

6.1Decision tree induction

6.2Attribute selection measures

6.3Iterative dichotomiser 3 (ID3) algorithm

6.3.1ID3 algorithm for the analysis of various data

6.3.1.1ID3 Algorithm for the analysis of economy (U.N.I.S.)

6.3.1.2ID3 algorithm for the analysis of MS

6.3.1.3ID3 algorithm for the analysis of mental functions

6.4C4.5 algorithm

Table of Contents for
6 Decision Tree