Steps (1–4) The first step is to calculate the root node in order to form the tree structure. The total number of record is 15, as can be seen in Table 6.1 (first column), which goes as follows: (Individual (ID): 1, 2, 3,…, 15). We can use the entropy formula to calculate the root node based on eq. (6.1).
As can be seen clearly in Table 6.1 class column, 11 individuals belong to patient classification and four to healthy. The general status entropy can be calculated based on eq. (6.1):
Step (5) Table 6.1 lists the calculation of the entropy value for the gender variable (female–male; our first variable) based on eq. (6.4).
Step (6) The gain ratio calculation for the gender variable can be calculated based on eq. (6.5).
The calculation will be as follows:
At this point, let us do the gain ratio calculation for the age variable following the same steps. In order to form the branches of the tree formed based on C4.5 algorithm, the age variable should be split into groups: let us assume that the age variable is grouped as follows and labeled accordingly.
Step (3) Let us now calculate the entropy values of the group numbers as labeled with the age interval values in Table 6.2; eq. (6.2) is used for this calculation.
Step (6) The gain calculation for the age variable Gain(Age) is calculated according to eq. (6.5):
Now, let us calculate the gain ratio value of the QIV variable, which is our last variable.
Let us follow the same steps and similarly apply them for the QIV variable as well. In order to form the branches of the tree to be formed, it is required to split the
QIV variable into groups. Let us assume that the QIV variable is grouped and are given as follows.
Step (3) Let us calculate the entropy value of the QIV variable based on Table 6.3:
Step (5) We can use eq. (6.5) to calculate the total weighted value of the QIV variable:
Step (6) Let us calculate the gain ratio value for the QIV variable.
Step (7) We have been able to calculate the gain ratio values of all the variables in Table 6.1. These variables are age, gender and QIV. The calculation has yielded the following results for the variables:
Gain Ratio(Gender) = 0.8387; Gain Ratio(Age) = 0.3890; Gain Ratio(QIV) = 0.3878 When ranked from the smallest to the greatest, the greatest gain ratio value has been obtained from Gain Ratio(Gender). Based on the values from the greatest to the smallest, gender is to be selected as the root node and it is presented as can be seen in Figure 6.15.
Gender variable constitutes the root of the tree in line with the procedural steps of C4.5 algorithm as can be seen in Figure 6.15. The branches of the tree are depicted as female and male.
C4.5 algorithm has been applied on economy (U.N.I.S.) dataset, multiple sclerosis (MS) dataset and WAIS-R dataset in Sections 6.4.1.1, 6.4.1.2 and 6.4.1.3, respectively.
As the first set of data, we will use, in the following sections, some data related to the U.N.I.S. countries’ economies. The attributes of the these countries’ economies are data regarding years, unemployment youth female (% of male labor force ages 15–24) (national estimate), account at a financial institution, male(% age 15+), gross value added at factor cost ($), tax revenue…, GDP growth (annual %). Data consist of a total of 18 attributes. Data belonging to USA, New Zealand, Italy and Sweden economies from 1960 to 2015 are defined based on the attributes given in Table 2.7.1 (economy dataset (http://data.worldbank.org)) [15], to be used in the following sections. For the classification of D matrix through C4.5 algorithm, the first step training procedure is to be employed. For the training procedure, 66.66% of the D matrix can be split for the training dataset (151 × 18) and 33.33% as the test dataset (77 × 18).
Following the classification of the training dataset being trained with C4.5 algorithm, we can classify the test dataset. Although the procedural steps of the algorithm may seem complicated at first glance, the only thing you have to do is to concentrate on the steps and grasp them. We can take a close glance at the steps presented in Table 6.16.
How can we find the class of a test data belonging to economy (U.N.I.S.) dataset provided in Table 2.7.1 according to C4.5 algorithm? We can find the class label of a sample data belonging to economy (U.N.I.S.) dataset through the following steps (Figure 6.16):
Steps (1–3) Let us split the U.N.I.S. dataset as follows: 66.6% as the training data (D =151 × 18) and 33.33% as the test data (T = 77 × 18).
Steps (4–10) Create random subset: The training of the training set (D = 228 × 8) through C4.5 algorithm has yielded the following results: number of nodes – 15. Sample decision tree structure for C4.5 is elicited in Figure 6.17.
Steps (11–12) The splitting criterion labels the node.
The training of the economy (U.N.I.S.) is done by applying C4.5 algorithm (Steps 1–12 as specified in Figure 6.16.). Through this, the decision tree made up of 15 nodes (the learned tree (N = 15)) has been obtained. The nodes chosen via the C4.5 learned tree nodes (account at a financial institution (male), gross value added at factor cost) and some of the rules regarding the nodes are provided in Figure 6.17.
Based on Figure 6.17, some of the rules obtained from the C4.5 algorithm applied to economy (U.N.I.S.) dataset can be seen as follows.
Figure 6.17(a) and (b) shows economy (U.N.I.S.) dataset sample decision tree structure based on C4.5 algorithm, obtained as such; if–then rules are as follows:
…
…
…
Rule 1
IF (account at a financial institution for male (% age 15+) < %1.02675e+10) and (gross value added at factor cost (current US($)) ≥ $2.17617e+12)
THEN class = USA
Rule 2
IF (account at a financial institution, male (%age 15+) ≥ %1.02675e+10)
THEN class = Italy
…
…
…
As a result, the data in the economy (U.N.I.S.) dataset with 33.33% portion allocated for the test procedure and classified as USA, New Zealand, Italy and Sweden have been classified with an accuracy rate of 86.4% based on C4.5 algorithm.
As presented in Table 2.10, MS dataset has data from the following groups: 76 sample belonging to RRMS, 76 sample to SPMS, 76 sample to PPMS and 76 sample belonging to healthy subjects of control group. The attributes of the control group are data regarding brain stem (MRI 1), corpus collasum periventricular (MRI 2), upper cervical (MRI 3) lesion diameter size (milimetric (mm)) in the MR image and EDSS score. Data consist of a total of 112 attributes. It is known that using these attributes of 304 individuals, we can know whether the data belong to the MS subgroup or healthy group. How can we make the classification as to which MS patient belongs to which subgroup of MS including healthy individuals and those diagnosed with MS (based on the lesion diameters (MRI 1, MRI 2, MRI 3), number of lesion size for (MRI 1, MRI 2, MRI 3) as obtained from MRI images and EDSS scores? D matrix has a dimension of (304 × 112). This means D matrix includes the MS dataset of 304 individuals along with their 112 attributes (see Table 2.10 for the MS dataset). For the classification of D matrix through Naive Bayesian algorithm, the first step training procedure is to be employed. For the training procedure, 66.66%of the D matrix can be split for the training dataset (203 × 112) and 33.33% as the test dataset (101 × 112).
Following the classification of the training dataset being trained with C4.5 algorithm, we can classify the test dataset. Although the procedural steps of the algorithm may seem complicated at first glance, the only thing you have to do is to concentrate on the steps and grasp them. For this, let us have a close look at the steps provided in Figure 6.18.
In order to be able to do classification through C4.5 algorithm, the test data has been randomly selected as 33.33% from the MS dataset.
Let us now think for a while about how to classify the MS dataset provided in Table 2.10.1 in accordance with the steps of C4.5 algorithm (Figure 6.18.)
Steps (1–3) Let us identify 66.66% of the MS dataset as the training data (D=203 × 112) and 33.33% as the test data (T=101 × 112).
Steps (4–10) Create random subset: The training of the training set (D=203 × 112) through C4.5 algorithm has yielded the following results: number of nodes – 23. Sample decision tree structure for C4.5 is elicited in Figure 6.19.
Steps (11–12) The splitting criterion labels the node.
The training of the MS dataset is done by applying C4.5 algorithm (Steps 1–12 as specified in Figure 6.18.). Through this, the decision tree made up of 23 nodes (the learned tree (N = 23)) has been obtained. The nodes chosen via the C4.5 learned tree (second region lesion size for 4 mm) and some of the rules pertaining to the nodes are presented in Figure 6.19.
Based on Figure 6.19(a), some of the rules obtained from C4.5 algorithm applied to MS dataset can be seen as follows.
Figure 6.19(a) and (b) shows MS dataset sample decision tree structure based on C4.5 algorithm, obtained as such; if–then rules are as follows:
…
…
…
Rule 1
IF (second region lesion size is 4 mm < (number of lesion is 6))
THEN class = PPMS
Rule 2
IF (second region lesion size is 4 mm ≥ (number of lesion is 6))
THEN class = SPMS
…
…
As a result, the data in the MS dataset with 33.33% portion allocated for the test procedure and classified as RRMS, SPMS, PPMS and healthy have been classified with an accuracy rate of 67.28% based on C4.5 algorithm.
As presented in Table 2.16, the WAIS-R dataset has data of 200 belonging to patient group and 200 sample to healthy control group. The attributes of the control group are data regarding school education, gender,…,DM. Data consist of a total of 21 attributes. It is known that using these attributes of 400 individuals, we can know if the data belong to patient or healthy group. How can we make the classification as to which individual belongs to which patient or healthy groups and those diagnosed with WAIS-R test (based on the school education, gender, …,DM)? D matrix has a dimension of 400 × 21. This means D matrix includes the WAIS-R dataset of 400 individuals along with their 21 attributes (see Table 2.16) for the WAIS-R dataset. For the classification of D matrix through C4.5 algorithm, the first step training procedure is to be employed. For the training procedure, 66.66% of the D matrix can be split for the training dataset (267 × 21) and 33.33% as the test dataset (133 × 21).
Following the classification of the training dataset being trained with C4.5 algorithm, we can classify the test dataset. Although the procedural steps of the algorithm may seem complicated at first glance, the only thing you have to do is to concentrate on the steps and grasp them. We can take a close look at the steps presented in Figure 6.20.
How can we classify the WAIS-R dataset in Table 2.16.1 in accordance with the steps of C4.5 algorithm (Figure 6.21)?
Steps (1–3) Let us identify the 66.66% of WAIS-R dataset as the training data (D=267 × 21) and 33.33% as the test data (T =133 × 21).
Steps (4–10) Create random subset: training of the training set (D = 267 × 21) with C4.5algorithm. The following results have been obtained: number of nodes – 15. Sample decision tree structure is shown in Figure 6.21.
Steps (11–12) The splitting criterion labels the node.
The training of the WAIS-R dataset is done by applying C4.5 algorithm (Steps 1–12 as specified in Figure 6.21). Through this, the decision tree consisting of 15 nodes (the learned tree (N = 15)) has been obtained. The nodes chosen via the ID3 learned tree nodes (gender, ordering ability) and some of the rules regarding the nodes are shown in Figure 6.21.
Based on Figure 6.21(a), some of the rules obtained from the C4.5 algorithm applied to the WAIS-R dataset can be seen as follows.
Figure 6.21(a) and (b) show WAIS-R dataset sample decision tree structure based on C4.5 algorithm, obtained as such; IF-THEN rules are as follows:
…
…
…
Rule 1
IF ((Gender = Female) and ordering ability of test score result < 0.5)
THEN CLASS = patient
Rule 2
IF ((Gender = Male) and ordering ability of test score result ≥ 0.5)
THEN CLASS = healthy
…
…
…
As a result, the data in the WAIS-R dataset with 33.33% portion allocated for the test procedure and classified as patient and healthy individuals have been classified with an accuracy rate of 99% based on C4.5 algorithm.
As the third algorithm in this section, let us elaborate on CART. We can use the notation described previously. The Gini index measures the impurity of D, a data partition or set of training datasets:
where pi is the probability a dataset in D and belongs to class Ci. It is estimated We calculate the sum over m classes [15–18].
The Gini index considers a binary split for each of the attributes. If A is a discrete-valued attribute with v distinct values, {a1, a2, …., av} occurs in D. If one wants to determine the best binary split on A, he/ she should analyze all the possible subsets that can be generated using the known values of A. Each subset, SA, is considered to be a binary test for A attribute of the form “A ∈ SA.” For a dataset this test will yield satisfying result if the value of A for the dataset is one of the values listed in SA. If A has v possible values, then there will be 2v possible subsets.
For example, if school education has three possible values that are {primary school, junior high school, vocational school}, then the possible subsets will be {primary school, junior high school, vocational school}, {primary school, junior high school}, {primary school, vocational school}, {primary school, vocational school}, {primary school}, {junior high school}, {vocational school} and {}. We exclude the power set {primary school, junior high school, vocational school}, and the empty set from consideration. They do not conceptually represent a split. This example illustrates that there are 2v − 2 possible ways to form two partitions of the D data depending a binary split on A. For the binary split, you calculate a weighted sum of the impurity of each resultant partition. For instance, the binary split on A splits D into D1 and D2. The Gini index of D with that partitioning is denoted as school education (primary school, junior high school, vocational school, bachelor, master’s degree, Phd).
Here, for each attribute, each of the possible binary splits is considered. If the attribute is one of discrete valued, then the subset yielding the minimum Gini index for the attribute is chosen as the splitting subset thereof [9].
If you wonder what happens with the attributes with continuous values, then each possible split-point has to be taken into consideration. A similar strategy is employed as has been described previously for information gain (with the midpoint between each pair sorted adjacent values is treated as a possible split-point). The point giving the least or minimum Gini index for a given (this case continuous valued) attribute is treated as the attribute’s split-point. Note that D1 is the set of datasets in D satisfying A≤ split_point and D2 the set of datasets in D satisfying A > split_point for a possible split-point of A as has been mentioned earlier. Impurity reduction occurs by a binary split on a discrete- or continuous-valued attribute with A being:
The attribute that has the minimum Gini index or the one that maximizes the reduction in impurity is chosen as the splitting attribute.
Figure 6.22 presents the flowchart of the CART algorithm.
Based on the aforementioned flowchart for CART algorithm, you can find the steps of CART algorithm in Figure 6.23. The steps applied in CART algorithm are as follows.
Example 6.3 shows the application of the CART algorithm (Figure 6.23) to get a clearer picture through relevant steps in the required order.
Example 6.3 In Table 6.3, the four variables for nine individuals as chosen from the sample WAIS-R dataset are elicited as school education, gender, age and class. In the example, let us consider how we can form the tree structure by calculating the Gini index value of the CART algorithm (Figure 6.23) for three variables that are (school education, age and gender) and for class these are class (patient and healthy individuals).
Let us now apply the steps for the CART algorithm (Figure 6.25) in order to form the tree structure for the variables in the sample dataset given in Table 6.4.
Iteration 1 The procedural steps in Table 6.4 and in Figure 6.25 regarding the CART algorithm will be used to form the tree structure for all the variables.
Steps (1–4) Regarding the variables provided (school education, age, gender and class) in Table 6.4, binary grouping is performed.
Table 6.5 is based on the class number that belongs to the variables classified as class (patient and healthy individuals).
Variables are as follows:
School education (primary school, vocational school, high school)
Age Gender (20 ≤ (female, young ≤ male) 25), (40 ≤ middle ≤ 50), (80 ≤ old ≤ 90)
Step (5) Ginileft and Giniright values (eq. (6.6)) are calculated for the variables given in Table 6.5. School education, age and gender for the CART algorithm are shown in Figure 6.25.
Gini calculation for school education is as follows:
Gini calculation for age is as follows:
Gini calculation for gender is as follows:
Table 6.4: Selected sample from the WAIS-R dataset.
Table 6.5: The number of classes that variables belong to identified as class (patient, healthy).
Step (6) The Ginij value of the variables (school education, age, gender) is calculated (see eq. (6.7)) using the results obtained in Step 5.
Table 6.6 lists the results obtained from the steps applied for Iteration 1 to the CART algorithm as in Steps 1–6.
Table 6.6: The application of the CART algorithm for Iteration 1 to the sample WAIS-R dataset.
Steps (7–11) The Gini values calculated for the variables in Table 6.6 are as follows:
The smallest value among the Ginij values has been obtained as GiniAge = 0.229. Branching will begin starting from the age variable (the smallest value), namely, the root node.
In order to start the root node, we take the registered individual (ID) values that correspond to the age variable classifications (right-left) in Table 6.2.
Given this, the procedure will start with the individuals that have the second and seventh IDs registered to the classification on the decision. As for the individuals registered as first, third, fourth, fifth and sixth, we can say that they will form the splitting of the other nodes in the tree. The results obtained from the first split in the tree are shown in Figure 6.24).
Iteration 2 The tree structure should be formed for all the variables in Table 6.4 in the CART algorithm. Thus, for the other variables, it is also required to apply the procedural steps of CART algorithm in Iteration 1 in Iteration 2.
The new sample WAIS-R dataset in Table 6.7 has been formed by excluding the second and seventh individuals registered in the first branching of the tree structure in Iteration 1.
Table 6.7: Sample WAIS-R dataset for the Iteration 2.
Steps (1–4) Regarding the variables provided (school education, age, gender and class) in Table 6.7 binary grouping is done.
Table 6.8 is based on the class number that belongs to the variables classified as class (patient, healthy).
Variables are as follows:
School education (primary school, vocational school, high school)
Age (40 ≤ middle ≤ 50) (80 ≤ old ≤ 90)
Gender (female, male)
Table 6.8: The number of classes that variables belong to identified as class (patient, healthy) in Table 6.7 (for Iteration 1).
Step (5) Ginileft and Giniright values (eq. (6.6)) are calculated for the variables listed in Table 6.7: school education, age, gender for the CART (Figure 6.24) algorithm.
Gini calculation for school education is as follows:
Gini calculation for age is as follows:
Gini calculation for gender is as follows:
Step (6) The Ginij value of the variables (school education, age, gender) is calculated (see eq. (6.7)) using the results obtained in Step 5.
Table 6.9: The application of the CART algorithm for Iteration 2 to the sample WAIS-R dataset.
Table 6.9 lists the results obtained from the steps applied for Iteration 2 to the CART algorithm as in Steps 1–6.
Steps (7–11) The Gini values calculated for the variables listed in Table 6.9 are listed as follows:
The smallest value among the Ginij values has been obtained as GiniGender = 0.200.
Branching will begin starting from the gender variable (the smallest value), namely, the root node.
In order to start the root node, we take the registered female and male values that correspond to the gender variable classifications (right-left) in Table 6.7.
Given this the procedure will start with the individuals that have the first, fourth and fifth IDs registered to the classification on the decision. As for the individuals registered as first, third, fourth, fifth and sixth, we can say that they will form the splitting of the other nodes in the tree. The results obtained from the first split in the tree (see Figure 6.25) are given as follows:
Iteration 3 The tree structure should be formed for all the variables in Table 6.4 in CART algorithm. Thus, for the other variables, it is required to apply the procedural steps of CART algorithm in Iteration 1 and Iteration 2 in Iteration 3 as well.
Regarding the variables provided (school education, age, gender and class) in Table 6.10 binary grouping is done.
Table 6.10 is based on the class number that belongs to the variables classified as class (patient, healthy; see Figure 6.26).
Table 6.10: Iteration 3 – sample WAIS-R dataset.
CART algorithm has been applied on economy (U.N.I.S.) dataset, multiple sclerosis (MS) dataset and WAIS-R dataset in Sections 6.5.1.1, 6.5.1.2 and 6.5.1.3, respectively.
As the first set of data, we will use in the following sections, some data related to the U.N.I.S. countries’ economies. The attributes of the these countries’ economies are data regarding years, unemployment, GDP per capita (current international $), youth male (% of male labor force ages 15–24) (national estimate), …, GDP growth (annual %). (Data consist of a total of 18 attributes). Data belonging to the USA, New Zealand, Italy and Sweden economies from 1960 to 2015 are defined based on the attributes given in Table 2.7. Economy dataset (http://data.worldbank.org) [15] is used in the following sections. For the classification of D matrix through CART algorithm, the first step training procedure is to be employed. For the training procedure, 66.66% of the D matrix can be split for the training dataset (151 × 18) and 33.33% as the test dataset (77 × 18).
Following the classification of the training dataset being trained with CART algorithm, we can classify the test dataset. Although the procedural steps of the algorithm may seem complicated at first glance, the only thing you have to do is to concentrate on the steps and grasp them. For this, let us have a close look at the steps provided in Figure 6.27.
How can we find the class of a test data belonging to economy (U.N.I.S.) dataset provided in Table 2.7.1 according to CART algorithm? We can find the class label of a sample data belonging to economy (U.N.I.S.) dataset through the following steps (Figure 6.26):
Steps (1–3) Let us split the U.N.I.S. dataset as follows: 66.6% as the training data (D =151 × 18) and 33.3% as the test data (T = 77 × 18).
Steps (4–10) Create random subset. The training of the training set (D=228 × 18) through the CART algorithm has yielded the following results: number of nodes – 20. Sample CART decision tree structure is elicited in Figure 6.27.
Steps (11–12) The splitting criterion labels the node.
The training of the economy (U.N.I.S.) dataset is carried by applying CART algorithm (Steps 1–12 as specified in Figure 6.28). Through this, the decision tree consisting of 20 nodes (the learned tree (N = 20)) has been obtained. The nodes chosen via the CART learned tree nodes (GDP per capita (current international $)) and some of the rules for the nodes are shown in Figure 6.28.
Some of the CART decision rules obtained for economy (U.N.I.S.) dataset based on Figure 6.30 can be listed as follows.
Figure 6.28(a) and (b) shows economy (U.N.I.S.) dataset sample decision tree structure based on CART, obtained as such; if–then rules are provided below:
…
…
…
Rule 1
IF (GDP per capita (current international ($)) < $144992e + 06)
THEN class = USA
Rule 2
IF (GDP per capita (current international ($)) ≥ $144992e + 06)
THEN class = Italy
…
…
…
As a result, the data in the economy (U.N.I.S.) dataset with 33.33% portion allocated for the test procedure and classified as USA, New Zealand, Italy and Sweden have been classified with an accuracy rate of 95.2% based on CART algorithm.
As listed in Table 2.10 MS dataset has data from the following groups: 76 sample belonging to RRMS, 76 sample to SPMS, 76 sample to PPMS and 76 sample belonging to healthy subjects of control group. The attributes of the control group are data regarding brain stem (MRI 1), corpus collasum periventricular (MRI 2), upper cervical (MRI 3), lesion diameter size (milimetric (mm)) in the MR image and EDSS score. Data consist of a total of 112 attributes. It is known that using these attributes of 304 individuals, we can know whether the data belong to the MS subgroup or healthy group. How can we make the classification as to which MS patient belongs to which subgroup of MS including healthy individuals and those diagnosed with MS (based on the lesion diameters (MRI 1, MRI 2, MRI 3), number of lesion size for (MRI 1, MRI 2, MRI 3) as obtained from MRI images and EDSS scores)? D matrix has a dimension of (304 × 112). This means D matrix includes the MS dataset of 304 individuals along with their 112 attributes (see Table 2.12.1 for the MS dataset). For the classification of D matrix through Naive Bayesian algorithm, the first step training procedure is to be employed. For the training procedure, 66.66% of the D matrix can be split for the training dataset (203 × 112) and 33.33% as the test dataset (101 × 112).
Following the classification of the training dataset being trained with CART algorithm, we can classify the test dataset. Although the procedural steps of the algorithm may seem complicated at first glance, the only thing you have to do is to concentrate on the steps and grasp them. We can take a close look at the steps presented in in Figure 6.29.
3.145.151.153