3Dynamic fuzzy decision tree learning

This chapter is divided into six sections. Section 3.1 analyses the research status of decision tree. In Section 3.2, we present the decision tree methods of dynamic fuzzy lattice. In Section 3.3, we provide dynamic fuzzy decision trees (DFDTs) special attribute processing technique. Section 3.4 presents the pruning strategy of DFDT. In Section 3.5, we introduce the application. The summary of this chapter is in Section 3.6.

3.1Research status of decision trees

3.1.1Overseas research status

1. Decision tree algorithms

An Attribute Value Taxonomies Decision Tree Learning algorithm [1] uses the accuracy of attribute values in instances to construct a decision tree across multiple layers. The algorithm has higher classification accuracy for multi-layer instances, and, when there are more missing values, the algorithm achieves better results than C4.5, which is more compact than the decision tree obtained by other methods. A learning algorithm for a Fuzzy Decision Tree [2] has been developed by constructing the tree based on the uncertainty of the decision boundary in terms of certain properties. This paper proposes a method to obtain classification confidence estimates as well as determine the location and associated uncertainty of each decision boundary. The method was applied to four machine learning datasets, and the results showed improved classifi-cation accuracy and tree size. A Soft Decision Tree (SDT) method using fuzzy sets has also been proposed [3]. SDT determines the structure of the tree by merging the growth and pruning processes of the tree and modifies the tree to improve its universality. The method proposed in [4] constructs a decision tree depending on the category of an attribute subset. Attribute selection is an important factor in improving the accuracy of the classifier, so this method retains the attribute classification information. A subset of the original attribute set is then obtained to construct the decision tree. As the information contained in the subset is more concentrated, the accuracy of the classifier is improved, and good results have been achieved in pattern recognition experiments. A method of constructing a decision tree using discrete continuous-valued attributes in the numerical domain has been proposed [5]. The continuous-valued attributes are first discretized, and then the decision tree is constructed with an estimated value for each interval. This method is used to construct a decision tree on the dataset in the numerical domain and offers significant improvements in speed and accuracy. At the same time, a practical pre-pruning method has been developed to deal with numerical attributes, and highly accurate results have been obtained.

2. Decision tree pruning

Decision tree pruning methods are summarized in [6, 7] in terms of theoretical basis, computational complexity, advantages and disadvantages. A large number of experimental data are used to objectively evaluate excessive and insufficient pruning methods with the aim of reducing the error inherent in current pruning methods [8]. presented the DI pruning method. Most pruning methods reduce the classification error rate, but in uncertain domains, many subtrees that cannot decrease the error rate may have some special effects. The DI pruning method takes into account the complexity of the subtree, and determines the corresponding decision rules from the leaf nodes of subtrees. At the same time, the DI method can be used to evaluate the quality of data for knowledge discovery. This method has been applied to the UnDeT software. [9] defined a unified framework based on the pruning method and maps the pruning process to an optimization problem. The framework can be used to learn the pruning method, leading to an experimental analysis of post-pruning in terms of prediction accuracy and tree size. In addition to construction and tree pruning methods, there have been many studies on attribute processing and applications in decision trees, such as IBM’s processing schemes for missing values in data mining decision rules. Decision trees have also been applied to the classification of chemical and biological structures [10].

3.1.2Domestic research status

1. Decision tree algorithms

The decision tree algorithms proposed by [11, 12] are based on rough set theory. Attribute reduction is first carried out [14], and then the branch property is selected by calculating the relative positive region. Yin et al. [13] proposed a reduced decision tree algorithm that can deal with changes in the data based on an incremental algorithm. Three basic theorems were proposed and proved, and the reliability of the algorithm was guaranteed. This algorithm can solve the problem of constructing decision trees on dynamic datasets, and the proposed Dynamic Decision Tree algorithm also promotes the development and improvement of online rule extraction. A two-stage decision tree construction algorithm that combines IBM’s SLIQ algorithm with multiattribute classification based on ECGA has been proposed [14]. The tree construction algorithm first constructs subtrees based on data from a coarse classification, then multi-attribute classification based on ECGA is applied to incomplete classification data on the leaf nodes. The rule set is directly obtained from the incomplete classification data of the leaf nodes, producing a decision tree that is a mixture of multiple attributes. Yang et al. [15] proposed a method to construct a decision tree by transforming a number of classes into two categories. Classes are transformed into positive and negative classes to produce the first level, and the positive class is then subdivided into positive and negative classes to generate the second level. In the same way, the negative class of the first level is also subdivided into positive and negative classes to generate a second level. This process is repeated until the classification is complete. Wang et al. [16] presented a fast scalable parallel classification (FSPC) algorithm to deal with datasets with multiple attributes. FSPC not only reduces the cost of communication and I/O but also improves the parallelism of the algorithm by dividing the dataset vertically in the process of selecting the test attribute. There have been many studies on decision tree learning algorithms, such as parallel decision tree classification research based on SPRINT [17], decision tree classification research, and the latest development of decision tree algorithms in data mining [18, 19].

2. Other aspects

Wen et al. [20] reported a method to deal with missing attribute values in decision trees. A method is defined to fill in missing values that cannot be added to the decision tree during construction. A new method of information acquisition for semi-structured decision trees has been proposed from the perspective of applications, including a standardization, character, utility, and construction process [21]. This provides a new efficient retrieval tool for web information. The method of network intrusion detection based on multiple decision trees [22] divides a big dataset into several sub-datasets, to which the decision tree algorithm is applied for information retrieval. The results from multiple decision trees are then combined to form a global judgment by voting. The method can be applied to network intrusion detection, which not only improves the processing ability of mass data but also reduces the false-positive rate. The application of decision tree research includes the use of decision trees for data packet filtering [22], customer classification [16], and so on.

Although scholars have continued to research and improve decision tree algorithms (such as SLIQ, SPRINT, and PUBLIC), the shortcomings and limitations of each algorithm are very obvious. The ID3 algorithm is simple and easy to use, but it cannot deal with continuous attributes and the constructed decision tree overfits data; in this regard, the C4.5 algorithm offers two improvements. The CART algorithm not only deals with highly skewed or polymorphic numerical data but also handles sequential or unordered categorical data. However, it is not suitable for processing large-scale data because the training samples must reside in memory.

Decision tree construction algorithms are generated through an in-depth study of decision tree learning. Although the decision trees built by these methods are very useful in generating rules and knowledge, they are often compromised by uncertainties related to people’s thinking and perception in terms of rule generation and knowledge expression. Thus, inductive learning with an exact description of features cannot adapt to the imprecise knowledge representation in a system. To handle imprecise knowledge representation in an uncertain (fuzzy) environment, a fuzzy decision tree learning algorithm is required.

The world and the problem domains are always changing. How to express the dynamic nature of decision trees is therefore a genuine problem. Current approaches can only solve problems in which the decision tree node set and attribute set are static. In fact, it is hoped that the set of nodes in the decision tree can describe dynamic fuzzy uncertainty problems in the real world. Hence, we introduce a Dynamic Fuzzy Lattice to conduct exploratory research on this problem. This will be introduced in the following. The concept of the Dynamic Fuzzy Lattice is described in Appendix 8.4.

3.2Decision tree methods for a dynamic fuzzy lattice

In this section, we first introduce the concept and classical algorithms related to decision trees and then outline the theoretical framework of decision trees based on the Dynamic Fuzzy Lattice. The following introduces the ID3 algorithm and related examples.

3.2.1ID3 algorithm and examples

ID3 is a decision tree classification algorithm based on information entropy that uses instance categories based on an attribute set. ID3 is based on information theory, in terms of the mutual information (information gain), as a measure of attribute discrimination. The information gain rate is used as the criterion for attribute selection [24].

For a training instance set X, the number of training instances that belong to class i is Ni and the total number of instances in X is |X|. If the probability of an instance belonging to class i is pi, then

If the decision attribute has k different values, then the entropy of the training instance set X with respect to state k is

The information gain of attribute A related to instance set X is defined as

where Value(A) is the set of all possible values of attribute A, and Xv indicates the subset that the value of attribute A in X is v, that is

Tab. 3.1: The training instance set of the students’ scores.

Consider student achievements as an example to illustrate the actual process of the ID3 algorithm. The training instance set of the students’ scores is given in Tab. 3.1.

At the beginning of the algorithm, all instances are included in the root node. To find the best partition attribute, the entropy of the training example set is needed:

The entropy of each attribute in the training instance is calculated as follows:

and

Thus, the information gain for the “Effort” attribute is

In the same way, Gain(X, Family) = 0.4591, Gain(X, Homework) = 0.

From this, we have that the information gain of the “Family” attribute is the largest, and the classification attribute of this node is “Family”. The final Decision Tree is shown in Fig. 3.1.

Fig. 3.1: Decision tree generated by Tab. 3.1.

3.2.2Characteristics of dynamic fuzzy analysis of decision trees

The ID3 algorithm actually ends with the construction of the decision tree. It operates on some of the attributes of the instance set. The instance set has four attributes (see Tab. 3.1): “Effort”, “Family”, “Homework”, and “Score”. For example, the range of “Effort” is {hard, normal} and there is no clear boundary between the two values. That is, there is no exact standard to measure what kind of degree is “hard” and what kind of degree is “normal”. Thus, the condition for obtaining the attribute value of “Effort” is fuzzy. At the same time, the “Effort” is constantly changing and developing, which shows that it has the characteristics of DFD. As the decision tree learning completes the classification of instances through the attributes, the decision tree learning system uncovers dynamic and fuzzy characteristics.

3.2.3Representation methods for dynamic fuzzy problems in decision trees [25, 26]

Dynamic fuzzy knowledge representation for decision tree learning can be summarized as follows:

(1)For attribute xi, an attribute value represented by (xi,xi) is used to replace xi. This indicates dynamic fuzzy characteristics, and its value is a dynamic fuzzy number. The attribute “Effort” in Tab. 3.1 is replaced by “Effort = hard” over the range {0.7,0.3}. This changes the attribute value represented by the dynamic fuzzy number, where (α,α) indicates a higher degree or a general degree of effort.

(2)For instance x,c(x),wherex=x1,x2,,xi represents the training instance, xi represents the attributes of the training instance. From (1), we can see that the attributes have the characteristics of DFD. Thus, x is represented by (x,x) and c(x) is represented by the dynamic fuzzy number (α,α), that is, c(x,x)=(α,α)ϵ[0,0]×[1,1].

(3)Each hypothesis h in the hypothesis space H is represented by the dynamic fuzzy number (α,α). The hypothesis h ∈ H represents a function defined on X : h : X ← [0, 1] × [←, →], where h(x,x)=(α,α)ϵ[0,0]×[1,1].

(4)The instance set U contains the instances, and the characteristics of DFD are represented by its attributes. Thus, the instance set U is represented by dynamic fuzzy sets (DFSs) (U,U) which means the instance set is a domain in a dynamic fuzzy space.

The representation of instance 9 in Tab. 3.1 after the above substitutions is presented in Tab. 3.2.

As DFS and DFL are used to represent instances, the decision tree constructed by those instances undergoes the corresponding transformation. For example, if we consider Fig. 3.1 in dynamic fuzzy terms, we obtain Fig. 3.2. The decision tree constructed by instances based on DFS and DFL is called a Dynamic Fuzzy Decision Tree.

To construct a decision tree using the Dynamic Fuzzy Lattice, we need to define the general-to-special order structure. Firstly, we give the following two hypotheses.

Definition 3.1 Let hi, hj be functions defined on (X,X). hi is equal to or more general than hj if and only if, ((x,x)ϵ(X,X),

This is denoted by hi >g hj, where (x,x) represents the instance. If hi is strictly more general than hj, we write hi >g hj.

Tab. 3.2: Representation of instance 9 in Tab. 3.1.

Fig. 3.2: Decision tree after Fig. 3.1 dynamic fuzzy.

The ≥g operator defines a partial order relation on the instance space (U,U) (the relation is reflexive, antisymmetric, and transitive), which provides an effective structure for learning problems. A more general explanation for the ≥g relation is that hi contains fewer instance constraints, as any instance of a class that is divided by it will be divided into the same class by hj. Because hj has more constraints than hi, hi is more general than hj. The attributes “Family = good”, “Effort = hard” and good” in Fig. 3.3 divide the instances of the training set. As shown in Fig. 3.3, the left tree structure is a set of instances of the decision tree that correspond to each node, and the relationship between the upper set and the lower set can be defined by ≥g, such as {3, 5, 10, 11} ≥g {3, 5}, {3, 5, 10, 11}≥g {10, 11}. Therefore, the set of corresponding instances of the nodes of the decision tree consists of the partial order relation. The right side is the result of dividing the instance set at each layer of the decision tree. It can be seen from the partition result that the classification attributes have gradually divided the instance set.

Fig. 3.3: Partition of training samples corresponding to the DFDT in Fig. 3.2.

3.2.4DFDT classification attribute selection algorithm

The representation of the DFDT is given above. As in the construction of a decision tree, we need to determine how to divide the set of instances, that is, decide what attributes to involve in the next step of the classification. This gives the DFDT classification attribute selection algorithm.

In DFDT learning, (ai,ai) denotes an attribute, the range of (ai,ai)is(Vai,Vai)={(v1,v1),(v2,v2),,(vk,vk),and|(vai,vai)|=k denotes the number of attribute values of (ai,ai).. An instance can be denoted as

where (vaii,vaii) denotes the i th attribute value of (ai,ai).

The instance set can be denoted as (U,U)={(Ii,Ii)|i=1,2,,n}.

From the above, the instances in (U,U) whose attributes (ai,ai) have values (t,t) can be denoted as

where (t,t)=(vaik,vaik), that is, the k th (k|(Vai,Vai)|) attribute value of attribute (ai,ai).

The instance set U can be divided according to the value of attribute (ai,ai), that is

where n|(Vai,Vai)|.

The condition attribute is (A,A)={(A1,A1),(A2,A2),,(Am,Am)} and the decision attribute is (AD,AD). As the decision attribute (AD,AD). is dependent on the condition attribute (Ai,Ai), there is a certain relationship between the condition attribute and decision attribute regarding the division of the instance set. It meets one of the following three relationships:

Par(vj,vj)(AD,AD)(U,U) is the partition of (U,U) by the j th attribute value of decision attribute (AD,AD),(Ai,Ai),thenPar(vk,vk)(Ai,Ai)(U,U)Par(vj,vj)(Ai,Ai)(U,U),Par(vk,vk)(Ai,Ai)(U,U)=Par(vj,vj)(Ai,Ai)(U,U),orPar(vk,vk)(Ai,Ai)(U,U)Par(vj,vj)(Ai,Ai)(U,U).Because"" and “ ⊆ ” are inverse operations, and “ = ” is a special case of “ ⊇ ”, we take the first situation as follows:

Par(vk,vk)(Ai,Ai)(U,U)=Par(vj,vj)(Ai,Ai)(U,U) represents the set of instances when the value of decision attribute (AD,AD)is(vj,vj) and contains those values for which condition attribute (Ai,Ai)is(vk,vk). That is, when the value of condition attribute (Ai,Ai)is(vk,vk), the partition instance set can be determined as class (vj,vj).As  Par(vk,vk)(Ai,Ai)(U,U)Par(vj,vj)(AD,AD)(U,U), we know the other attribute values besides (vk,vk)of(Ai,Ai) can classify the instances into class (vj,vj).

Simultaneously, n|(VAi,VAi)|,Par(vj,vj)(AD,AD)(U,U)k=1nPar(vk,vk)(Ai,Ai)(U,U), so the certain degree of attribute (Ai,Ai) classifying the instances into class (vj,vj) is

Cer(Ai.Ai)j(U,U)=max(|Par(Ai.Ai)(vk.vk)(U)|)/min(|k=1nPar(Ai.Ai)(vk.vk)(U,U)|),

where

Par(Ai.Ai)(vk.vk)(U,U)Par(AD.AD)(vj.vj)(U,U)k=1nPar(Ai.Ai)(vk.vk)(U,U),n|(VAi,VAi)|.

There are n=|(VAD,VAD)| classes in the instance set. The condition attribute (Ai,Ai) has a certain degree of partitioning for each class. Thus, the degree of certainty of partitioning for the entire set of instances is denoted as

Cer(Ai.Ai)(U,U)=k=1nprokCer(Ai.Ai)k(U,U),

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.186.109