Chapter 5: Classification (2/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

5.2.2 Decision Tree Classiﬁ cation

Decision tree classiﬁ cation uses a decision tree as a predictive model which

maps observations about an item to conclusions about the item’s target

value. More descriptive names for such tree models are classiﬁ cation trees

or regression trees. In these tree structures, leaves represent class labels and

branches represent conjunctions of features that lead to those class labels.

In decision analysis, a decision tree can be used to visually and explicitly

represent decisions and decision making. In data mining, a decision tree

describes data but not decisions; rather the resulting classiﬁ cation tree can

be an input for decision making. This page deals with decision trees in data

mining. Decision tree learning is a method commonly used in data mining.

The goal is to create a model that predicts the value of a target variable

based on several input variables. An example is shown on the right. Each

interior node corresponds to one of the input variables; there are edges

to children for each of the possible values of that input variable. Each

leaf represents a value of the target variable given the values of the input

variables represented by the path from the root to the leaf. A tree can be

“learned” by splitting the source set into subsets based on an attribute value

test. This process is repeated on each derived subset in a recursive manner

called recursive partitioning. The recursion is completed when the subset

at a node has all the same value of the target variable, or when splitting no

longer adds value to the predictions. This process of top-down induction of

decision trees (TDIDT) [14] is an example of a greedy algorithm, and it is by

far the most common strategy for learning decision trees from data, but it is

not the only strategy. In fact, some approaches have been developed recently

allowing tree induction to be performed in a bottom-up fashion [4].

In classiﬁ cation, there have three different nodes in decision tree which are

described as following:

• A root node that has no incoming edges and zero or more outgoing

edges.

• Internal nodes, each of which has exactly one incoming edge and two

or more outgoing edges.

• Leaf or terminal nodes, each of which has exactly one incoming edge

and no outgoing edges.

Figure 5.2.2 shows the decision tree for the survival passengers on

the Titanic classiﬁ cation problem. In the ﬁ gure, “sibsp” is the number of

spouses or siblings aboard. Each interior node corresponds to one of the

input variables; there are edges to children for each of the possible values

of that input variable. Each leaf represents a value of the target variable

given the values of the input variables represented by the path from the

root to the leaf.

Classiﬁ cation 105

106 Applied Data Mining

5.2.3 Hunt’s Algorithm

To build an optimal decision tree is the key problem in a decision tree

classiﬁ er. In general, decision trees can be constructed from a given set of

attributes. While some of the trees are more accurate than others, ﬁ nding

the optimal tree is computationally infeasible because of the exponential

size of the search space. However, various efﬁ cient algorithms have been

developed to construct a reasonably accurate, albeit suboptimal, decision

tree in a reasonable amount of time. These algorithms usually employ a

greedy strategy that grows a decision tree by making a series of locally

optimum decisions about which attribute to use for partitioning the data.

Hunt’s algorithm is one of the efﬁ cient method for constructing a decision

tree. It grows a decision tree in a recursive fashion by partitioning the

training records into successively purer subsets. Let D

be the set of training

records that reach a node t. The general recursive procedure is deﬁ ned as

algorithm 5.1 [17]. It recursively applies the procedure to each subset until

all the records in the subset belong to the same class. Hunt’s algorithm

assumes that each combination of attribute sets has a unique class label

during the procedure. If all the records associated with D

have identical

attribute values except for the class label, then it is not possible to split these

records any further. In that case, the node is declared a leaf node with the

same class label as the majority class of training records associated with

this node.

Is sex male?

Is age>9.5? Survived

Died Is sibsp>2.5?

Died Survived

Y N

0.73 36%

0.89 2% 0.05 2%

0.17 61%

Root node

Internal node

Leaf node

Figure 5.2.2: An example of decision tree classiﬁ cation for Titanic

Algorithm 5.1: Hunt’s algorithm

(1) If D

contains records that belong the same class y

, then t is a leaf node labeled

as y

(2) If D

is an empty set, then t is a leaf node labeled by the default class, y

(3) If D

contains records that belong to more than one class, use an attribute test

to split the data into smaller subsets.

5.3 Bayesian Network and Classiﬁ cation

5.3.1 Bayesian Network

Bayesian network theory can be thought of as a fusion of incidence

diagrams and Bayes’ theorem. A Bayesian network, or belief network, shows

conditional probability and causality relationships between variables. For

example, a Bayesian network could represent the probabilistic relationships

between diseases and symptoms. Given the symptoms, the network can

be used to compute the probabilities of the presence of various diseases.

The probability of an event occurring given that another event has already

occurred is called conditional probability. The probabilistic model is described

qualitatively by a directed acyclic graph, or DAG. The vertices of the graph,

which represent variables, are called nodes. The nodes are represented as

circles containing the variable name. The connections between the nodes

are called arcs, or edges. The edges are drawn as arrows between the

nodes, and represent dependence between the variables. Therefore, any

pair of nodes indicates that one node is the parent of the other so there are

no independent assumptions. Independent assumptions are implied in

Bayesian networks by the absence of a link. Figure 5.3.1 shows an example of

DAG. The node where the arc originates is called the parent, while the node

where the arc ends is called the child. In this case, V

is a parent of V

and

, V

has parents V

and V

. Nodes that can be reached from other nodes

are called descendants. Nodes that lead a path to a speciﬁ c node are called

ancestors. For example, V

and V

are descendants of V

, and V

is ancestors

of V

and V

. Since no child can be its own ancestor or descendent, there

are no loops in Bayesian networks. Bayesian networks will generally also

include a set of probability tables, stating the probabilities for the true/false

values of the variables. The main point of Bayesian Networks is to allow

for probabilistic inference to be performed. This means that the probability

of each value of a node in the Bayesian network can be computed when

the values of the other variables are known. Also, because independence

among the variables is easy to recognize since conditional relationships are

clearly deﬁ ned by a graph edge, not all joint probabilities in the Bayesian

system need to be calculated in order to make a decision. Classiﬁ cation

with Bayesian network.

Classiﬁ cation 107

108 Applied Data Mining

Figure 5.3.1: An example of DAG

Figure 5.3.2 depicts the possible structure of a Bayesian network used

for classiﬁ cation. The dotted lines denote potential links, and the blue box

indicates that additional nodes and links can be added to the model, usually

between the input and output nodes.

In order to perform classiﬁ cation with a Bayesian network such as the

one depicted in Fig. 5.3.2, ﬁ rst evidence must be set on the input nodes,

and then the output nodes can be queried using standard Bayesian network

inference. The result will be a distribution for each output node, so that you

can not only determine the most probable state for each output, but also

see the probability assigned to each output state. Figure 5.3.3 shows the

structure of a Naive Bayes classiﬁ er, which is the simplest form of useful

Bayesian network classiﬁ er. The links in a Naive Bayes model are directed

from output to input, which gives the model its simplicity, as there are

no interactions between the inputs, except indirectly via the output. Note

however that directing links from output to input, is not a requirement for

all Bayesian network classiﬁ ers.

Input2

Input3

Input4

Input1

Output2

Output1

Figure 5.3.2: Generic structure of a Bayesian network classiﬁ er

One of the most effective classiﬁ ers, in the sense that its predictive

performance is competitive with state-of-the-art classiﬁ ers, is the so-called

naive Bayesian classiﬁ er described, for example, by Duda and Hart [9] and

by Langley et al. [12]. This classiﬁ er learns from training data the conditional

probability of each attribute Ai given the class label C. Classiﬁ cation is then

done by applying Bayes rule to compute the probability of C given the

particular instance of A

,...,A

, and then predicting the class with the highest

posterior probability. This computation is rendered feasible by making a

strong independence assumption: all the attributes Ai are conditionally

independent given the value of the class C.

5.3.2 Backpropagation and Classiﬁ cation

5.3.2.1 Backpropagation Method

Backpropagation [1] is a common method of training artiﬁ cial neural

networks so as to minimize the objective function. Arthur E. Bryson and Yu-

Chi Ho described it as a multi-stage dynamic system optimization method

in 1969 [15, 6]. It wasn’t until 1974 and later, when applied in the context

of neural networks and through the work of Paul Werbos [19], Rumelhart

and Kubat [16, 11], that it gained recognition, and it led to a “renaissance”

in the ﬁ eld of artiﬁ cial neural network research. It is a supervised learning

method, and is a generalization of the delta rule. It requires a dataset of the

desired output for many inputs, making up the training set. It is most useful

for feed-forward networks (networks that have no feedback, or simply, that

have no connections that loop). The term is an abbreviation for “backward

propagation of errors”. Backpropagation requires that the activation

ut2

ut3

ut4

ut1

Out

ut1

Figure 5.3.3: Naive Bayes model

Classiﬁ cation

109

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5: Classification (2/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 5: Classification (2/4)