5.2.2 Decision Tree Classi cation
Decision tree classifi cation uses a decision tree as a predictive model which
maps observations about an item to conclusions about the item’s target
value. More descriptive names for such tree models are classifi cation trees
or regression trees. In these tree structures, leaves represent class labels and
branches represent conjunctions of features that lead to those class labels.
In decision analysis, a decision tree can be used to visually and explicitly
represent decisions and decision making. In data mining, a decision tree
describes data but not decisions; rather the resulting classifi cation tree can
be an input for decision making. This page deals with decision trees in data
mining. Decision tree learning is a method commonly used in data mining.
The goal is to create a model that predicts the value of a target variable
based on several input variables. An example is shown on the right. Each
interior node corresponds to one of the input variables; there are edges
to children for each of the possible values of that input variable. Each
leaf represents a value of the target variable given the values of the input
variables represented by the path from the root to the leaf. A tree can be
“learned” by splitting the source set into subsets based on an attribute value
test. This process is repeated on each derived subset in a recursive manner
called recursive partitioning. The recursion is completed when the subset
at a node has all the same value of the target variable, or when splitting no
longer adds value to the predictions. This process of top-down induction of
decision trees (TDIDT) [14] is an example of a greedy algorithm, and it is by
far the most common strategy for learning decision trees from data, but it is
not the only strategy. In fact, some approaches have been developed recently
allowing tree induction to be performed in a bottom-up fashion [4].
In classifi cation, there have three different nodes in decision tree which are
described as following:
• A root node that has no incoming edges and zero or more outgoing
edges.
Internal nodes, each of which has exactly one incoming edge and two
or more outgoing edges.
Leaf or terminal nodes, each of which has exactly one incoming edge
and no outgoing edges.
Figure 5.2.2 shows the decision tree for the survival passengers on
the Titanic classifi cation problem. In the fi gure, “sibsp” is the number of
spouses or siblings aboard. Each interior node corresponds to one of the
input variables; there are edges to children for each of the possible values
of that input variable. Each leaf represents a value of the target variable
given the values of the input variables represented by the path from the
root to the leaf.
Classifi cation 105
106 Applied Data Mining
5.2.3 Hunt’s Algorithm
To build an optimal decision tree is the key problem in a decision tree
classifi er. In general, decision trees can be constructed from a given set of
attributes. While some of the trees are more accurate than others, fi nding
the optimal tree is computationally infeasible because of the exponential
size of the search space. However, various effi cient algorithms have been
developed to construct a reasonably accurate, albeit suboptimal, decision
tree in a reasonable amount of time. These algorithms usually employ a
greedy strategy that grows a decision tree by making a series of locally
optimum decisions about which attribute to use for partitioning the data.
Hunt’s algorithm is one of the effi cient method for constructing a decision
tree. It grows a decision tree in a recursive fashion by partitioning the
training records into successively purer subsets. Let D
t
be the set of training
records that reach a node t. The general recursive procedure is defi ned as
algorithm 5.1 [17]. It recursively applies the procedure to each subset until
all the records in the subset belong to the same class. Hunt’s algorithm
assumes that each combination of attribute sets has a unique class label
during the procedure. If all the records associated with D
t
have identical
attribute values except for the class label, then it is not possible to split these
records any further. In that case, the node is declared a leaf node with the
same class label as the majority class of training records associated with
this node.
Is sex male?
Is age>9.5? Survived
Died Is sibsp>2.5?
Died Survived
Y N
Y
N
Y N
0.73 36%
0.89 2% 0.05 2%
0.17 61%
Root node
Internal node
Leaf node
Leaf node
Figure 5.2.2: An example of decision tree classifi cation for Titanic
Algorithm 5.1: Hunt’s algorithm
(1) If D
t
contains records that belong the same class y
t
, then t is a leaf node labeled
as y
t
(2) If D
t
is an empty set, then t is a leaf node labeled by the default class, y
d
(3) If D
t
contains records that belong to more than one class, use an attribute test
to split the data into smaller subsets.
5.3 Bayesian Network and Classi cation
5.3.1 Bayesian Network
Bayesian network theory can be thought of as a fusion of incidence
diagrams and Bayes’ theorem. A Bayesian network, or belief network, shows
conditional probability and causality relationships between variables. For
example, a Bayesian network could represent the probabilistic relationships
between diseases and symptoms. Given the symptoms, the network can
be used to compute the probabilities of the presence of various diseases.
The probability of an event occurring given that another event has already
occurred is called conditional probability. The probabilistic model is described
qualitatively by a directed acyclic graph, or DAG. The vertices of the graph,
which represent variables, are called nodes. The nodes are represented as
circles containing the variable name. The connections between the nodes
are called arcs, or edges. The edges are drawn as arrows between the
nodes, and represent dependence between the variables. Therefore, any
pair of nodes indicates that one node is the parent of the other so there are
no independent assumptions. Independent assumptions are implied in
Bayesian networks by the absence of a link. Figure 5.3.1 shows an example of
DAG. The node where the arc originates is called the parent, while the node
where the arc ends is called the child. In this case, V
0
is a parent of V
1
and
V
2
, V
2
has parents V
0
and V
1
. Nodes that can be reached from other nodes
are called descendants. Nodes that lead a path to a specifi c node are called
ancestors. For example, V
1
and V
2
are descendants of V
0
, and V
1
is ancestors
of V
2
and V
3
. Since no child can be its own ancestor or descendent, there
are no loops in Bayesian networks. Bayesian networks will generally also
include a set of probability tables, stating the probabilities for the true/false
values of the variables. The main point of Bayesian Networks is to allow
for probabilistic inference to be performed. This means that the probability
of each value of a node in the Bayesian network can be computed when
the values of the other variables are known. Also, because independence
among the variables is easy to recognize since conditional relationships are
clearly defi ned by a graph edge, not all joint probabilities in the Bayesian
system need to be calculated in order to make a decision. Classifi cation
with Bayesian network.
Classifi cation 107
108 Applied Data Mining
V
3
V
0
V
1
V
2
Figure 5.3.1: An example of DAG
Figure 5.3.2 depicts the possible structure of a Bayesian network used
for classifi cation. The dotted lines denote potential links, and the blue box
indicates that additional nodes and links can be added to the model, usually
between the input and output nodes.
In order to perform classifi cation with a Bayesian network such as the
one depicted in Fig. 5.3.2, fi rst evidence must be set on the input nodes,
and then the output nodes can be queried using standard Bayesian network
inference. The result will be a distribution for each output node, so that you
can not only determine the most probable state for each output, but also
see the probability assigned to each output state. Figure 5.3.3 shows the
structure of a Naive Bayes classifi er, which is the simplest form of useful
Bayesian network classifi er. The links in a Naive Bayes model are directed
from output to input, which gives the model its simplicity, as there are
no interactions between the inputs, except indirectly via the output. Note
however that directing links from output to input, is not a requirement for
all Bayesian network classifi ers.
Input2
Input3
Input4
Input1
Output2
Output1
Figure 5.3.2: Generic structure of a Bayesian network classifi er
One of the most effective classifi ers, in the sense that its predictive
performance is competitive with state-of-the-art classifi ers, is the so-called
naive Bayesian classifi er described, for example, by Duda and Hart [9] and
by Langley et al. [12]. This classifi er learns from training data the conditional
probability of each attribute Ai given the class label C. Classifi cation is then
done by applying Bayes rule to compute the probability of C given the
particular instance of A
1
,...,A
n
, and then predicting the class with the highest
posterior probability. This computation is rendered feasible by making a
strong independence assumption: all the attributes Ai are conditionally
independent given the value of the class C.
5.3.2 Backpropagation and Classi cation
5.3.2.1 Backpropagation Method
Backpropagation [1] is a common method of training artifi cial neural
networks so as to minimize the objective function. Arthur E. Bryson and Yu-
Chi Ho described it as a multi-stage dynamic system optimization method
in 1969 [15, 6]. It wasn’t until 1974 and later, when applied in the context
of neural networks and through the work of Paul Werbos [19], Rumelhart
and Kubat [16, 11], that it gained recognition, and it led to a “renaissance”
in the fi eld of artifi cial neural network research. It is a supervised learning
method, and is a generalization of the delta rule. It requires a dataset of the
desired output for many inputs, making up the training set. It is most useful
for feed-forward networks (networks that have no feedback, or simply, that
have no connections that loop). The term is an abbreviation for “backward
propagation of errors”. Backpropagation requires that the activation
In
p
ut2
In
p
ut3
In
p
ut4
In
p
ut1
Out
p
ut1
Figure 5.3.3: Naive Bayes model
Classifi cation
109
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.22.50