110 Applied Data Mining
function used by the artifi cial neurons (or “nodes”) be differentiable. The
main framework of Backpropagation could be described as Algorithm 5.2.
In step 2.2, the ratio infl uences the speed and quality of learning; it is called
the learning rate. The sign of the gradient of a weight indicates where the
error is increasing; this is why the weight must be updated in the opposite
direction.
Algorithm 5.2: Backpropagation
(1) Propagation
(1.1) Forward propagation of a training pattern’s input through the neural
network in order to generate the propagation’s output activations.
(1.2) Backward propagation of the propagation’s output activations through
the neural network using the training pattern’s target in order to generate the
deltas of all output and hidden neurons.
(2) Weight update
(2.1) Multiply its output delta and input activation to get the gradient of the
weight.
(2.2) Bring the weight in the opposite direction of the gradient by subtracting
a ratio of it from the weight.
(3) Repeat phases 1 and 2 until the performance of the network is satisfactory.
5.3.2.2 Classi er with Backpropagation
The model structure of BP (backpropagation) classifi cation algorithm
uses full connection each layers and nodes from input layer to output
layer. Obviously, it needs a lot of calculation. However, we are not still
satisfi ed with standard neural network or back-propagation model based
decision support system because we want to get better quality of decision
performance and less computing iteration when we want to develop in a
specifi c domain area.
5.3.3 Association-based Classi cation
Association rule mining is an important and highly active area of data
mining research. Recently, data mining techniques have been developed
that apply concepts used in association rule mining to the problem of
classifi cation. In this section, we study three methods in historical order.
The fi rst two, ARCS_ORCS [2] and associative classifi cation [18], use
association rules for classifi cation. The third method, CAEP [8], mines
“emerging patterns” that consider the concept of support used in mining
associations. The fi rst method mines association rules based on clustering
and then employs the rules for classifi cation. The ARCS or Association Rule
Clustering System, mines association rules of the form Aquan
1
... Aquan
2
=¿
Acat where Aquan
1
and Aquan
2
are tests on quantitative attributive ranges
(where the ranges are dynamically determined), and Acat assigns a class
label for a categorical attribute from the given training data. Association
rules are plotted on a 2-D grid. The algorithm scans the grid, searching for
rectangular clusters of rules. In this way, adjacent ranges of the quantitative
attributes occurring within a rule cluster may be combined. The clustered
association rules generated by ARCS were empirically found to be slightly
more accurate than C4.5 when there are outliers in the data. The accuracy of
ARCS is related to the degree of discretization used. In terms of scalability,
ARCS requires “a constant amount of memory”, regardless of the database
size. C4.5 has exponentially higher execution times than ARCS, requiring the
entire database, multiplied by some factor, to fi t entirely in main memory.
The second method is referred to as associative classifi cation. It mines
rules of the form condset= >y, where condset is a set of items (or attribute-
value pairs) and y is a class label. Rules that satisfy pre-specifi ed minimum
supports are frequent, where a rule has support s. if s% of the samples in
the given data set contain consent and belong to class y. A rule satisfying
minimum confi dence is called accurate, where a rule has confi dence c, if c%
of the samples in the given data set that contain consent belong to class y. If
a set of rules has the same consent, then the rule with the highest confi dence
is selected as the possible rule (PR) to represent the set.
The association classifi cation method consists of two steps. The fi rst
step fi nds the set of all PRs that are both frequent and accurate. It uses an
iterative approach, where prior knowledge is used to prune the rule search.
The second step uses a heuristic method to construct the classifi er, where the
discovered rules are organized according to decreasing precedence based
on their confi dence and support. The algorithm may require several passes
over the data set, depending on the length of the longest rule found.
When classifying a new sample, the fi rst rule satisfying the sample is
used to classify it. The classifi er also contains a default rule, having lowest
precedence, which specifi es a default class for any new sample that is
not satisfi ed by any other rule in the classifi er. In general, the associative
classifi cation method was empirically found to be more accurate than C4.5
on several data sets. Each of the above two steps was shown to have linear
scale-up.
The third method, CAEP (classifi cation by aggregating emerging
patterns), uses the notion of itemset supports to mine emerging patterns
(EPs), which are used to construct a classifi er. Roughly speaking, an EP is
an itemset (or set of items) whose support increases signifi cantly from one
class of data to another. The ratio of the two supports is called the growth
rate of the EP. For example, suppose that we have a data set of customers
with the classes buys
c
omputer = “yes”, or C1, and buys computer = “no”,
or C2, the itemset age = “30”, student = “no” is a typical EP, whose support
Classifi cation 111
112 Applied Data Mining
increases from 0.2% in C1 to 57.6% in C2 at a growth rate of EP = 288. Note
that an item is either a simple equality test; on a categorical attribute is
in an interval. Each EP is a multi-attribute test and can be very strong at
differentiating instances of one class from another. For instance, if a new
sample X contains the above EP, then with odds of 99.6% we can claim that
X belongs to C2. In general, the differentiating power of an EP is roughly
proportional to its growth rate and its support in the target class.
For each class C, CAEP fi nd EPs satisfying given support and growth
rate thresholds, where growth rate I computed with respect to the set of
all non-C samples versus the target set of all C samples, “Borderbased”
algorithms can be used for this purpose. Where classifying a new sample,
X, for each class C, the differentiating power of the EPs of class C that
occur in X are aggregated to derive a score for C that is then normalized.
The class with the largest normalized score determines the class label of X.
CAEP has been found to be more accurate than C4.5 and association0-based
classifi cation on several data sets. It also performs well on data sets where
the mail class of interest is in the minority. It scales up on data volume
and dimensionality. An alternative classifi er, called the JEP-classifi er, was
proposed based on jumping emerging patterns (JEPs). A JEP is a special
type of EP, defi ned as an itemset whose support increases abruptly from
zero in one data set to nonzero in another data set. The two classifi ers are
considered complementary.
5.3.4 Support Vector Machines and Classi cation
5.3.4.1 Support Vector Machines
In machine learning, support vector machines [7] are supervised learning
models with associated learning algorithms that analyze data and recognize
patterns, used for classifi cation and regression analysis. The basic SVM
takes a set of input data and predicts, for each given input, which of two
possible classes forms the output, making it a non-probabilistic binary
linear classifi er. Given a set of training examples, each marked as belonging
to one of two categories, an SVM training algorithm builds a model that
assigns new examples into one category or the other. An SVM model is
a representation of the examples as points in space, mapped so that the
examples of the separate categories are divided by a clear gap that is as
wide as possible. New examples are then mapped into that same space and
predicted to belong to a category based on which side of the gap they fall on.
In addition to performing linear classifi cation, SVMs can effi ciently perform
non-linear classifi cation using what is called the kernel trick, implicitly
mapping their inputs into high-dimensional feature spaces. More formally,
a support vector machine constructs a hyperplane or set of hyperplanes in
a high or infi nite-dimensional space, which can be used for classifi cation,
regression, or other tasks. Intuitively, a good separation is achieved by
the hyperplane that has the largest distance to the nearest training data
point of any class (so-called functional margin), since in general the larger
the margin the lower the generalization error of the classifi er. Whereas
the original problem may be stated in a fi nite dimensional space, it often
happens that the sets to discriminate are not linearly separable in that space.
For this reason, it was proposed that the original fi nite-dimensional space
be mapped into a much higher-dimensional space, presumably making the
separation easier in that space. To keep the computational load reasonable,
the mappings used by SVM schemes are designed to ensure that dot
products may be computed easily in terms of the variables in the original
space, by defi ning them in terms of a kernel function K(x, y) selected to
suit the problem [20]. The hyperplanes in the higher-dimensional space are
defi ned as the set of points whose inner product with a vector in that space
is constant. The vectors defi ning the hyperplanes can be chosen to be linear
combinations with parameters of images of feature vectors that occur in the
data base. With this choice of a hyperplane, the points in the feature space
that are mapped into the hyperplane are defi ned by the relation: ¹
i
α
i
K(x
i
, x)
= constant. Note that if K(x, y) becomes small as y grows further away from
x, each element in the sum measures the degree of closeness of the test point
x to the corresponding data base point x
i
. In this way, the sum of kernels
above can be used to measure the relative nearness of each test point to the
data points originating in one or the other of the sets to be discriminated.
Note the fact that the set of points x mapped into any hyperplane can be
quite convoluted as a result, allowing much more complex discrimination
between sets which are not convex at all in the original space.
5.3.4.2 Classi er with Support Vector Machines
The original optimal hyperplane algorithm proposed by Vapnik in 1963 was
a linear classifi er. However, in 1992, Bernhard E. Boser, Isabelle M. Guyon
and Vladimir N. Vapniks suggested a way to create nonlinear classifi ers
by applying the kernel trick (originally proposed by Aizerman et al. [3])
to maximum-margin hyperplanes [5]. The resulting algorithm is formally
similar, except that every dot product is replaced by a nonlinear kernel
function. This allows the algorithm to fi t the maximum-margin hyperplane
in a transformed feature space. The transformation may be nonlinear and
the transformed space high dimensional; thus though the classifi er is a
hyperplane in the high-dimensional feature space, it may be nonlinear in the
original input space. If the kernel used is a Gaussian radial basis function,
the corresponding feature space is a Hilbert space of infi nite dimensions.
Classifi cation 113
114 Applied Data Mining
Maximum margin classifi ers are well regularized, so the infi nite dimensions
do not spoil the results. Some common kernels include:
Polynomial (homogeneous): K(x
i
, x
j
) = (x
i
· x
j
)
d
.
Polynomial (inhomogeneous): K(x
i
, x
j
) = (x
i
· x
j
+ 1)
d
.
Gaussian radial basis function: K(x
i
, x
j
) = exp(−γ ·||x
i
x
j
||
2
), for γ > 0.
Hyperbolic tangent: k(x
i
, x
j
) = tanh(Kx
i
· x
j
+ c), for k > 0 c < 0.
The classifi er with SVM has the following properties. SVMs belong
to a family of generalized linear classifi ers and can be interpreted as an
extension of the perception. They can also be considered a special case of
Tikhonov regularization. A special property is that they simultaneously
minimize the empirical classifi cation error and maximize the geometric
margin; hence they are also known as maximum margin classifi ers. A
comparison of the SVM to other classifi ers has been made by Meyer, Leisch
and Hornik [13].
Simple feature selection algorithms are ad hoc, but there are also more
methodical approaches. From a theoretical perspective, it can be shown
that optimal feature selection for supervised learning problems requires an
exhaustive search of all possible subsets of features of the chosen cardinality.
If large numbers of features are available, this is impractical. For practical
supervised learning algorithms, the search is for a satisfactory set of features
instead of an optimal set. Feature selection algorithms typically fall into
two categories: feature ranking and subset selection. Feature ranking ranks
the features by a metric and eliminates all features that do not achieve an
adequate score. Subset selection searches the set of possible features for the
optimal subset. In statistics, the most popular form of feature selection is
stepwise regression. It is a greedy algorithm that adds the best feature (or
deletes the worst feature) at each round. The main control issue is deciding
when to stop the algorithm. In machine learning, this is typically done by
cross-validation. In statistics, some criteria are optimized. This leads to the
inherent problem of nesting. More robust methods have been explored,
such as branch and bound and piecewise linear network. Subset selection
evaluates a subset of features as a group for suitability. Subset selection
algorithms can be broken into Wrappers, Filters and Embedded. Wrappers
use a search algorithm to search through the space of possible features
and evaluate each subset by running a model on the subset. Wrappers
can be computationally expensive and have a risk of over fi tting to the
model. Filters are similar to Wrappers in the search approach, but instead
of evaluating against a model, a simpler fi lter is evaluated. Embedded
techniques are embedded in and specifi c to a model. Many popular search
approaches use greedy hill climbing, which iteratively evaluates a candidate
subset of features, then modifi es the subset and evaluates if the new subset
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.91.239