Chapter 5: Classification (3/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

110 Applied Data Mining

function used by the artiﬁ cial neurons (or “nodes”) be differentiable. The

main framework of Backpropagation could be described as Algorithm 5.2.

In step 2.2, the ratio inﬂ uences the speed and quality of learning; it is called

the learning rate. The sign of the gradient of a weight indicates where the

error is increasing; this is why the weight must be updated in the opposite

direction.

Algorithm 5.2: Backpropagation

(1) Propagation

(1.1) Forward propagation of a training pattern’s input through the neural

network in order to generate the propagation’s output activations.

(1.2) Backward propagation of the propagation’s output activations through

the neural network using the training pattern’s target in order to generate the

deltas of all output and hidden neurons.

(2) Weight update

(2.1) Multiply its output delta and input activation to get the gradient of the

weight.

(2.2) Bring the weight in the opposite direction of the gradient by subtracting

a ratio of it from the weight.

(3) Repeat phases 1 and 2 until the performance of the network is satisfactory.

5.3.2.2 Classiﬁ er with Backpropagation

The model structure of BP (backpropagation) classiﬁ cation algorithm

uses full connection each layers and nodes from input layer to output

layer. Obviously, it needs a lot of calculation. However, we are not still

satisﬁ ed with standard neural network or back-propagation model based

decision support system because we want to get better quality of decision

performance and less computing iteration when we want to develop in a

speciﬁ c domain area.

5.3.3 Association-based Classiﬁ cation

Association rule mining is an important and highly active area of data

mining research. Recently, data mining techniques have been developed

that apply concepts used in association rule mining to the problem of

classiﬁ cation. In this section, we study three methods in historical order.

The ﬁ rst two, ARCS_ORCS [2] and associative classiﬁ cation [18], use

association rules for classiﬁ cation. The third method, CAEP [8], mines

“emerging patterns” that consider the concept of support used in mining

associations. The ﬁ rst method mines association rules based on clustering

and then employs the rules for classiﬁ cation. The ARCS or Association Rule

Clustering System, mines association rules of the form Aquan

... Aquan

=¿

Acat where Aquan

and Aquan

are tests on quantitative attributive ranges

(where the ranges are dynamically determined), and Acat assigns a class

label for a categorical attribute from the given training data. Association

rules are plotted on a 2-D grid. The algorithm scans the grid, searching for

rectangular clusters of rules. In this way, adjacent ranges of the quantitative

attributes occurring within a rule cluster may be combined. The clustered

association rules generated by ARCS were empirically found to be slightly

more accurate than C4.5 when there are outliers in the data. The accuracy of

ARCS is related to the degree of discretization used. In terms of scalability,

ARCS requires “a constant amount of memory”, regardless of the database

size. C4.5 has exponentially higher execution times than ARCS, requiring the

entire database, multiplied by some factor, to ﬁ t entirely in main memory.

The second method is referred to as associative classiﬁ cation. It mines

rules of the form condset= >y, where condset is a set of items (or attribute-

value pairs) and y is a class label. Rules that satisfy pre-speciﬁ ed minimum

supports are frequent, where a rule has support s. if s% of the samples in

the given data set contain consent and belong to class y. A rule satisfying

minimum conﬁ dence is called accurate, where a rule has conﬁ dence c, if c%

of the samples in the given data set that contain consent belong to class y. If

a set of rules has the same consent, then the rule with the highest conﬁ dence

is selected as the possible rule (PR) to represent the set.

The association classiﬁ cation method consists of two steps. The ﬁ rst

step ﬁ nds the set of all PRs that are both frequent and accurate. It uses an

iterative approach, where prior knowledge is used to prune the rule search.

The second step uses a heuristic method to construct the classiﬁ er, where the

discovered rules are organized according to decreasing precedence based

on their conﬁ dence and support. The algorithm may require several passes

over the data set, depending on the length of the longest rule found.

When classifying a new sample, the ﬁ rst rule satisfying the sample is

used to classify it. The classiﬁ er also contains a default rule, having lowest

precedence, which speciﬁ es a default class for any new sample that is

not satisﬁ ed by any other rule in the classiﬁ er. In general, the associative

classiﬁ cation method was empirically found to be more accurate than C4.5

on several data sets. Each of the above two steps was shown to have linear

scale-up.

The third method, CAEP (classiﬁ cation by aggregating emerging

patterns), uses the notion of itemset supports to mine emerging patterns

(EPs), which are used to construct a classiﬁ er. Roughly speaking, an EP is

an itemset (or set of items) whose support increases signiﬁ cantly from one

class of data to another. The ratio of the two supports is called the growth

rate of the EP. For example, suppose that we have a data set of customers

with the classes buys



omputer = “yes”, or C1, and buys computer = “no”,

or C2, the itemset age = “≤30”, student = “no” is a typical EP, whose support

Classiﬁ cation 111

112 Applied Data Mining

increases from 0.2% in C1 to 57.6% in C2 at a growth rate of EP = 288. Note

that an item is either a simple equality test; on a categorical attribute is

in an interval. Each EP is a multi-attribute test and can be very strong at

differentiating instances of one class from another. For instance, if a new

sample X contains the above EP, then with odds of 99.6% we can claim that

X belongs to C2. In general, the differentiating power of an EP is roughly

proportional to its growth rate and its support in the target class.

For each class C, CAEP ﬁ nd EPs satisfying given support and growth

rate thresholds, where growth rate I computed with respect to the set of

all non-C samples versus the target set of all C samples, “Borderbased”

algorithms can be used for this purpose. Where classifying a new sample,

X, for each class C, the differentiating power of the EPs of class C that

occur in X are aggregated to derive a score for C that is then normalized.

The class with the largest normalized score determines the class label of X.

CAEP has been found to be more accurate than C4.5 and association0-based

classiﬁ cation on several data sets. It also performs well on data sets where

the mail class of interest is in the minority. It scales up on data volume

and dimensionality. An alternative classiﬁ er, called the JEP-classiﬁ er, was

proposed based on jumping emerging patterns (JEPs). A JEP is a special

type of EP, deﬁ ned as an itemset whose support increases abruptly from

zero in one data set to nonzero in another data set. The two classiﬁ ers are

considered complementary.

5.3.4 Support Vector Machines and Classiﬁ cation

5.3.4.1 Support Vector Machines

In machine learning, support vector machines [7] are supervised learning

models with associated learning algorithms that analyze data and recognize

patterns, used for classiﬁ cation and regression analysis. The basic SVM

takes a set of input data and predicts, for each given input, which of two

possible classes forms the output, making it a non-probabilistic binary

linear classiﬁ er. Given a set of training examples, each marked as belonging

to one of two categories, an SVM training algorithm builds a model that

assigns new examples into one category or the other. An SVM model is

a representation of the examples as points in space, mapped so that the

examples of the separate categories are divided by a clear gap that is as

wide as possible. New examples are then mapped into that same space and

predicted to belong to a category based on which side of the gap they fall on.

In addition to performing linear classiﬁ cation, SVMs can efﬁ ciently perform

non-linear classiﬁ cation using what is called the kernel trick, implicitly

mapping their inputs into high-dimensional feature spaces. More formally,

a support vector machine constructs a hyperplane or set of hyperplanes in

a high or inﬁ nite-dimensional space, which can be used for classiﬁ cation,

regression, or other tasks. Intuitively, a good separation is achieved by

the hyperplane that has the largest distance to the nearest training data

point of any class (so-called functional margin), since in general the larger

the margin the lower the generalization error of the classiﬁ er. Whereas

the original problem may be stated in a ﬁ nite dimensional space, it often

happens that the sets to discriminate are not linearly separable in that space.

For this reason, it was proposed that the original ﬁ nite-dimensional space

be mapped into a much higher-dimensional space, presumably making the

separation easier in that space. To keep the computational load reasonable,

the mappings used by SVM schemes are designed to ensure that dot

products may be computed easily in terms of the variables in the original

space, by deﬁ ning them in terms of a kernel function K(x, y) selected to

suit the problem [20]. The hyperplanes in the higher-dimensional space are

deﬁ ned as the set of points whose inner product with a vector in that space

is constant. The vectors deﬁ ning the hyperplanes can be chosen to be linear

combinations with parameters of images of feature vectors that occur in the

data base. With this choice of a hyperplane, the points in the feature space

that are mapped into the hyperplane are deﬁ ned by the relation: ¹

K(x

, x)

= constant. Note that if K(x, y) becomes small as y grows further away from

x, each element in the sum measures the degree of closeness of the test point

x to the corresponding data base point x

. In this way, the sum of kernels

above can be used to measure the relative nearness of each test point to the

data points originating in one or the other of the sets to be discriminated.

Note the fact that the set of points x mapped into any hyperplane can be

quite convoluted as a result, allowing much more complex discrimination

between sets which are not convex at all in the original space.

5.3.4.2 Classiﬁ er with Support Vector Machines

The original optimal hyperplane algorithm proposed by Vapnik in 1963 was

a linear classiﬁ er. However, in 1992, Bernhard E. Boser, Isabelle M. Guyon

and Vladimir N. Vapniks suggested a way to create nonlinear classiﬁ ers

by applying the kernel trick (originally proposed by Aizerman et al. [3])

to maximum-margin hyperplanes [5]. The resulting algorithm is formally

similar, except that every dot product is replaced by a nonlinear kernel

function. This allows the algorithm to ﬁ t the maximum-margin hyperplane

in a transformed feature space. The transformation may be nonlinear and

the transformed space high dimensional; thus though the classiﬁ er is a

hyperplane in the high-dimensional feature space, it may be nonlinear in the

original input space. If the kernel used is a Gaussian radial basis function,

the corresponding feature space is a Hilbert space of inﬁ nite dimensions.

Classiﬁ cation 113

114 Applied Data Mining

Maximum margin classiﬁ ers are well regularized, so the inﬁ nite dimensions

do not spoil the results. Some common kernels include:

• Polynomial (homogeneous): K(x

, x

) = (x

· x

)

• Polynomial (inhomogeneous): K(x

, x

) = (x

· x

+ 1)

• Gaussian radial basis function: K(x

, x

) = exp(−γ ·||x

–x

), for γ > 0.

• Hyperbolic tangent: k(x

, x

) = tanh(Kx

· x

+ c), for k > 0 c < 0.

The classiﬁ er with SVM has the following properties. SVMs belong

to a family of generalized linear classiﬁ ers and can be interpreted as an

extension of the perception. They can also be considered a special case of

Tikhonov regularization. A special property is that they simultaneously

minimize the empirical classiﬁ cation error and maximize the geometric

margin; hence they are also known as maximum margin classiﬁ ers. A

comparison of the SVM to other classiﬁ ers has been made by Meyer, Leisch

and Hornik [13].

Simple feature selection algorithms are ad hoc, but there are also more

methodical approaches. From a theoretical perspective, it can be shown

that optimal feature selection for supervised learning problems requires an

exhaustive search of all possible subsets of features of the chosen cardinality.

If large numbers of features are available, this is impractical. For practical

supervised learning algorithms, the search is for a satisfactory set of features

instead of an optimal set. Feature selection algorithms typically fall into

two categories: feature ranking and subset selection. Feature ranking ranks

the features by a metric and eliminates all features that do not achieve an

adequate score. Subset selection searches the set of possible features for the

optimal subset. In statistics, the most popular form of feature selection is

stepwise regression. It is a greedy algorithm that adds the best feature (or

deletes the worst feature) at each round. The main control issue is deciding

when to stop the algorithm. In machine learning, this is typically done by

cross-validation. In statistics, some criteria are optimized. This leads to the

inherent problem of nesting. More robust methods have been explored,

such as branch and bound and piecewise linear network. Subset selection

evaluates a subset of features as a group for suitability. Subset selection

algorithms can be broken into Wrappers, Filters and Embedded. Wrappers

use a search algorithm to search through the space of possible features

and evaluate each subset by running a model on the subset. Wrappers

can be computationally expensive and have a risk of over ﬁ tting to the

model. Filters are similar to Wrappers in the search approach, but instead

of evaluating against a model, a simpler ﬁ lter is evaluated. Embedded

techniques are embedded in and speciﬁ c to a model. Many popular search

approaches use greedy hill climbing, which iteratively evaluates a candidate

subset of features, then modiﬁ es the subset and evaluates if the new subset

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5: Classification (3/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 5: Classification (3/4)