Chapter 4: Clustering Analysis (6/9)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

82 Applied Data Mining

Assuming that each z

is independent and identically distributed according

to multinomial distribution with probabilities τ

, τ

, ..., τ

for the K clusters,

and the density of an observation x_i given z_i is given by

(| )

Px θ

∏

the resulting complete-data log-likelihood is ℓ

(Θ; D) =

log[ ( | )]

ik k i k

zPxτθ

∑∑

, which is also known as the classiﬁ cation

log-likelihood in the clustering literature. The EM algorithm can

then be viewed as operating in two steps. In the E-step, we calculate

the class-conditional probability P(z

, Θ) for each sequence under

each of the K clusters using the current parameter. In the M-step, we

update by weighing each sequence according to its class-conditional

probability. Thus the EM algorithm is guaranteed to lead to a sequence

of Θ’s which have non-decreasing likelihood, i.e., under fairly broad

conditions it will ﬁ nd at least a local maximum. Algorithm 4.10 shows

the main steps of EM algorithm and The implement of EM clustering

was embedded in MCLUST R package (http://www.stat.washington.

edu/mclust/).

Algorithm 4.10: EM clustering

Input: Data set D

Output: Clustering result

(1) First, initialize the parameters θ to some random values.

(2) Compute the best value for Z given these parameter values.

(3) Then, use the just-computed values of Z to compute a better estimate for the

parameters θ. Parameters associated with a particular value of Z will use only

those data points whose associated latent variable has that value.

(4) iterate steps 2 and 3 until convergence.

(5) Return clustering result.

4.3.5.2 Extensions of EM Algorithm

The EM algorithm for clustering has a number of limitations. First, the rate

of convergence can be very slow. This does not appear to be a problem in

practice for well-separated mixtures when started with reasonable values.

Second, the number of conditional probabilities associated with each

observation is equal to the number of components in the mixture, so that the

EM algorithm for clustering may not be practical for models with very large

number of clusters. Finally, EM breaks down when the covariance matrix

corresponding to one or more components becomes ill-conditioned (singular

or nearly singular). If EM for a model with a certain number of components

is applied to a mixture with fewer groups than the number of mixture

components, then it may fail due to ill-conditioning. A number of variants

of the EM algorithm have been proposed for model-based clustering, some

of which can avoid the limitations of the EM algorithm. CEM and SEM

[19] are two widely used variants. The classiﬁ cation EM (CEM) algorithm

can be regarded as a classiﬁ cation version of the EM algorithm, where the

complete log-likelihood is maximized. It incorporates a classiﬁ cation step

(C-step) between the E-step and the M-step of the standard EM algorithm

by using a MAP principle. In the C-step, each object x

is assigned to the

cluster which provides the maximum posterior probability. It has been

shown that the k-means algorithm is exactly the CEM algorithm for a

Gaussian mixture with equal proportions and a common covariance matrix

of the form. Since most of the classical clustering criteria can be analyzed as

classiﬁ cation maximum likelihood criteria, the CEM algorithm turns out to

be quite a general clustering algorithm. However, from the practical point

of view, the solution provided by the CEM algorithm does depend on its

initial position, especially when the clusters are not well separated. Celeux

et al. considered a stochastic version of EM as well as the CEM algorithm

in the context of computing the MLE for ﬁ nite mixture models. They call

it the stochastic EM (SEM) algorithm. With the SEM algorithm, the current

posterior probabilities are used in a stochastic E-step (Sstep), wherein each

observation object xi is assigned to one of the K clusters according to the

posterior probability distributions for all clusters. Numerical experiments

have shown that SEM performs well and can overcome some of the

limitations of the EM algorithm.

4.4 High-dimensional clustering algorithm

Data collected in the world are so large that it is becoming increasingly

difﬁ cult for users to access them. Knowledge Discovery in Databases (KDD)

is the non-trivial process of identifying valid, novel, potentially useful

and ultimately understandable patterns in data [23]. The KDD process is

interactive and iterative, involving numerous steps. Data mining is one such

step in the KDD process. In this section, we focus on the high-dimensional

clustering problem, which is one of the most useful tasks in data mining for

discovering groups and identifying interesting distributions and patterns

in the underlying data. Thus, the goal of clustering is to partition a data

set into subgroups such that objects in each particular group are similar

and objects in different groups are dissimilar [15]. According to [31], four

problems need to be overcome for high-dimensional clustering:

(1) Multiple dimensions are hard to think in, impossible to visualize,

and due to the exponential growth of the number of possible values

with each dimension, complete enumeration of all subspaces becomes

Clustering Analysis 83

84 Applied Data Mining

intractable with increasing dimensionality. This problem is referred to

as the curse of dimensionality.

(2) The concept of distance becomes less precise as the number of

dimensions grows, since the distance between any two points in a

given dataset converges. The discrimination of the nearest and farthest

point in particular becomes meaningless:

max min

min

lim 0

dist dist

dist

→∞

−

→

(3) A cluster is intended to group objects that are related, based on

observations of their attribute’s values. However, given a large number

of attributes some of the attributes will usually not be meaningful for a

given cluster. For example, in newborn screening, a cluster of samples

might identify newborns that share similar blood values, which might

lead to insights about the relevance of certain blood values for a disease.

But for different diseases, different blood values might form a cluster,

and other values might be uncorrelated. This is known as the local

feature relevance problem: different clusters might be found in different

subspaces, so a global ﬁ ltering of attributes is not sufﬁ cient.

(4) Given a large number of attributes, it is likely that some attributes are

correlated. Hence, clusters might exist in arbitrarily oriented afﬁ ne

subspaces. Recent research by Houle et al. [42] indicates that the

discrimination problems only occur when there is a high number of

irrelevant dimensions, and that shared-nearest-neighbor approaches

can improve results.

Lots of clustering algorithms have been proposed to high dimensional

clustering problem. In general, the algorithmic approaches for ﬁ nding

these subspaces (i.e., traversing the search space of all possible axis-parallel

subspaces) can be divided into the following two categories: Bottom-up

approaches and Top-down approaches. In the following sections, we will

discuss these two different ways in detail.

4.4.1 Bottom-up Approaches

The exponential search space that needs to be traversed is equivalent to

the search space of the frequent item set problem in market basket analysis

in transaction databases [10]. Each attribute represents an item and each

subspace cluster is a transaction of the items representing the attributes that

span the corresponding subspace. Finding item sets with frequency l then

relates to ﬁ nding all combinations of attributes that constitute a subspace

containing at least one cluster. This observation is the rationale of most

bottom-up subspace clustering approaches. The subspaces that contain

clusters are determined starting from all one-dimensional subspaces that

accommodate at least one cluster by employing a search strategy similar to

frequent item set mining algorithms. To apply any efﬁ cient frequent item

set mining algorithm, the cluster criterion must implement a downward

closure property (also called monotonicity property): If subspace S contains

a cluster, then any subspace T  S must also contain a cluster. The reverse

implication, if a subspace T does not contain a cluster, then any super

space S  T also cannot contain a cluster, can be used for pruning, that is,

excluding speciﬁ c subspaces from consideration. Let us note that there are

bottom-up algorithms that do not use an APRIORI-like subspace search,

but instead apply other search heuristics. In another way, the bottom-up

approaches are also called as subspace clustering.

CLIQUE [1], the pioneering approach to subspace clustering, uses a

grid-based clustering notion. The data space is partitioned by an axis-

parallel grid into equal units of width ζ. Only units which contain at least

τ points are considered as dense. A cluster is deﬁ ned as a maximal set

of adjacent dense units. Since dense units satisfy the downward closure

property, subspace clusters can be explored rather efﬁ ciently in a bottom-

up way. Starting with all one-dimensional dense units, (k+1)-dimensional

dense units are computed from the set of k-dimensional dense units in an

APRIORI-like style. If a (k+1)-dimensional unit contains a projection onto

a k-dimensional unit that is not dense, then the (k+1)-dimensional unit

also cannot be dense. Further, a heuristic that is based on the minimum

description length principle is introduced to discard candidate units

within less interesting subspaces (i.e., subspaces that contain only a very

small number of dense units). This way, the efﬁ ciency of the algorithm is

enhanced but at the cost of incomplete results, namely some true clusters

are lost. There are some variants of CLIQUE. The method ENCLUS [11] also

relies on a ﬁ xed grid, but searches for subspaces that potentially contain

one or more clusters rather than for dense units. Three quality criteria

for subspaces are introduced, one implementing the downward closure

property. The method MAFIA [45] uses an adaptive grid. The generation of

subspace clusters is similar to CLIQUE. Another variant of CLIQUE, called

nCluster [41], allows overlapping windows of length δ as one-dimensional

units of the grid. In summary, all grid-based methods use a simple but rather

efﬁ cient cluster model. The shape of each resulting cluster corresponds to a

polygon with axis-parallel lines in the corresponding subspace. Obviously,

the accuracy and efﬁ ciency of CLIQUE and its variants primarily depend

on the granularity and the positioning of the grid. A higher grid granularity

results in higher runtime requirements but will most likely produce more

accurate results. SUBCLU [16] uses the DBSCAN cluster model of density

connected sets. It is shown that density-connected sets satisfy the downward

closure property. This enables SUBCLU to search for density based clusters

in subspaces in an APRIORI-like style. The resulting clusters may exhibit

an arbitrary shape and size in the corresponding subspaces. RIS [39] is

a subspace ranking algorithm that uses a quality criterion to rate the

Clustering Analysis 85

86 Applied Data Mining

interestingness of subspaces. This criterion is based on the monotonicity

of core points which are the central concept of the density-based clustering

notion of DBSCAN. An Apriori-like subspace generation method (similar

to SUBCLU) is used to compute all relevant subspaces and rank them by

interestingness. The clusters can be computed in the generated subspaces

using any clustering method of choice. SURFING [14] is a subspace ranking

algorithm that does not rely on a global density threshold. It computes the

interestingness of a subspace based on the distribution of the k-nearest

neighbors of all data points in the corresponding projection. An efﬁ cient,

bottom-up subspace expansion heuristics ensures that less interesting

subspaces are not generated for examination. More subspace clustering

algorithms in detailed, please reference to [53, 56, 35].

4.4.2 Top-down Approaches

The rationale behind Top-down approaches is to determine the subspace

of a cluster starting from the full-dimensional space. This is usually done

by determining a subset of attributes for a given set of points (potential

cluster members) such that the points meet the given cluster criterion when

projected onto the corresponding subspace. Obviously, the dilemma is

that for the determination of the subspace of a cluster, at least some cluster

members must be identiﬁ ed. On the other hand, in order to determine cluster

memberships, the subspace of each cluster must be known. To escape from

this circular dependency, most top-down approaches rely on a rather strict

assumption, which we call the locality assumption. It is assumed that the

subspace of a cluster can be derived from the local neighborhood (in the

full-dimensional data space) of the cluster center or the cluster members.

In other words, it is assumed that even in the full-dimensional space, the

subspace of each cluster can be learned from the local neighborhood of

cluster representatives or cluster members. Other top-down approaches

that do not rely on the locality assumption use random sampling in order

to generate a set of potential cluster members. According to the top-down

approaches working way, it is also called as projective clustering. Projective

clustering is an efﬁ cient way of dealing with high dimensional clustering

problems. Explicitly or implicitly, projective clustering algorithms assume

the following deﬁ nition: Give a data set D of n-dimensional data objects,

a projected cluster is deﬁ ned as a pair (C

, S

), where C

is a subset of data

objects and S

is a subset of attributes such that the data objects in C

are

projected along each attribute in S

onto a small range of values, compared

to the range of values of the whole data set in S

, and the data objects in C

are uniformly distributed along every other attributes not in S

. The task

of projective clustering is to search and report all projective clusters in the

search space.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4: Clustering Analysis (6/9)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 4: Clustering Analysis (6/9)