82 Applied Data Mining
Assuming that each z
i
is independent and identically distributed according
to multinomial distribution with probabilities τ
1
, τ
2
, ..., τ
K
for the K clusters,
and the density of an observation x_i given z_i is given by
1
(| )
K
ik
k
Px θ
=
θ
zi
k
,
the resulting complete-data log-likelihood is
C
(Θ; D) =
11
log[ ( | )]
mK
ik k i k
ik
zPxτθ
==
∑∑
θ
τ
, which is also known as the classifi cation
log-likelihood in the clustering literature. The EM algorithm can
then be viewed as operating in two steps. In the E-step, we calculate
the class-conditional probability P(z
i
|x
i
, Θ) for each sequence under
each of the K clusters using the current parameter. In the M-step, we
update by weighing each sequence according to its class-conditional
probability. Thus the EM algorithm is guaranteed to lead to a sequence
of Θ’s which have non-decreasing likelihood, i.e., under fairly broad
conditions it will fi nd at least a local maximum. Algorithm 4.10 shows
the main steps of EM algorithm and The implement of EM clustering
was embedded in MCLUST R package (http://www.stat.washington.
edu/mclust/).
Algorithm 4.10: EM clustering
Input: Data set D
Output: Clustering result
(1) First, initialize the parameters θ to some random values.
(2) Compute the best value for Z given these parameter values.
(3) Then, use the just-computed values of Z to compute a better estimate for the
parameters θ. Parameters associated with a particular value of Z will use only
those data points whose associated latent variable has that value.
(4) iterate steps 2 and 3 until convergence.
(5) Return clustering result.
4.3.5.2 Extensions of EM Algorithm
The EM algorithm for clustering has a number of limitations. First, the rate
of convergence can be very slow. This does not appear to be a problem in
practice for well-separated mixtures when started with reasonable values.
Second, the number of conditional probabilities associated with each
observation is equal to the number of components in the mixture, so that the
EM algorithm for clustering may not be practical for models with very large
number of clusters. Finally, EM breaks down when the covariance matrix
corresponding to one or more components becomes ill-conditioned (singular
or nearly singular). If EM for a model with a certain number of components
is applied to a mixture with fewer groups than the number of mixture
components, then it may fail due to ill-conditioning. A number of variants
of the EM algorithm have been proposed for model-based clustering, some
of which can avoid the limitations of the EM algorithm. CEM and SEM
[19] are two widely used variants. The classifi cation EM (CEM) algorithm
can be regarded as a classifi cation version of the EM algorithm, where the
complete log-likelihood is maximized. It incorporates a classifi cation step
(C-step) between the E-step and the M-step of the standard EM algorithm
by using a MAP principle. In the C-step, each object x
i
is assigned to the
cluster which provides the maximum posterior probability. It has been
shown that the k-means algorithm is exactly the CEM algorithm for a
Gaussian mixture with equal proportions and a common covariance matrix
of the form. Since most of the classical clustering criteria can be analyzed as
classifi cation maximum likelihood criteria, the CEM algorithm turns out to
be quite a general clustering algorithm. However, from the practical point
of view, the solution provided by the CEM algorithm does depend on its
initial position, especially when the clusters are not well separated. Celeux
et al. considered a stochastic version of EM as well as the CEM algorithm
in the context of computing the MLE for fi nite mixture models. They call
it the stochastic EM (SEM) algorithm. With the SEM algorithm, the current
posterior probabilities are used in a stochastic E-step (Sstep), wherein each
observation object xi is assigned to one of the K clusters according to the
posterior probability distributions for all clusters. Numerical experiments
have shown that SEM performs well and can overcome some of the
limitations of the EM algorithm.
4.4 High-dimensional clustering algorithm
Data collected in the world are so large that it is becoming increasingly
diffi cult for users to access them. Knowledge Discovery in Databases (KDD)
is the non-trivial process of identifying valid, novel, potentially useful
and ultimately understandable patterns in data [23]. The KDD process is
interactive and iterative, involving numerous steps. Data mining is one such
step in the KDD process. In this section, we focus on the high-dimensional
clustering problem, which is one of the most useful tasks in data mining for
discovering groups and identifying interesting distributions and patterns
in the underlying data. Thus, the goal of clustering is to partition a data
set into subgroups such that objects in each particular group are similar
and objects in different groups are dissimilar [15]. According to [31], four
problems need to be overcome for high-dimensional clustering:
(1) Multiple dimensions are hard to think in, impossible to visualize,
and due to the exponential growth of the number of possible values
with each dimension, complete enumeration of all subspaces becomes
Clustering Analysis 83
84 Applied Data Mining
intractable with increasing dimensionality. This problem is referred to
as the curse of dimensionality.
(2) The concept of distance becomes less precise as the number of
dimensions grows, since the distance between any two points in a
given dataset converges. The discrimination of the nearest and farthest
point in particular becomes meaningless:
max min
min
lim 0
d
dist dist
dist
→∞
.
(3) A cluster is intended to group objects that are related, based on
observations of their attribute’s values. However, given a large number
of attributes some of the attributes will usually not be meaningful for a
given cluster. For example, in newborn screening, a cluster of samples
might identify newborns that share similar blood values, which might
lead to insights about the relevance of certain blood values for a disease.
But for different diseases, different blood values might form a cluster,
and other values might be uncorrelated. This is known as the local
feature relevance problem: different clusters might be found in different
subspaces, so a global fi ltering of attributes is not suffi cient.
(4) Given a large number of attributes, it is likely that some attributes are
correlated. Hence, clusters might exist in arbitrarily oriented affi ne
subspaces. Recent research by Houle et al. [42] indicates that the
discrimination problems only occur when there is a high number of
irrelevant dimensions, and that shared-nearest-neighbor approaches
can improve results.
Lots of clustering algorithms have been proposed to high dimensional
clustering problem. In general, the algorithmic approaches for fi nding
these subspaces (i.e., traversing the search space of all possible axis-parallel
subspaces) can be divided into the following two categories: Bottom-up
approaches and Top-down approaches. In the following sections, we will
discuss these two different ways in detail.
4.4.1 Bottom-up Approaches
The exponential search space that needs to be traversed is equivalent to
the search space of the frequent item set problem in market basket analysis
in transaction databases [10]. Each attribute represents an item and each
subspace cluster is a transaction of the items representing the attributes that
span the corresponding subspace. Finding item sets with frequency l then
relates to fi nding all combinations of attributes that constitute a subspace
containing at least one cluster. This observation is the rationale of most
bottom-up subspace clustering approaches. The subspaces that contain
clusters are determined starting from all one-dimensional subspaces that
accommodate at least one cluster by employing a search strategy similar to
frequent item set mining algorithms. To apply any effi cient frequent item
set mining algorithm, the cluster criterion must implement a downward
closure property (also called monotonicity property): If subspace S contains
a cluster, then any subspace T S must also contain a cluster. The reverse
implication, if a subspace T does not contain a cluster, then any super
space S T also cannot contain a cluster, can be used for pruning, that is,
excluding specifi c subspaces from consideration. Let us note that there are
bottom-up algorithms that do not use an APRIORI-like subspace search,
but instead apply other search heuristics. In another way, the bottom-up
approaches are also called as subspace clustering.
CLIQUE [1], the pioneering approach to subspace clustering, uses a
grid-based clustering notion. The data space is partitioned by an axis-
parallel grid into equal units of width ζ. Only units which contain at least
τ points are considered as dense. A cluster is defi ned as a maximal set
of adjacent dense units. Since dense units satisfy the downward closure
property, subspace clusters can be explored rather effi ciently in a bottom-
up way. Starting with all one-dimensional dense units, (k+1)-dimensional
dense units are computed from the set of k-dimensional dense units in an
APRIORI-like style. If a (k+1)-dimensional unit contains a projection onto
a k-dimensional unit that is not dense, then the (k+1)-dimensional unit
also cannot be dense. Further, a heuristic that is based on the minimum
description length principle is introduced to discard candidate units
within less interesting subspaces (i.e., subspaces that contain only a very
small number of dense units). This way, the effi ciency of the algorithm is
enhanced but at the cost of incomplete results, namely some true clusters
are lost. There are some variants of CLIQUE. The method ENCLUS [11] also
relies on a fi xed grid, but searches for subspaces that potentially contain
one or more clusters rather than for dense units. Three quality criteria
for subspaces are introduced, one implementing the downward closure
property. The method MAFIA [45] uses an adaptive grid. The generation of
subspace clusters is similar to CLIQUE. Another variant of CLIQUE, called
nCluster [41], allows overlapping windows of length δ as one-dimensional
units of the grid. In summary, all grid-based methods use a simple but rather
effi cient cluster model. The shape of each resulting cluster corresponds to a
polygon with axis-parallel lines in the corresponding subspace. Obviously,
the accuracy and effi ciency of CLIQUE and its variants primarily depend
on the granularity and the positioning of the grid. A higher grid granularity
results in higher runtime requirements but will most likely produce more
accurate results. SUBCLU [16] uses the DBSCAN cluster model of density
connected sets. It is shown that density-connected sets satisfy the downward
closure property. This enables SUBCLU to search for density based clusters
in subspaces in an APRIORI-like style. The resulting clusters may exhibit
an arbitrary shape and size in the corresponding subspaces. RIS [39] is
a subspace ranking algorithm that uses a quality criterion to rate the
Clustering Analysis 85
86 Applied Data Mining
interestingness of subspaces. This criterion is based on the monotonicity
of core points which are the central concept of the density-based clustering
notion of DBSCAN. An Apriori-like subspace generation method (similar
to SUBCLU) is used to compute all relevant subspaces and rank them by
interestingness. The clusters can be computed in the generated subspaces
using any clustering method of choice. SURFING [14] is a subspace ranking
algorithm that does not rely on a global density threshold. It computes the
interestingness of a subspace based on the distribution of the k-nearest
neighbors of all data points in the corresponding projection. An effi cient,
bottom-up subspace expansion heuristics ensures that less interesting
subspaces are not generated for examination. More subspace clustering
algorithms in detailed, please reference to [53, 56, 35].
4.4.2 Top-down Approaches
The rationale behind Top-down approaches is to determine the subspace
of a cluster starting from the full-dimensional space. This is usually done
by determining a subset of attributes for a given set of points (potential
cluster members) such that the points meet the given cluster criterion when
projected onto the corresponding subspace. Obviously, the dilemma is
that for the determination of the subspace of a cluster, at least some cluster
members must be identifi ed. On the other hand, in order to determine cluster
memberships, the subspace of each cluster must be known. To escape from
this circular dependency, most top-down approaches rely on a rather strict
assumption, which we call the locality assumption. It is assumed that the
subspace of a cluster can be derived from the local neighborhood (in the
full-dimensional data space) of the cluster center or the cluster members.
In other words, it is assumed that even in the full-dimensional space, the
subspace of each cluster can be learned from the local neighborhood of
cluster representatives or cluster members. Other top-down approaches
that do not rely on the locality assumption use random sampling in order
to generate a set of potential cluster members. According to the top-down
approaches working way, it is also called as projective clustering. Projective
clustering is an effi cient way of dealing with high dimensional clustering
problems. Explicitly or implicitly, projective clustering algorithms assume
the following defi nition: Give a data set D of n-dimensional data objects,
a projected cluster is defi ned as a pair (C
k
, S
k
), where C
k
is a subset of data
objects and S
k
is a subset of attributes such that the data objects in C
k
are
projected along each attribute in S
k
onto a small range of values, compared
to the range of values of the whole data set in S
k
, and the data objects in C
k
are uniformly distributed along every other attributes not in S
k
. The task
of projective clustering is to search and report all projective clusters in the
search space.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.135.4