PROCLUS [9] is one of the classical projective clustering algorithms.
It discovers groups of data objects located closely in each of the related
dimension in its associated subspace. In such case, the data objects would
spread along certain directions which are parallel to the original data axes.
ORCLUS [10] aims to detect arbitrarily oriented subspaces formed by any
set of orthogonal vectors. EPCH [38] is focused on uncovering projective
clusters with varying dimensionality, without requiring users to input
the expected average dimensionality l of the associated subspace and the
number of clusters K that inherently exists in the data set. The d-dimensional
histogram created with equal width, is used to capture the dense units
and their locations in the d-dimensional space. A compression structure is
used to store these dense units and their locations. At last, a search method
is used to merge similar and adjacent dense units and form subspace
clusters. P3C [44] can effectively discover projective clusters in the data
while minimizing the number of required parameters. P3C also does not
need the number of projective clusters as input and can discover the true
number of clusters. There are three steps consisted in P3C. Firstly, regions
corresponding to the clusters on each attribute are discovered. Secondly, a
cluster core structure described by a combination of the detected regions is
designed to capture the dense areas in a high dimensional space. Thirdly,
cluster cores are refi ned into projective clusters, outliers are identifi ed,
and the relevant attributes for each cluster are determined. STATPC [43]
uses a varying width hyper-rectangle structure to fi nd out the dense areas
embedded in the high dimensional space. By using a spatial statistical
method, all dense hyper-rectangles are found. A heuristic search process
is run to merge these dense hyper-rectangles and clustering results are
generated. The clusters of projective clustering are defi ned as the dense
areas in corresponding subsets of attributes. In projective clustering, it is a
common way that a hyper-rectangle structure is used to fi nd out the dense
areas in the d-dimensional space at fi rst; and then, a search method is run
to merge these hyper-rectangles for generating clusters. Because the dense
area is captured by the hyper-rectangle structure, it is important to defi ne
the structure before clustering. There are two kinds of hyper-rectangle
structures used in projective clustering—the equal width hyper-rectangle
structure and the varying width hyper-rectangle structure. For the equal
width hyper-rectangle structure, each dimension is divided into equal
width intervals, and the hyper-rectangles are constructed by these intervals,
for instance, the d-dimensional histogram is used as the fi rst step in the
construction of hyper-rectangle structure in EPCH.
Clustering Analysis 87
88 Applied Data Mining
4.4.3 Other Methods
Hybrid clustering algorithms do not belong to bottom-up or top-down
approaches. Algorithms that do not aim at uniquely assigning each data
point to a cluster nor at fi nding all clusters in all subspaces are called hybrid
algorithms. Some hybrid algorithms offer the user an optional functionality
of a pure projected clustering algorithm. Others aim at computing only
the subspaces of potential interest rather than the fi nal clusters. Usually,
hybrid methods that report clusters allow overlapping clusters, but do
not aim at computing all clusters in all subspaces. DOC [49] uses a global
density threshold to defi ne a subspace cluster by means of hypercubes of
xed side-length w containing at least α points. A random search algorithm
is proposed to compute such subspace clusters from a starting seed of
sampled points. A third parameter β specifi es the balance between the
number of points and the dimensionality of a cluster. This parameter affects
the dimensionality of the resulting clusters, and thus DOC usually also has
problems with subspace clusters of signifi cantly different dimensionality.
Due to the very simple clustering model, the clusters may contain additional
noise points (if w is too large) or not all points that naturally belong to the
cluster (if w is too small). One run of DOC may (with a certain probability)
nd one subspace cluster. If k clusters need to be identifi ed, DOC has to
be applied at least k times. If the points assigned to the clusters found so
far are excluded from subsequent runs, DOC can be considered as a pure
projected clustering algorithm because each point is uniquely assigned
to one cluster or to noise (if not assigned to a cluster). On the other hand,
if the cluster points are not excluded from subsequent runs, the resulting
clusters of multiple runs may overlap. Usually, DOC cannot produce all
clusters in all subspaces. MINECLUS [60] is based on a similar idea as
DOC, but proposes a deterministic method to fi nd an optimal projected
cluster, given a sample seed point. The authors transform the problem
into a frequent item set mining problem and employ a modifi ed frequent
pattern tree growth method. Further heuristics are introduced to enhance
effi ciency and accuracy.
DiSH [6] follows a similar idea as PreDeCon but uses a hierarchical
clustering model. This way, hierarchies of subspace clusters can be
discovered, that is, the information that a lower-dimensional cluster is
embedded within a higher-dimensional one. The distance between points
and clusters refl ects the dimensionality of the subspace that is spanned by
combining the corresponding subspace of each cluster. As in COSA, the
weighting of attributes is learned for each object, not for entire clusters.
The learning of weights, however, is based on single attributes, not on the
entire feature space. DiSH uses an algorithm that is inspired by the density-
based hierarchical clustering algorithm OPTICS. However, DiSH extends
the cluster ordering computed by OPTICS in order to fi nd hierarchies of
subspace clusters with multiple inclusions (a lower-dimensional subspace
cluster may be embedded in multiple higher-dimensional subspace
clusters). SCHISM [51] mines interesting subspaces rather than subspace
clusters, hence, it is not exactly a subspace clustering algorithm, but solves
a related problem: fi nding subspaces to look for clusters. It employs a
grid-like discretization of the database and applies a depthfi rst search with
backtracking to fi nd maximally interesting subspaces. FIRES [47] computes
one-dimensional clusters using any clustering technique the user is most
accomplished with in a fi rst step. These one-dimensional clusters are then
merged by applying a ”clustering of clusters.” The similarity of clusters
is defi ned by the number of intersecting points. The resulting clusters
represent hyper-rectangular approximations of the true subspace clusters.
In an optional postprocessing step, these approximations can be refi ned
by again applying any clustering algorithm to the points included in the
approximation projected onto the corresponding subspace. Though using
a bottom-up search strategy, FIRES is rather effi cient because it does not
employ a worst-case exhaustive search procedure but a heuristic that is
linear in the dimensionality of the data space. However, this performance
boost is paid for by an expected loss of clustering accuracy. It cannot be
specifi ed whether the subspace clusters produced by FIRES may overlap or
not. In general, the clusters may overlap, but usually FIRES cannot produce
all clusters in all subspaces.
4.5 Constraint-based Clustering Algorithm
In computer science, constrained clustering is a class of semi-supervised
learning algorithms [36]. Typically, constrained clustering incorporates
either a set of must-link constraints, cannot-link constraints, or both, with a
data clustering algorithm. Both a must-link and a cannot-link constraint
defi ne a relationship between two data instances. A must-link constraint is
used to specify that the two instances in the must-link relation should be
associated with the same cluster. A cannot-link constraint is used to specify
that the two instances in the cannot-link relation should not be associated
with the same cluster. These sets of constraints acts as a guide for which a
constrained clustering algorithm will attempt to fi nd clusters in a data set
which satisfy the specifi ed must-link and cannot-link constraints. Some
constrained clustering algorithms will abort if no such clustering exists
which satisfi es the specifi ed constraints. Others will try to minimize the
amount of constraint violation should it be impossible to fi nd a clustering
which satisfi es the constraints.
Clustering Analysis 89
90 Applied Data Mining
4.5.1 COP K-means
In the context of partitioning algorithms, instance level constraints are a
useful way to express a prior knowledge about which instances should or
should not be grouped together. Consequently, we consider two types of
pair-wise constraints:
Must-link constraints specify that two instances have to be in the same
cluster.
Cannot-link constraints specify that two instances must not be placed
in the same cluster.
The must-link constraints defi ne a transitive binary relation over the
instances. Consequently, when making use of a set of constraints (of both
kinds), we take a transitive closure over the constraints. The full set of
derived constraints is then presented to the clustering algorithm. In general,
constraints may be derived from partially labeled data or from background
knowledge about the domain or data set.
Algorithm 4.11 gives the framework of COP K-means algorithm. The
major modifi cation is that, when updating cluster assignments, we ensure
that none of the specifi ed constraints are violated. We attempt to assign
each point d
i
to its closest cluster C
j
. This will succeed unless a constraint
would be violated. If there is another point d= that must be assigned to
the same cluster as d, but that is already in some other cluster, or there is
another point d
F
that cannot be grouped with d but is already in C, then d
i
cannot be placed in C. We continue down the sorted list of clusters until
we fi nd one that can legally host d. Constraints are never broken; if a legal
cluster cannot be found for d, the empty partition is returned. An interactive
demo of this algorithm can be found at http://www.cs.cornell.edu/home/
wkiri/cop-kmeans/.
4.5.2 MPCK-means
Given a set of data objects D, a set of must-link constraints M, a set of
cannot-link constraints C, corresponding cost sets W and
W
, and the desired
number of clusters K, MPCK-Mmeans fi nds a disjoint K-partitioning
1
{}
K
kk
C
=
of D (with each cluster having a centroid µ
k
and a local weight matrix Ak)
such that the objective function is (locally) minimized [17]. The algorithm
integrates the use of constraints and metric learning. Constraints are utilized
during cluster initialization and when assigning points to clusters, and the
distance metric is adapted by re-estimating the weight matrices Ak during
each iteration based on the current cluster assignments and constraint
violations. Algorithm 4.12 gives the pseudocode of MPCK-means.
Algorithm 4.11: COP-K-means
Input: data set D, must-link constraints Con
=
¡ D x D, cannot-link constraints
Con ¡ D x D
Output: Clustering result
(1) Let C
1
,…, C
k
be the initial cluster centers.
(2) For each point d
i
in D, assign it to the closest cluster C
j
such that
violate-constraints (d
i
, C
j
, Con
=
, Con) is false. If no such cluster exists, fail
(return {}).
(3) For each cluster C
i
, update its center by averaging all of the points d
j
that have
been assigned to it.
(4) Iterate between (2) and (3) until convergence.
(5) Return {C
1
,…, C
k
}.
violate-constraints(data point d, cluster C, must-link constraints Con ¡ D x D,
cannot-link constraints Con
¡D x D)
(1) For each (d,d
=
) Con
=
: if d
=
£C, return true.
(2) For each (d,d
) Con
: if d
£C, return true.
(3) Otherwise, return false
4.5.3 AFCC
AFCC is based on an iterative reallocation that partitions a data set into
an optimal number of clusters by locally minimizing the sum of intra-
cluster distances while respecting as many as possible of the constraints
provided. AFCC alternates between membership updating step and
centroid estimation step while generating actively at each iteration new
candidate pairs for constraints [28]. After the initialization step, we continue
by computing α, the factor that will ensure a balanced infl uence from
the constrained data and unlabeled patterns than β, the factor that will
determine which term of the membership updating equation will dominate.
Afterwards, memberships will be updated. In the second step, based on
the cardinalities of different clusters, spurious clusters will be discarded,
thus obtaining the centroids of good clusters. At this time, a data partition
is available, AFCC will then try to identify least well defi ned cluster and
selects in an active manner good candidates for the need of generating
maximally informative constraints. As distance d(x
i
, µ
j
) between a data
item x
i
and a cluster centriod µ
j
, one can use either the ordinary Euclidean
distance when the clusters are assumed to be spherical or the Mahalanobis
distance when they are assumed to be elliptical: d
2
(x
i
, µ
k
) = |C
k
|
1/n
(x
i
µ
k
)
T
δ
k
−1
(x
i
µ
k
), where n is the dimension of the space considered and C
k
is the
covariance matrices of the cluster k:
δ
k
=
2
1
2
1
()()
M
T
ikikik
i
M
ik
i
ux x
u
μμ
=
=
−−
µ
k
µ
k
Clustering Analysis 91
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.204.102