Chapter 4: Clustering Analysis (7/9)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

PROCLUS [9] is one of the classical projective clustering algorithms.

It discovers groups of data objects located closely in each of the related

dimension in its associated subspace. In such case, the data objects would

spread along certain directions which are parallel to the original data axes.

ORCLUS [10] aims to detect arbitrarily oriented subspaces formed by any

set of orthogonal vectors. EPCH [38] is focused on uncovering projective

clusters with varying dimensionality, without requiring users to input

the expected average dimensionality l of the associated subspace and the

number of clusters K that inherently exists in the data set. The d-dimensional

histogram created with equal width, is used to capture the dense units

and their locations in the d-dimensional space. A compression structure is

used to store these dense units and their locations. At last, a search method

is used to merge similar and adjacent dense units and form subspace

clusters. P3C [44] can effectively discover projective clusters in the data

while minimizing the number of required parameters. P3C also does not

need the number of projective clusters as input and can discover the true

number of clusters. There are three steps consisted in P3C. Firstly, regions

corresponding to the clusters on each attribute are discovered. Secondly, a

cluster core structure described by a combination of the detected regions is

designed to capture the dense areas in a high dimensional space. Thirdly,

cluster cores are reﬁ ned into projective clusters, outliers are identiﬁ ed,

and the relevant attributes for each cluster are determined. STATPC [43]

uses a varying width hyper-rectangle structure to ﬁ nd out the dense areas

embedded in the high dimensional space. By using a spatial statistical

method, all dense hyper-rectangles are found. A heuristic search process

is run to merge these dense hyper-rectangles and clustering results are

generated. The clusters of projective clustering are deﬁ ned as the dense

areas in corresponding subsets of attributes. In projective clustering, it is a

common way that a hyper-rectangle structure is used to ﬁ nd out the dense

areas in the d-dimensional space at ﬁ rst; and then, a search method is run

to merge these hyper-rectangles for generating clusters. Because the dense

area is captured by the hyper-rectangle structure, it is important to deﬁ ne

the structure before clustering. There are two kinds of hyper-rectangle

structures used in projective clustering—the equal width hyper-rectangle

structure and the varying width hyper-rectangle structure. For the equal

width hyper-rectangle structure, each dimension is divided into equal

width intervals, and the hyper-rectangles are constructed by these intervals,

for instance, the d-dimensional histogram is used as the ﬁ rst step in the

construction of hyper-rectangle structure in EPCH.

Clustering Analysis 87

88 Applied Data Mining

4.4.3 Other Methods

Hybrid clustering algorithms do not belong to bottom-up or top-down

approaches. Algorithms that do not aim at uniquely assigning each data

point to a cluster nor at ﬁ nding all clusters in all subspaces are called hybrid

algorithms. Some hybrid algorithms offer the user an optional functionality

of a pure projected clustering algorithm. Others aim at computing only

the subspaces of potential interest rather than the ﬁ nal clusters. Usually,

hybrid methods that report clusters allow overlapping clusters, but do

not aim at computing all clusters in all subspaces. DOC [49] uses a global

density threshold to deﬁ ne a subspace cluster by means of hypercubes of

ﬁ xed side-length w containing at least α points. A random search algorithm

is proposed to compute such subspace clusters from a starting seed of

sampled points. A third parameter β speciﬁ es the balance between the

number of points and the dimensionality of a cluster. This parameter affects

the dimensionality of the resulting clusters, and thus DOC usually also has

problems with subspace clusters of signiﬁ cantly different dimensionality.

Due to the very simple clustering model, the clusters may contain additional

noise points (if w is too large) or not all points that naturally belong to the

cluster (if w is too small). One run of DOC may (with a certain probability)

ﬁ nd one subspace cluster. If k clusters need to be identiﬁ ed, DOC has to

be applied at least k times. If the points assigned to the clusters found so

far are excluded from subsequent runs, DOC can be considered as a pure

projected clustering algorithm because each point is uniquely assigned

to one cluster or to noise (if not assigned to a cluster). On the other hand,

if the cluster points are not excluded from subsequent runs, the resulting

clusters of multiple runs may overlap. Usually, DOC cannot produce all

clusters in all subspaces. MINECLUS [60] is based on a similar idea as

DOC, but proposes a deterministic method to ﬁ nd an optimal projected

cluster, given a sample seed point. The authors transform the problem

into a frequent item set mining problem and employ a modiﬁ ed frequent

pattern tree growth method. Further heuristics are introduced to enhance

efﬁ ciency and accuracy.

DiSH [6] follows a similar idea as PreDeCon but uses a hierarchical

clustering model. This way, hierarchies of subspace clusters can be

discovered, that is, the information that a lower-dimensional cluster is

embedded within a higher-dimensional one. The distance between points

and clusters reﬂ ects the dimensionality of the subspace that is spanned by

combining the corresponding subspace of each cluster. As in COSA, the

weighting of attributes is learned for each object, not for entire clusters.

The learning of weights, however, is based on single attributes, not on the

entire feature space. DiSH uses an algorithm that is inspired by the density-

based hierarchical clustering algorithm OPTICS. However, DiSH extends

the cluster ordering computed by OPTICS in order to ﬁ nd hierarchies of

subspace clusters with multiple inclusions (a lower-dimensional subspace

cluster may be embedded in multiple higher-dimensional subspace

clusters). SCHISM [51] mines interesting subspaces rather than subspace

clusters, hence, it is not exactly a subspace clustering algorithm, but solves

a related problem: ﬁ nding subspaces to look for clusters. It employs a

grid-like discretization of the database and applies a depthﬁ rst search with

backtracking to ﬁ nd maximally interesting subspaces. FIRES [47] computes

one-dimensional clusters using any clustering technique the user is most

accomplished with in a ﬁ rst step. These one-dimensional clusters are then

merged by applying a ”clustering of clusters.” The similarity of clusters

is deﬁ ned by the number of intersecting points. The resulting clusters

represent hyper-rectangular approximations of the true subspace clusters.

In an optional postprocessing step, these approximations can be reﬁ ned

by again applying any clustering algorithm to the points included in the

approximation projected onto the corresponding subspace. Though using

a bottom-up search strategy, FIRES is rather efﬁ cient because it does not

employ a worst-case exhaustive search procedure but a heuristic that is

linear in the dimensionality of the data space. However, this performance

boost is paid for by an expected loss of clustering accuracy. It cannot be

speciﬁ ed whether the subspace clusters produced by FIRES may overlap or

not. In general, the clusters may overlap, but usually FIRES cannot produce

all clusters in all subspaces.

4.5 Constraint-based Clustering Algorithm

In computer science, constrained clustering is a class of semi-supervised

learning algorithms [36]. Typically, constrained clustering incorporates

either a set of must-link constraints, cannot-link constraints, or both, with a

data clustering algorithm. Both a must-link and a cannot-link constraint

deﬁ ne a relationship between two data instances. A must-link constraint is

used to specify that the two instances in the must-link relation should be

associated with the same cluster. A cannot-link constraint is used to specify

that the two instances in the cannot-link relation should not be associated

with the same cluster. These sets of constraints acts as a guide for which a

constrained clustering algorithm will attempt to ﬁ nd clusters in a data set

which satisfy the speciﬁ ed must-link and cannot-link constraints. Some

constrained clustering algorithms will abort if no such clustering exists

which satisﬁ es the speciﬁ ed constraints. Others will try to minimize the

amount of constraint violation should it be impossible to ﬁ nd a clustering

which satisﬁ es the constraints.

Clustering Analysis 89

90 Applied Data Mining

4.5.1 COP K-means

In the context of partitioning algorithms, instance level constraints are a

useful way to express a prior knowledge about which instances should or

should not be grouped together. Consequently, we consider two types of

pair-wise constraints:

• Must-link constraints specify that two instances have to be in the same

cluster.

• Cannot-link constraints specify that two instances must not be placed

in the same cluster.

The must-link constraints deﬁ ne a transitive binary relation over the

instances. Consequently, when making use of a set of constraints (of both

kinds), we take a transitive closure over the constraints. The full set of

derived constraints is then presented to the clustering algorithm. In general,

constraints may be derived from partially labeled data or from background

knowledge about the domain or data set.

Algorithm 4.11 gives the framework of COP K-means algorithm. The

major modiﬁ cation is that, when updating cluster assignments, we ensure

that none of the speciﬁ ed constraints are violated. We attempt to assign

each point d

to its closest cluster C

. This will succeed unless a constraint

would be violated. If there is another point d= that must be assigned to

the same cluster as d, but that is already in some other cluster, or there is

another point d

that cannot be grouped with d but is already in C, then d

cannot be placed in C. We continue down the sorted list of clusters until

we ﬁ nd one that can legally host d. Constraints are never broken; if a legal

cluster cannot be found for d, the empty partition is returned. An interactive

demo of this algorithm can be found at http://www.cs.cornell.edu/home/

wkiri/cop-kmeans/.

4.5.2 MPCK-means

Given a set of data objects D, a set of must-link constraints M, a set of

cannot-link constraints C, corresponding cost sets W and

, and the desired

number of clusters K, MPCK-Mmeans ﬁ nds a disjoint K-partitioning

{}

of D (with each cluster having a centroid µ

and a local weight matrix Ak)

such that the objective function is (locally) minimized [17]. The algorithm

integrates the use of constraints and metric learning. Constraints are utilized

during cluster initialization and when assigning points to clusters, and the

distance metric is adapted by re-estimating the weight matrices Ak during

each iteration based on the current cluster assignments and constraint

violations. Algorithm 4.12 gives the pseudocode of MPCK-means.

Algorithm 4.11: COP-K-means

Input: data set D, must-link constraints Con

¡ D x D, cannot-link constraints

Con ¡ D x D

Output: Clustering result

(1) Let C

,…, C

be the initial cluster centers.

(2) For each point d

in D, assign it to the closest cluster C

such that

violate-constraints (d

, C

, Con

, Con) is false. If no such cluster exists, fail

(return {}).

(3) For each cluster C

, update its center by averaging all of the points d

that have

been assigned to it.

(4) Iterate between (2) and (3) until convergence.

(5) Return {C

,…, C

violate-constraints(data point d, cluster C, must-link constraints Con ¡ D x D,

cannot-link constraints Con



¡D x D)

(1) For each (d,d

) Con

: if d

£C, return true.

(2) For each (d,d



) Con



: if d



£C, return true.

(3) Otherwise, return false

4.5.3 AFCC

AFCC is based on an iterative reallocation that partitions a data set into

an optimal number of clusters by locally minimizing the sum of intra-

cluster distances while respecting as many as possible of the constraints

provided. AFCC alternates between membership updating step and

centroid estimation step while generating actively at each iteration new

candidate pairs for constraints [28]. After the initialization step, we continue

by computing α, the factor that will ensure a balanced inﬂ uence from

the constrained data and unlabeled patterns than β, the factor that will

determine which term of the membership updating equation will dominate.

Afterwards, memberships will be updated. In the second step, based on

the cardinalities of different clusters, spurious clusters will be discarded,

thus obtaining the centroids of good clusters. At this time, a data partition

is available, AFCC will then try to identify least well deﬁ ned cluster and

selects in an active manner good candidates for the need of generating

maximally informative constraints. As distance d(x

, µ

) between a data

item x

and a cluster centriod µ

, one can use either the ordinary Euclidean

distance when the clusters are assumed to be spherical or the Mahalanobis

distance when they are assumed to be elliptical: d

, µ

) = |C

1/n

− µ

)

−1

− µ

), where n is the dimension of the space considered and C

is the

covariance matrices of the cluster k:

()()

ikikik

ux x

μμ

−−

∑

Clustering Analysis 91

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4: Clustering Analysis (7/9)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 4: Clustering Analysis (7/9)