Chapter 4: Clustering Analysis (4/9)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

72 Applied Data Mining

(1) Start by assigning each item to its own cluster, so that if you have m

items, you now have m clusters, each containing just one item. Let

the distances (similarities) between the clusters equal the distances

(similarities) between the items they contain.

(2) Find the closest (most similar) pair of clusters and merge them into a

single cluster, so that now you have one less cluster.

(3) Compute distances (similarities) between the new cluster and each of

the old clusters.

(4) Repeat steps 2 and 3 until all items are clustered into a single cluster

of size N. Thus, the agglomerative clustering algorithm will result in

a binary cluster tree with single article clusters as its leaf nodes and a

root node containing all the articles.

• Single Link

For the single link or MIN version of hierarchical clustering, the

proximity of two clusters is deﬁ ned as minimum of the distance

between any two objects in the different clusters. Single link is good

at handling non-elliptical shapes, but is sensitive to noise and outliers.

Figure 4.3.2(a) gives a sample similarity matrix for ﬁ ve objects and the

dendrogram shows the series of merges derived by using the single

link clustering method [Fig. 4.3.2(b)].

1 0.92 0.09 0.65 0.21

0.92 1 0.72 0.61 0.50

0.09 0.72 1 0.45 0.30

0.65 0.61 0.45 1. 0.84

0.21 0.50 0.30 0.84 1

(a) similarity matrix for five objects (b) Dendogram of single link

Figure 4.3.2: An example of Single Link

• Complete Link Clustering

For the complete link version of hierarchical clustering, the proximity

of two clusters is deﬁ ned to be maximum of the distance (minimum

of the similarity) between any two points in the different clusters.

(The technique is called complete link because, if you start with all

points as singleton clusters, and add links between points, strongest

links ﬁ rst, then a group of points is not a cluster until all the points

in it are completely linked.) Complete link is less susceptible to noise

and outliers, but can break large clusters, and has trouble with convex

shapes. The following table gives a sample similarity matrix and the

dendrogram shows the series of merges that result from using the

complete link technique. Figure 4.3.3(a) gives a sample similarity

matrix for ﬁ ve objects and the dendrogram shows the series of merges

derived by using the single link clustering method [Fig. 4.3.3(b)].

1 0.92 0.09 0.65 0.21

0.92 1 0.72 0.61 0.50

0.09 0.72 1 0.45 0.30

0.65 0.61 0.45 1. 0.84

0.21 0.50 0.30 0.84 1

(a) Similarity matrix for five objects



(b) Dendogram of single link



Figure 4.3.3: An example of Complete Link Clustering

• Average Link Clustering

For the group average version of hierarchical clustering, the proximity

of two clusters is deﬁ ned to be the average of the pairwise proximities

between all pairs of points in the different clusters. Notice that

this is an intermediate approach between MIN and MAX. This is

expressed by the following equation: similarity (cluster1, cluster2) =

11,2 2

similarity( 1, 2)

|| 1|| || 2 ||

p cluster p cluster

cluster cluster

∈∈

∗

∑

Figure 4.3.4 gives a sample similarity matrix and the dendrogram shows

the series of merges that result from using the group average approach.

The hierarchical clustering in this simple case is the same as produced by

MIN.

• Ward’s method

Ward’s method says that the distance between two clusters, and , is

how much the sum of squares will increase when we merge them:

O(C

k’

) = ¹

¢CkCk’

||x

− centroid

C

k’

− ¹

xi¢Ck

||x

− centroid

−

xi¢Ck’

||x

− centroid

k’

num num

|| centroid centroid ||

num +num

′

−

where centroid

is the center of cluster k, and num

is the number of objects

in it. O is called the merging cost of combining the clusters C

and C'

. With

hierarchical clustering, the sum of squares starts out at zero (because every

point is in its own cluster) and then grows as we merge clusters. Ward’s

method keeps this growth as small as possible. This is nice if you believe

that the sum of squares should be small. Notice that the number of points

Clustering Analysis 73

74 Applied Data Mining

shows up in O, as well as their geometric separation. Given two pairs of

clusters whose centers are equally far apart, Ward’s method will prefer to

merge the smaller ones. Ward’s method is both greedy, and constrained by

previous choices as to which clusters to form. This means its sum-of-squares

for a given number k of clusters is usually larger than the minimum for that

k, and even larger than what k-means will achieve. If this is bothersome for

your application, one common trick is use hierarchical clustering to pick

K (see below), and then run k-means starting from the clusters found by

Ward’s method to reduce the sum of squares from a good starting point.

1 0.92 0.09 0.65 0.21

0.92 1 0.72 0.61 0.50

0.09 0.72 1 0.45 0.30

0.65 0.61 0.45 1. 0.84

0.21 0.50 0.30 0.84 1

(a) Similarity matrix for five objects (b) Dendrogram of average link

Figure 4.3.4: An example of Average Link Clustering

4.3.3 Density-based methods

Density-based approaches apply a local cluster criterion. Clusters are

regarded as regions in the data space in which the objects are dense, and

which are separated by regions of low object density (noise). These regions

may have an arbitrary shape and the points inside a region may be arbitrarily

distributed.

4.3.3.1 DBSCAN

The most popular density based clustering method is DBSCAN [22]. In

contrast to many newer methods, it features a well-deﬁ ned cluster model

called “density-reachability”. Similar to link-based clustering, it is based

on connecting points within certain distance thresholds. DBSCAN has

been applied to a 5-dimensional feature space created from several satellite

images covering the area of California (5 different spectral channels: 1

visible, 2 reﬂ ected infrared, and 2 emitted (thermal) infrared). The images

are taken from the roster data of the SEQUOIA 2000 Storage Benchmark.

This kind of clustering application is one of the basic methods for automatic

landuse detection from remote sensing data. The main idea of DBSCAN

could be described as Fig. 4.3.5. From Fig. 4.3.5, we can ﬁ nd that DBSCAN’s

deﬁ nition of a cluster is based on the notion of density reachability. Basically,

a point q is directly density-reachable from a point p if it is not farther away

than a given distance (i.e., is part of its neighborhood) and if p is surrounded

by sufﬁ ciently many points such that one may consider p and q to be part

of a cluster. q is called density-reachable (note the distinction from “directly

density-reachable”) from p if there is a sequence p1,...,pm of points with p1=p

and pm =q where each pi+1 is directly density-reachable from pi. Note that

the relation of density-reachable is not symmetric. q might lie on the edge

of a cluster, having insufﬁ ciently many neighbors to count as dense itself.

This would halt the process of ﬁ nding a path that stops with the ﬁ rst non-

dense point. By contrast, starting the process with q would lead to p (though

the process would halt there, p being the ﬁ rst non-dense point). Due to this

asymmetry, the notion of density-connected is introduced: two points p

and q are density-connected if there is a point o such that both p and q are

density-reachable from o. Density-connectedness is symmetric. Algorithm

4.7 gives the main steps of DBSCAN. DBSCAN visits each point of the

database, possibly multiple times (e.g., as candidates to different clusters).

For practical considerations, however, the time complexity is mostly

governed by the number of region Query invocations. DBSCAN executes

exactly one such query for each point, and if an indexing structure is used

that executes such a neighborhood query in O(log n), an overall runtime

complexity of O(v

logn) is obtained. Without the use of an accelerating

index structure, the run time complexity is O(n

). Often the distance matrix

of size (n

− n)/2 is materialized to avoid distance re-computations. This

however also needs O(n

) memory. However, it only connects points that

satisfy a density criterion in the original variant deﬁ ned as a minimum

number of other objects within this radius. A cluster consists of all density-

connected objects (which can form a cluster of an arbitrary shape, in contrast

to many other methods) plus all objects that are within these objects’ range.

Figure 4.3.5: The basic idea of DBSCAN.

Clustering Analysis 75

76 Applied Data Mining

Another interesting property of DBSCAN is that its complexity is fairly

low—it requires a linear number of range queries on the database—and

that it will discover essentially the same results (it is deterministic for core

and noise points, but not for border points) in each run, therefore there is

no need to run it multiple times.

Algorithm 4.7: DBSCAN

(1) select a point p

(2) Retrieve all points density-reachable from p wrt G and MinPts.

(3) If p is a core point, a cluster is formed.

(4) If p is a border point, no points are density-reachable from p and DBSCAN

visits the next point of the database.

(5) Continue the process until all of the points have been processed.

(6) Retur

n clustering results.

OPTICS [11] is a generalization of DBSCAN that removes the need

to choose an appropriate value for the range parameter, and produces

a hierarchical result related to that of linkage clustering. DeLi-Clu [5],

Density-Link-Clustering combines ideas from single-link clustering and

OPTICS, eliminating the parameter entirely and offering performance

improvements over OPTICS by using an R-tree index. The key drawback

of DBSCAN and OPTICS is that they expect some kind of density drop

to detect cluster borders. Moreover they cannot detect intrinsic cluster

structures which are prevalent in the majority of real life data. On data sets

with, for example, overlapping Gaussian distributions—a common use case

in artiﬁ cial data—the cluster borders produced by these algorithms will

often look arbitrary, because the cluster density decreases continuously. On

a data set consisting of mixtures of Gaussians, these algorithms are nearly

always outperformed by methods such as EM clustering that are able to

precisely model this kind of data.

4.3.3.2 DENCLUE

DENCLUE (DENsity CLUstEring) [32] is a density clustering approach

that takes a more formal approach to density based clustering by modeling

the overall density of a set of points as the sum of “inﬂ uence” functions

associated with each point. The resulting overall density function will have

local peaks, i.e., local density maxima, and these local peaks can be used to

deﬁ ne clusters in a straightforward way. Speciﬁ cally, for each data point, a

hill climbing procedure ﬁ nds the nearest peak associated with that point,

and the set of all data points associated with a particular peak (called a local

density attractor) becomes a (center-deﬁ ned) cluster. However, if the density

at a local peak is too low, then the points in the associated cluster are classiﬁ ed

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4: Clustering Analysis (4/9)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 4: Clustering Analysis (4/9)