72 Applied Data Mining
(1) Start by assigning each item to its own cluster, so that if you have m
items, you now have m clusters, each containing just one item. Let
the distances (similarities) between the clusters equal the distances
(similarities) between the items they contain.
(2) Find the closest (most similar) pair of clusters and merge them into a
single cluster, so that now you have one less cluster.
(3) Compute distances (similarities) between the new cluster and each of
the old clusters.
(4) Repeat steps 2 and 3 until all items are clustered into a single cluster
of size N. Thus, the agglomerative clustering algorithm will result in
a binary cluster tree with single article clusters as its leaf nodes and a
root node containing all the articles.
Single Link
For the single link or MIN version of hierarchical clustering, the
proximity of two clusters is defi ned as minimum of the distance
between any two objects in the different clusters. Single link is good
at handling non-elliptical shapes, but is sensitive to noise and outliers.
Figure 4.3.2(a) gives a sample similarity matrix for fi ve objects and the
dendrogram shows the series of merges derived by using the single
link clustering method [Fig. 4.3.2(b)].
x
1
x
2
x
3
x
4
x
5
x
1
1 0.92 0.09 0.65 0.21
x
2
0.92 1 0.72 0.61 0.50
x
3
0.09 0.72 1 0.45 0.30
x
4
0.65 0.61 0.45 1. 0.84
x
5
0.21 0.50 0.30 0.84 1
x
1
x
2
x
3
x
4
x
5
(a) similarity matrix for five objects (b) Dendogram of single link
Figure 4.3.2: An example of Single Link
Complete Link Clustering
For the complete link version of hierarchical clustering, the proximity
of two clusters is defi ned to be maximum of the distance (minimum
of the similarity) between any two points in the different clusters.
(The technique is called complete link because, if you start with all
points as singleton clusters, and add links between points, strongest
links fi rst, then a group of points is not a cluster until all the points
in it are completely linked.) Complete link is less susceptible to noise
and outliers, but can break large clusters, and has trouble with convex
shapes. The following table gives a sample similarity matrix and the
S
dendrogram shows the series of merges that result from using the
complete link technique. Figure 4.3.3(a) gives a sample similarity
matrix for fi ve objects and the dendrogram shows the series of merges
derived by using the single link clustering method [Fig. 4.3.3(b)].
x
1
x
2
x
3
x
4
x
5
x
1
1 0.92 0.09 0.65 0.21
x
2
0.92 1 0.72 0.61 0.50
x
3
0.09 0.72 1 0.45 0.30
x
4
0.65 0.61 0.45 1. 0.84
x
5
0.21 0.50 0.30 0.84 1
x
1
x
2
x
3
x
4
x
5
(a) Similarity matrix for five objects
(b) Dendogram of single link
Figure 4.3.3: An example of Complete Link Clustering
Average Link Clustering
For the group average version of hierarchical clustering, the proximity
of two clusters is defi ned to be the average of the pairwise proximities
between all pairs of points in the different clusters. Notice that
this is an intermediate approach between MIN and MAX. This is
expressed by the following equation: similarity (cluster1, cluster2) =
11,2 2
similarity( 1, 2)
|| 1|| || 2 ||
p cluster p cluster
pp
cluster cluster
∈∈
.
Figure 4.3.4 gives a sample similarity matrix and the dendrogram shows
the series of merges that result from using the group average approach.
The hierarchical clustering in this simple case is the same as produced by
MIN.
Ward’s method
Ward’s method says that the distance between two clusters, and , is
how much the sum of squares will increase when we merge them:
O(C
k
,C
k’
) = ¹
x
i
¢CkCk
||x
i
centroid
C
k
C
k
||
2
¹
xi¢Ck
||x
i
centroid
C
k
||
2
¹
xi¢Ck
||x
i
centroid
C
k
||
2
=
2
num num
|| centroid centroid ||
num +num
kk
kk
kk
CC
CC
CC
where centroid
k
is the center of cluster k, and num
k
is the number of objects
in it. O is called the merging cost of combining the clusters C
k
and C'
k
. With
hierarchical clustering, the sum of squares starts out at zero (because every
point is in its own cluster) and then grows as we merge clusters. Ward’s
method keeps this growth as small as possible. This is nice if you believe
that the sum of squares should be small. Notice that the number of points
Clustering Analysis 73
74 Applied Data Mining
shows up in O, as well as their geometric separation. Given two pairs of
clusters whose centers are equally far apart, Ward’s method will prefer to
merge the smaller ones. Ward’s method is both greedy, and constrained by
previous choices as to which clusters to form. This means its sum-of-squares
for a given number k of clusters is usually larger than the minimum for that
k, and even larger than what k-means will achieve. If this is bothersome for
your application, one common trick is use hierarchical clustering to pick
K (see below), and then run k-means starting from the clusters found by
Ward’s method to reduce the sum of squares from a good starting point.
x
1
x
2
x
3
x
4
x
5
x
1
1 0.92 0.09 0.65 0.21
x
2
0.92 1 0.72 0.61 0.50
x
3
0.09 0.72 1 0.45 0.30
x
4
0.65 0.61 0.45 1. 0.84
x
5
0.21 0.50 0.30 0.84 1
x
1
x
2
x
3
x
4
x
5
(a) Similarity matrix for five objects (b) Dendrogram of average link
Figure 4.3.4: An example of Average Link Clustering
4.3.3 Density-based methods
Density-based approaches apply a local cluster criterion. Clusters are
regarded as regions in the data space in which the objects are dense, and
which are separated by regions of low object density (noise). These regions
may have an arbitrary shape and the points inside a region may be arbitrarily
distributed.
4.3.3.1 DBSCAN
The most popular density based clustering method is DBSCAN [22]. In
contrast to many newer methods, it features a well-defi ned cluster model
called “density-reachability”. Similar to link-based clustering, it is based
on connecting points within certain distance thresholds. DBSCAN has
been applied to a 5-dimensional feature space created from several satellite
images covering the area of California (5 different spectral channels: 1
visible, 2 refl ected infrared, and 2 emitted (thermal) infrared). The images
are taken from the roster data of the SEQUOIA 2000 Storage Benchmark.
This kind of clustering application is one of the basic methods for automatic
landuse detection from remote sensing data. The main idea of DBSCAN
could be described as Fig. 4.3.5. From Fig. 4.3.5, we can fi nd that DBSCAN’s
defi nition of a cluster is based on the notion of density reachability. Basically,
a point q is directly density-reachable from a point p if it is not farther away
than a given distance (i.e., is part of its neighborhood) and if p is surrounded
by suffi ciently many points such that one may consider p and q to be part
of a cluster. q is called density-reachable (note the distinction from “directly
density-reachable”) from p if there is a sequence p1,...,pm of points with p1=p
and pm =q where each pi+1 is directly density-reachable from pi. Note that
the relation of density-reachable is not symmetric. q might lie on the edge
of a cluster, having insuffi ciently many neighbors to count as dense itself.
This would halt the process of fi nding a path that stops with the fi rst non-
dense point. By contrast, starting the process with q would lead to p (though
the process would halt there, p being the fi rst non-dense point). Due to this
asymmetry, the notion of density-connected is introduced: two points p
and q are density-connected if there is a point o such that both p and q are
density-reachable from o. Density-connectedness is symmetric. Algorithm
4.7 gives the main steps of DBSCAN. DBSCAN visits each point of the
database, possibly multiple times (e.g., as candidates to different clusters).
For practical considerations, however, the time complexity is mostly
governed by the number of region Query invocations. DBSCAN executes
exactly one such query for each point, and if an indexing structure is used
that executes such a neighborhood query in O(log n), an overall runtime
complexity of O(v
*
logn) is obtained. Without the use of an accelerating
index structure, the run time complexity is O(n
2
). Often the distance matrix
of size (n
2
n)/2 is materialized to avoid distance re-computations. This
however also needs O(n
2
) memory. However, it only connects points that
satisfy a density criterion in the original variant defi ned as a minimum
number of other objects within this radius. A cluster consists of all density-
connected objects (which can form a cluster of an arbitrary shape, in contrast
to many other methods) plus all objects that are within these objects’ range.
Figure 4.3.5: The basic idea of DBSCAN.
N
B
A
C
Clustering Analysis 75
76 Applied Data Mining
Another interesting property of DBSCAN is that its complexity is fairly
low—it requires a linear number of range queries on the database—and
that it will discover essentially the same results (it is deterministic for core
and noise points, but not for border points) in each run, therefore there is
no need to run it multiple times.
Algorithm 4.7: DBSCAN
(1) select a point p
(2) Retrieve all points density-reachable from p wrt G and MinPts.
(3) If p is a core point, a cluster is formed.
(4) If p is a border point, no points are density-reachable from p and DBSCAN
visits the next point of the database.
(5) Continue the process until all of the points have been processed.
(6) Retur
n clustering results.
OPTICS [11] is a generalization of DBSCAN that removes the need
to choose an appropriate value for the range parameter, and produces
a hierarchical result related to that of linkage clustering. DeLi-Clu [5],
Density-Link-Clustering combines ideas from single-link clustering and
OPTICS, eliminating the parameter entirely and offering performance
improvements over OPTICS by using an R-tree index. The key drawback
of DBSCAN and OPTICS is that they expect some kind of density drop
to detect cluster borders. Moreover they cannot detect intrinsic cluster
structures which are prevalent in the majority of real life data. On data sets
with, for example, overlapping Gaussian distributions—a common use case
in artifi cial data—the cluster borders produced by these algorithms will
often look arbitrary, because the cluster density decreases continuously. On
a data set consisting of mixtures of Gaussians, these algorithms are nearly
always outperformed by methods such as EM clustering that are able to
precisely model this kind of data.
4.3.3.2 DENCLUE
DENCLUE (DENsity CLUstEring) [32] is a density clustering approach
that takes a more formal approach to density based clustering by modeling
the overall density of a set of points as the sum of “infl uence” functions
associated with each point. The resulting overall density function will have
local peaks, i.e., local density maxima, and these local peaks can be used to
defi ne clusters in a straightforward way. Specifi cally, for each data point, a
hill climbing procedure fi nds the nearest peak associated with that point,
and the set of all data points associated with a particular peak (called a local
density attractor) becomes a (center-defi ned) cluster. However, if the density
at a local peak is too low, then the points in the associated cluster are classifi ed
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.208.206