76 Applied Data Mining
Another interesting property of DBSCAN is that its complexity is fairly
low—it requires a linear number of range queries on the database—and
that it will discover essentially the same results (it is deterministic for core
and noise points, but not for border points) in each run, therefore there is
no need to run it multiple times.
Algorithm 4.7: DBSCAN
(1) select a point p
(2) Retrieve all points density-reachable from p wrt G and MinPts.
(3) If p is a core point, a cluster is formed.
(4) If p is a border point, no points are density-reachable from p and DBSCAN
visits the next point of the database.
(5) Continue the process until all of the points have been processed.
(6) Retur
n clustering results.
OPTICS [11] is a generalization of DBSCAN that removes the need
to choose an appropriate value for the range parameter, and produces
a hierarchical result related to that of linkage clustering. DeLi-Clu [5],
Density-Link-Clustering combines ideas from single-link clustering and
OPTICS, eliminating the parameter entirely and offering performance
improvements over OPTICS by using an R-tree index. The key drawback
of DBSCAN and OPTICS is that they expect some kind of density drop
to detect cluster borders. Moreover they cannot detect intrinsic cluster
structures which are prevalent in the majority of real life data. On data sets
with, for example, overlapping Gaussian distributions—a common use case
in artifi cial data—the cluster borders produced by these algorithms will
often look arbitrary, because the cluster density decreases continuously. On
a data set consisting of mixtures of Gaussians, these algorithms are nearly
always outperformed by methods such as EM clustering that are able to
precisely model this kind of data.
4.3.3.2 DENCLUE
DENCLUE (DENsity CLUstEring) [32] is a density clustering approach
that takes a more formal approach to density based clustering by modeling
the overall density of a set of points as the sum of “infl uence” functions
associated with each point. The resulting overall density function will have
local peaks, i.e., local density maxima, and these local peaks can be used to
defi ne clusters in a straightforward way. Specifi cally, for each data point, a
hill climbing procedure fi nds the nearest peak associated with that point,
and the set of all data points associated with a particular peak (called a local
density attractor) becomes a (center-defi ned) cluster. However, if the density
at a local peak is too low, then the points in the associated cluster are classifi ed