88 Applied Data Mining
4.4.3 Other Methods
Hybrid clustering algorithms do not belong to bottom-up or top-down
approaches. Algorithms that do not aim at uniquely assigning each data
point to a cluster nor at fi nding all clusters in all subspaces are called hybrid
algorithms. Some hybrid algorithms offer the user an optional functionality
of a pure projected clustering algorithm. Others aim at computing only
the subspaces of potential interest rather than the fi nal clusters. Usually,
hybrid methods that report clusters allow overlapping clusters, but do
not aim at computing all clusters in all subspaces. DOC [49] uses a global
density threshold to defi ne a subspace cluster by means of hypercubes of
fi xed side-length w containing at least α points. A random search algorithm
is proposed to compute such subspace clusters from a starting seed of
sampled points. A third parameter β specifi es the balance between the
number of points and the dimensionality of a cluster. This parameter affects
the dimensionality of the resulting clusters, and thus DOC usually also has
problems with subspace clusters of signifi cantly different dimensionality.
Due to the very simple clustering model, the clusters may contain additional
noise points (if w is too large) or not all points that naturally belong to the
cluster (if w is too small). One run of DOC may (with a certain probability)
fi nd one subspace cluster. If k clusters need to be identifi ed, DOC has to
be applied at least k times. If the points assigned to the clusters found so
far are excluded from subsequent runs, DOC can be considered as a pure
projected clustering algorithm because each point is uniquely assigned
to one cluster or to noise (if not assigned to a cluster). On the other hand,
if the cluster points are not excluded from subsequent runs, the resulting
clusters of multiple runs may overlap. Usually, DOC cannot produce all
clusters in all subspaces. MINECLUS [60] is based on a similar idea as
DOC, but proposes a deterministic method to fi nd an optimal projected
cluster, given a sample seed point. The authors transform the problem
into a frequent item set mining problem and employ a modifi ed frequent
pattern tree growth method. Further heuristics are introduced to enhance
effi ciency and accuracy.
DiSH [6] follows a similar idea as PreDeCon but uses a hierarchical
clustering model. This way, hierarchies of subspace clusters can be
discovered, that is, the information that a lower-dimensional cluster is
embedded within a higher-dimensional one. The distance between points
and clusters refl ects the dimensionality of the subspace that is spanned by
combining the corresponding subspace of each cluster. As in COSA, the
weighting of attributes is learned for each object, not for entire clusters.
The learning of weights, however, is based on single attributes, not on the
entire feature space. DiSH uses an algorithm that is inspired by the density-
based hierarchical clustering algorithm OPTICS. However, DiSH extends