CHAPTER 7
Advanced Clustering Analysis
7.1 Introduction
Clustering is a useful approach in data mining processes for identifying
patterns and revealing underlying knowledge from large data collections.
The application areas of clustering include image segmentation, information
retrieval, and document classifi cation, associate rule mining, web usage
tracking and transaction analysis. Generally, clustering is defi ned as the
process of partitioning unlabeled data set into meaningful groups (clusters)
so that intra-group similarities are maximized and inter-group similarities
are minimized at the same time. In essence, clustering involves the following
unsupervised learning process, which can be written as: Defi ne an encoder
function c(x) to map each data object xi into a particular group Gk(c(x) = k)
² x ¢ Gk, k = 1, 2, 3, ..., k, so that a cluster criterion Q(c) = ¹
k
= 1
K
¹
c
(x
i
) = k,
c(x
j
= k)dist(x
i
, x
j
) is minimized.
As we know, this is a classical combinatorial optimization problem and
solving it is exactly NP-hard, even with just two clusters [13]. According
to computation complexity theory [36], no complete algorithm can get the
overall optimal solutions in a polynomial time, unless P = NP. Iterative
refi nement method, a popular approximate algorithm, is widely adopted
by various unsupervised learning algorithms. A general iterative refi nement
clustering process can be summarized as Algorithm 7.1 [6].
Algorithm 7.1: General iterative refi nement clustering
Initialization: Initialize the parameters of the current cluster model.
Refi nement: Repeat until the cluster model converges.
(1) Generate the cluster membership assignments for all data objects, based on
the current model;
(2) Refine the model parameters based on the current cluster membership
assignments.
154 Applied Data Mining
Figure 7.1.1: An example of iterative refi nement clustering algorithm
The intuitionists denotation of iterative refi nement clustering algorithm
is shown in Fig. 7.1.1. The horizontal axis denotes feasible solutions of
clustering problem and the vertical axis is the corresponding objective
function values of feasible solutions.
In this chapter, the feasible solution is the results of encode function
(or the clustering results) and the objective function value is the values
of cluster criterion Q(c) = ¹
k
= 1
K
¹
c
(x
i
) = k, c(x
j
= k)dist(x
i
, x
j
). Without loss
of generality, we assume that point 3 is selected as the initialization of an
iterative refi nement clustering algorithm, and by repeating step (1) and (2),
the algorithm will converge to point 4, one of the feasible solutions with
suboptimal objective function value. An iterative refi nement clustering
algorithm is a popularly used clustering approach based on heuristic
search, which is to minimize the cluster criterion in a designated region
(i.e., guarding the heuristic search with the specifi ed initial values). In this
scenario, the obtained clustering result is dependent on the initialization
and the minimized cluster criterion only refl ects the sub-optimal solution
with this running of heuristic search. In other words, the nature of heuristic
search process makes the iterative refi nement clustering algorithms heavily
sensitive to the initialization settings, thus not guaranteeing the higher
quality clustering results with randomly chosen initializations [5]. Therefore,
how to deal with the sensitivity problem of initialization in iterative
refi nement clustering algorithm is becoming an active and well concerned.
That the initialization model must be correct is an important underlying
assumption for iterative refi nement clustering algorithm. It can determine
the clustering solution [6], that is, different initialization models will produce
different clustering results (or different local minimum points as shown in
Fig. 7.1.1). Since the problem of obtaining a globally optimal initial state
has been shown to be NP-hard [15], the study on the initialization methods
towards a sub-optimal clustering result hence is more practical, and is of
great value for the clustering research. Recently, initialization methods have
been categorized into three major families: random sampling methods,
distance optimization methods and density estimations [17]. Forgy adopted
uniformly random input objects as the seed clusters [14], and MacQueen
gave an equivalent way with selecting the fi rst K input objects as the seed
clusters [29]. In the FASTCLUS, a K-means variance implemented in SAS
[37], the simple cluster seeking (SCS) initialization method is adopted [21].
Katsavounidis et al. proposed a method that utilizes the sorted pairwise
distances for initialization [22]. Kaufman and Rousseeuw introduced a
method that estimates the density through pairwise distance comparison,
and initializes the seed clusters using the input objects from areas with
high local density [23]. In Ref. [18], a method which combines local density
approximation and random initialization is proposed. Belal et al. fi nd a
set of medians extracted from a dimension with maximum and then use
the medians as the initialization of K-means [4]. Niu et al. give a novel
algorithm called PR (Pointer Ring), which initializes cluster centers based
on pointer ring by partition traditional hyper-rectangular units further
to hyper-triangle subspaces [33]. The initialization steps of K-means++
algorithm can be described as: choosing an initial center m
1
uniformly at
random from data set; and then selecting the next center m
i
= x
0
from data
set with a probability, where dist(x, m) denote the shortest distance from a
data object x to the closest center m; iterative until fi nd K centers [2]. The
main steps of initialization centers of K-means by kd-tree are: fi rst, the
density of a data at various locations are estimated by using kd-tree; and
then use a modifi cation of Katsavounidis algorithm, which incorporates
this density information, to choose K seeds for K-means algorithm [39]. And
recently, Lu et al. treated the clustering problem as a weighted clustering
problem so as to fi nd a better initial cluster center based on the hierarchical
approach [28].
7.2 Space Smoothing Search Methods in Heuristic
Clustering
The goal of modifi ed initialization methods, is to reduce the infl uence
of sub-optimal solutions (the local minimum points) bestrewed in the
whole search space, as shown in Fig. 7.1.1. Although iterative refi nement
clustering algorithms with these modifi ed initialization methods have
some merits in improving the quality of cluster results, they also have high
probability to be attracted by local minimum points. Local search method
is the essence of iterative refi nement clustering algorithms. Lots of the
local minimum points make a local search problem hard and sensitive to
Advanced Clustering Analysis 155
156 Applied Data Mining
the initialization. Those proposed modifi ed initialization methods are only
focused on how to select an initialization which can improve the quality of
iterative refi nement clustering algorithm, but the search space embedded
lots of local minimum points is ignored. Smoothing search space method
reconstructs the search space by fi lling local minimum points, to reduce
the infl uence of local minimum points. In this paper, we fi rst design two
smoothing operators to reconstruct the search space by fi lling the minimum
traps (points) based on the relationship between distance metric and cluster
criterion. Each smoothing operator has a parameter, smoothing factor, to
control the number of minimum traps. And then, we give a topCdown
clustering algorithm with smoothing search space (TDCS3) to reduce the
infl uence of initialization. The main steps of TDCS3 are to: (1) dynamically
reconstruct a series of smoothed search space as a hierarchical structure:
the most smoothed search space at the top, and the original search space
at the bottom, other smoothed search spaces are distributed between them,
by fi lling the local minimum points; (2) at the top level of the hierarchical
structure, an existing iterative refi nement clustering algorithm is run with
random initialization to generate the cluster result; (3) from the second
level to the bottom level of the hierarchical structure, the same clustering
algorithm is run with the initialization derived from the cluster result on
the previous level.
Figure 7.2.1: Illustration of smoothing search space
7.2.1 Smoothing Search Space and Smoothing Operator
7.2.1.1 Local Search and Smoothing Search Space
Local search method is the essence of iterative refi nement clustering
algorithms. During the mid-sixties, local search method was first
proposed to cope with the overwhelming computational intractability of
NP-hard combinatorial optimization problems. Give a minimization (or
maximization) problem with objective function f and feasible region F, a
Start solution
Local minimum
Global optimum
typical local search algorithm requires that, with each solution x
i
¢ R
d
, there
is associated a predefi ned neighborhood N(x
i
) R
d
. Given a current solution
point x
i
¢ R
d
, the set N(x
i
) is searched for a point x
i
+1 with f(x
i
+ 1) < f(−x
i
) or
(f(x
i
+ 1) > f(x
i
)). If such a point exists, it becomes the new current solution
point (x
i
x
i
+1), and then the process is iterated. Otherwise, x
i
is retained as
a local optimum with respect to N(x
i
). Then a set of feasible solution points
is generated, and each of them is locally improved within its neighborhood.
Local search methods only check the neighborhood of current feasible
solution x
i
, so the search range has been dramatically reduced and the
convergence speed has been accelerated. A major shortcoming of local
search is that the algorithm has a tendency to get stuck at a locally optimum
confi guration, i.e., a local minima point, as the point 2 or 4 shown in Fig.
7.1. Different neighborhood structures result in difference terrain surface
structures of the search space and produce different numbers of local
minimum points. The effectiveness of a local search algorithm relies on
the number of local minimum points in the search space [16], that is, local
minimum points make a search problem hard. The smaller the number of
local minimum points, the more effective a local search algorithm is. In
order to reduce the infl uence of local minimum to local search algorithm,
some local minimum traps must be fi lled. Gu and Huang [16] has called
the method of fi lling minimum trap as the smoothing search space, and it
is able to dynamically reconstruct the problem structure and smooth the
rugged terrain surface of the search space. The smoothed search space could
hide some local minimum points, therefore, improving the performance
of the traditional local search algorithm. Figure 7.2.1 is the illustration of
smoothing search space.
From Fig. 7.2.1, we can see that Many local minimum traps are fi lled
after running a smoothing operator. The real line curve shows the original
search space which has lots of minimum traps, and dashes shows the
smoothed search space with fewer minimum traps. At the former discussing,
we can fi nd that lots of the local minimum points which are embedded
in the search space make a local search problem hard and sensitive to the
initialization. The essence of iterative refi nement clustering algorithms is the
local search method, thus they have the same real reason for initialization
sensitivity problem. The main idea of smoothing search space is always
common, but different application areas have different ways to smoothing
the search space. In clustering area, clustering is defi ned as the process
of partitioning unlabeled data objects into meaningful groups (clusters)
so that the value of cluster criterion Q(c) is minimized. Minimizing Q(c)
value means that the intra-similarities of all clusters are maximized or
the distances of each data object to its cluster center is minimized. So the
cluster criterion Q(c) has a close relationship with the similarity or distance
between data objects.
Advanced Clustering Analysis 157
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.3.154