7.3 Using Approximate Backbone for Initializations in
Clustering
As described in Section 7.2, iterative refi nement clustering exactly is
NP-hard, even with just two clusters. In real application, K-centre clustering
algorithm is a traditional iterative refi nement clustering, so it inherits the
advantages and drawbacks from iterative refi nement clustering.
For a larger data set, researchers are seeking heuristic methods to solve
this clustering problem. For example, the K-centre clustering algorithm
is a popularly used clustering approach based on heuristic search by
minimizing the sum of squared error locally and obtaining the local
suboptimal clustering results. However, the heuristic search process makes
K-centre clustering algorithms heavily sensitive to the initialization, and
usually cannot guarantee the high quality clustering results with random
initializations [5]. On the other hand, different local suboptimal clustering
results do refl ect the different likelihood of data instances gathering around
various centres in a data set [43]. Due to the fact that there are 80% local
suboptimal solutions are observed to distribute around the global optimal
solutions [24, 30, 35], it is believed that fi nding the commonly overlapped
intersections of various local suboptimal clustering results will facilitate
locating the global optimal solutions. Moreover, for K-centre heuristic
clustering, it is expected that choosing these intersection areas as the initial
search space will result in better clustering results. Backbone analysis
is becoming an active research topic in NP-hard problem recently. The
backbone of a NP-hard problem is regarded as the core part of all global
optimal solutions, which was fi rst proposed in [26] for Travelling Salesman
Problem (TSP), and has attracted much attention recently [42, 38]. An exact
backbone, however, is generally hard to be obtained for many optimization
problems in real applications. Instead, Approximate Backbone (AB), as
indicated by the name—the approximate form of backbone, and defi ned
as the intersection of different local suboptimal solutions of a dataset, is
often used to investigate the characteristic of the dataset and expedite the
convergence speed of heuristic algorithms [44, 8, 19]. In this paper, we
intend to adopt the concept of AB to address the initialization problems
suffering the heuristic clustering described above, and in particular,
propose a Heuristic Clustering Approach Based on Approximate Backbone
(HC_AB). The basic idea of HC_AB is that: we, fi rst identify the AB from a
set of local suboptimal solutions derived from running K-centre clustering
with different initialization settings; then, construct a new restricted search
Advanced Clustering Analysis 163
164 Applied Data Mining
space based on the AB for heuristic search; eventually re-run the K-centre
clustering algorithm by using this new search space which has the AB as the
part of initialization, and generate a better clustering result that is deemed
to best approximate the global optimal solution.
7.3.1 De nitions and Background of Approximate Backbone
In this section, we give several defi nitions to prepare the background of
approximate backbone. Given a data set D = x
1
, x
2
, ...x
n
contains N data
objects and each object x
i
¢ R
d
is defi ned over-dimensional feature space.
Let dist : R
d
× R
d
R
+
be a given distance function between any two objects
in R
d
. A K-centre clustering algorithm takes D as an input and partitions
the N data objects into K clusters such that the sum of squared error
H
=
¹
K
k=1
¹
x
i
¢C
k
dist(x
i
, v
k
) is minimized, where C
k
is a cluster and v
k
is the centre
of C
k
. Since each cluster is represented by its centre, the K-centre clustering
result can be represented as V = v
1
, v
2
, ...v
K
. The local suboptimal clustering
result is represented by a set of centres such that the corresponding
H
value
is minimized in a local area. The Global optimal clustering results are
the collection of suboptimal clustering results with the smallest
H
value.
DEFINITION 1 (Backbone). Given the global optimal clustering results
Z* = V
1
*
, V
2
*
, ...V
p
*
where V
p
*
= v
p
1
*
, v
p
2
*
, ...v
p
k
*
, p = 1, ...P. The backbone of
this clustering problem is defi ned as the intersection of P global optimal
clustering results. backbone(V
1
*
, V
2
*
, ...V
p
*
) = V
1
*
V
2
*
....V
p
*.
Generally, the global optimal solution is hard to be obtained for a NP-
hard problem in fact, resulting in diffi culty in identifying the theoretically
ideal backbone. However, in many research areas, researchers have observed
an interesting fact that there are 80% local suboptimal solutions being
distributed around the global optimal solutions and a big valley structure
is seen. Motivated by this fact, we intuitively have an idea in mind on how
to approximate the ideal backbone by making use of the local suboptimal
solutions.
DEFINITION 7.4 (Approximate Backbone). Given the local suboptimal
clustering result Z = V
1
, V
2
, ..., V
M
, where V
m
= v
m
1
, v
m
2
..., v
m
K
m = 1, ....,M. The
AB is defi ned as the intersection of M local suboptimal clustering results.
a_bone(V
1
, V
2
, ..., V
M
) = V
1
V
2
.. V
M
As described above, our method aims to use the AB of local optimal
solutions to form the part of initialization (i.e., the start points for heuristic
search), thus constructing an appropriate AB for the heuristic search in
K-centre clustering algorithm is a key issue. In other words, the quality of
K-centre clustering results is greatly dependent on the characteristics of
AB. To address the concern, here we propose two parameters to describe
the characteristics of AB—Scale and Purity. The former one describes how
many percentages of total local optimal solutions are included in the AB;
whereas the later one denotes how many percentages of local suboptimal
solutions included in AB are also existed in the theoretically ideal backbone
as well. In particular, Approximate Backbone Scale (ABS) and Approximate
Backbone Purity (ABP) are defi ned as follows.
DEFINITION 7.5 (Approximate Backbone Scale). Given an AB, a
b
one
(V
1
, V
2
, ..., V
M
). Approximate Backbone Scale is defi ned as the proportion
of the AB cardinality to the cluster number K.
ABS =
12
| _ ( , ,..., ) |
M
a bone V V V
K
DEFINITION 7.6 (Approximate Backbone Purity). Given an AB, a
b
one(V
1
,
V
2
, ..., V
M
), and a backbone backbone(V
1
*
, ..., V
P
*
) Approximate Backbone
Purity is defi ned as the proportion of the cardinality of the intersection of
the AB and the backbone to the AB cardinality.
ABP =
12 1 2
12
(| _ ( , ,..., ) ( *, *,..., *) |)
| _ ( , ,..., ) |
MP
M
a bone V V V backbone V V V
a bone V V V
As the AB is used for the selection of initialization, in order to achieve
the best result of heuristic search, we expect to form an appropriate AB with
both large ABS and ABP values, which indicates the fact that the most of
whole local suboptimal solutions should be included in the initialization
and the included local suboptimal solutions (centres) are closely scattered
around the global optimal clustering result. In order to better understand
ABS and ABP, we give an example to explain them. Given a data set D,
which contains 500 objects and 10 clusters, each cluster is represented by
a representative object. An assumed global optimal clustering result V*
is located in the fi rst row in table 1 and three local suboptimal clustering
results V
1
, V
2
, V
3
, obtained by running K-centre clustering algorithm with
three initializations, are also listed in Table 7.3.1. For the simplifi cation, we
assume that each centre is represented by an objects ID in D, and there is
only one global optimal clustering result in D, that is backbone(V*) = V*.
For this example, we can obtain the AB: a
b
one(V
1
, V
2
, V
3
) = 43, 78, 198,
240, 310, 366, 480. Known from the defi nition of ABS, the value of ABS in
this example is calculated as follow.
ABS =
12 3
| _ ( , ,..., ) | 7
0.7
10
a bone V V V
K
==
Advanced Clustering Analysis 165
166 Applied Data Mining
Name Centre set
*
V
22, 78, 109, 180, 230, 292, 310, 366, 412, 475
1
V
43, 78, 109, 198, 240, 262, 310, 366, 412, 480
2
V
43, 78, 128, 198, 240, 262, 310, 366, 412, 480
3
V
43, 78, 128 198, 240, 252, 310, 366, 432, 480
Figure 7.3.1: Clustering results of D
We observed that there are three commonly overlapped centres existed
in AB and backbone, thus the value of ABP is,
ABP =
12 3
12 3
|_ ( , , ) *| 3
0.429
|_ ( , , )| 7
a bone V V V V
a bone V V V
==
According to the Definition 7.5, the AB is derived from M local
suboptimal solutions, so the characteristics of AB has a close relationship
with M. In order to illustrate this relationship, we construct three data
sets: RandomS1, RandomS2 and RandomS3, each of which contains 34
clusters. And each cluster has 100 data objects, among which 99 objects
are generated by a Gaussian distribution function with different mean (µ)
and standard deviation (σ) and the last one is the mean of the rest, which
is deemed as the centre of the cluster. We run Vertex Substitution Heuristic
(VSH) algorithm [7], a classical K-centre clustering algorithm, on these
three data sets, and note the process as VSH
R
andomS1, VSH
R
andomS2 and
VSH
R
andomS3 respectively. VSH was executed for M=2:2:20 times on each
data set, where M=2:2:20 means M changing from 2 to 20 with step 2. The
relationships between ABS, ABP and M are shown in Fig. 7.3.2.
0.4
0.5
0.6
0.7
0.8
0.9
1
2 4 6 8 10 12 14 16 18 20
number of local suboptinal cluster results,
M
Approximate Backbone Scale(ABS)
VSH_Ra ndo mS1
VSH_Ra ndo mS2
VSH_Ra ndo mS3
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2468101214161820
number of local suboptimal cluster results,M
Approximate Backbone Purity(ABP)
VSH_Rando mS 1
VSH_Rando mS 2
VSH_Rando mS 3
Figure 7.3.2: The ABS and ABP of AB
number of local suboptinal cluster results, M
number of local suboptinal cluster results, M
From Fig. 7.3.2, we can see that the trends of changes on ABS and
ABP are along opposite directions with M—ABS is decreasing while M
increasing, on the contrary, ABP increasing, and eventually, the changes of
ABS and ABP become slight and a balanced state is reached. As indicated
above, we hope to form a better AB with both higher ABS and ABP values to
construct a good searching start for K-centre clustering algorithm. Due to the
inconsistent changes of ABS and ABP with M, we have to choose a tradeoff
between ABS and ABP to ensure the better clustering results. According
to the experimental results, we fi nd the fact that choosing a reasonable M
can guarantee a better AB. The setting of M is an open question in various
applications [19]. In this chapter, we experimentally set M value based on
the size of data set and the cluster number.
7.3.2 Heuristic Clustering Algorithm based on Approximate Backbone
7.3.2.1 Reconstruct the New Searching Space
A new search space, S determined by the AB, is crucial to heuristic clustering
algorithms. Various applications might have different new search space
reconstruction methods. For example, in TSP [16], the new search space is
reconstructed by including all the edges that are not occurred in AB. While
the construction of AB leads to the formation of the fraction of the new search
space, the number of centres within the AB is usually less than the number
of predefi ned clusters in K-centre clustering. In that case, how to select
the rest centres from the local suboptimal solutions is another important
issue. Essentially, a cluster centre that was’nt included in the AB but was
frequently occurred in local suboptimal clustering results is more likely to
be selected to represent a real cluster. Although we can refi ne the new search
space with all the centers occurred in local suboptimal solutions, in order to
reduce the size of the new search space, we therefore select a centre into the
new search space, only if it is a frequent centre. DEFINITION 5 (Frequent
Centre). Given an occurrence threshold β, the local suboptimal clustering
result Z = V
1
, ..., V
M
and the AB a
b
one(Z), for each centre v
i
¢ UZ a
b
one(Z), if
the occurrence frequency of v
i
exceeds β, then v
i
is a frequent centre, where
is a set difference operator and UZ = U
m
= 1
M
V
m
.
We use the example shown in Table 1 to explain the new search space
refi nement. From table 1, we fi nd that there are 14 objects be selected as the
centre of clusters in V
1
, V
2
, V
3
. The AB of this example is a
b
one(V
1
, V
2
, V
3
) =
109, 128, 130, 262, 252, 412, 432, if the occurrence frequency of v
i
exceeds β,
then it is a frequent centre and will be included in S. According to the big
valley phenomenon, here we set the parameter β = M
*
80% = 2.4 (for this
example M=3), and we obtain the three additional centers 128, 262, 412.
Advanced Clustering Analysis 167
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.187.199