Chapter 7: Advanced Clustering Analysis (3/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

7.3 Using Approximate Backbone for Initializations in

Clustering

As described in Section 7.2, iterative reﬁ nement clustering exactly is

NP-hard, even with just two clusters. In real application, K-centre clustering

algorithm is a traditional iterative reﬁ nement clustering, so it inherits the

advantages and drawbacks from iterative reﬁ nement clustering.

For a larger data set, researchers are seeking heuristic methods to solve

this clustering problem. For example, the K-centre clustering algorithm

is a popularly used clustering approach based on heuristic search by

minimizing the sum of squared error locally and obtaining the local

suboptimal clustering results. However, the heuristic search process makes

K-centre clustering algorithms heavily sensitive to the initialization, and

usually cannot guarantee the high quality clustering results with random

initializations [5]. On the other hand, different local suboptimal clustering

results do reﬂ ect the different likelihood of data instances gathering around

various centres in a data set [43]. Due to the fact that there are 80% local

suboptimal solutions are observed to distribute around the global optimal

solutions [24, 30, 35], it is believed that ﬁ nding the commonly overlapped

intersections of various local suboptimal clustering results will facilitate

locating the global optimal solutions. Moreover, for K-centre heuristic

clustering, it is expected that choosing these intersection areas as the initial

search space will result in better clustering results. Backbone analysis

is becoming an active research topic in NP-hard problem recently. The

backbone of a NP-hard problem is regarded as the core part of all global

optimal solutions, which was ﬁ rst proposed in [26] for Travelling Salesman

Problem (TSP), and has attracted much attention recently [42, 38]. An exact

backbone, however, is generally hard to be obtained for many optimization

problems in real applications. Instead, Approximate Backbone (AB), as

indicated by the name—the approximate form of backbone, and deﬁ ned

as the intersection of different local suboptimal solutions of a dataset, is

often used to investigate the characteristic of the dataset and expedite the

convergence speed of heuristic algorithms [44, 8, 19]. In this paper, we

intend to adopt the concept of AB to address the initialization problems

suffering the heuristic clustering described above, and in particular,

propose a Heuristic Clustering Approach Based on Approximate Backbone

(HC_AB). The basic idea of HC_AB is that: we, ﬁ rst identify the AB from a

set of local suboptimal solutions derived from running K-centre clustering

with different initialization settings; then, construct a new restricted search

Advanced Clustering Analysis 163

164 Applied Data Mining

space based on the AB for heuristic search; eventually re-run the K-centre

clustering algorithm by using this new search space which has the AB as the

part of initialization, and generate a better clustering result that is deemed

to best approximate the global optimal solution.

7.3.1 Deﬁ nitions and Background of Approximate Backbone

In this section, we give several deﬁ nitions to prepare the background of

approximate backbone. Given a data set D = x

, x

, ...x

contains N data

objects and each object x

¢ R

is deﬁ ned over-dimensional feature space.

Let dist : R

× R



be a given distance function between any two objects

in R

. A K-centre clustering algorithm takes D as an input and partitions

the N data objects into K clusters such that the sum of squared error

k=1

¢C

dist(x

, v

) is minimized, where C

is a cluster and v

is the centre

of C

. Since each cluster is represented by its centre, the K-centre clustering

result can be represented as V = v

, v

, ...v

. The local suboptimal clustering

result is represented by a set of centres such that the corresponding

value

is minimized in a local area. The Global optimal clustering results are

the collection of suboptimal clustering results with the smallest

value.

DEFINITION 1 (Backbone). Given the global optimal clustering results

Z* = V

, V

, ...V

where V

= v

, v

, ...v

, p = 1, ...P. The backbone of

this clustering problem is deﬁ ned as the intersection of P global optimal

clustering results. backbone(V

, V

, ...V

) = V

∩V

∩....∩V

Generally, the global optimal solution is hard to be obtained for a NP-

hard problem in fact, resulting in difﬁ culty in identifying the theoretically

ideal backbone. However, in many research areas, researchers have observed

an interesting fact that there are 80% local suboptimal solutions being

distributed around the global optimal solutions and a big valley structure

is seen. Motivated by this fact, we intuitively have an idea in mind on how

to approximate the ideal backbone by making use of the local suboptimal

solutions.

DEFINITION 7.4 (Approximate Backbone). Given the local suboptimal

clustering result Z = V

, V

, ..., V

, where V

= v

, v

..., v

m = 1, ....,M. The

AB is deﬁ ned as the intersection of M local suboptimal clustering results.

a_bone(V

, V

, ..., V

) = V

∩ V

∩ .. ∩ V

As described above, our method aims to use the AB of local optimal

solutions to form the part of initialization (i.e., the start points for heuristic

search), thus constructing an appropriate AB for the heuristic search in

K-centre clustering algorithm is a key issue. In other words, the quality of

K-centre clustering results is greatly dependent on the characteristics of

AB. To address the concern, here we propose two parameters to describe

the characteristics of AB—Scale and Purity. The former one describes how

many percentages of total local optimal solutions are included in the AB;

whereas the later one denotes how many percentages of local suboptimal

solutions included in AB are also existed in the theoretically ideal backbone

as well. In particular, Approximate Backbone Scale (ABS) and Approximate

Backbone Purity (ABP) are deﬁ ned as follows.

DEFINITION 7.5 (Approximate Backbone Scale). Given an AB, a

one

, V

, ..., V

). Approximate Backbone Scale is deﬁ ned as the proportion

of the AB cardinality to the cluster number K.

ABS =

| _ ( , ,..., ) |

a bone V V V

DEFINITION 7.6 (Approximate Backbone Purity). Given an AB, a

one(V

, ..., V

), and a backbone backbone(V

, ..., V

) Approximate Backbone

Purity is deﬁ ned as the proportion of the cardinality of the intersection of

the AB and the backbone to the AB cardinality.

ABP =

12 1 2

(| _ ( , ,..., ) ( *, *,..., *) |)

| _ ( , ,..., ) |

a bone V V V backbone V V V

a bone V V V

∩

As the AB is used for the selection of initialization, in order to achieve

the best result of heuristic search, we expect to form an appropriate AB with

both large ABS and ABP values, which indicates the fact that the most of

whole local suboptimal solutions should be included in the initialization

and the included local suboptimal solutions (centres) are closely scattered

around the global optimal clustering result. In order to better understand

ABS and ABP, we give an example to explain them. Given a data set D,

which contains 500 objects and 10 clusters, each cluster is represented by

a representative object. An assumed global optimal clustering result V*

is located in the ﬁ rst row in table 1 and three local suboptimal clustering

results V

, V

, obtained by running K-centre clustering algorithm with

three initializations, are also listed in Table 7.3.1. For the simpliﬁ cation, we

assume that each centre is represented by an objects ID in D, and there is

only one global optimal clustering result in D, that is backbone(V*) = V*.

For this example, we can obtain the AB: a

one(V

, V

) = 43, 78, 198,

240, 310, 366, 480. Known from the deﬁ nition of ABS, the value of ABS in

this example is calculated as follow.

ABS =

12 3

| _ ( , ,..., ) | 7

0.7

a bone V V V

Advanced Clustering Analysis 165

166 Applied Data Mining

Name Centre set

22, 78, 109, 180, 230, 292, 310, 366, 412, 475

43, 78, 109, 198, 240, 262, 310, 366, 412, 480

43, 78, 128, 198, 240, 262, 310, 366, 412, 480

43, 78, 128 198, 240, 252, 310, 366, 432, 480

Figure 7.3.1: Clustering results of D

We observed that there are three commonly overlapped centres existed

in AB and backbone, thus the value of ABP is,

ABP =

12 3

|_ ( , , ) *| 3

0.429

|_ ( , , )| 7

a bone V V V V

a bone V V V

∩

According to the Definition 7.5, the AB is derived from M local

suboptimal solutions, so the characteristics of AB has a close relationship

with M. In order to illustrate this relationship, we construct three data

sets: RandomS1, RandomS2 and RandomS3, each of which contains 34

clusters. And each cluster has 100 data objects, among which 99 objects

are generated by a Gaussian distribution function with different mean (µ)

and standard deviation (σ) and the last one is the mean of the rest, which

is deemed as the centre of the cluster. We run Vertex Substitution Heuristic

(VSH) algorithm [7], a classical K-centre clustering algorithm, on these

three data sets, and note the process as VSH

andomS1, VSH

andomS2 and

VSH

andomS3 respectively. VSH was executed for M=2:2:20 times on each

data set, where M=2:2:20 means M changing from 2 to 20 with step 2. The

relationships between ABS, ABP and M are shown in Fig. 7.3.2.

0.4

0.5

0.6

0.7

0.8

0.9

2 4 6 8 10 12 14 16 18 20

number of local suboptinal cluster results,

Approximate Backbone Scale(ABS)

VSH_Ra ndo mS1

VSH_Ra ndo mS2

VSH_Ra ndo mS3

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

2468101214161820

number of local suboptimal cluster results,M

Approximate Backbone Purity(ABP)

VSH_Rando mS 1

VSH_Rando mS 2

VSH_Rando mS 3

Figure 7.3.2: The ABS and ABP of AB

number of local suboptinal cluster results, M

From Fig. 7.3.2, we can see that the trends of changes on ABS and

ABP are along opposite directions with M—ABS is decreasing while M

increasing, on the contrary, ABP increasing, and eventually, the changes of

ABS and ABP become slight and a balanced state is reached. As indicated

above, we hope to form a better AB with both higher ABS and ABP values to

construct a good searching start for K-centre clustering algorithm. Due to the

inconsistent changes of ABS and ABP with M, we have to choose a tradeoff

between ABS and ABP to ensure the better clustering results. According

to the experimental results, we ﬁ nd the fact that choosing a reasonable M

can guarantee a better AB. The setting of M is an open question in various

applications [19]. In this chapter, we experimentally set M value based on

the size of data set and the cluster number.

7.3.2 Heuristic Clustering Algorithm based on Approximate Backbone

7.3.2.1 Reconstruct the New Searching Space

A new search space, S determined by the AB, is crucial to heuristic clustering

algorithms. Various applications might have different new search space

reconstruction methods. For example, in TSP [16], the new search space is

reconstructed by including all the edges that are not occurred in AB. While

the construction of AB leads to the formation of the fraction of the new search

space, the number of centres within the AB is usually less than the number

of predeﬁ ned clusters in K-centre clustering. In that case, how to select

the rest centres from the local suboptimal solutions is another important

issue. Essentially, a cluster centre that was’nt included in the AB but was

frequently occurred in local suboptimal clustering results is more likely to

be selected to represent a real cluster. Although we can reﬁ ne the new search

space with all the centers occurred in local suboptimal solutions, in order to

reduce the size of the new search space, we therefore select a centre into the

new search space, only if it is a frequent centre. DEFINITION 5 (Frequent

Centre). Given an occurrence threshold β, the local suboptimal clustering

result Z = V

, ..., V

and the AB a

one(Z), for each centre v

¢ UZ a

one(Z), if

the occurrence frequency of v

exceeds β, then v

is a frequent centre, where

is a set difference operator and UZ = U

= 1

We use the example shown in Table 1 to explain the new search space

reﬁ nement. From table 1, we ﬁ nd that there are 14 objects be selected as the

centre of clusters in V

, V

. The AB of this example is a

one(V

, V

) =

109, 128, 130, 262, 252, 412, 432, if the occurrence frequency of v

exceeds β,

then it is a frequent centre and will be included in S. According to the big

valley phenomenon, here we set the parameter β = M

80% = 2.4 (for this

example M=3), and we obtain the three additional centers 128, 262, 412.

Advanced Clustering Analysis 167

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7: Advanced Clustering Analysis (3/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 7: Advanced Clustering Analysis (3/6)