Chapter 7: Advanced Clustering Analysis (1/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 7

Advanced Clustering Analysis

7.1 Introduction

Clustering is a useful approach in data mining processes for identifying

patterns and revealing underlying knowledge from large data collections.

The application areas of clustering include image segmentation, information

retrieval, and document classiﬁ cation, associate rule mining, web usage

tracking and transaction analysis. Generally, clustering is deﬁ ned as the

process of partitioning unlabeled data set into meaningful groups (clusters)

so that intra-group similarities are maximized and inter-group similarities

are minimized at the same time. In essence, clustering involves the following

unsupervised learning process, which can be written as: Deﬁ ne an encoder

function c(x) to map each data object xi into a particular group Gk(c(x) = k)

² x ¢ Gk, k = 1, 2, 3, ..., k, so that a cluster criterion Q(c) = ¹

= 1

) = k,

c(x

= k)dist(x

, x

) is minimized.

As we know, this is a classical combinatorial optimization problem and

solving it is exactly NP-hard, even with just two clusters [13]. According

to computation complexity theory [36], no complete algorithm can get the

overall optimal solutions in a polynomial time, unless P = NP. Iterative

reﬁ nement method, a popular approximate algorithm, is widely adopted

by various unsupervised learning algorithms. A general iterative reﬁ nement

clustering process can be summarized as Algorithm 7.1 [6].

Algorithm 7.1: General iterative reﬁ nement clustering

Initialization: Initialize the parameters of the current cluster model.

Reﬁ nement: Repeat until the cluster model converges.

(1) Generate the cluster membership assignments for all data objects, based on

the current model;

(2) Refine the model parameters based on the current cluster membership

assignments.

154 Applied Data Mining

Figure 7.1.1: An example of iterative reﬁ nement clustering algorithm

The intuitionists denotation of iterative reﬁ nement clustering algorithm

is shown in Fig. 7.1.1. The horizontal axis denotes feasible solutions of

clustering problem and the vertical axis is the corresponding objective

function values of feasible solutions.

In this chapter, the feasible solution is the results of encode function

(or the clustering results) and the objective function value is the values

of cluster criterion Q(c) = ¹

= 1

) = k, c(x

= k)dist(x

, x

). Without loss

of generality, we assume that point 3 is selected as the initialization of an

iterative reﬁ nement clustering algorithm, and by repeating step (1) and (2),

the algorithm will converge to point 4, one of the feasible solutions with

suboptimal objective function value. An iterative reﬁ nement clustering

algorithm is a popularly used clustering approach based on heuristic

search, which is to minimize the cluster criterion in a designated region

(i.e., guarding the heuristic search with the speciﬁ ed initial values). In this

scenario, the obtained clustering result is dependent on the initialization

and the minimized cluster criterion only reﬂ ects the sub-optimal solution

with this running of heuristic search. In other words, the nature of heuristic

search process makes the iterative reﬁ nement clustering algorithms heavily

sensitive to the initialization settings, thus not guaranteeing the higher

quality clustering results with randomly chosen initializations [5]. Therefore,

how to deal with the sensitivity problem of initialization in iterative

reﬁ nement clustering algorithm is becoming an active and well concerned.

That the initialization model must be correct is an important underlying

assumption for iterative reﬁ nement clustering algorithm. It can determine

the clustering solution [6], that is, different initialization models will produce

different clustering results (or different local minimum points as shown in

Fig. 7.1.1). Since the problem of obtaining a globally optimal initial state

has been shown to be NP-hard [15], the study on the initialization methods

towards a sub-optimal clustering result hence is more practical, and is of

great value for the clustering research. Recently, initialization methods have

been categorized into three major families: random sampling methods,

distance optimization methods and density estimations [17]. Forgy adopted

uniformly random input objects as the seed clusters [14], and MacQueen

gave an equivalent way with selecting the ﬁ rst K input objects as the seed

clusters [29]. In the FASTCLUS, a K-means variance implemented in SAS

[37], the simple cluster seeking (SCS) initialization method is adopted [21].

Katsavounidis et al. proposed a method that utilizes the sorted pairwise

distances for initialization [22]. Kaufman and Rousseeuw introduced a

method that estimates the density through pairwise distance comparison,

and initializes the seed clusters using the input objects from areas with

high local density [23]. In Ref. [18], a method which combines local density

approximation and random initialization is proposed. Belal et al. ﬁ nd a

set of medians extracted from a dimension with maximum and then use

the medians as the initialization of K-means [4]. Niu et al. give a novel

algorithm called PR (Pointer Ring), which initializes cluster centers based

on pointer ring by partition traditional hyper-rectangular units further

to hyper-triangle subspaces [33]. The initialization steps of K-means++

algorithm can be described as: choosing an initial center m

uniformly at

random from data set; and then selecting the next center m

= x

from data

set with a probability, where dist(x, m) denote the shortest distance from a

data object x to the closest center m; iterative until ﬁ nd K centers [2]. The

main steps of initialization centers of K-means by kd-tree are: ﬁ rst, the

density of a data at various locations are estimated by using kd-tree; and

then use a modiﬁ cation of Katsavounidis algorithm, which incorporates

this density information, to choose K seeds for K-means algorithm [39]. And

recently, Lu et al. treated the clustering problem as a weighted clustering

problem so as to ﬁ nd a better initial cluster center based on the hierarchical

approach [28].

7.2 Space Smoothing Search Methods in Heuristic

Clustering

The goal of modiﬁ ed initialization methods, is to reduce the inﬂ uence

of sub-optimal solutions (the local minimum points) bestrewed in the

whole search space, as shown in Fig. 7.1.1. Although iterative reﬁ nement

clustering algorithms with these modiﬁ ed initialization methods have

some merits in improving the quality of cluster results, they also have high

probability to be attracted by local minimum points. Local search method

is the essence of iterative reﬁ nement clustering algorithms. Lots of the

local minimum points make a local search problem hard and sensitive to

Advanced Clustering Analysis 155

156 Applied Data Mining

the initialization. Those proposed modiﬁ ed initialization methods are only

focused on how to select an initialization which can improve the quality of

iterative reﬁ nement clustering algorithm, but the search space embedded

lots of local minimum points is ignored. Smoothing search space method

reconstructs the search space by ﬁ lling local minimum points, to reduce

the inﬂ uence of local minimum points. In this paper, we ﬁ rst design two

smoothing operators to reconstruct the search space by ﬁ lling the minimum

traps (points) based on the relationship between distance metric and cluster

criterion. Each smoothing operator has a parameter, smoothing factor, to

control the number of minimum traps. And then, we give a topCdown

clustering algorithm with smoothing search space (TDCS3) to reduce the

inﬂ uence of initialization. The main steps of TDCS3 are to: (1) dynamically

reconstruct a series of smoothed search space as a hierarchical structure:

the most smoothed search space at the top, and the original search space

at the bottom, other smoothed search spaces are distributed between them,

by ﬁ lling the local minimum points; (2) at the top level of the hierarchical

structure, an existing iterative reﬁ nement clustering algorithm is run with

random initialization to generate the cluster result; (3) from the second

level to the bottom level of the hierarchical structure, the same clustering

algorithm is run with the initialization derived from the cluster result on

the previous level.

Figure 7.2.1: Illustration of smoothing search space

7.2.1 Smoothing Search Space and Smoothing Operator

7.2.1.1 Local Search and Smoothing Search Space

Local search method is the essence of iterative reﬁ nement clustering

algorithms. During the mid-sixties, local search method was first

proposed to cope with the overwhelming computational intractability of

NP-hard combinatorial optimization problems. Give a minimization (or

maximization) problem with objective function f and feasible region F, a

Start solution

Local minimum

Global optimum

typical local search algorithm requires that, with each solution x

¢ R

, there

is associated a predeﬁ ned neighborhood N(x

)  R

. Given a current solution

point x

¢ R

, the set N(x

) is searched for a point x

+1 with f(x

+ 1) < f(−x

) or

(f(x

+ 1) > f(x

)). If such a point exists, it becomes the new current solution

point (x

+1), and then the process is iterated. Otherwise, x

is retained as

a local optimum with respect to N(x

). Then a set of feasible solution points

is generated, and each of them is locally improved within its neighborhood.

Local search methods only check the neighborhood of current feasible

solution x

, so the search range has been dramatically reduced and the

convergence speed has been accelerated. A major shortcoming of local

search is that the algorithm has a tendency to get stuck at a locally optimum

conﬁ guration, i.e., a local minima point, as the point 2 or 4 shown in Fig.

7.1. Different neighborhood structures result in difference terrain surface

structures of the search space and produce different numbers of local

minimum points. The effectiveness of a local search algorithm relies on

the number of local minimum points in the search space [16], that is, local

minimum points make a search problem hard. The smaller the number of

local minimum points, the more effective a local search algorithm is. In

order to reduce the inﬂ uence of local minimum to local search algorithm,

some local minimum traps must be ﬁ lled. Gu and Huang [16] has called

the method of ﬁ lling minimum trap as the smoothing search space, and it

is able to dynamically reconstruct the problem structure and smooth the

rugged terrain surface of the search space. The smoothed search space could

hide some local minimum points, therefore, improving the performance

of the traditional local search algorithm. Figure 7.2.1 is the illustration of

smoothing search space.

From Fig. 7.2.1, we can see that Many local minimum traps are ﬁ lled

after running a smoothing operator. The real line curve shows the original

search space which has lots of minimum traps, and dashes shows the

smoothed search space with fewer minimum traps. At the former discussing,

we can ﬁ nd that lots of the local minimum points which are embedded

in the search space make a local search problem hard and sensitive to the

initialization. The essence of iterative reﬁ nement clustering algorithms is the

local search method, thus they have the same real reason for initialization

sensitivity problem. The main idea of smoothing search space is always

common, but different application areas have different ways to smoothing

the search space. In clustering area, clustering is deﬁ ned as the process

of partitioning unlabeled data objects into meaningful groups (clusters)

so that the value of cluster criterion Q(c) is minimized. Minimizing Q(c)

value means that the intra-similarities of all clusters are maximized or

the distances of each data object to its cluster center is minimized. So the

cluster criterion Q(c) has a close relationship with the similarity or distance

between data objects.

Advanced Clustering Analysis 157

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7: Advanced Clustering Analysis (1/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 7: Advanced Clustering Analysis (1/6)