Chapter 7: Advanced Clustering Analysis (5/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

7.4.3.2 Deﬁ nition of SLDA

Let D = x

, x

, ..., x

be a data set of N d-dimensional data objects. Let

A = attr

, attr

, ..., attr

be the set of attributes of the data objects so that x

¢ dom(attr

), where dom(attr

) denotes the domain of the attribute attr

, j = 1,

..., d. Without losing the generality, we assume that all the attributes have

been normalized, i.e., dom(attr

) ¢ [0, 1]. In this section, we ﬁ rst construct

a hyper-rectangle structure of data object x based on the kernel width W*

and determine the corresponding subset of attributes S by running Rodeo;

and then, conduct a spatial statistical test on the hyper-rectangle to decide

whether it is a SLDA around the data object x. To better describe the process

of determining the SLDA, we introduce the following deﬁ nitions.

Deﬁ nition 7.6 (Hyper-Rectangle). Given a subset of attributes S and a kernel

width W* = width

, ...,width

of data object x, the hyper-rectangle structure

H around x can be constructed as : H = I

???I

, where I

= [x

?width

/2, x

+width

/2], j = 1, p, p = dim(S).

0.8

0.6

0.4

0.2

Dimension 4

0 0.5 1

Dimension 4

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1234

Dime nsi ons

Kernel withd

Ke r ne l w id t h

0.8

0.6

0.4

0.2

Dimension 2

0 0.5 1

Dimension 1

(a) Distribution on dimension 1 and 2 (b) Distribution on dimension 3 and 4 (c) the kernel width of each dimension

Figure 7.4.1: The distribution and the kernel width

The volume of hyper-rectangle around the data object x is vol(H) = 2

= 1

width

. The number of data objects within the hyper-rectangle H can be

used to indicate the local density around x.

Deﬁ nition 7.6 (Data Objects in Hyper-Rectangle). Given a data set D and

a hyper-rectangle H around a data object x in the subset of attributes S, the

data objects located in H are deﬁ ned as: remm(H) = x

¢ D | (x

− width

/2)

≤ x

j ≤ (x

− width

/2), i = 1, ...,N; j = 1, ..., p.

Deﬁ nition 7.7 (Local Density of Hyper-Rectangle). Given the identiﬁ ed

data object set remm(H) of a data object x , the local density around x in the

subset of attributes S is deﬁ ned as: LS(H) =|remm(H)|, where |·| means

the number of objects within the hyper-rectangle.

Known from the spatial statistical theory of assigning N data points

in a space, the number of data points which are assigned in a bounded

area is subject to the Binomial distribution with parameters of N and the

Advanced Clustering Analysis 173

174 Applied Data Mining

volume of the bound area [16]. Here H is the hyper-rectangle in the subset

of attributes, which is a bounded area, and the local density LS(H) is the

number of data objects assigned in H. So LS(H) is subject to the Binomial

distribution with parameter N and vol(H), i.e.,

LS(H) ´ Binomial(N, vol(H)).

To decide whether H is a dense hyper-rectangle, we run a null

hypothesis statistical test on H.

: Hyper-rectangle H in S contains LS(H) data objects.

The signiﬁ cant level α of the statistical hypothesis is a ﬁ xed probability

of wrongly rejecting the null hypothesis, when in fact it is true. α is also called

the rate of false positives or the probability of type I error. The critical value

of the statistical hypothesis test is a threshold which the value of the test

statistic is compared to determine whether the null hypothesis is rejected.

There are two test methods for hypothesis test: one-side test and two-side

test. For the one-side test, the critical value θ is computed based on α =

p(LS(H) > θ). For the two-side test, the computation of left critical value θ

is the same as the one-side test, but the right critical value θ

is computed

based on α = p(LS(H) > θ

) Where is a probability function.

Deﬁ nition 7.8 (Signiﬁ cant Local Dense Area). Let H be a hyper-rectangle

in the subset of attributes S. Let α be a signiﬁ cant level and θ be the critical

value computed at the signiﬁ cant level α based on the one-side test, where

the probability is computed using Binomial(N, vol(H)). If LS(H) > θ, (H,

S) is deemed as a Signiﬁ cant Local Density Area (SLDA) around the data

object x.

0.8

0.6

0.4

0.2

Dimension 2

0 0.5 1

Dimension 1

Figure 7.4.2: Hyper-rectangle of dataset D

7.4.4 Projective Clustering based on SLDAs

7.4.4.1 Finding SLDAs

After using the Rodeo method, the subset of attributes S and the kernel

width w* of a data object x are generated. Based on S and w*, a hyper-

rectangle structure H is constructed around x. We also use the data set

described in Section 2.2 as an example to illustrate the constructed hyper-

rectangle. The subset of attributes having important supports on kernel

estimation is on dimensions 1 and 2.

The example of H is shown in Fig. 7.4.2. In this ﬁ gure, there are two

hyper-rectangles H

, H

around two randomly selected data objects. But

the local density values of each hyper-rectangle are different. Based on the

Deﬁ nitions 7.6 and 7.7, the local density value of the dashed frame (H

) is

2, whereas the local density value of the real line frame (H

) is 5. According

to the spatial statistical hypothesis described in Section 2.3, we set the

signiﬁ cant level α = 0.001 and the value of θ calculated by the one-side

statistical test is 3.21. So the dashed frame is not a signiﬁ cant local density

area as LS(H

) < θ, and it will be deleted.

The greedy search method, G_SLDA as shown in Algorithm 7.6, is

utilized to ﬁ nd the whole signiﬁ cant local dense areas which can cover

the data distribution of D in the d-dimensional space. The main steps of

G_SLDA are to:

(1) randomly select a data object x from D and calculate the subset of

attributes S and the kernel width W* using Rodeo (steps 3–4);

(2) create a hyper-rectangle structure H around x based on Deﬁ nition 7.5,

and obtain the local dense value of H based on Deﬁ nition 7.6 and 7.7

(steps 5–6);

Algorithm 7.6: G_SLAD

Input: Data set D

Output: SLDA set SLD

1. SLD Ø;

2. Loop

3. Randomly select a data object x from D;

4. [S,W*] =Rodeo(D, x);

5. H = CreateH(x, S, W*);

6. remm(H) = G_remm(D, H);

7. Run a statistical test of LS(H)=|remm(H)|on H according to deﬁ nition 4;

8. If LS(H) θ

9. SLD = SLD  (H, S);

10. x and the data object in remm(H) are signed as visited;

11. end

12. Until all data objects in D are visited

13. return SLD

Advanced Clustering Analysis

175

176 Applied Data Mining

(3) decide whether H is a Signiﬁ cant Local Dense Area based on Deﬁ nition

7.8. If H is a signiﬁ cant local dense area, store the pair (H, S) into SLD

set, and label x and the data objects in H as visited, otherwise, discard

the obtained hyper-rectangle H (steps 7–11). These steps iteratively

run until all data objects have been processed.

7.4.4.2 Generating Clusters by Merging SLDAs

G_SLDA ﬁ nd out all the signiﬁ cant local density areas of data set D. Each

SLDA, (H, S), in SLD contain a density hyper-rectangle satisfying LS(H) ≥ θ

and its relevant subset of attribute S. The main structure of the density area

in a subset of attributes can be captured by all the density hyper-rectangles

embedded in it. To further understand the relationship between the density

area and signiﬁ cant local density area, we give an example to show it. Given

a data set D used in Fig. 7.4.1, each object is described by four attributes,

the range of each attribute is in [0,1]. The data of the ﬁ rst two attributes is

subject to a normal distribution and the data of the last two attributes is

subject to a uniform distribution.

From the above discussion, we know that the identiﬁ ed subset of

attributes is on S = 1, 2 denoted by nodes in Fig. 7.4.2, and the SLDAs

generated by G_SLDA on S = 1, 2 are represented by six solid rectangles (i.e.,

, H

). From Fig. 7.4.3, we therefore can conclude that the main structure

of data distribution is characterized by six dense hyper-rectangles.

Figure 7.4.3: Example of the relationship between SLDAs and the density area

0.8

0.6

0.4

0.2

Dimension 2

0 0.5 1

Dimension 1

The clustering result of projective clustering is represented by the data

objects which are located in small ranges in its speciﬁ c subspace. As for

SLDA, the various Hs do indicate these small ranges in S. The data objects

that are scattered in deferent hyper-rectangle Hs but within the same

subspace S constitute the different parts of a dense area in S. To obtain the

fully projected clusters, we need to merge these hyper-rectangles within

the same subspace S to generate its corresponding cluster. More concretely,

the clusters derived by a projective clustering algorithm are represented by

all the dense areas and their related subsets of attributes.

Algorithm 7.7: MC_SLDA

Input. SLD

Output. Clustering results and outliers

(1) Divide SLD into several subsets;

(2) For each subset of SLD, a single-linkage merger algorithm is run to ﬁ nd out the

clustering results;

(3) Reﬁ ne the clustering results

Hence we merge all the dense hyper-rectangles in the same subset of

attributes to generate the clusters. Further, for the example shown in Fig.

7.4.3, the cluster in the subset of attributes S = 1, 2 is (

= 1

, (1, 2)), where



= 1

denotes the merger processing of density hyper-rectangles. The

three major steps of the merger clustering algorithm on SLDAs, named

MC_SLDA, are to (1) divide SLDAs into several subsets so that the hyper-

rectangles within one attribute subset have the same subspace S; (2) run a

single-linkage merger algorithm on these subsets to ﬁ nd the fully projected

signiﬁ cant local dense area; (3) reﬁ ne the clustering results and detecting

the outliers. The pseudo codes of MC_SLDA are detailed in Algorithm 2.

The data objects which are not included in any clusters are denoted as Rest

= D(

k=1

where

is the set different operator. In the clustering result reﬁ nement, we use

the reassign method proposed in [39] to assign data objects in Rest to the

corresponding clusters. After the reﬁ nement, the data objects which do not

belong to any clusters can be regarded as outliers, and an outlier collection

is generated.

Hyper-rectangle structure is often used in ﬁ nding the density area

in high dimensional data sets. The determination of the width of hyper-

rectangle structure is a crucial task in high dimension clustering applications.

The majority of projective clustering algorithms use the restrictive model

to determine the width of hyper-rectangle, which has signiﬁ cant efforts on

discovering real clustering results. Inspired by the kernel density method,

we present a new way to design the hyper-rectangle structure, whose width

is determined by the true data distribution. In order to examine whether

a hyper-rectangle structure is a Signiﬁ cant Local Density Area, we run a

Advanced Clustering Analysis 177

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7: Advanced Clustering Analysis (5/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 7: Advanced Clustering Analysis (5/6)