Chapter 4: Clustering Analysis (1/9)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 4

Clustering Analysis

4.1 Clustering Analysis

Clustering analysis is an important learning method which doesn’t need any

prior knowledge. Clustering is usually performed when no information is

available concerning the membership of data items to predeﬁ ned classes. For

this reason, clustering is traditionally seen as part of unsupervised learning.

We nevertheless speak here of unsupervised clustering to distinguish it from

a more recent and less common approach that makes use of a small amount

of supervision to “guide” or “adjust” clustering. In this chapter, we focus

on discussing this unsupervised learning method. The aim of clustering

analysis is to divide data into groups (clusters) that are meaningful, useful

or both. For meaningful groups, the goal of clustering is to capture the

natural structure of data. In some cases, however, clustering is only a useful

starting point for other purposes, such as data summarization. Whether for

understanding or utility, clustering analysis has played an important role in

a wide variety of ﬁ elds: computer science, pattern recognition, information

retrieval, machine learning, biology, data mining etc. Many data mining

queries are concerned either with how the data objects are grouped or which

objects could be considered remote from natural groupings. There have been

many works on cluster analysis, but we are now witnessing a signiﬁ cant

resurgence of interest in new clustering techniques. Scalability and high

dimensionality are not the only focus of the recent research in clustering

analysis. Indeed, it is getting difﬁ cult to keep track of all the new clustering

strategies, their advantages and shortcomings. The following are the typical

requirements for a good clustering technique in data mining [30, 29]:

• Scalability: The cluster method should be applicable to huge databases

and performance should decrease linearly with data size increase.

• Versatility: Clustering objects could be of different types—numerical

data, boolean data or categorical data. Ideally a clustering method

should be suitable for all different types of data objects.

58 Applied Data Mining

• Ability to discover clusters with different shapes: This is an important

requirement for spatial data clustering. Many clustering algorithms

can only discover clusters with spherical shapes.

• Minimal input parameter: This method should require a minimum

amount of domain knowledge for correct clustering. However, most

current clustering algorithms have several key parameters and are

thus not practical for use in real world applications.

• Robust with regard to noise: This is important because noise exists

everywhere in practical problems. A good clustering algorithm should

be able to perform successfully even in the presence of a great deal of

noise.

• Insensitive to the data input order: The clustering method should give

consistent results irrespective of the order the data is presented.

• Scaleable to high dimensionality: The ability to handle high dimensionality

is very challenging but real data sets are often multidimensional.

There is no single algorithm that can fully satisfy all the above requirements.

It is important to understand the characteristics of each algorithm so that

the proper algorithm can be selected for the clustering problem at hand.

Recently, there are several new clustering techniques offering useful

advances, possibly even complete solutions. During the past decades,

clustering analysis has been used to deal with practical problems in many

applications, as summed up by Han [30, 29]. Biology. Biologists have spent

many years creating a taxonomy (hierarchical classiﬁ cation) of all living

things: kingdom, class, order, family, genus and species. More recently,

biologists have applied clustering to analyze the large amounts of genetic

information that are now available. For example, clustering has been used

to ﬁ nd groups of genes that have similar functions from high dimensional

genes data.

It has been used for Information Retrieval. The World Wide Web consists

of billions of Web pages, and the results of a query to a search engine can

return thousands of pages. Clustering can be used to group these search

results into a small number of clusters, each of which captures a particular

aspect of the query.

Climate. Understanding the earth’s climate requires ﬁ nding patterns in the

atmosphere and ocean. To that end, clustering analysis has been applied

to ﬁ nd patterns in the atmospheric pressure of polar regions and areas of

the ocean that have a signiﬁ cant impact on land climate.

Psychology and Medicine. All illness or condition frequently has a number

of variations, and cluster analysis can be used to identify these different

subcategories. For example, clustering has been used to identify types of

depression. Cluster analysis can also be used to detect patterns in the spatial

or temporal distribution of a disease.

Business. Businesses collect large amounts of information on current and

potential customers. Clustering can be used to segment customers into a

small number of groups for additional analysis and marketing activities.

This chapter provides an introduction to clustering analysis. We

begin with the discussion of data types which have been met in clustering

analysis, and then, we will introduce some traditional clustering algorithms

which have the ability to deal with low dimension data clustering. High-

dimensional problem is a new challenge for clustering analysis, and lots of

high-dimensional clustering algorithms have been proposed by researchers.

Constraint-based clustering algorithm is a kind of semi-supervised

learning method, and it will be brieﬂ y discussed in this chapter as well.

Consensus cluster algorithm focuses on the clustering results derived by

other traditional clustering algorithms. It is a new method to improve the

quality of clustering result.

4.2 Types of Data in Clustering Analysis

As we know, clustering analysis methods could be used in different

application areas. So for clustering, different types of data sets will be

met. Data sets are made up of data objects (also referred to as samples,

examples, instance, data points, or objects) and a data object represents

an entity. For example, in a sales database, the objects may be customers,

store items and sales; in a medical database, the objects may be patients; in

a university database, the objects may be students, course, professor, salary;

in a webpage database, the objects maybe the users, links and pages; in a

tagging database, the objects may be users, tags and resources, and so on.

In clustering scenario, there have two traditional ways to organize the data

objects: Data Matrix and Proximity Matrix.

4.2.1 Data Matrix

A set of objects is represented as an m by n matrix, where there are m rows,

one for each object, and n columns, one for each attribute. This matrix

has different names, e.g., pattern matrix or data matrix, depending on

the particular ﬁ eld. Figure 4.2.1 below, provides a concrete example of

web usage data objects and their corresponding data matrix, where s

i=1,...,m indicates m user sessions and p

, j=1,...,n indicates n pages, a

indicates s

has visited pj, otherwise, a

=0. Because different attributes may

be measured on different scales, e.g., centimeter and kilogram, the data

is sometimes transformed before being used. In cases where the range of

Clustering Analysis 59

60 Applied Data Mining

values differs widely from attribute to attribute, these differing attribute

scales can dominate the results of the cluster analysis and it is common to

standardize the data so that all attributes are on the same scale. In order

to introduce these approaches clearly, we denote that x

is the i-th data

object, x

is the value of the j-th attribute of the i-th object, and x

' is the

standardized attribute value. There have some common approaches for

data standardization as follows:

(1) x'

max | |

. Divide each attribute value of an object by the

maximum observed absolute value of that attribute. This restricts all

attribute values to lie between -1 and 1. If all the values are positive,

all transformed values lie between 0 and 1. This approach may not

produce good results unless the attributes are uniformly distributed,

and this approach is also sensitive to outliers.

(2)

= x

−µ

. For each attribute value subtract off the mean of that

attribute and then divide it by the standard deviation of the attribute,

where µ

∑

is the mean of the j-th feature, and U

()

ij j

x μ

−

∑

is the standard deviation of the j-th feature.

Kaufman et al. indicate that if the data are normally distributed, then

most transformed attribute values will be lie between -1 and 1. This

approach has no request for the data distribution, but it is also sensitive

to the outliers like the ﬁ rst approach.

(3)

= x

−µ

. For each attribute value subtract off the mean of that

attribute and divide i+ by the attribute’s absolute deviation (KR 90),

where U

ij j

−

∑

is the absolute standard deviation of the

j-th attribute. Typically, most attribute values will lie between -1 and

1. This approach is the most robust in the presence of outliers among

three approaches.

Figure 4.2.1: An example of data matrix

…P

…

4.2.2 The Proximity Matrix

While cluster analysis sometimes uses the original data matrix, many

clustering algorithms use a similarity matrix, S, or a dissimilarity matrix,

D. For convenience, both matrices are commonly referred to as a proximity

matrix, P. A proximity matrix, P, is an m by m matrix containing all the

pairwise dissimilarities or similarities between the objects being considered.

If x

and x

are the i-th and j-th objects respectively, then the entry of p

is the similarity, s

, or the dissimilarity, d

, between x

and x

, where, p

denotes the element at the i-th row and j-th column of the proximity matrix

P. For simplicity, we will use p

to represent either s

or d

. Figure 4.2.2

gives an example of proximity matrix in social tagging mining, where the

element in the matrix represents the cosine similarity between tags. From

the description of data objects, we could see the fact that data objects

0.41 1

0 0.5 1

0.58 0 0 1

0.41 0.50 0.50 0.71 1

Figure 4.2.2: An example of Proximity Matrix

are typically described by attribute. An attributes is a data ﬁ eld which

represents a characteristic or a feature of a data object. In the proposed

literature, the nouns attribute, dimension, feature, and variable are often

used interchangeably. The term “dimension” is commonly used in data

warehousing, while the term “feature” is tended to be used in machine

learning. For the term “variable”, it used to be occurred in statisticians. In

data mining and database area, the researchers prefer the term “attribute”.

For example, attributes describing a user object in a webpage database can

include, user

D, page

, page

, · · · , page

. A set of attributes used to describe a

given object is called an attribute vector or feature vector. The distribution of

data involving one attribute is called univariate, and a bivariate distribution

involves two attributes, and so on. The type of an attribute is determined

by the possible values, that is, nominal, binary, ordinal or numeric.

Nominal Attributes. The values of enumerations attribute are symbols

or names of things, that is, enumerations is related to names. Each value

represents some kind of category, code, or state. There is no meaningful

order for the value of these kind of attributes. As we know, in some cases,

Clustering Analysis 61

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4: Clustering Analysis (1/9)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 4: Clustering Analysis (1/9)