CHAPTER 4
Clustering Analysis
4.1 Clustering Analysis
Clustering analysis is an important learning method which doesn’t need any
prior knowledge. Clustering is usually performed when no information is
available concerning the membership of data items to predefi ned classes. For
this reason, clustering is traditionally seen as part of unsupervised learning.
We nevertheless speak here of unsupervised clustering to distinguish it from
a more recent and less common approach that makes use of a small amount
of supervision to “guide” or “adjust” clustering. In this chapter, we focus
on discussing this unsupervised learning method. The aim of clustering
analysis is to divide data into groups (clusters) that are meaningful, useful
or both. For meaningful groups, the goal of clustering is to capture the
natural structure of data. In some cases, however, clustering is only a useful
starting point for other purposes, such as data summarization. Whether for
understanding or utility, clustering analysis has played an important role in
a wide variety of fi elds: computer science, pattern recognition, information
retrieval, machine learning, biology, data mining etc. Many data mining
queries are concerned either with how the data objects are grouped or which
objects could be considered remote from natural groupings. There have been
many works on cluster analysis, but we are now witnessing a signifi cant
resurgence of interest in new clustering techniques. Scalability and high
dimensionality are not the only focus of the recent research in clustering
analysis. Indeed, it is getting diffi cult to keep track of all the new clustering
strategies, their advantages and shortcomings. The following are the typical
requirements for a good clustering technique in data mining [30, 29]:
Scalability: The cluster method should be applicable to huge databases
and performance should decrease linearly with data size increase.
Versatility: Clustering objects could be of different types—numerical
data, boolean data or categorical data. Ideally a clustering method
should be suitable for all different types of data objects.
58 Applied Data Mining
Ability to discover clusters with different shapes: This is an important
requirement for spatial data clustering. Many clustering algorithms
can only discover clusters with spherical shapes.
Minimal input parameter: This method should require a minimum
amount of domain knowledge for correct clustering. However, most
current clustering algorithms have several key parameters and are
thus not practical for use in real world applications.
Robust with regard to noise: This is important because noise exists
everywhere in practical problems. A good clustering algorithm should
be able to perform successfully even in the presence of a great deal of
noise.
Insensitive to the data input order: The clustering method should give
consistent results irrespective of the order the data is presented.
Scaleable to high dimensionality: The ability to handle high dimensionality
is very challenging but real data sets are often multidimensional.
There is no single algorithm that can fully satisfy all the above requirements.
It is important to understand the characteristics of each algorithm so that
the proper algorithm can be selected for the clustering problem at hand.
Recently, there are several new clustering techniques offering useful
advances, possibly even complete solutions. During the past decades,
clustering analysis has been used to deal with practical problems in many
applications, as summed up by Han [30, 29]. Biology. Biologists have spent
many years creating a taxonomy (hierarchical classifi cation) of all living
things: kingdom, class, order, family, genus and species. More recently,
biologists have applied clustering to analyze the large amounts of genetic
information that are now available. For example, clustering has been used
to fi nd groups of genes that have similar functions from high dimensional
genes data.
It has been used for Information Retrieval. The World Wide Web consists
of billions of Web pages, and the results of a query to a search engine can
return thousands of pages. Clustering can be used to group these search
results into a small number of clusters, each of which captures a particular
aspect of the query.
Climate. Understanding the earth’s climate requires fi nding patterns in the
atmosphere and ocean. To that end, clustering analysis has been applied
to fi nd patterns in the atmospheric pressure of polar regions and areas of
the ocean that have a signifi cant impact on land climate.
Psychology and Medicine. All illness or condition frequently has a number
of variations, and cluster analysis can be used to identify these different
subcategories. For example, clustering has been used to identify types of
depression. Cluster analysis can also be used to detect patterns in the spatial
or temporal distribution of a disease.
Business. Businesses collect large amounts of information on current and
potential customers. Clustering can be used to segment customers into a
small number of groups for additional analysis and marketing activities.
This chapter provides an introduction to clustering analysis. We
begin with the discussion of data types which have been met in clustering
analysis, and then, we will introduce some traditional clustering algorithms
which have the ability to deal with low dimension data clustering. High-
dimensional problem is a new challenge for clustering analysis, and lots of
high-dimensional clustering algorithms have been proposed by researchers.
Constraint-based clustering algorithm is a kind of semi-supervised
learning method, and it will be briefl y discussed in this chapter as well.
Consensus cluster algorithm focuses on the clustering results derived by
other traditional clustering algorithms. It is a new method to improve the
quality of clustering result.
4.2 Types of Data in Clustering Analysis
As we know, clustering analysis methods could be used in different
application areas. So for clustering, different types of data sets will be
met. Data sets are made up of data objects (also referred to as samples,
examples, instance, data points, or objects) and a data object represents
an entity. For example, in a sales database, the objects may be customers,
store items and sales; in a medical database, the objects may be patients; in
a university database, the objects may be students, course, professor, salary;
in a webpage database, the objects maybe the users, links and pages; in a
tagging database, the objects may be users, tags and resources, and so on.
In clustering scenario, there have two traditional ways to organize the data
objects: Data Matrix and Proximity Matrix.
4.2.1 Data Matrix
A set of objects is represented as an m by n matrix, where there are m rows,
one for each object, and n columns, one for each attribute. This matrix
has different names, e.g., pattern matrix or data matrix, depending on
the particular fi eld. Figure 4.2.1 below, provides a concrete example of
web usage data objects and their corresponding data matrix, where s
i
,
i=1,...,m indicates m user sessions and p
j
, j=1,...,n indicates n pages, a
ij
=1
indicates s
i
has visited pj, otherwise, a
ij
=0. Because different attributes may
be measured on different scales, e.g., centimeter and kilogram, the data
is sometimes transformed before being used. In cases where the range of
Clustering Analysis 59
60 Applied Data Mining
values differs widely from attribute to attribute, these differing attribute
scales can dominate the results of the cluster analysis and it is common to
standardize the data so that all attributes are on the same scale. In order
to introduce these approaches clearly, we denote that x
i
is the i-th data
object, x
ij
is the value of the j-th attribute of the i-th object, and x
ij
' is the
standardized attribute value. There have some common approaches for
data standardization as follows:
(1) x'
ij
=
max | |
ij
ij
i
x
x
. Divide each attribute value of an object by the
maximum observed absolute value of that attribute. This restricts all
attribute values to lie between -1 and 1. If all the values are positive,
all transformed values lie between 0 and 1. This approach may not
produce good results unless the attributes are uniformly distributed,
and this approach is also sensitive to outliers.
(2)
x'
ij
= x
ij
µ
j
'
U
j
. For each attribute value subtract off the mean of that
attribute and then divide it by the standard deviation of the attribute,
where µ
j
=
1
1
m
ij
i
x
m
=
is the mean of the j-th feature, and U
j
=
1
m
2
1
()
m
ij j
i
x μ
=
μ
is the standard deviation of the j-th feature.
Kaufman et al. indicate that if the data are normally distributed, then
most transformed attribute values will be lie between -1 and 1. This
approach has no request for the data distribution, but it is also sensitive
to the outliers like the fi rst approach.
(3)
x'
ij
= x
ij
µ
j
'
U
A
j
. For each attribute value subtract off the mean of that
attribute and divide i+ by the attribute’s absolute deviation (KR 90),
where U
A
j
=
1
1
||
m
ij j
i
x
m
μ
=
μ
is the absolute standard deviation of the
j-th attribute. Typically, most attribute values will lie between -1 and
1. This approach is the most robust in the presence of outliers among
three approaches.
Figure 4.2.1: An example of data matrix
p
1
p
2
…P
n
s
1
s
2
a
ij
s
m
a
ij
p
1
S
1
S
2
S
m
p
2
p
n
4.2.2 The Proximity Matrix
While cluster analysis sometimes uses the original data matrix, many
clustering algorithms use a similarity matrix, S, or a dissimilarity matrix,
D. For convenience, both matrices are commonly referred to as a proximity
matrix, P. A proximity matrix, P, is an m by m matrix containing all the
pairwise dissimilarities or similarities between the objects being considered.
If x
i
and x
j
are the i-th and j-th objects respectively, then the entry of p
ij
is the similarity, s
ij
, or the dissimilarity, d
ij
, between x
i
and x
j
, where, p
ij
denotes the element at the i-th row and j-th column of the proximity matrix
P. For simplicity, we will use p
ij
to represent either s
ij
or d
ij
. Figure 4.2.2
gives an example of proximity matrix in social tagging mining, where the
element in the matrix represents the cosine similarity between tags. From
the description of data objects, we could see the fact that data objects
1
0.41 1
0 0.5 1
0.58 0 0 1
0.41 0.50 0.50 0.71 1
Figure 4.2.2: An example of Proximity Matrix
are typically described by attribute. An attributes is a data fi eld which
represents a characteristic or a feature of a data object. In the proposed
literature, the nouns attribute, dimension, feature, and variable are often
used interchangeably. The term “dimension” is commonly used in data
warehousing, while the term “feature” is tended to be used in machine
learning. For the term “variable”, it used to be occurred in statisticians. In
data mining and database area, the researchers prefer the term “attribute”.
For example, attributes describing a user object in a webpage database can
include, user
I
D, page
1
, page
2
, · · · , page
n
. A set of attributes used to describe a
given object is called an attribute vector or feature vector. The distribution of
data involving one attribute is called univariate, and a bivariate distribution
involves two attributes, and so on. The type of an attribute is determined
by the possible values, that is, nominal, binary, ordinal or numeric.
Nominal Attributes. The values of enumerations attribute are symbols
or names of things, that is, enumerations is related to names. Each value
represents some kind of category, code, or state. There is no meaningful
order for the value of these kind of attributes. As we know, in some cases,
Clustering Analysis 61
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.22.163