depression. Cluster analysis can also be used to detect patterns in the spatial
or temporal distribution of a disease.
Business. Businesses collect large amounts of information on current and
potential customers. Clustering can be used to segment customers into a
small number of groups for additional analysis and marketing activities.
This chapter provides an introduction to clustering analysis. We
begin with the discussion of data types which have been met in clustering
analysis, and then, we will introduce some traditional clustering algorithms
which have the ability to deal with low dimension data clustering. High-
dimensional problem is a new challenge for clustering analysis, and lots of
high-dimensional clustering algorithms have been proposed by researchers.
Constraint-based clustering algorithm is a kind of semi-supervised
learning method, and it will be briefl y discussed in this chapter as well.
Consensus cluster algorithm focuses on the clustering results derived by
other traditional clustering algorithms. It is a new method to improve the
quality of clustering result.
4.2 Types of Data in Clustering Analysis
As we know, clustering analysis methods could be used in different
application areas. So for clustering, different types of data sets will be
met. Data sets are made up of data objects (also referred to as samples,
examples, instance, data points, or objects) and a data object represents
an entity. For example, in a sales database, the objects may be customers,
store items and sales; in a medical database, the objects may be patients; in
a university database, the objects may be students, course, professor, salary;
in a webpage database, the objects maybe the users, links and pages; in a
tagging database, the objects may be users, tags and resources, and so on.
In clustering scenario, there have two traditional ways to organize the data
objects: Data Matrix and Proximity Matrix.
4.2.1 Data Matrix
A set of objects is represented as an m by n matrix, where there are m rows,
one for each object, and n columns, one for each attribute. This matrix
has different names, e.g., pattern matrix or data matrix, depending on
the particular fi eld. Figure 4.2.1 below, provides a concrete example of
web usage data objects and their corresponding data matrix, where s
i
,
i=1,...,m indicates m user sessions and p
j
, j=1,...,n indicates n pages, a
ij
=1
indicates s
i
has visited pj, otherwise, a
ij
=0. Because different attributes may
be measured on different scales, e.g., centimeter and kilogram, the data
is sometimes transformed before being used. In cases where the range of
Clustering Analysis 59