258 Applied Data Mining
Description: Given a set of observations (x
1
, x
2
, · · · , x
n
), where each
observation is a d-dimensional real vector, k-means clustering aims to
partition the n observations into k sets (k n), S = {S
1
, S
2
, · · · , S
k
} so as to
minimize the Within-Cluster Sum of Squares (WCSS), where µ
i
is the mean
of points in S
i
.
2
1
arg min || ||
ji
k
ji
S
ixS
xu
=∈
∑∑
μ
The advantages of the K-means algorithm are: The low time
consumption and the fast processing speed on the condition of value k is
small; the compacted clusters production performance is satisfactory; the
clusters do not overlap since they are in non-hierarchical structure.
There are some disadvantages of K-means algorithm: The algorithm
is not able to calculate the applicable number of clusters automatically;
the user has to assign the value k as an input to the algorithm in advance.
Simultaneously, the specifi c number of clusters restricts the prediction of
what the real k should be. Various initial partitions lead to different number
of clusters, and the results for different composition of clusters can be
distinct in some of the experiments.
Figure 12.3.1: Frame Structure of K-Means Algorithm
Start
Number of
Clusters: K
+
Centroid
Distance Objects
to Centroids
Grouping Based on
Minimum Distance
No Object Move
Group?
End
There are extensive related research works on it. The author in [27]
theorized that K-means was a classical heuristic clustering algorithm. Due
to the sensitivity problem of K-means, some modifi ed approaches have
been proposed in the literature. Fast Global K-means [51] (FGK means),
for example, is an incremental approach of clustering that dynamically
adds one cluster centre at a time through a deterministic global search
procedure consisting of D executions of the K-means algorithm with
different suitable initial positions. Zhu et al. presented a new clustering
strategy, which can produce much lower Q(C) value than affinity
propagation (AP) by initializing K-means clustering with cluster centre
produced by AP [45]. In [42], the authors were motivated theoretically and
experimentally by a use of a deterministic divisive hierarchical method
and use of PCA-part (Principal Component Analysis Partitioning) as the
initialization of K-means. In order to overcome the sensitivity problem of
heuristic clustering algorithm, Han et al. proposed CLARANS based on
the random restart local search method [4]. VSH [9] used the iteratively
modifying cluster centre method to deal with initiation problem. More
modifi ed methods addressing the initialization sensitivity problem of
clustering algorithm are referred to [20, 36, 59] .
12.3.2 Hierarchical Clustering
The K-means algorithm has the limitation of choosing the specifi c number of
clusters, and it has the problem of non-determinism. It returns the clusters in
an unstructured set. As a result of such limitations, if we require hierarchy
structure, we need to involve the hierarchical clustering.
Hierarchical clustering constructs a hierarchy of clusters that can
be illustrated in a tree structure as a dendrogram. Each node in the
tree structure, including the root, represents the relationship between
parents and children, so it is able to explore different levels of clustering
granularity [19]. Hierarchical clustering algorithms are either top-down
or bottom-up, the bottom-up algorithms treat each fi le as a separate
cluster in the beginning and then begin to merge, until all cluster clusters
have been merged into a single cluster, such cluster contains all the
les.
The bottom-up hierarchical clustering is called hierarchical agglomerative
clustering.
The top-down clustering requires a method for dividing a cluster. It
splits clusters recursively until the individual documents are reached [69]
The advantages of the hierarchical clustering are [5, 19]: It has a high
exibility with respect to the level of granularity; it is easy to deal with any
Social Tagging Systems 259
260 Applied Data Mining
form of similarity metric or the distance; it does not require pre-assignment
of the number of clusters, and therefore has high applicability.
The disadvantages of the hierarchical clustering are summarized as
[70]: The termination judgment conditions and the interpretation of the
hierarchy are complex; if an incorrect assignment exists, most hierarchical
algorithms do not rebuild intermediate clusters; the single pass of analysis
and local decisions are the infl uencing factor of the clusters.
12.3.3 Spectral Clustering
The spectral clustering combines some of the benefits of the two
aforementioned approaches. It refers to a class of techniques which rely on
the eigenvalues of the adjacency similarity matrix; it can partition all of the
elements into disjoint clusters, the elements that have high similarity will
end up in the same cluster. Elements within one cluster have low similarity
with other clusters’ elements. The spectral clustering is based on the graph
partition. It maps the original inherent relationships onto a new spectral
space. The whole items are simultaneously partitioned into disjoint clusters
with minimum cut optimization. Spectral clustering techniques make use of
the spectrum of the similarity matrix of the data to perform dimensionality
reduction for clustering in fewer dimensions [71].
The original formula for the spectral clustering is:
L = I D
−1/2
WD
−1/2
where W is the corresponding similarity matrix, and D is the diagonal
matrix, D
ii
=
ij
j
S
.
According to the spectral graph theory in [13], the k singular vectors of
the reformed matrix RM
User
= D
−1/2
SM
User
D
−1/2
present a best approximation
to the projection of user-tag vectors on the new spectral space.
Compared to those clustering algorithms above, spectral clustering
algorithm has many fundamental advantages: It is very simple to
implement; it performs well with no local minima, so it could be solved
effi ciently by standard linear algebra methods; it also can keep the shapes
and densities in the cluster invariantly; the performance of obtained result
is better.
The disadvantages of the spectral clustering are summarized as: The
high time complexity and space complexity lead the processing ineffi cient.
In some cases, the clustering processing is unstable.
Social Tagging Systems 261
Figure 12.3.2: Example of the Spectral Clustering [37]
Figure 12.3.3: Example of the Spectral Clustering [37]
12.3.4 Quality of Clusters and Modularity Method
There are various categories of methods to measure the quality of clusters,
such as ”Compactness”, a measure of similarity of objects within an
individual cluster to the other objects outside the cluster; or the ”Isolation”,
a measure of separation among the objects outside the cluster [54]. In the
262 Applied Data Mining
research, we combine such attributes together, so as to utilize the modularity
method to evaluate the clustering algorithms. It is one of the quantitative
measures for the ”goodness” of the clusters discovered.
The modularity value is computed by the differences between the actual
number of edges within a cluster and the expected number of such edges.
The high value of the modularity shows the good divisions; that means,
the nodes within the same cluster have the concentrated connections but
only sparse connections between different clusters. It helps to evaluate the
quality of the cluster; here ”quality of cluster” consists of two criteria, i.e.,
the number of clusters and the similarity of each cluster [32].
Consider a particular division of a network into k clusters. We can
defi ne a k×k symmetric matrix SM whose element sm
ij
is the fraction of all
edges in the network that link vertices in cluster p to vertices in cluster q.
Take two clusters C
p
and C
q
randomly, the similarity smC
pq
between them
can be defi ned as
,, 1,2
pppq
qp
pq
cCcC
pq
pq
cCcC
c
smC p q m
c
∈∈
∈∈
==
∑∑
∑∑
where c
pq
is the element in the similarity matrix for the whole objects. When
p=q, the smC
pq
is the similarity between the elements inside the clusters,
while p q, the smC
pq
is the similarity between the cluster C
p
and the
cluster C
q
. So the condition of a high quality cluster is max(
pp
p
smC
) and
min(
,
pq
pq
smc
), p q, p, q = 1, 2, · · ·m.
Summing over all pairs of vertices in the same group, the modularity,
denoted Q, is given by:
22
11
[ ( ) ] Tr || ||
mm
pp pq
qq
Qsmc smc SMSM
==
=− =
∑∑
where the value m is the amount of clusters. The trace of this matrix TrSM
1
m
pp
p
smC
=
gives the fraction of edges in the network that connect vertices
in the same cluster, and a good division into clusters should have a high
value of it. If we place all vertices in a single cluster, the value of TrSM
would get the maximal value of 1 because there is no information about
cluster structure at all.
This quantity measures the fraction of the edges in the network that
connect vertices of the same type minus the expected value of the same
quantity in a network with the same cluster divisions. Utilize the value Q to
evaluate the clusters [4]: Values approaching Q=1, which is the maximum,
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.131.137