Chapter 12: Social Tagging Systems (3/5)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

258 Applied Data Mining

Description: Given a set of observations (x

, x

, · · · , x

), where each

observation is a d-dimensional real vector, k-means clustering aims to

partition the n observations into k sets (k ≤ n), S = {S

, S

, · · · , S

} so as to

minimize the Within-Cluster Sum of Squares (WCSS), where µ

is the mean

of points in S

arg min || ||

ixS

=∈

−

∑∑

The advantages of the K-means algorithm are: The low time

consumption and the fast processing speed on the condition of value k is

small; the compacted clusters production performance is satisfactory; the

clusters do not overlap since they are in non-hierarchical structure.

There are some disadvantages of K-means algorithm: The algorithm

is not able to calculate the applicable number of clusters automatically;

the user has to assign the value k as an input to the algorithm in advance.

Simultaneously, the speciﬁ c number of clusters restricts the prediction of

what the real k should be. Various initial partitions lead to different number

of clusters, and the results for different composition of clusters can be

distinct in some of the experiments.

Figure 12.3.1: Frame Structure of K-Means Algorithm

Start

Number of

Clusters: K

–

Centroid

Distance Objects

to Centroids

Grouping Based on

Minimum Distance

No Object Move

Group?

End

There are extensive related research works on it. The author in [27]

theorized that K-means was a classical heuristic clustering algorithm. Due

to the sensitivity problem of K-means, some modiﬁ ed approaches have

been proposed in the literature. Fast Global K-means [51] (FGK means),

for example, is an incremental approach of clustering that dynamically

adds one cluster centre at a time through a deterministic global search

procedure consisting of D executions of the K-means algorithm with

different suitable initial positions. Zhu et al. presented a new clustering

strategy, which can produce much lower Q(C) value than affinity

propagation (AP) by initializing K-means clustering with cluster centre

produced by AP [45]. In [42], the authors were motivated theoretically and

experimentally by a use of a deterministic divisive hierarchical method

and use of PCA-part (Principal Component Analysis Partitioning) as the

initialization of K-means. In order to overcome the sensitivity problem of

heuristic clustering algorithm, Han et al. proposed CLARANS based on

the random restart local search method [4]. VSH [9] used the iteratively

modifying cluster centre method to deal with initiation problem. More

modiﬁ ed methods addressing the initialization sensitivity problem of

clustering algorithm are referred to [20, 36, 59] .

12.3.2 Hierarchical Clustering

The K-means algorithm has the limitation of choosing the speciﬁ c number of

clusters, and it has the problem of non-determinism. It returns the clusters in

an unstructured set. As a result of such limitations, if we require hierarchy

structure, we need to involve the hierarchical clustering.

Hierarchical clustering constructs a hierarchy of clusters that can

be illustrated in a tree structure as a dendrogram. Each node in the

tree structure, including the root, represents the relationship between

parents and children, so it is able to explore different levels of clustering

granularity [19]. Hierarchical clustering algorithms are either top-down

or bottom-up, the bottom-up algorithms treat each ﬁ le as a separate

cluster in the beginning and then begin to merge, until all cluster clusters

have been merged into a single cluster, such cluster contains all the

ﬁ les.

The bottom-up hierarchical clustering is called hierarchical agglomerative

clustering.

The top-down clustering requires a method for dividing a cluster. It

splits clusters recursively until the individual documents are reached [69]

The advantages of the hierarchical clustering are [5, 19]: It has a high

ﬂ exibility with respect to the level of granularity; it is easy to deal with any

Social Tagging Systems 259

260 Applied Data Mining

form of similarity metric or the distance; it does not require pre-assignment

of the number of clusters, and therefore has high applicability.

The disadvantages of the hierarchical clustering are summarized as

[70]: The termination judgment conditions and the interpretation of the

hierarchy are complex; if an incorrect assignment exists, most hierarchical

algorithms do not rebuild intermediate clusters; the single pass of analysis

and local decisions are the inﬂ uencing factor of the clusters.

12.3.3 Spectral Clustering

The spectral clustering combines some of the benefits of the two

aforementioned approaches. It refers to a class of techniques which rely on

the eigenvalues of the adjacency similarity matrix; it can partition all of the

elements into disjoint clusters, the elements that have high similarity will

end up in the same cluster. Elements within one cluster have low similarity

with other clusters’ elements. The spectral clustering is based on the graph

partition. It maps the original inherent relationships onto a new spectral

space. The whole items are simultaneously partitioned into disjoint clusters

with minimum cut optimization. Spectral clustering techniques make use of

the spectrum of the similarity matrix of the data to perform dimensionality

reduction for clustering in fewer dimensions [71].

The original formula for the spectral clustering is:

L = I − D

−1/2

where W is the corresponding similarity matrix, and D is the diagonal

matrix, D

∑

According to the spectral graph theory in [13], the k singular vectors of

the reformed matrix RM

User

= D

−1/2

User

−1/2

present a best approximation

to the projection of user-tag vectors on the new spectral space.

Compared to those clustering algorithms above, spectral clustering

algorithm has many fundamental advantages: It is very simple to

implement; it performs well with no local minima, so it could be solved

efﬁ ciently by standard linear algebra methods; it also can keep the shapes

and densities in the cluster invariantly; the performance of obtained result

is better.

The disadvantages of the spectral clustering are summarized as: The

high time complexity and space complexity lead the processing inefﬁ cient.

In some cases, the clustering processing is unstable.

Social Tagging Systems 261

Figure 12.3.2: Example of the Spectral Clustering [37]

Figure 12.3.3: Example of the Spectral Clustering [37]

12.3.4 Quality of Clusters and Modularity Method

There are various categories of methods to measure the quality of clusters,

such as ”Compactness”, a measure of similarity of objects within an

individual cluster to the other objects outside the cluster; or the ”Isolation”,

a measure of separation among the objects outside the cluster [54]. In the

262 Applied Data Mining

research, we combine such attributes together, so as to utilize the modularity

method to evaluate the clustering algorithms. It is one of the quantitative

measures for the ”goodness” of the clusters discovered.

The modularity value is computed by the differences between the actual

number of edges within a cluster and the expected number of such edges.

The high value of the modularity shows the good divisions; that means,

the nodes within the same cluster have the concentrated connections but

only sparse connections between different clusters. It helps to evaluate the

quality of the cluster; here ”quality of cluster” consists of two criteria, i.e.,

the number of clusters and the similarity of each cluster [32].

Consider a particular division of a network into k clusters. We can

deﬁ ne a k×k symmetric matrix SM whose element sm

is the fraction of all

edges in the network that link vertices in cluster p to vertices in cluster q.

Take two clusters C

and C

randomly, the similarity smC

between them

can be deﬁ ned as

,, 1,2

pppq

cCcC

smC p q m



∈∈

∑∑

where c

is the element in the similarity matrix for the whole objects. When

p=q, the smC

is the similarity between the elements inside the clusters,

while p  q, the smC

is the similarity between the cluster C

and the

cluster C

. So the condition of a high quality cluster is max(

smC

∑

) and

min(

smc

∑

), p  q, p, q = 1, 2, · · ·m.

Summing over all pairs of vertices in the same group, the modularity,

denoted Q, is given by:

[ ( ) ] Tr || ||

pp pq

Qsmc smc SMSM

=− =−

∑∑

where the value m is the amount of clusters. The trace of this matrix TrSM

smC

∑

gives the fraction of edges in the network that connect vertices

in the same cluster, and a good division into clusters should have a high

value of it. If we place all vertices in a single cluster, the value of TrSM

would get the maximal value of 1 because there is no information about

cluster structure at all.

This quantity measures the fraction of the edges in the network that

connect vertices of the same type minus the expected value of the same

quantity in a network with the same cluster divisions. Utilize the value Q to

evaluate the clusters [4]: Values approaching Q=1, which is the maximum,

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 12: Social Tagging Systems (3/5)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 12: Social Tagging Systems (3/5)