11.3 Clustering Graph and Network Data

Cluster analysis on graph and network data extracts valuable knowledge and information. Such data are increasingly popular in many applications. We discuss applications and challenges of clustering graph and network data in Section 11.3.1. Similarity measures for this form of clustering are given in Section 11.3.2. You will learn about graph clustering methods in Section 11.3.3.

In general, the terms graph and network can be used interchangeably. In the rest of this section, we mainly use the term graph.

11.3.1 Applications and Challenges

As a customer relationship manager at AllElectronics, you notice that a lot of data relating to customers and their purchase behavior can be preferably modeled using graphs.

Example 11.16

Bipartite graph

The customer purchase behavior at AllElectronics can be represented in a bipartite graph. In a bipartite graph, vertices can be divided into two disjoint sets so that each edge connects a vertex in one set to a vertex in the other set. For the AllElectronics customer purchase data, one set of vertices represents customers, with one customer per vertex. The other set represents products, with one product per vertex. An edge connects a customer to a product, representing the purchase of the product by the customer. Figure 11.12 shows an illustration.

“What kind of knowledge can we obtain by a cluster analysis of the customer-product bipartite graph?” By clustering the customers such that those customers buying similar sets of products are placed into one group, a customer relationship manager can make product recommendations. For example, suppose Ada belongs to a customer cluster in which most of the customers purchased a digital camera in the last 12 months, but Ada has yet to purchase one. As manager, you decide to recommend a digital camera to her.

Alternatively, we can cluster products such that those products purchased by similar sets of customers are grouped together. This clustering information can also be used for product recommendations. For example, if a digital camera and a high-speed flash memory card belong to the same product cluster, then when a customer purchases a digital camera, we can recommend the high-speed flash memory card.

image

Figure 11.12 Bipartite graph representing customer-purchase data.

Bipartite graphs are widely used in many applications. Consider another example.

Example 11.17

Web search engines

In web search engines, search logs are archived to record user queries and the corresponding click-through information. (The click-through information tells us on which pages, given as a result of a search, the user clicked.) The query and click-through information can be represented using a bipartite graph, where the two sets of vertices correspond to queries and web pages, respectively. An edge links a query to a web page if a user clicks the web page when asking the query. Valuable information can be obtained by cluster analyses on the query–web page bipartite graph. For instance, we may identify queries posed in different languages, but that mean the same thing, if the click-through information for each query is similar.

As another example, all the web pages on the Web form a directed graph, also known as the web graph, where each web page is a vertex, and each hyperlink is an edge pointing from a source page to a destination page. Cluster analysis on the web graph can disclose communities, find hubs and authoritative web pages, and detect web spams.

In addition to bipartite graphs, cluster analysis can also be applied to other types of graphs, including general graphs, as elaborated Example 11.18.

Example 11.18

Social network

A social network is a social structure. It can be represented as a graph, where the vertices are individuals or organizations, and the links are interdependencies between the vertices, representing friendship, common interests, or collaborative activities. AllElectronics' customers form a social network, where each customer is a vertex, and an edge links two customers if they know each other.

As customer relationship manager, you are interested in finding useful information that can be derived from AllElectronics' social network through cluster analysis. You obtain clusters from the network, where customers in a cluster know each other or have friends in common. Customers within a cluster may influence one another regarding purchase decision making. Moreover, communication channels can be designed to inform the “heads” of clusters (i.e., the “best” connected people in the clusters), so that promotional information can be spread out quickly. Thus, you may use customer clustering to promote sales at AllElectronics.

As another example, the authors of scientific publications form a social network, where the authors are vertices and two authors are connected by an edge if they coauthored a publication. The network is, in general, a weighted graph because an edge between two authors can carry a weight representing the strength of the collaboration such as how many publications the two authors (as the end vertices) coauthored. Clustering the coauthor network provides insight as to communities of authors and patterns of collaboration.

“Are there any challenges specific to cluster analysis on graph and network data?” In most of the clustering methods discussed so far, objects are represented using a set of attributes. A unique feature of graph and network data is that only objects (as vertices) and relationships between them (as edges) are given. No dimensions or attributes are explicitly defined. To conduct cluster analysis on graph and network data, there are two major new challenges.

■ “How can we measure the similarity between two objects on a graph accordingly?” Typically, we cannot use conventional distance measures, such as Euclidean distance. Instead, we need to develop new measures to quantify the similarity. Such measures often are not metric, and thus raise new challenges regarding the development of efficient clustering methods. Similarity measures for graphs are discussed in Section 11.3.2.

■ “How can we design clustering models and methods that are effective on graph and network data?” Graph and network data are often complicated, carrying topological structures that are more sophisticated than traditional cluster analysis applications. Many graph data sets are large, such as the web graph containing at least tens of billions of web pages in the publicly indexable Web. Graphs can also be sparse where, on average, a vertex is connected to only a small number of other vertices in the graph. To discover accurate and useful knowledge hidden deep in the data, a good clustering method has to accommodate these factors. Clustering methods for graph and network data are introduced in Section 11.3.3.

11.3.2 Similarity Measures

“How can we measure the similarity or distance between two vertices in a graph?” In our discussion, we examine two types of measures: geodesic distance and distance based on random walk.

Geodesic Distance

A simple measure of the distance between two vertices in a graph is the shortest path between the vertices. Formally, the geodesic distance between two vertices is the length in terms of the number of edges of the shortest path between the vertices. For two vertices that are not connected in a graph, the geodesic distance is defined as infinite.

Using geodesic distance, we can define several other useful measurements for graph analysis and clustering. Given a graph image, where V is the set of vertices and E is the set of edges, we define the following:

■ For a vertext image, the eccentricity of v, denoted image, is the largest geodesic distance between v and any other vertex image. The eccentricity of v captures how far away v is from its remotest vertex in the graph.

■ The radius of graph G is the minimum eccentricity of all vertices. That is,

image (11.27)

The radius captures the distance between the “most central point” and the “farthest border” of the graph.

■ The diameter of graph G is the maximum eccentricity of all vertices. That is,

image (11.28)

The diameter represents the largest distance between any pair of vertices.

■ A peripheral vertex is a vertex that achieves the diameter.

Example 11.19

Measurements based on geodesic distance

Consider graph G in Figure 11.13. The eccentricity of a is 2, that is, image, image, and image. Thus, the radius of G is 2, and the diameter is 3. Note that it is not necessary that image. Vertices c, d, and e are peripheral vertices.

image

Figure 11.13 A graph, G, where vertices c, d, and e are peripheral.

SimRank: Similarity Based on Random Walk and Structural Context

For some applications, geodesic distance may be inappropriate in measuring the similarity between vertices in a graph. Here we introduce SimRank, a similarity measure based on random walk and on the structural context of the graph. In mathematics, a random walk is a trajectory that consists of taking successive random steps.

Example 11.20

Similarity between people in a social network

Let’s consider measuring the similarity between two vertices in the AllElectronics customer social network of Example 11.18. Here, similarity can be explained as the closeness between two participants in the network, that is, how close two people are in terms of the relationship represented by the social network.

“How well can the geodesic distance measure similarity and closeness in such a network?” Suppose Ada and Bob are two customers in the network, and the network is undirected. The geodesic distance (i.e., the length of the shortest path between Ada and Bob) is the shortest path that a message can be passed from Ada to Bob and vice versa. However, this information is not useful for AllElectronics' customer relationship management because the company typically does not want to send a specific message from one customer to another. Therefore, geodesic distance does not suit the application.

“What does similarity mean in a social network?” We consider two ways to define similarity:

■ Two customers are considered similar to one another if they have similar neighbors in the social network. This heuristic is intuitive because, in practice, two people receiving recommendations from a good number of common friends often make similar decisions. This kind of similarity is based on the local structure (i.e., the neighborhoods ) of the vertices, and thus is called structural context–based similarity.

■ Suppose AllElectronics sends promotional information to both Ada and Bob in the social network. Ada and Bob may randomly forward such information to their friends (or neighbors ) in the network. The closeness between Ada and Bob can then be measured by the likelihood that other customers simultaneously receive the promotional information that was originally sent to Ada and Bob. This kind of similarity is based on the random walk reachability over the network, and thus is referred to as similarity based on random walk.

Let’s have a closer look at what is meant by similarity based on structural context, and similarity based on random walk.

The intuition behind similarity based on structural context is that two vertices in a graph are similar if they are connected to similar vertices. To measure such similarity, we need to define the notion of individual neighborhood. In a directed graph image, where V is the set of vertices and image is the set of edges, for a vertex image, the individual in-neighborhood of v is defined as

image (11.29)

Symmetrically, we define the individual out-neighborhood of v as

image (11.30)

Following the intuition illustrated in Example 11.20, we define SimRank, a structural-context similarity, with a value that is between 0 and 1 for any pair of vertices. For any vertex, image, the similarity between the vertex and itself is image because the neighborhoods are identical. For vertices image such that image, we can define

image (11.31)

where C is a constant between 0 and 1. A vertex may not have any in-neighbors. Thus, we define Eq. (11.31) to be 0 when either I(u) or I(v) is ∅. Parameter C specifies the rate of decay as similarity is propagated across edges.

“How can we compute SimRank?” A straightforward method iteratively evaluates Eq. (11.31) until a fixed point is reached. Let image be the SimRank score calculated at the i th round. To begin, we set

image (11.32)

We use Eq. (11.31) to compute image from si as

image (11.33)

It can be shown that image. Additional methods for approximating SimRank are given in the bibliographic notes (Section 11.7).

Now, let’s consider similarity based on random walk. A directed graph is strongly connected if, for any two nodes u and v, there is a path from u to v and another path from v to u. In a strongly connected graph, image, for any two vertices, image, we can define the expected distance from u to v as

image (11.34)

where image is a path starting from u and ending at v that may contain cycles but does not reach v until the end. For a traveling tour, image, its length is image. The probability of the tour is defined as

image (11.35)

To measure the probability that a vertex w receives a message that originated simultaneously from u and v, we extend the expected distance to the notion of expected meeting distance, that is,

image (11.36)

where image is a pair of tours image and image of the same length. Using a constant C between 0 and 1, we define the expected meeting probability as

image (11.37)

which is a similarity measure based on random walk. Here, the parameter C specifies the probability of continuing the walk at each step of the trajectory.

It has been shown that image for any two vertices, u and v. That is, SimRank is based on both structural context and random walk.

11.3.3 Graph Clustering Methods

Let’s consider how to conduct clustering on a graph. We first describe the intuition behind graph clustering. We then discuss two general categories of graph clustering methods.

To find clusters in a graph, imagine cutting the graph into pieces, each piece being a cluster, such that the vertices within a cluster are well connected and the vertices in different clusters are connected in a much weaker way. Formally, for a graph, image, a cut, image, is a partitioning of the set of vertices V in G, that is, image and image. The cut set of a cut is the set of edges, image. The size of the cut is the number of edges in the cut set. For weighted graphs, the size of a cut is the sum of the weights of the edges in the cut set.

“What kinds of cuts are good for deriving clusters in graphs?” In graph theory and some network applications, a minimum cut is of importance. A cut is minimum if the cut’s size is not greater than any other cut’s size. There are polynomial time algorithms to compute minimum cuts of graphs. Can we use these algorithms in graph clustering?

Example 11.21

Cuts and clusters

Consider graph G in Figure 11.14. The graph has two clusters: image and image, and one outlier vertex, l.

Consider cut image. Only one edge, namely, (e, l), crosses the two partitions created by C1. Therefore, the cut set of C1 is image and the size of C1 is 1. (Note that the size of any cut in a connected graph cannot be smaller than 1.) As a minimum cut, C1 does not lead to a good clustering because it only separates the outlier vertex, l, from the rest of the graph.

image

Figure 11.14 A graph G and two cuts.

Cut image leads to a much better clustering than C1.

The edges in the cut set of C2 are those connecting the two “natural clusters” in the graph. Specifically, for edges (d, h) and (e, k) that are in the cut set, most of the edges connecting d, h, e, and k belong to one cluster.

Example 11.21 indicates that using a minimum cut is unlikely to lead to a good clustering. We are better off choosing a cut where, for each vertex u that is involved in an edge in the cut set, most of the edges connecting to u belong to one cluster. Formally, let image be the degree of u, that is, the number of edges connecting to u. The sparsity of a cut image is defined as

image (11.38)

A cut is sparsest if its sparsity is not greater than the sparsity of any other cut. There may be more than one sparsest cut.

In Example 11.21 and Figure 11.14, C2 is a sparsest cut. Using sparsity as the objective function, a sparsest cut tries to minimize the number of edges crossing the partitions and balance the partitions in size.

Consider a clustering on a graph image that partitions the graph into k clusters. The modularity of a clustering assesses the quality of the clustering and is defined as

image (11.39)

where li is the number of edges between vertices in the i th cluster, and di is the sum of the degrees of the vertices in the i th cluster. The modularity of a clustering of a graph is the difference between the fraction of all edges that fall into individual clusters and the fraction that would do so if the graph vertices were randomly connected. The optimal clustering of graphs maximizes the modularity.

Theoretically, many graph clustering problems can be regarded as finding good cuts, such as the sparsest cuts, on the graph. In practice, however, a number of challenges exist:

■ High computational cost: Many graph cut problems are computationally expensive. The sparsest cut problem, for example, is NP-hard. Therefore, finding the optimal solutions on large graphs is often impossible. A good trade-off between efficiency/scalability and quality has to be achieved.

■ Sophisticated graphs: Graphs can be more sophisticated than the ones described here, involving weights and/or cycles.

■ High dimensionality: A graph can have many vertices. In a similarity matrix, a vertex is represented as a vector (a row in the matrix) with a dimensionality that is the number of vertices in the graph. Therefore, graph clustering methods must handle high dimensionality.

■ Sparsity: A large graph is often sparse, meaning each vertex on average connects to only a small number of other vertices. A similarity matrix from a large sparse graph can also be sparse.

There are two kinds of methods for clustering graph data, which address these challenges. One uses clustering methods for high-dimensional data, while the other is designed specifically for clustering graphs.

The first group of methods is based on generic clustering methods for high-dimensional data. They extract a similarity matrix from a graph using a similarity measure such as those discussed in Section 11.3.2. A generic clustering method can then be applied on the similarity matrix to discover clusters. Clustering methods for high-dimensional data are typically employed. For example, in many scenarios, once a similarity matrix is obtained, spectral clustering methods (Section 11.2.4) can be applied. Spectral clustering can approximate optimal graph cut solutions. For additional information, please refer to the bibliographic notes (Section 11.7).

The second group of methods is specific to graphs. They search the graph to find well-connected components as clusters. Let’s look at a method called SCAN (Structural Clustering Algorithm for Networks) as an example.

Given an undirected graph, image, for a vertex, image, the neighborhood of u is image. Using the idea of structural-context similarity, SCAN measures the similarity between two vertices, image, by the normalized common neighborhood size, that is,

image (11.40)

The larger the value computed, the more similar the two vertices. SCAN uses a similarity threshold ε to define the cluster membership. For a vertex, image, the ε-neighborhood of u is defined as image. The ε-neighborhood of u contains all neighbors of u with a structural-context similarity to u that is at least ε.

In SCAN, a core vertex is a vertex inside of a cluster. That is, image is a core vertex if image, where μ is a popularity threshold. SCAN grows clusters from core vertices. If a vertex v is in the ε-neighborhood of a core u, then v is assigned to the same cluster as u. This process of growing clusters continues until no cluster can be further grown. The process is similar to the density-based clustering method, DBSCAN (Chapter 10).

Formally, a vertex v can be directly reached from a core u if image. Transitively, a vertex v can be reached from a core u if there exist vertices image such that w1 can be reached from u, wi can be reached from image for image, and v can be reached from wn. Moreover, two vertices, image, which may or may not be cores, are said to be connected if there exists a core w such that both u and v can be reached from w. All vertices in a cluster are connected. A cluster is a maximum set of vertices such that every pair in the set is connected.

Some vertices may not belong to any cluster. Such a vertex u is a hub if the neighborhood image of u contains vertices from more than one cluster. If a vertex does not belong to any cluster, and is not a hub, it is an outlier.

The SCAN algorithm is shown in Figure 11.15. The search framework closely resembles the cluster-finding process in DBSCAN. SCAN finds a cut of the graph, where each cluster is a set of vertices that are connected based on the transitive similarity in a structural context.

image

Figure 11.15 SCAN algorithm for cluster analysis on graph data.

An advantage of SCAN is that its time complexity is linear with respect to the number of edges. In very large and sparse graphs, the number of edges is in the same scale of the number of vertices. Therefore, SCAN is expected to have good scalability on clustering large graphs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.151.153