12. Mining Large Graphs (1/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Mining Large Graphs

David F. Gleich and Michael W. Mahoney

CONTENTS

12.1 Introduction .................................................................... 192

12.1.1 Scope and Overview ................................................... 192

12.2 Preliminaries ................................................................... 193

12.2.1 Graph Representations ................................................ 193

12.2.2 Graph Mining Tasks ................................................... 194

12.2.3 Classiﬁcation of Large Graphs ........................................ 195

12.2.4 Large Graph Mining Systems ......................................... 197

12.2.5 Sources for Data ....................................................... 197

12.3 Canonical Types of Large-Scale Graph Mining Methods ...................... 198

12.3.1 Geodesic Neighborhood-Based Graph Mining ........................ 198

12.3.2 Diﬀusion Neighborhood-Based Graph Mining ........................ 199

12.3.3 Generalized Matrix-Vector Products Graph Mining .................. 199

12.4 Mining with Geodesic Neighborhoods ......................................... 199

12.4.1 Single Neighborhoods .................................................. 199

12.4.2 All Neighborhoods ..................................................... 200

12.4.3 Complexities with Power-Law Graphs ................................ 200

12.5 Mining with Diﬀusion Neighborhoods ......................................... 201

12.5.1 ACL Push Methods ................................................... 203

12.5.2 Random Walks with Restart .......................................... 203

12.6 Mining with Generalized Matrix-Vector Products ............................. 204

12.6.1 Algorithms with Standard Matrix-Vector Products .................. 205

12.6.2 Algorithms with Semi-Ring Matrix-Vector Products ................. 205

12.6.3 General Updates ....................................................... 207

12.6.4 Edge-Oriented Updates ................................................ 207

12.6.5 Complexities with Power-Law Graphs ................................ 208

12.6.6 Implementations in SQL ............................................... 208

12.7 Limitations and Tradeoﬀs in Mining Large Graph ............................ 208

12.7.1 Graph Drawing ........................................................ 209

12.7.2 Viral Propagation ..................................................... 210

12.7.3 Communities and Modularity ......................................... 211

12.7.3.1 Preliminaries ............................................... 211

12.7.3.2 Modularity Deﬁnition ...................................... 212

12.7.3.3 Modularity as a Cut Measure .............................. 212

12.8 Large Graph Mining in the Future

............................................ 214

12.8.1 Mining with Streaming Procedures ................................... 214

12.8.2 Mining with Generalized Matrix–Matrix Products ................... 214

191

192 Handbook of Big Data

Acknowledgments ...................................................................... 215

References ............................................................................. 215

12.1 Introduction

Graphs provide a general representation or data model for many types of data, where pair-

wise relationships are known or thought to be particularly important.

∗

Thus, it should

not be surprising that interest in graph mining has grown with the recent interest in big

data. Much of the big data generated and analyzed involves pair-wise relationships among

a set of entities. For example, in e-commerce applications such as with Amazon’s product

database, customers are related to products through their purchasing activities; on the web,

web pages are related through hypertext linking relationships; on social networks such as

Facebook, individuals are related through their friendships; and so on. Similarly, in scientiﬁc

applications, research articles are related through citations; proteins are related through

metabolic pathways, co-expression, and regulatory network eﬀects within a cell; materials

are related through models of their crystalline structure; and so on.

While many graphs are small, many large graphs are now extremely LARGE. For

example, in early 2008, Google announced that it had indexed over one trillion URLs

on the internet, corresponding to a graph with over one trillion nodes [1]; in 2012, the

Facebook friendship network spanned 721 million individuals and had 137 billion links [2];

phone companies process a few trillion calls a year [3]; the human brain has around 100

billion neurons and 100 trillion neuronal connections [4]; one of the largest reported graph

experiments involved 4.4 trillion nodes and around 70 trillion edges in a synthetic experiment

that required one petabyte of storage space [5]; and one of the largest reported experiments

with a real-world graph involved over 1.5 trillion edges [6].

Given the ubiquity, size, and importance of graphs in many application areas, it should

come as no surprise that large graph mining serves numerous roles within the large-scale data

analysis ecosystem. For example, it can help us learn new things about the world, including

both the chemical and biological sciences [7,8] as well as results in the social and economic

sciences such as the Facebook study that showed that any two people in the(ir) world can

be connected through approximately four intermediate individuals [2]. Alternatively, large

graph mining produces similar information for recommendation, suggestion, and prediction

from messy data [9,10]; it can also tell us how to optimize a data infrastructure to improve

response time [11]; and it can tell us when and how our data are anomalous [12].

12.1.1 Scope and Overview

In this chapter, we will provide an overview of several topics in the general area of mining

large graphs. This is a large and complicated area. Thus, rather than attempting to be

comprehensive, we will instead focus on what seems to us to be particularly interesting or

underappreciated algorithmic developments that will in upcoming years provide the basis for

an improved understanding of the properties of moderately large to very large informatics

graphs. There are many reviews and overviews for the interested reader to learn more about

graph mining; see, for example, [13,14]. An important theme in our chapter is that large

graphs are often very diﬀerent than small graphs and thus intuitions from small graphs

∗

In the simplest case, a graph G =(V, E ) consists of a set V of vertices or nodes and a set E of edges,

where each edge consists of an undirected pairs of nodes. Of course, in many applications one is interested in

graphs that have weighted or directed edges, are time-varying, have additional meta-information associated

with the nodes or edges, and so on.

Mining Large Graphs 193

often simply do not hold for large graphs. A second important theme is that, depending on

the size of the graph, diﬀerent classes of algorithms may be more or less appropriate. Thus,

we will concern ourselves primarily with what is (and is not) even possible in large graph

mining; we’ll describe why one might (or might not) be interested in performing particular

graph mining tasks that are possible; and we will provide brief comments on how to make a

large graph mining task work on a large distributed system such as MapReduce cluster or a

Spark cluster. Throughout, we’ll highlight some of the common challenges, we’ll discuss the

heuristics and procedures used to overcome these challenges, and we’ll describe how some of

these procedures are useful outside the domain for which they were originally developed.

∗

several points, we will also highlight the relationships between seemingly distinct methods

and unexpected, often implicit, properties of large-scale graph algorithms.

12.2 Preliminaries

When data is

†

represented as a graph, the objects underlying the relationships are called

nodes or vertices, and the relationships are called edges, links,orarcs. For instance, if we are

considering a dataset representing web pages and the links from one page to another, then

the vertices represent the web pages and the edges represent those links between pages. The

result is a directed graph because edges between pages need not be reciprocal. Thus, the

idea with representing the data as a graph is that we can abstract the details of a particular

domain away into the formation of a graph. Then we can take a domain-speciﬁc question,

such as “How do I understand phone calling patterns?,” and we can rephrase that as a

question about the vertices and edges in the graph that is used to model the data.

Let us note that there are often many ways to turn a set of data into a graph. There could

be multiple types of possible edges corresponding to diﬀerent types of relationships among

the objects. This is common in what is known as semantic graph analysis and semantic

graph mining. Determining what edges to use from such a graph is a fascinating problem

that can often have a dramatic eﬀect on the result and/or the scalability of an algorithm. In

order to keep our discussion contained, however, we will assume that the underlying graph

has been constructed in such a way that the graph mining tasks we discuss make sense on

the ﬁnal graph. Having a non-superﬁcial understanding of what graph mining algorithms

actually do and why they might or might not be useful often provides excellent guidance

on choosing nodes/edges to include or exclude in the graph construction process.

12.2.1 Graph Representations

The canonical graph we analyze is G =(V,E), where V is the set of vertices and E is the

set of edges. We will use n to denote the number of vertices. We assume that the number of

edges is O(n) as well, and we will use this for complexity results. If we wish to be speciﬁc,

the number of edges will be |E|. Graphs can be either directed or undirected, although some

algorithms may not make sense on both types.

∗

We have attempted in this preliminary work to strike a balance between providing accurate intuition

about our perspective on large graph mining and precise formal statements. In the main text, we skew

toward accurate intuition, and in some cases, we provide additional technical caveats for the experts in the

footnotes.

†

Or are—aside from the linguistic issue, one of the challenges in developing graph algorithms is that

graphs can be used to represent a single data point as well as many data points. For example, there is

N = 1 web graph out there, but graphs are also used to represent correlations and similarities between

many diﬀerent data points, each of which is represented by a feature vector. Diﬀerent research areas think

about these issues in very diﬀerent ways.

194 Handbook of Big Data

While for tiny graphs, for example, graphs containing fewer than several thousand nodes,

one can take a bird’s eye view and think about the entire graph since, for example, it can

be stored in processor cache, for larger graphs, it is important to worry about how the

data are structured to determine how algorithms run. We need to discuss two important

representations of graphs that have a large impact on what is and is not possible with large

graph mining.

Edge list. The edge list is simply a list of pairs of vertices in the graph, one pair for

each edge. Edges can appear in any order. There is no index, so checking whether or not

an edge exists requires a linear scan over all edges. This might also be distributed among

many machines. Edge lists are common in graphs created based on translating data from

sources such as log ﬁles into relationships.

Adjacency list. Given a vertex, its adjacency list is the set of neighbors for that vertex.

An adjacency list representation allows us to query for this set of neighbors in a time that we

will consider constant.

∗

Adjacency lists are common when graphs are an explicit component

of the original data model.

The adjacency list is the most ﬂexible format because it can always serve as an edge list

through a simple in-place transformation. In comparison, although building the adjacency

list representation from an edge list is a linear-time operation, it may involve an expensive

amount of data movement within a distributed environment.

†

12.2.2 Graph Mining Tasks

We will use the following representative problems/algorithms to help frame our discussion

below.

Random walk steps. A random walk in a graph moves from vertex to vertex by randomly

choosing a neighbor. For most adjacency list representations, one step of a random

walk is a constant time operation.

‡

Running millions of random walk steps given in an

adjacency list representation of a large graph is easy,

and these steps can be used to

extract a small region of a massive graph nearby that seed [16,17].

Connected components. Determining the connected components of a graph is a fundamental

step in most large graph mining pipelines. On an adjacency list, this can be done using

breadth-ﬁrst search in O(n) time and memory. On an edge list, this can also be done

in O(n) time and memory, assuming that the diameter of the graph does not grow

with the size of the graph, by using semi-ring iterations [18] that we will discuss in

Section 12.6.

PageRank. PageRank [16] is one of a host of graph centrality measures [19] that give

information to address the question “What are the most important nodes in my

graph?” See [20] for a long list of examples of where it has been successfully used

to analyze large graphs. Just as with connected components, it takes O(n)timeand

memory to compute PageRank, in either the adjacency list representation or edge list

representation. Computing PageRank is one of the most common primitives used to test

large graph analysis frameworks (e.g., [21–23]).

∗

Various implementations and systems we consider may not truly guarantee constant time access to

the neighborhood set of vertices, for example, it may be O(log n), or be constant in some loosely amortized

sense, but this is still a useful approximation for the purposes of distinguishing access patterns for algorithms

in large graph mining.

†

This data movement is a great ﬁt for Google’s MapReduce system.

‡

This is not always guaranteed because selecting a random neighbor may be an O(d) operation, where

d is the degree of the node, depending on the implementation details.

Current research eﬀorts are devoted to running random walks with restarts for millions of seeds

concurrently on edge list representations of massive graphs [15].

Mining Large Graphs 195

Eﬀective diameter. The eﬀective diameter of a graph is the length of the longest path

necessary to connect 90% of the possible node pairs.

∗

Understanding this value guides

our intuition about short paths between nodes. Generating an accurate estimate of the

eﬀective diameter is possible in O(n) time and memory using a simple algorithm with

a sophisticated analysis [24,25].

Extremal eigenvalues. There are a variety of matrix structures associated with a graph.

One of the most common is the adjacency matrix, denoted A,whereA

i,j

= 1 if there is

an edge between vertices i and j and A

i,j

=0otherwise.

†

Another common matrix is the

normalized Laplacian, denoted L,whereL

i,i

=1andL

i,j

=1/



degree(i) · degree(j)

if there is an edge between i and j,andL

i,j

= 0 otherwise. The largest and smallest

eigenvalues and eigenvectors of the adjacency or normalized Laplacian matrix of a graph

reveal a host of graph properties, from a network centrality score known as eigenvector

centrality to the Fiedler vector that indicates good ways of splitting a graph into

pieces [26,27]. The best algorithms for these problems use the ARPACK software [28],

which includes sophisticated techniques to lock eigenvalues and vectors after they have

converged. This method would require something like O(nk log n) time and memory to

compute reasonable estimates of the extremal k eigenvalues and eigenvectors.

‡

Triangle counting. Triangles, or triples of vertices (i, j, k) where all are connected, have

a variety of uses in large graph mining. For instance, counting the triangles incident

to a node helps indicate the tendency of the graph to have interesting groups, and

thus feature in many link prediction, recommendation systems, and anomaly detections

schemes. Given a sparse graph (such that there are order n edges), computing the

triangles takes O(n

√

n) work and memory.

All-pairs problems. Explicit all-pairs computations (shortest paths, commute times, and

graph kernels [29]) on graphs are generally infeasible for large graphs. Sometimes, there

are algorithms that enable fast (near constant-time) queries of any given distance pair

or the closest k-nodes query; and there is a class of algorithms that generate so-called

Nystr¨om approximations of these distances that yields near-constant time queries.

Finding exact scalable methods for these problems is one of the open challenges in

large graph mining.

There are of course many other things that could be computed, for example, the δ-

hyperbolicity properties of a graph with an Θ(n

) algorithm [30]; however, these are many

of the most representative problems/algorithms in which graph miners are interested. See

Table 12.1 for a brief summary.

12.2.3 Classiﬁcation of Large Graphs

We now classify large graphs based on their size. As always with a classiﬁcation, this is only

a rough guide that aids our intuition about some natural boundaries in how properties of

graph mining change with size. We will use the previous list of tasks from Section 12.2.2

∗

The diameter of a graph is the length of the longest shortest path to connect all pairs of nodes that have

a valid path between them. This measure is not reliable/robust, as many graphs contain a small number of

outlier pieces that increase the diameter a lot. Clearly, the exact percentile is entirely arbitrary, but choosing

90% is common. A parameter-less alternative is the average distance in the graph.

†

Formally, all matrices associated with a graph require a mapping of the vertices to the indicates 1 to

n; however, many implementations of algorithms with matrices on graphs need not create this mapping

explicitly. Instead, the algorithm can use the natural vertices identiﬁers with the implicit understanding

that the algorithm is equivalent to some ordering of the vertices.

‡

This bound is not at all precise, but a fully precise bound is not a useful guide to practice; this statement

represents a working intuition for how long it takes compared to other ideas.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 12. Mining Large Graphs (1/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
12. Mining Large Graphs (1/6)