12
Mining Large Graphs
David F. Gleich and Michael W. Mahoney
CONTENTS
12.1 Introduction .................................................................... 192
12.1.1 Scope and Overview ................................................... 192
12.2 Preliminaries ................................................................... 193
12.2.1 Graph Representations ................................................ 193
12.2.2 Graph Mining Tasks ................................................... 194
12.2.3 Classification of Large Graphs ........................................ 195
12.2.4 Large Graph Mining Systems ......................................... 197
12.2.5 Sources for Data ....................................................... 197
12.3 Canonical Types of Large-Scale Graph Mining Methods ...................... 198
12.3.1 Geodesic Neighborhood-Based Graph Mining ........................ 198
12.3.2 Diffusion Neighborhood-Based Graph Mining ........................ 199
12.3.3 Generalized Matrix-Vector Products Graph Mining .................. 199
12.4 Mining with Geodesic Neighborhoods ......................................... 199
12.4.1 Single Neighborhoods .................................................. 199
12.4.2 All Neighborhoods ..................................................... 200
12.4.3 Complexities with Power-Law Graphs ................................ 200
12.5 Mining with Diffusion Neighborhoods ......................................... 201
12.5.1 ACL Push Methods ................................................... 203
12.5.2 Random Walks with Restart .......................................... 203
12.6 Mining with Generalized Matrix-Vector Products ............................. 204
12.6.1 Algorithms with Standard Matrix-Vector Products .................. 205
12.6.2 Algorithms with Semi-Ring Matrix-Vector Products ................. 205
12.6.3 General Updates ....................................................... 207
12.6.4 Edge-Oriented Updates ................................................ 207
12.6.5 Complexities with Power-Law Graphs ................................ 208
12.6.6 Implementations in SQL ............................................... 208
12.7 Limitations and Tradeoffs in Mining Large Graph ............................ 208
12.7.1 Graph Drawing ........................................................ 209
12.7.2 Viral Propagation ..................................................... 210
12.7.3 Communities and Modularity ......................................... 211
12.7.3.1 Preliminaries ............................................... 211
12.7.3.2 Modularity Definition ...................................... 212
12.7.3.3 Modularity as a Cut Measure .............................. 212
12.8 Large Graph Mining in the Future
............................................ 214
12.8.1 Mining with Streaming Procedures ................................... 214
12.8.2 Mining with Generalized Matrix–Matrix Products ................... 214
191
192 Handbook of Big Data
Acknowledgments ...................................................................... 215
References ............................................................................. 215
12.1 Introduction
Graphs provide a general representation or data model for many types of data, where pair-
wise relationships are known or thought to be particularly important.
Thus, it should
not be surprising that interest in graph mining has grown with the recent interest in big
data. Much of the big data generated and analyzed involves pair-wise relationships among
a set of entities. For example, in e-commerce applications such as with Amazon’s product
database, customers are related to products through their purchasing activities; on the web,
web pages are related through hypertext linking relationships; on social networks such as
Facebook, individuals are related through their friendships; and so on. Similarly, in scientific
applications, research articles are related through citations; proteins are related through
metabolic pathways, co-expression, and regulatory network effects within a cell; materials
are related through models of their crystalline structure; and so on.
While many graphs are small, many large graphs are now extremely LARGE. For
example, in early 2008, Google announced that it had indexed over one trillion URLs
on the internet, corresponding to a graph with over one trillion nodes [1]; in 2012, the
Facebook friendship network spanned 721 million individuals and had 137 billion links [2];
phone companies process a few trillion calls a year [3]; the human brain has around 100
billion neurons and 100 trillion neuronal connections [4]; one of the largest reported graph
experiments involved 4.4 trillion nodes and around 70 trillion edges in a synthetic experiment
that required one petabyte of storage space [5]; and one of the largest reported experiments
with a real-world graph involved over 1.5 trillion edges [6].
Given the ubiquity, size, and importance of graphs in many application areas, it should
come as no surprise that large graph mining serves numerous roles within the large-scale data
analysis ecosystem. For example, it can help us learn new things about the world, including
both the chemical and biological sciences [7,8] as well as results in the social and economic
sciences such as the Facebook study that showed that any two people in the(ir) world can
be connected through approximately four intermediate individuals [2]. Alternatively, large
graph mining produces similar information for recommendation, suggestion, and prediction
from messy data [9,10]; it can also tell us how to optimize a data infrastructure to improve
response time [11]; and it can tell us when and how our data are anomalous [12].
12.1.1 Scope and Overview
In this chapter, we will provide an overview of several topics in the general area of mining
large graphs. This is a large and complicated area. Thus, rather than attempting to be
comprehensive, we will instead focus on what seems to us to be particularly interesting or
underappreciated algorithmic developments that will in upcoming years provide the basis for
an improved understanding of the properties of moderately large to very large informatics
graphs. There are many reviews and overviews for the interested reader to learn more about
graph mining; see, for example, [13,14]. An important theme in our chapter is that large
graphs are often very different than small graphs and thus intuitions from small graphs
In the simplest case, a graph G =(V, E ) consists of a set V of vertices or nodes and a set E of edges,
where each edge consists of an undirected pairs of nodes. Of course, in many applications one is interested in
graphs that have weighted or directed edges, are time-varying, have additional meta-information associated
with the nodes or edges, and so on.
Mining Large Graphs 193
often simply do not hold for large graphs. A second important theme is that, depending on
the size of the graph, different classes of algorithms may be more or less appropriate. Thus,
we will concern ourselves primarily with what is (and is not) even possible in large graph
mining; we’ll describe why one might (or might not) be interested in performing particular
graph mining tasks that are possible; and we will provide brief comments on how to make a
large graph mining task work on a large distributed system such as MapReduce cluster or a
Spark cluster. Throughout, we’ll highlight some of the common challenges, we’ll discuss the
heuristics and procedures used to overcome these challenges, and we’ll describe how some of
these procedures are useful outside the domain for which they were originally developed.
At
several points, we will also highlight the relationships between seemingly distinct methods
and unexpected, often implicit, properties of large-scale graph algorithms.
12.2 Preliminaries
When data is
represented as a graph, the objects underlying the relationships are called
nodes or vertices, and the relationships are called edges, links,orarcs. For instance, if we are
considering a dataset representing web pages and the links from one page to another, then
the vertices represent the web pages and the edges represent those links between pages. The
result is a directed graph because edges between pages need not be reciprocal. Thus, the
idea with representing the data as a graph is that we can abstract the details of a particular
domain away into the formation of a graph. Then we can take a domain-specific question,
such as “How do I understand phone calling patterns?,” and we can rephrase that as a
question about the vertices and edges in the graph that is used to model the data.
Let us note that there are often many ways to turn a set of data into a graph. There could
be multiple types of possible edges corresponding to different types of relationships among
the objects. This is common in what is known as semantic graph analysis and semantic
graph mining. Determining what edges to use from such a graph is a fascinating problem
that can often have a dramatic effect on the result and/or the scalability of an algorithm. In
order to keep our discussion contained, however, we will assume that the underlying graph
has been constructed in such a way that the graph mining tasks we discuss make sense on
the final graph. Having a non-superficial understanding of what graph mining algorithms
actually do and why they might or might not be useful often provides excellent guidance
on choosing nodes/edges to include or exclude in the graph construction process.
12.2.1 Graph Representations
The canonical graph we analyze is G =(V,E), where V is the set of vertices and E is the
set of edges. We will use n to denote the number of vertices. We assume that the number of
edges is O(n) as well, and we will use this for complexity results. If we wish to be specific,
the number of edges will be |E|. Graphs can be either directed or undirected, although some
algorithms may not make sense on both types.
We have attempted in this preliminary work to strike a balance between providing accurate intuition
about our perspective on large graph mining and precise formal statements. In the main text, we skew
toward accurate intuition, and in some cases, we provide additional technical caveats for the experts in the
footnotes.
Or are—aside from the linguistic issue, one of the challenges in developing graph algorithms is that
graphs can be used to represent a single data point as well as many data points. For example, there is
N = 1 web graph out there, but graphs are also used to represent correlations and similarities between
many different data points, each of which is represented by a feature vector. Different research areas think
about these issues in very different ways.
194 Handbook of Big Data
While for tiny graphs, for example, graphs containing fewer than several thousand nodes,
one can take a bird’s eye view and think about the entire graph since, for example, it can
be stored in processor cache, for larger graphs, it is important to worry about how the
data are structured to determine how algorithms run. We need to discuss two important
representations of graphs that have a large impact on what is and is not possible with large
graph mining.
Edge list. The edge list is simply a list of pairs of vertices in the graph, one pair for
each edge. Edges can appear in any order. There is no index, so checking whether or not
an edge exists requires a linear scan over all edges. This might also be distributed among
many machines. Edge lists are common in graphs created based on translating data from
sources such as log files into relationships.
Adjacency list. Given a vertex, its adjacency list is the set of neighbors for that vertex.
An adjacency list representation allows us to query for this set of neighbors in a time that we
will consider constant.
Adjacency lists are common when graphs are an explicit component
of the original data model.
The adjacency list is the most flexible format because it can always serve as an edge list
through a simple in-place transformation. In comparison, although building the adjacency
list representation from an edge list is a linear-time operation, it may involve an expensive
amount of data movement within a distributed environment.
12.2.2 Graph Mining Tasks
We will use the following representative problems/algorithms to help frame our discussion
below.
Random walk steps. A random walk in a graph moves from vertex to vertex by randomly
choosing a neighbor. For most adjacency list representations, one step of a random
walk is a constant time operation.
Running millions of random walk steps given in an
adjacency list representation of a large graph is easy,
§
and these steps can be used to
extract a small region of a massive graph nearby that seed [16,17].
Connected components. Determining the connected components of a graph is a fundamental
step in most large graph mining pipelines. On an adjacency list, this can be done using
breadth-first search in O(n) time and memory. On an edge list, this can also be done
in O(n) time and memory, assuming that the diameter of the graph does not grow
with the size of the graph, by using semi-ring iterations [18] that we will discuss in
Section 12.6.
PageRank. PageRank [16] is one of a host of graph centrality measures [19] that give
information to address the question What are the most important nodes in my
graph?” See [20] for a long list of examples of where it has been successfully used
to analyze large graphs. Just as with connected components, it takes O(n)timeand
memory to compute PageRank, in either the adjacency list representation or edge list
representation. Computing PageRank is one of the most common primitives used to test
large graph analysis frameworks (e.g., [21–23]).
Various implementations and systems we consider may not truly guarantee constant time access to
the neighborhood set of vertices, for example, it may be O(log n), or be constant in some loosely amortized
sense, but this is still a useful approximation for the purposes of distinguishing access patterns for algorithms
in large graph mining.
This data movement is a great fit for Google’s MapReduce system.
This is not always guaranteed because selecting a random neighbor may be an O(d) operation, where
d is the degree of the node, depending on the implementation details.
§
Current research efforts are devoted to running random walks with restarts for millions of seeds
concurrently on edge list representations of massive graphs [15].
Mining Large Graphs 195
Effective diameter. The effective diameter of a graph is the length of the longest path
necessary to connect 90% of the possible node pairs.
Understanding this value guides
our intuition about short paths between nodes. Generating an accurate estimate of the
effective diameter is possible in O(n) time and memory using a simple algorithm with
a sophisticated analysis [24,25].
Extremal eigenvalues. There are a variety of matrix structures associated with a graph.
One of the most common is the adjacency matrix, denoted A,whereA
i,j
= 1 if there is
an edge between vertices i and j and A
i,j
=0otherwise.
Another common matrix is the
normalized Laplacian, denoted L,whereL
i,i
=1andL
i,j
=1/
degree(i) · degree(j)
if there is an edge between i and j,andL
i,j
= 0 otherwise. The largest and smallest
eigenvalues and eigenvectors of the adjacency or normalized Laplacian matrix of a graph
reveal a host of graph properties, from a network centrality score known as eigenvector
centrality to the Fiedler vector that indicates good ways of splitting a graph into
pieces [26,27]. The best algorithms for these problems use the ARPACK software [28],
which includes sophisticated techniques to lock eigenvalues and vectors after they have
converged. This method would require something like O(nk log n) time and memory to
compute reasonable estimates of the extremal k eigenvalues and eigenvectors.
Triangle counting. Triangles, or triples of vertices (i, j, k) where all are connected, have
a variety of uses in large graph mining. For instance, counting the triangles incident
to a node helps indicate the tendency of the graph to have interesting groups, and
thus feature in many link prediction, recommendation systems, and anomaly detections
schemes. Given a sparse graph (such that there are order n edges), computing the
triangles takes O(n
n) work and memory.
All-pairs problems. Explicit all-pairs computations (shortest paths, commute times, and
graph kernels [29]) on graphs are generally infeasible for large graphs. Sometimes, there
are algorithms that enable fast (near constant-time) queries of any given distance pair
or the closest k-nodes query; and there is a class of algorithms that generate so-called
Nystr¨om approximations of these distances that yields near-constant time queries.
Finding exact scalable methods for these problems is one of the open challenges in
large graph mining.
There are of course many other things that could be computed, for example, the δ-
hyperbolicity properties of a graph with an Θ(n
4
) algorithm [30]; however, these are many
of the most representative problems/algorithms in which graph miners are interested. See
Table 12.1 for a brief summary.
12.2.3 Classification of Large Graphs
We now classify large graphs based on their size. As always with a classification, this is only
a rough guide that aids our intuition about some natural boundaries in how properties of
graph mining change with size. We will use the previous list of tasks from Section 12.2.2
The diameter of a graph is the length of the longest shortest path to connect all pairs of nodes that have
a valid path between them. This measure is not reliable/robust, as many graphs contain a small number of
outlier pieces that increase the diameter a lot. Clearly, the exact percentile is entirely arbitrary, but choosing
90% is common. A parameter-less alternative is the average distance in the graph.
Formally, all matrices associated with a graph require a mapping of the vertices to the indicates 1 to
n; however, many implementations of algorithms with matrices on graphs need not create this mapping
explicitly. Instead, the algorithm can use the natural vertices identifiers with the implicit understanding
that the algorithm is equivalent to some ordering of the vertices.
This bound is not at all precise, but a fully precise bound is not a useful guide to practice; this statement
represents a working intuition for how long it takes compared to other ideas.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.19.75