Mining Large Graphs 195
Effective diameter. The effective diameter of a graph is the length of the longest path
necessary to connect 90% of the possible node pairs.
∗
Understanding this value guides
our intuition about short paths between nodes. Generating an accurate estimate of the
effective diameter is possible in O(n) time and memory using a simple algorithm with
a sophisticated analysis [24,25].
Extremal eigenvalues. There are a variety of matrix structures associated with a graph.
One of the most common is the adjacency matrix, denoted A,whereA
i,j
= 1 if there is
an edge between vertices i and j and A
i,j
=0otherwise.
†
Another common matrix is the
normalized Laplacian, denoted L,whereL
i,i
=1andL
i,j
=1/
degree(i) · degree(j)
if there is an edge between i and j,andL
i,j
= 0 otherwise. The largest and smallest
eigenvalues and eigenvectors of the adjacency or normalized Laplacian matrix of a graph
reveal a host of graph properties, from a network centrality score known as eigenvector
centrality to the Fiedler vector that indicates good ways of splitting a graph into
pieces [26,27]. The best algorithms for these problems use the ARPACK software [28],
which includes sophisticated techniques to lock eigenvalues and vectors after they have
converged. This method would require something like O(nk log n) time and memory to
compute reasonable estimates of the extremal k eigenvalues and eigenvectors.
‡
Triangle counting. Triangles, or triples of vertices (i, j, k) where all are connected, have
a variety of uses in large graph mining. For instance, counting the triangles incident
to a node helps indicate the tendency of the graph to have interesting groups, and
thus feature in many link prediction, recommendation systems, and anomaly detections
schemes. Given a sparse graph (such that there are order n edges), computing the
triangles takes O(n
√
n) work and memory.
All-pairs problems. Explicit all-pairs computations (shortest paths, commute times, and
graph kernels [29]) on graphs are generally infeasible for large graphs. Sometimes, there
are algorithms that enable fast (near constant-time) queries of any given distance pair
or the closest k-nodes query; and there is a class of algorithms that generate so-called
Nystr¨om approximations of these distances that yields near-constant time queries.
Finding exact scalable methods for these problems is one of the open challenges in
large graph mining.
There are of course many other things that could be computed, for example, the δ-
hyperbolicity properties of a graph with an Θ(n
4
) algorithm [30]; however, these are many
of the most representative problems/algorithms in which graph miners are interested. See
Table 12.1 for a brief summary.
12.2.3 Classification of Large Graphs
We now classify large graphs based on their size. As always with a classification, this is only
a rough guide that aids our intuition about some natural boundaries in how properties of
graph mining change with size. We will use the previous list of tasks from Section 12.2.2
∗
The diameter of a graph is the length of the longest shortest path to connect all pairs of nodes that have
a valid path between them. This measure is not reliable/robust, as many graphs contain a small number of
outlier pieces that increase the diameter a lot. Clearly, the exact percentile is entirely arbitrary, but choosing
90% is common. A parameter-less alternative is the average distance in the graph.
†
Formally, all matrices associated with a graph require a mapping of the vertices to the indicates 1 to
n; however, many implementations of algorithms with matrices on graphs need not create this mapping
explicitly. Instead, the algorithm can use the natural vertices identifiers with the implicit understanding
that the algorithm is equivalent to some ordering of the vertices.
‡
This bound is not at all precise, but a fully precise bound is not a useful guide to practice; this statement
represents a working intuition for how long it takes compared to other ideas.