The standard PageRank algorithm uses the Power Method to compute successive iterates that converge to the principal eigenvector of the Markov matrix A representing the Web link graph. Since it was shown in the previous chapter that the Power Method converges quickly, one approach to accelerating PageRank would be to directly modify the Power Method to exploit our knowledge about the matrix A.
In this chapter, we present several algorithms that accelerate the convergence of PageRank by using successive iterates of the Power Method to estimate the nonprincipal eigenvectors of A, and periodically subtracting these estimates from the current iterate x(k) of the Power method. We call this class of algorithms extrapolation algorithms. We present three extrapolation algorithms in this chapter: Aitken Extrapolation and Quadratic Extrapolation exploit our knowledge that the first eigenvalue λ1(A) = 1, and Power Extrapolation exploits our knowledge from the previous chapter that λ2(A) = c.
Empirically, we show that Quadratic Extrapolation and Power Extrapolation speed up PageRank computation by 25–300% on a Web graph of 80 million nodes, with minimal overhead. This contribution is useful to the PageRank community and the numerical linear algebra community in general, as these are fast methods for determining the dominant eigenvector of a Markov matrix that is too large for standard fast methods to be practical.
We begin by introducing Aitken Extrapolation, which we develop as follows. We assume that the iterate can be expressed as a linear combination of the first two eigenvectors. This assumption allows us to solve for the principal eigenvector in closed form using the successive iterates .
Of course, can only be approximated as a linear combination of the first two eigenvectors, so the that we compute is only an estimate of the true However, it can be seen from Section 2.3.1 that this approximation becomes increasingly accurate as k becomes larger.
We begin our formulation of Aitken Extrapolation by assuming that can be expressed as a linear combination of the first two eigenvectors
Since the first eigenvalue λ1 of a Markov matrix is 1, we can write the next two iterates as
Now, let us define
where xi represents the ith component of the vector . Doing simple algebra using (5.2) and (5.3) gives
Now, let us define fi as the quotient gi/hi:
Therefore,
Hence, from (5.1), we have a closed-form solution for :
However, since this solution is based on the assumption that can be written as a linear combination of and , (5.11) gives only an approximation to . Algorithms 3 and 4 show how to use Aitken Extrapolation in conjunction with the Power Method to get consistently better estimates of .
Aitken Extrapolation is equivalent to applying the well-known Aitken Δ2 method for accelerating linearly convergent sequences [4] to each component of the iterate . What is novel here is this derivation of Aitken acceleration, and the proof that Aitken acceleration computes the principal eigenvector of a Markov matrix in one step under the assumption that the power-iteration estimate can be expressed as a linear combination of the first two eigenvectors.
As a sidenote, let us briefly develop a related method. Rather than using (5.4), let us define gi alternatively as
We define h as in (5.5), and fi now becomes
Algorithm 3: Aitken Extrapolation
Algorithm 4: Power Method with Aitken Extrapolation
By (5.2),
Again, this is an approximation to , since it is based on the assumption that can be expressed as a linear combination of and . What is interesting here is that this is equivalent to performing a second-order epsilon acceleration algorithm [73] on each component of the iterate . For this reason, we call this algorithm Epsilon Extrapolation. In experiments, Epsilon Extrapolation has performed similarly to Aitken Extrapolation, so we will focus on Aitken Extrapolation in the next sections.
In order for an extrapolation method such as Aitken Extrapolation or Epsilon Extrapolation to be useful, the overhead should be minimal. By overhead, we mean any costs in addition to the cost of applying Algorithm 1 to generate iterates. It is clear from inspection that the operation count of the loop in Algorithm 3 is O(n), where n is the number of pages on the Web. The operation count of one extrapolation step is less than the operation count of a single iteration of the Power Method, and since Aitken Extrapolation may be applied only periodically, we say that Aitken Extrapolation has minimal overhead. In our implementation, the additional cost of each application of Aitken Extrapolation was negligible—about 1% of the cost of a single iteration of the Power Method (i.e., 1% of the cost of Algorithm 1).
In Figure 5.1, we show the convergence of the Power Method with Aitken Extrapolation applied once at the 10th iteration, compared to the convergence of the unaccelerated Power Method for the STANFORD.EDU dataset. The x-axis denotes the number of. times a multiplication A occurred, that is, the number of times Algorithm 1 was needed. Note that there is a spike at the acceleration step, but speedup occurs nevertheless. This spike is caused by the poor approximation for . The speedup likely occurs because, while the extrapolation step introduces some error in the components in the direction of each eigenvector, it is likely to actually reduce the error in the component in the direction of the second eigenvector, which is slow to go away by the standard Power Method.
Multiple applications of Aitken Extrapolation are generally not more useful than only one. This is because, by the time the residual settles down from the first application, a second extrapolation approximation does not give a very different estimate from the current iterate itself.
For c = 0.99, Aitken Extrapolation takes 38% less time to reach an iterate with a residual of 0.01. However, after this initial speedup, the convergence rate for Aitken slows down, so that to reach an iterate with a residual of 0.002, the time savings drops to 13%. For lower values of c, Aitken provides much less benefit. Since there is a spike in the residual graph, if Aitken Extrapolation is applied too often, the power iterations will not converge. In experiments, Epsilon Extrapolation performed similarly to Aitken Extrapolation.
The methods presented thus far are very different from standard fast eigensolvers, which generally rely strongly on matrix factorizations or matrix inversions. Standard fast eigensolvers do not work well for the PageRank problem, since the Web hyperlink matrix is so large and sparse. For problems where the matrix is small enough for an efficient inversion, standard eigensolvers such as inverse iteration are likely to be faster than these methods. The Aitken and Epsilon Extrapolation methods take advantage of the fact that the first eigenvalue of the Markov hyperlink matrix is 1 to find an approximation to the principal eigenvector.
In the next section, we present Quadratic Extrapolation, which assumes the iterate can be expressed as a linear combination of the first three eigenvectors, and solves for in closed form under this assumption. As we shall soon discuss, the Quadratic Extrapolation step is simply a linear combination of successive iterates, and thus does not produce spikes in the residual.
We develop the Quadratic Extrapolation algorithm as follows. We assume that the Markov matrix A has only 3 eigenvectors, and that the iterate can be expressed as a linear combination of these 3 eigenvectors. These assumptions allow us to solve for the principal eigenvector in closed form using the successive iterates
Of course, A has more than 3 eigenvectors, and can only be approximated as a linear combination of the first 3 eigenvectors. Therefore, the that we compute in this algorithm is only an estimate for the true . We show empirically that this estimate is a better estimate to than the iterate , and that our estimate becomes closer to the true value of as k becomes larger. In Section 5.3.3 we show that by periodically applying Quadratic Extrapolation to the successive iterates computed in PageRank, for values of c close to 1, we can speed up the convergence of PageRank by a factor of over 3.
We begin our formulation with the assumption stated at the beginning of this section
We then define the successive iterates
Since we assume A has 3 eigenvectors, the characteristic polynomial pA(λ) is given by
A is a Markov matrix, so we know that the first eigenvalue λ1 = 1. The eigenvalues of A are also the zeros of the characteristic polynomial pA(λ). Therefore,
The Cayley-Hamilton theorem states that any matrix A satisfies its own characteristic polynomial pA(A) = 0 [30]. Therefore, by the Cayley-Hamilton theorem, for any vector z in
Letting , we have
From (5.13)–(5.15),
From (5.17),
We may rewrite this as
Let us make the following definitions:
We can now write (5.22) in matrix notation:
We now wish to solve for . Since we are not interested in the trivial solution , we constrain the leading term of the characteristic polynomial :
We may do this because constraining a single coefficient of the polynomial does not affect the zeros.1 (5.26) is therefore written
This is an overdetermined system, so we solve the corresponding least-squares problem,
where Y+ is the pseudoinverse of the matrix . Now, (5.17), (5.27), and (5.29) completely determine the coefficients of the characteristic polynomial pA(λ) (5.16).
Algorithm 5: Quadratic Extrapolation
We may now divide pA(λ) by λ − 1 to get the polynomial qA(λ), whose roots are λ2 and λ3, the second two eigenvalues of A
Simple polynomial division gives the following values for β0,β1, and β2:
Again, by the Cayley-Hamilton theorem, if z is any vector in
where is the eigenvector of A corresponding to eigenvalue 1 (the principal eigenvector). Letting , we have
From (5.13)–(5.15), we get a closed form solution for :
However, since this solution is based on the assumption that A has only 3 eigenvectors, (5.36) gives only an approximation to .
Algorithms 5 and 6 show how to use Quadratic Extrapolation in conjunction with the Power Method to get consistently better estimates of .
Algorithm 6: Power Method with Quadratic Extrapolation
1. Compute the reduced QR factorization Y = QR using 2 steps of Gram-Schmidt.
2. Compute the vector – QTy(k).
3. Solve the upper triangular system:
Algorithm 7: Using Gram-Schmidt to solve for γ1 and γ2
The overhead in performing the extrapolation shown in Algorithm 5 comes primarily from the least-squares computation of γ1 and γ2:
It is clear that the other steps in this algorithm are either O(1) or O(n) operations.
Since Y is an n × 2 matrix, we can do the least-squares solution cheaply in just 2 iterations of the Gram-Schmidt algorithm [68]. Therefore, γ1 and γ2 can be computed in O(n) operations. While a presentation of Gram-Schmidt is outside the scope of this book, we show in Algorithm 7 how to apply Gram-Schmidt to solve for [γ1γ2]T in O(n) operations. Since the extrapolation step is on the order of a single iteration of the Power Method, and since Quadratic Extrapolation is applied only periodically during the Power Method, we say that Quadratic Extrapolation has low overhead. In our experimental setup, the overhead of a single application of Quadratic Extrapolation is half the cost of a standard power iteration (i.e., half the cost of Algorithm 1). This number includes the cost of storing on disk the intermediate data required by Quadratic Extrapolation (such as the previous iterates), since they may not fit in main memory.
Of the algorithms we have discussed for accelerating the convergence of PageRank, Quadratic Extrapolation performs the best empirically. In particular, Quadratic Extrapolation considerably improves convergence relative to the Power Method when the damping factor c is close to 1. We measured the performance of Quadratic Extrapolation under various scenarios on the LARGEWEB dataset. Figure 5.2 shows the rates of convergence when c = 0.90; after factoring in overhead, Quadratic Extrapolation reduces the time needed to reach a residual of 0.001 by 23%.2 Figure 5.3 shows the rates of convergence when c = 0.95; in this case, Quadratic Extrapolation speeds up convergence more significantly, saving 31% in the time needed to reach a residual of 0.001. Finally, in the case where c = 0.99, the speedup is more dramatic. Figure 5.4 shows the rates of convergence of the Power Method and Quadratic Extrapolation for c = 0.99. Because the Power Method is so slow to converge in this case, we plot the curves until a residual of 0.01 is reached. The use of extrapolation saves 69% in time needed to reach a residual of 0.01; that is, the unaccelerated Power Method took more than three times as long as the Quadratic Extrapolation method to reach the desired residual. The wallclock times for these scenarios are summarized in Figure 5.5.
Figure 5.6 shows the convergence for the Power Method, Aitken Extrapolation, and Quadratic Extrapolation on the STANFORD.EDU dataset; each method was carried out to 200 iterations. To reach a residual of 0.01, Quadratic Extrapolation saved 59% in time over the Power Method, compared to a 38% savings for Aitken Extrapolation.
An important observation about Quadratic Extrapolation is that it does not necessarily need to be applied very often to achieve maximum benefit. By contracting the error in the current iterate along the direction of the second and third eigenvectors, Quadratic Extrapolation actually enhances the convergence of future applications of the standard Power Method. The Power Method, as discussed previously, is very effective in annihilating error components of the iterate in directions along eigenvectors with small eigenvalues. By subtracting approximations to the second and third eigenvectors, Quadratic Extrapolation leaves error components primarily along the smaller eigenvectors, which the Power Method is better equipped to eliminate.
For instance, in Figure 5.7, we plot the convergence when Quadratic Extrapolation is applied 5 times compared with when it is applied as often as possible (in this case, 14 times), to achieve a residual of 0.001. Note that the additional applications of Quadratic Extrapolation do not lead to much further improvement. In fact, once we factor in the 0.5 iteration-cost of each application of Quadratic Extrapolation, the case where it was applied 5 times ends up being faster.
Like Aitken and Epsilon Extrapolation, Quadratic Extrapolation makes the assumption that an iterate can be expressed as a linear combination of a subset of the eigenvectors of A in order to find an approximation to the principal eigenvector of A. In Aitken and Epsilon Extrapolation, we assume that can be written as a linear combination of the first two eigenvectors, and in Quadratic Extrapolation, we assume that can be written as a linear combination of the first three eigenvectors. Since the assumption made in Quadratic Extrapolation is closer to reality, the resulting approximations are closer to the true value of the principal eigenvector of A.
While Aitken and Epsilon Extrapolation are logical extensions of existing acceleration algorithms, Quadratic Extrapolation is completely novel. Furthermore, all these algorithms are general purpose. That is, they can be used to compute the principal eigenvector of any large, sparse Markov matrix, not just the Web graph. They should be useful in any situation where the size and sparsity of the matrix is such that a QR factorization is prohibitively expensive.
Both Aitken and Quadratic Extrapolation assume that none of the nonprincipal eigenvalues of the hyperlink matrix are known. However, in the previous chapter, we showed that the modulus of the second eigenvalue of A is given by the damping factor c. Note that the Web graph can have many eigenvalues with modulus c (e.g., one of c, –c, ci, and –ci). In the next sections, we will discuss Power Extrapolation, which exploits known eigenvalues of A to accelerate PageRank.
As in the previous sections, the simplest extrapolation rule assumes that the iterate can be expressed as a linear combination of the eigenvectors and , where m2 has eigenvalue c:
Now consider the current iterate (k); because the Power Method generates iterates by successive multiplication by A, we can write (k) as
Plugging in λ2 = c, we see that
This allows us to solve for in closed form:
Figure 5.8 shows the convergence of this simple power extrapolation algorithm (which we shall call Simple Extrapolation) and the standard Power Method, where there was one application of Simple Extrapolation at iteration 3 of the Power Method. Simple Extrapolation is not effective, as the assumption that c is the only eigenvalue of modulus c is inaccurate. In fact, by doubling the error in the eigenspace corresponding to eigenvalue –c, this extrapolation technique slows down the convergence of the Power Method.
The next extrapolation rule assumes that the iterate (k–2) can be expressed as a linear combination of the eigenvectors u1, u2, and u3, where u2 has eigenvalue c and u3 has eigenvalue –c:
Now consider the current iterate ; because the Power Method generates iterates by successive multiplication by A, we can write as
Plugging in λ2 = c and λ3 = –c, we see that
This allows us to solve for in closed form:
A2 Extrapolation eliminates error along the eigenspaces corresponding to eigenvalues of c and –c.
Figure 5.9 shows the convergence of A2 extrapolated PageRank and the standard Power Method where A2 Extrapolation was applied once at iteration 4. Empirically, A2 extrapolation speeds up the convergence of the Power Method by 18%. The initial effect of the application increases the residual, but by correctly subtracting much of the largest non-principal eigenvectors, the convergence upon further iterations of the Power Method is sped up.
The previous extrapolation rule made use of the fact that (–c)2 = c2. We can generalize that derivation to the case where the eigenvalues of modulus c are given by cdi, where {di} are the dth roots of unity. From Theorem 2.1 of [35] and Theorem 1 given in the Appendix, it follows that these eigenvalues arise from leaf nodes in the strongly connected component (SCC) graph of the Web that are cycles of length d. Because we know empirically that the Web has such leaf nodes in the SCC graph, it is likely that eliminating error along the dimensions of eigenvectors corresponding to these eigenvalues will speed up PageRank.
We make the assumption that can be expressed as a linear combination of the eigenvectors {u1,…, ud+1}, where the eigenvalues of {u2, …, ud+1} are the dth roots of unity, scaled by c:
Then consider the current iterate (k); because the Power Method generates iterates by successive multiplication by A, we can write (k) as
But since λi is cdi, where di is a dth root of unity
Algorithm 8: Power Extrapolation
Algorithm 9: Power Method with Power Extrapolation
This allows us to solve for in closed form:
For instance, for d = 4, the assumption made is that the nonprincipal eigenvalues of modulus c are given by c, –c, ci, and –ci (i.e., the 4th roots of unity). A graph in which the leaf nodes in the SCC graph contain only cycles of length l, where l is any divisor of d = 4 has exactly this property.
Algorithms 8 and 9 show how to use Ad Extrapolation in conjunction with the Power Method. Note that Power Extrapolation with d = 1 is just Simple Extrapolation.
The overhead in performing the extrapolation shown in Algorithm 8 comes from computing the linear combination , an O(n) computation.
In our experimental setup, the overhead of a single application of Power Extrapolation is 1% of the cost of a standard power iteration. Furthermore, Power Extrapolation needs to be applied only once to achieve the full benefit.
In our experiments, Ad Extrapolation performs the best for d = 6. Figure 5.10 plots the convergence of Ad Extrapolation for d ∈ {1, 2, 4, 6, 8}, as well as of the standard Power Method, for c = 0.85 and c = 0.90.
The wallclock speedups, compared with the standard Power Method, for these 5 values of d for c = 0.85 are given in Table 5.1.
For comparison, Figure 5.11 compares the convergence of the Quadratic Extrapolated PageRank with A6 Extrapolated PageRank. Note that the speedup in convergence is similar; however, A6 Extrapolation is much simpler to implement, and has negligible overhead, so that its wallclock-speedup is higher. In particular, each application of Quadratic Extrapolation requires about half of the cost of a standard iteration, and must be applied several times to achieve maximum benefit.
Type |
Speedup |
---|---|
d = 1 |
−28% |
d=2 |
18% |
d=4 |
26% |
d=6 |
30% |
d=8 |
22% |
Quadratic |
21% |
In this section, we present empirical results demonstrating the suitability of the L1 residual, even in the context of measuring convergence of induced document rankings. In measuring the convergence of the PageRank vector, prior work has usually relied on δk = ||Ax(k) − x(k)||p, the Lp norm of the residual vector, for p = 1 or p = 2, as an indicator of convergence [7, 56]. Given the intended application, we might expect that a better measure of convergence is the distance, using an appropriate measure of distance, between the rank orders for query results induced by Ax(k) and x(k). We use two measures of distance for rank orders, both based on the the Kendall’s rank correlation measure [47]: the KDist measure, defined below, and the Kmin measure, introduced by Fagin et al. in [27]. To determine whether the residual is a “good” measure of convergence, we compared it to the KDist and Kmin of rankings generated by Ax(k) and x(k).
We show empirically that in the case of PageRank computations, the L1 residual δk is closely correlated with the KDist and Kmin distances between query results generated using the values in Ax(k) and x(k).
We define the distance measure, KDist as follows. Consider two partially ordered lists of URLs, τ1 and τ2, each of length k. Let U be the union of the URLs in τ1 and τ2. If ρ1 is U − τ1, then let τ′1 be the extension of τ1, where τ′1 contains ρ1 appearing after all the URLs in τ13. We extend τ2 analogously to yield τ′2. KDist is then defined as
In other words, KDist (τ1, τ2) is the probability that τ′1 and τ′2 disagree4 on the relative ordering of a randomly selected pair of distinct nodes (u, v) ∈ U × U.
To measure the convergence of PageRank iterations in terms of induced rank orders, we measured the KDist distance between the induced rankings for the top 100 results, averaged across 27 test queries, using successive power iterates for the LARGEWEB dataset, with the damping factor c set to 0.9.5 The average residuals using the KDist, Kmin, and L1 measures are plotted in Figure 5.12.6 Surprisingly, the L1 residual is almost perfectly correlated with KDist, and is closely correlated with Kmin.7 A rigorous explanation for the close match between the L1 residual and the Kendall’s τ-based residuals is an interesting avenue of future investigation.
1 i.e., 5.27 fixes a scaling for .
2 The time savings we give factor in the overhead of applying extrapolation, and represent wallclock time savings.
3 The URLs in ρ are placed with the same ordinal rank at the end of τ.
4 A pair ordered in one list and tied in the other is considered a disagreement.
5 Computing Kendall’s τ over the complete ordering of all of LARGEWEB is expensive; instead we opt to compute KDist and Kmin over query results. The argument can also be made that it is more sensible to compute these measures over a randomly chosen subset of the web rather than the whole web, since small changes in ordering over low ranked pages rarely indicate an important change.
6 The L1 residual δk is normalized so that δ0 is 1.
7 We emphasize that we have shown close agreement between L1 and KDist for measuring residuals, not for distances between arbitrary vectors.
3.144.9.124