Chapter Seven

BlockRank

In this chapter, we exploit yet another observation about the Web matrix A to speed up PageRank. We observe that the Web link graph has a nested block structure; that is, the vast majority of hyperlinks link pages on a host to other pages on the same host, and many of those that do not link pages within the same domain. We show how to exploit this structure to speed up the computation of PageRank by BlockRank, a 3-stage algorithm whereby (1) the local PageRanks of pages for each host are computed independently using the link structure of that host, (2) these local PageRanks are then weighted by the “importance” of the corresponding host (calculated by running PageRank at the host-level), and (3) the standard PageRank algorithm is then run using as its starting vector the weighted concatenation of the local PageRanks. We begin the chapter by exploring the block structure of the Web.

7.1 BLOCK STRUCTURE OF THE WEB

The key terminology we use here is given in Table 7.1.

To investigate the structure of the Web, we run the following simple experiment. We take all the hyperlinks in FULL-LARGEWEB, and count how many of these are “intrahost” links (links from a page to another page in the same host) and how many are “interhost” links (links from a page to a page in a different host). Table 7.2 shows that 79.1% of the links in this dataset are intrahost links and 20.9% are interhost links. These intrahost connectivity statistics are not far from the earlier results of Bharat et al. [11]. We also investigate the number of links that are intradomain links, and the number of links that are interdomain links. Notice in Table 7.2 that an even larger majority of links are intradomain links (83.9%).

These results make sense intuitively. Take as an example the domain cs.stanford. edu. Most of the links in cs.stanford.edu are links around the cs.stanford.edu site (such as cs.stanford.edu/admissions or cs.stanford.edu/ research). Furthermore, a large percentage of non-navigational links are links to other Stanford hosts, such as www.stanford.edu, library.stanford.edu, or www-cs-students.stanford.edu.

Table 7.1 Example illustrating our terminology using the sample url http://cs.stanford.edu/research/.

Table 7.2 Hyperlink statistics on LARGEWEB for the full graph (Full: 291M nodes, 1.137B links) and for the graph with dangling nodes removed (DNR: 64.7M nodes, 607M links).

One might expect that this structure exists even in lower levels of the Web hierarchy. For example, one might expect that pages under cs.stanford.edu/admissions/ are highly interconnected, and very loosely connected with pages in other sublevels, leading to a nested block structure. This type of nested block structure can be naturally exposed by sorting the link graph to construct a link matrix in the following way. We sort urls lexicographically, except that as the sort key we reverse the components of the domain. For instance, the sort key for the url www.stanford. edu/home/students/ would be edu.stanford.www/home/students. The urls are then assigned sequential identifiers when constructing the link matrix. A link matrix contains as its (i, J)th entry 1 if there is a link from i to J, and 0 otherwise. This arrangement has the desired property that urls are grouped in turn by top level domain, domain, hostname, and finally path. The subgraph for pages in stanford.edu appears as a sub-block of the full link matrix. In turn, the subgraph for pages in www-db.stanford.edu appears as a nested sub-block.

We can then gain insight into the structure of the Web by using dotplots to visualize the link matrix. In a dotplot, if there exists a link from page i to page J then point (i, J) is colored; otherwise point (i, J) is white. Since our full datasets are too large to see individual pixels, we show several slices of the web in Figure 7.1. Notice three things:

  1. There is a definite block structure to the Web.
  2. The individual blocks are much smaller than the entire Web.
  3. There are clear nested blocks corresponding to domains, hosts, and subdirectories.

Figure 7.1(a) shows the dotplot for the ibm.com domain. Notice that there are clear blocks, which correspond to different hosts within ibm.com; for example, the upper left block corresponds to the almaden.ibm.com hosts (the hosts for IBM’s Almaden Research Center). Notice that the pages at the very end of the plot (pages i ≥ 18544) are heavily inlinkend (the vertical line at the lower right-hand corner of the plot). These are the pages within the www.ibm.com host, and it is expected that they should be heavily inlinkend. Also notice that there are 4 patterns that look like the upside-down letter “L.” These are sites that have a shallow hierarchy; the root node links to most pages in the host (horizontal line), and is linkend to by most pages in the host (vertical line), but there is not much interlinkage within the site itself (empty block). Finally, notice that the area around the diagonal is very dense; this corresponds to strong intrablock linkage, especially in the smaller blocks.

Figure 7.1 A view of four different slices of the Web: (a) the IBM domain; (b) all the hosts in the Stanford and Berkeley domains; (c) the first 50 Stanford domains, alphabetically; and (d) the host-graph of the Stanford and Berkeley domains.

Figure 7.1(b) shows the dotplot for STANFORD/BERKELEY. Notice that this also has a strong block structure and a dense diagonal. Furthermore, this plot makes clear the nested block structure of the web. The superblock on the upper left-hand side is the stanford.edu domain, and the superblock on the lower right-hand side is the berkeley.edu domain.

Figure 7.1(c) shows a closeup of the first 50 hosts alphabetically within the stanford.edu domain. The majority of this dotplot is composed of three large hosts: acomp.stanford.edu, the academic computing site at Stanford, in the upper left- hand corner; cmgm.stanford.edu, an online bioinformatics resource, in the middle; and daily.stanford.edu, the Web site for the Stanford Daily (Stanford’s student newspaper) in the lower right-hand corner. There are many interesting structural motifs in this plot. First there is a long vertical line in the upper left-hand corner. This feature corresponds to the web site http://acomp.stanford.edu; most pages in the acomp.stanford.edu host point to this root node. Also, there is a clear nested block structure within acomp.stanford.edu on the level of different directories in the url hierarchy.

In the Stanford Daily site, we see diagonal lines, long vertical blocks, a main center block, and short thin blocks. The first several web pages in daily.stanford.edu represent the front page of the paper for the past several days. Each front page links to the front page of the day before, and therefore there is a small diagonal line in the upper left-hand corner of the Stanford Daily block. The diagonals are due to the url naming convention of the Stanford Daily, which causes the lexicographic ordering of urls to induce a chronological ordering of the articles. The front pages link to articles, which are the middle pages of the block. Therefore, we see a horizontal strip in the top middle. These articles also link back to the front pages, and so we see a vertical strip on the middle left-hand side. The articles link to each other, since each article links to related articles and articles by the same author. This accounts for the square block in the center. The long vertical strips represent pages on the standard menu on each page of the site (such as the “subscriptions” page, the “write a letter to the editor” page, and the “advertising” page). Finally, the diagonal lines that surround the middle block are pages such as “e-mail this article to a friend” or “comment on this article,” which are linkend to only one article each.

Figure 7.1(d) shows the host graph for the stanford.edu and berkeley.edu domains, in which each host is treated as a single node, and an edge is placed between host i and host J if there is a link between any page in host i and host J. Again, we see strong block structure on the domain level, and the dense diagonal shows strong block structure on the host level as well. The vertical and horizontal lines near the bottom right-hand edge of the Stanford and Berkeley domains represent the www.stanford.edu and www.berkeley.edu hosts, showing that these hosts are, as expected, strongly linkend to most other hosts within their own domain.

7.1.1 Block Sizes

We investigate here the sizes of the hosts on the Web. Figure 7.2(a) shows the distribution over number of (crawled) pages of the hosts in LARGEWEB. Notice that the majority of sites are small, on the order of 100 pages. Figure 7.2(b) shows the sizes of the host-blocks in the Web when dangling nodes are removed. When dangling nodes are removed, the blocks become smaller, and the distribution is still skewed toward small blocks. The largest block had 6,000 pages. In future sections we show how to exploit the small sizes of the blocks, relative to the dataset as a whole, to speed link analysis.

Figure 7.2 Histogram of distribution over host sizes of the Web. The x-axis gives bucket sizes for the log10 of the size of the host-blocks, and the y-axis gives the fraction of host-blocks that are that size.

7.1.2 The GeoCities Effect

While one would expect that most domains have high intracluster link density, as in cs.stanford.edu, there are some domains that one would expect to have low intracluster link density, for example pages.yahoo.com (formerly www.geocities.com). The Web site http://pages.yahoo.com is the root page for Yahoo! GeoCities, a free Web hosting service. There is no reason to think that people who have Web sites on GeoCities would prefer to link to one another rather than to sites not in GeoCities.1 Figure 7.3 shows a histogram of the intradomain densities of the Web. In Figure 7.3(a) there is a spike near 0% intrahost linkage, showing that many hosts are not very intraconnected. However, when we remove the hosts that have only 1 page (Figure 7.3(b)), this spike is substantially dampened, and when we exclude hosts with fewer than 5 pages, the spike is eliminated. This shows that the hosts in LARGEWEB that are not highly intraconnected are very small in size. When the very small hosts are removed, the great majority of hosts have high intra-host densities.

It should be noted that this test does not exclude the possibility of nested subhosts. For example, each user in GeoCities may have a highly intralinkend area within GeoCities. In fact, this is observed to be the case with most freehosts. Thus, algorithms for subdividing hosts based on url path and intra-level density would be useful for the BlockRank algorithm. However, this is outside the scope of this book.

7.2 BLOCKRANK ALGORITHM

We now present the BlockRank algorithm that exploits the empirical findings of the previous section to speed up the computation of PageRank. This work is motivated by and builds on aggregation/disaggregation techniques [58, 65] and domain decomposition techniques [30] in numerical linear algebra. Steps 2 and 3 of the BlockRank algorithm are similar to the Rayleigh-Ritz refinement technique [51].

Figure 7.3 Distribution over interconnectivity of host-blocks for the DNR-LARGEWEB dataset. The x-axis shows percentile buckets for intrahost linkage density (the percent of edges originating or terminating in a given host that are intrahost links), and the y-axis shows the fraction of hosts that have that linkage density. (a) for all hosts; (b) for all hosts that have more than 1 page; (c) for all hosts that have 5 or more pages.

7.2.1 Overview of BlockRank Algorithm

The block structure of the Web suggests a fast algorithm for computing PageRank, wherein a “local PageRank vector” is computed for each host, giving the relative importance of pages within a host. These local PageRank vectors can then be used to form an approximation to the global PageRank vector that is used as a starting vector for the standard PageRank computation. This is the basic idea behind the BlockRank algorithm, which we summarize here and describe in this section. The algorithm proceeds as follows:

0. Split the Web into blocks by domain.

1. Compute the local PageRanks for each block (Section 7.2.2).

2. Estimate the relative importance, or “BlockRank” of each block (Section 7.2.3).

3. Weight the local PageRanks in each block by the BlockRank of that block, and concatenate the weighted local PageRanks to form an approximate global PageRank vector (Section 7.2.4).

4. Use as a starting vector for standard PageRank (Section 7.2.5).

We describe the steps in detail below, and we introduce some notation here. We use lower-case indices (i, J) to represent indices of individual Web sites, and upper-case indices (I, J) to represent indices of blocks. We use the shorthand notation iI to denote that page i ∈ block I. The number of elements in block J is denoted nJ. The graph of a given block J is given by the nJ × nJ submatrix GJ J of the matrix G.

7.2.2 Computing Local PageRanks

In this section, we describe computing a “local PageRank vector” for each block in the Web. Since most blocks have very few links in and out of the block as compared to the number of links within the block, it seems plausible that the relative rankings of most of the pages within a block are determined by the inter-block links.

We define the local PageRank vector of a block J (GJ J) to be the result of the PageRank algorithm applied only on block J, as if block J represented the entire Web, and as if the links to pages in other blocks did not exist.

That is,

where the start vector is the nJ × 1 uniform probability vector over pages in block J ([1/nJ]n×1), and the personalization vector is the nJ × 1 vector whose elements are all zero except the element corresponding to the root node of block J, whose value is 1. (A uniform vector may also be used as the personalization vector for local PageRank computations. It does not affect the final computed PageRank vector, and the difference in computation time is negligible.)

Local PageRank accuracies

To investigate how well these local PageRank vectors approximate the relative magnitudes of the true PageRank vectors within a given host, we run the following experiment. We compute the local PageRank vectors of each host in STANFORD/BERKELEY. We also compute the global PageRank vector for STANFORD/BERKELEY using the standard PageRank algorithm whose personalization vector is a uniform distribution over root nodes. We then compare the local PageRank scores of the pages within a given host to the global PageRank scores of the pages in the same host.

Specifically, we take the elements corresponding to the pages in host J of the global PageRank vector , and form the vector from these elements. We normalize so that its elements sum to 1 in order to compare it to the local PageRank vector , which also has an L1 norm of 1. Specifically,

Table 7.3 The “closeness” as measured by average (a) absolute error, and (b) KDist distance of the local PageRank vectors and the global PageRank segments , compared to the closeness of uniform vectors and the global PageRank segments for the STANFORD/BERKELEY dataset.

We call these vectors normalized global PageRank segments, or global PageRank segments for short.

The results are summarized in Table 7.3. The absolute error is on average 0.2383 for the hosts in STANFORD/BERKELEY.

We compare the error of the local PageRank vectors to the error of a uniform vector for each host J. The error is on average 1.2804 for the hosts in STANFORD/BERKELEY. One can see that the local PageRank vectors are much closer to the global PageRank segments than the uniform vectors are. So a concatenation of the local PageRank vectors may form a better start vector for the standard PageRank iteration than the uniform vector.

The relative ordering of pages within a host induced by local PageRank scores is generally close to the intrahost ordering induced by the global PageRank scores. To compare the orderings, we measure the average “distance” between the local PageRank vectors and global PageRank segments . using the Kendall’s-τ rank correlation measure.

The average distance is 0.0571 for the hosts in STANFORD/BERKELEY. Notice that this distance is small. This observation means that the ordering induced by the local PageRank is close to being correct, and thus suggests that the majority of the L1 error in the comparison of local and global PageRanks comes from the miscalibration of a few pages on each host. Indeed the miscalibration may be among important pages; as we discuss next, this miscalibration is corrected by the final step of our algorithm. Furthermore, the relative rankings of pages on different hosts is unkown at this point. For these reasons, we do not suggest using local PageRank for ranking pages; we use it only as a tool for computing the global PageRank more efficiently.

Table 7.4 confirms this observation for the host aa.stanford.edu. Notice that the ordering is preserved, and a large part of the discrepancy is due to http://aa.stanford.edu. The local PageRank computation gives too little weight to the root node. Since the elements of the local vector sum to 1, the ranks of all other pages are upweighted.

Table 7.4 The local PageRank vector for the domain aa.stanford.edu (left) compared to the global PageRank segment corresponding to the same pages. The local PageRank vector has a similar ordering to that of the normalized components of the global PageRank vector. The discrepancy in actual ranks is largely due to the fact that the local PageRank vector does not give enough weight to the root node http://aa.stanford.edu.

Web Page

Local

Global

http://aa.stanford.edu

0.2196

0.4137

http://aa.stanford.edu/aeroastro/AAfolks.html

0.0910

0.0730

http://aa.stanford.edu/aeroastro/AssistantsAero.html

0.0105

0.0048

http://aa.stanford.edu/aeroastro/EngineerAero.html

0.0081

0.0044

http://aa.stanford.edu/aeroastro/Faculty.html

0.0459

0.0491

http://aa.stanford.edu/aeroastro/FellowsAero.html

0.0081

0.0044

http://aa.stanford.edu/aeroastro/GraduateGuide.html

0.1244

0.0875

http://aa.stanford.edu/aeroastro/Labs.html

0.0387

0.0454

http://aa.stanford.edu/aeroastro/Links.html

0.0926

0.0749

http://aa.stanford.edu/aeroastro/MSAero.html

0.0081

0.0044

http://aa.stanford.edu/aeroastro/News.html

0.0939

0.0744

http://aa.stanford.edu/aeroastro/PhdAero.html

0.0081

0.0044

http://aa.stanford.edu/aeroastro/aacourseinfo.html

0.0111

0.0039

http://aa.stanford.edu/aeroastro/aafaculty.html

0.0524

0.0275

http://aa.stanford.edu/aeroastro/aalabs.html

0.0524

0.0278

http://aa.stanford.edu/aeroastro/admitinfo.html

0.0110

0.0057

http://aa.stanford.edu/aeroastro/courseinfo.html

0.0812

0.0713

http://aa.stanford.edu/aeroastro/draftcourses.html

0.0012

0.0003

http://aa.stanford.edu/aeroastro/labs.html

0.0081

0.0044

http://aa.stanford.edu/aeroastro/prospective.html

0.0100

0.0063

http://aa.stanford.edu/aeroastro/resources.html

0.0112

0.0058

http://aa.stanford.edu/aeroastro/visitday.html

0.0123

0.0068

Local PageRank convergence rates

Another interesting question to investigate is how quickly the local PageRank scores converge. In Figure 7.4, we show a histogram of the number of iterations it takes for the local PageRank scores for each host in DNR-LARGEWEB to converge to an L1 residual < 10−1. Notice that most hosts converge to this residual in fewer than 12 iterations.

Interestingly, there is no correlation between the convergence rate of a host and the host’s size. Rather, the convergence rate is primarily dependent on the extent of the nested block structure within the host. That is, hosts with strong nested blocks are likely to converge slowly, since they represent slow-mixing Markov chains. Hosts with a more random connection pattern converge faster, since they represent a fast-mixing Markov chain.

This observation suggests that one could make the local PageRank computations even faster by wisely choosing the blocks. That is, if a host has a strong nested block structure, use the directories within that host as your blocks. However, this is not a crucial issue, because we show in Chapter 7.3 that the local PageRank computations can be performed in a distributed fashion in parallel with the crawl. Therefore, reducing the cost of the local PageRank computations is not a bottleneck for computing PageRank with our scheme, as the local computations can be pipelined with the crawl.2

Figure 7.4 Local PageRank convergence rates for hosts in DNR-LARGEWEB. The x-axis is the number of iterations until convergence to a tolerance of 10−1, and the y-axis is the fraction of hosts that converge in a given number of iterations.

7.2.3 Estimating the Relative Importance of Each Block

In this section, we describe computing the relative importance, or BlockRank, of each block. Assume there are k blocks in the Web. To compute BlockRanks, we first create the block graph B, where each vertex in the graph corresponds to a block in the Web graph. An edge between two pages in the Web is represented as an edge between the corresponding blocks (or a self-edge, if both pages are in the same block). The edge weights are determined as follows: the weight of an edge between blocks I and J is defined to be the sum of the edge-weights from pages in I to pages in J in the Web graph, weighted by the local PageRanks of the linking pages in block I.

That is, if A is the Web graph and li is the local PageRank of page i in block I, then the weight of edge BI J is given by

We can write this in matrix notation as follows. Define the local PageRank matrix L to be the n × k matrix whose columns are the local PageRank vectors :

Define the matrix S to be the n × k matrix that has the same structure as L, but whose nonzero entries are all replaced by 1. Then the block matrix B is the k × k matrix

Notice that B is a transition matrix where the element BI J represents the transition probability of block I to block J. That is,

Once we have the k × k transition matrix B, we may use the standard PageRank algorithm on this reduced matrix to compute the BlockRank vector . That is:

where is the uniform k-vector .

Note that this process is the same as computing the stationary distribution of the transition matrix c · B + (1 − c)Ek, where we define . The analogy to the random surfer model of [56] is this: we imagine a random surfer going from block to block according to the transition probability matrix B. At each stage, the surfer will get bored with probability 1 − c and jump to a different block.

7.2.4 Approximating Global PageRank Using Local PageRank and BlockRank

In this section, we find an estimate to the global PageRank vector . At this point, we have two sets of rankings. Within each block J, we have the local PageRanks of the pages in the block. We also have the BlockRank vector whose elements bJ are the BlockRank for each block J, measuring the relative importance of the blocks. We may now approximate the global PageRank of a page jJ as its local PageRank lj, weighted by the BlockRank bJ of the block in which it resides. That is,

In matrix notation, this is

Recall that the local PageRanks of each block sum to 1. The BlockRanks also sum to 1; therefore, our approximate global PageRanks will also sum to 1. The reasoning follows. The sum of of our approximate global PageRanks sum(xj) = Σjxj can be written as a sum over blocks

Using our definition for xj from (7.2.4)

Since the local PageRanks for each domain sum to 1 (Σj∈Jlj = 1)

And since the BlockRanks also sum to 1(ΣJ bJ = 1)

Therefore, we may use our approximate global PageRank vector as a start vector for the standard PageRank algorithm.

7.2.5 Using This Estimate as a Start Vector

In order to compute the true global PageRank vector from our approximate PageRank vector , we simply use it as a start vector for standard PageRank. That is:

where G is the graph of the Web, and is the uniform distribution over root nodes. In Section 7.6, we show how to compute different personalizations quickly once has been computed. The BlockRank algorithm for computing PageRank, presented in the preceding sections, is summarized by Algorithm 13.

0. Sort the Web graph lexicographically as described in Section 7.1, exposing the nested block structure of the Web.

1. Compute the local PageRank vector for each block J.

2. Compute block transition matrix B and BlockRanks .

3. Find an approximation to the global PageRank vector by weighting the local PageRanks of pages in block J by the BlockRank of J.

4. Use this approximation as a start vector for a standard PageRank iteration.

Algorithm 13: BlockRank Algorithm

7.3 ADVANTAGES OF BLOCKRANK

The BlockRank algorithm has four major advantages over the standard PageRank algorithm.

Advantage 1 A major speedup of our algorithm comes from caching effects. All the host-blocks in our crawl are small enough that each block graph fits in main memory, and the vector of ranks for the active block largely fits in the CPU cache. As the full graph does not fit in main memory, the local PageRank iterations thus require less disk i/o than the global computations. The full rank vectors do fit in main memory; however, using the sorted link structure3 dramatically improves the memory access patterns to the rank vector. Indeed, if we use the sorted link structure, designed for BlockRank, as the input instead to the standard PageRank algorithm, the enhanced locality of reference to the rank vectors cuts the time needed for each iteration of the standard algorithm by more than half: from 6.5 minutes to 3.1 minutes for each iteration on DNR-LARGEWEB!

Advantage 2 In our BlockRank algorithm, the local PageRank vectors for many blocks will converge quickly; thus the computations of those blocks may be terminated after only a few iterations. This increases the effectiveness of the local PageRank computation by allowing it to expend more computation on slowly converging blocks, and less computation on faster converging blocks. Note, for instance, in Figure 7.4 that there is a wide range of rates of convergence for the blocks. In the standard PageRank algorithm, iterations operate on the whole graph; thus the convergence bottleneck is largely due to the slowest blocks. Much computation is wasted recomputing the PageRank of blocks whose local computation has already converged.

Advantage 3 The local PageRank computations in Step 1 of the BlockRank algorithm can be computed in a completely parallel or distributed fashion. That is, the local PageRanks for each block can be computed on a separate processor, or computer. The only communication required is that, at the end of Step 1, each computer should send its local PageRank vector to a central computer that will compute the global PageRank vector. If our graph consists of n total pages, the net communication cost consists of 8n bytes (if using 8-byte double precision floating point values). Naive parallelization of the computation that does not exploit block structure would require a transfer of 8n bytes after each iteration, a significant penalty. Furthermore, the local PageRank computations can be pipelined with the web crawl. That is, the local PageRank computation for a host can begin as a separate process as soon as the crawler finishes crawling the host. In this case, only the costs of Steps 2–4 of the BlockRank algorithm become rate-limiting.

Advantage 4 In several scenarios, the local PageRank computations (the results of Step 1) can be reused during future applications of the BlockRank algorithm. Consider for instance, news sites such as cnn.com that are crawled more frequently then the general Web. In this case, after a crawl of cnn.com, if we wish to recompute the global PageRank vector, we can rerun the BlockRank algorithm, except that in Step 1, only the local PageRanks for the cnn.com block need to be recomputed. The remaining local PageRanks will be unchanged and can be reused in Steps 2–3. In this way, we can also reuse the local PageRank computations for the case of computing several “personalized” PageRank vectors. We further discuss personalized PageRank in Section 7.6.

7.4 EXPERIMENTAL RESULTS

In this section, we investigate the speedup of BlockRank compared to the standard algorithm for computing PageRank. The speedup of our algorithm for typical scenarios comes from the first three advantages listed in Section 7.3. The speedups are due to less expensive iterations, as well as fewer total iterations. (Advantage 4 is discussed in subsequent sections.)

We begin with the scenario in which PageRank is computed after the completion of the crawl; we assume that only Step 0 of the BlockRank algorithm is computed concurrently with the crawl. As mentioned in Advantage 1 from the previous section, the improved reference locality due to blockiness exposed by lexicographically sorting the link matrix, achieves a speedup of a factor of two in the time needed for each iteration of the standard PageRank algorithm. This speedup is completely independent of the value chosen for c, and does not affect the rate of convergence as measured in number of iterations required to reach a particular L1 residual.

If instead of the standard PageRank algorithm, we use the BlockRank algorithm on the block structured matrix, we gain the full benefit of Advantages 1 and 2; the blocks each fit in main memory, and many blocks converge more quickly than the convergence of the entire Web. We compare the wallclock time it takes to compute PageRank using the BlockRank algorithm in this scenario, where local PageRank vectors are computed serially after the crawl is complete, with the wall- clock time it takes to compute PageRank using the standard algorithm given in [56]. Table 7.5 gives the running times of the 4 steps of the BlockRank algorithm on the LARGEWEB dataset. The first 3 rows of Table 7.6 give the wallclock running times for standard PageRank, standard PageRank using the url-sorted link matrix, and the full BlockRank algorithm computed after the crawl. We see a small additional speedup for BlockRank on top of the previously described speedup. Subsequently, we will describe a scenario in which the costs of Steps 1-3 become largely irrelevant, leading to further effective speedups.

In this next scenario, we assume that the cost of Step 1 can be made negligible in one of two ways: the local PageRank vectors can be pipelined with the web crawl, or they can be computed in parallel after the crawl. If the local PageRank vectors are computed as soon as possible (e.g., as soon as a host has been fully crawled), the majority of local PageRank vectors will have been computed by the time the crawl is finished. Similarly, if the local PageRank vectors are computed after the crawl, but in a distributed manner, using multiple processors (or machines) to compute the PageRank vectors independently, the time it takes to compute the local PageRanks will be low compared to the standard PageRank computation. Thus, only the running time of Steps 2-4 of BlockRank will be relevant in computing net speedup. The construction of B is the dominant cost of Step 2, but this too can be pipelined; Step 3 has negligible cost. Thus the speedup of BlockRank in this scenario is determined by the increased rate of convergence in Step 4 that comes from using the BlockRank approximation as the start vector. We now take a closer look at the relative rates of convergence. In Figure 7.5, we show the convergence rate of standard PageRank compared to the convergence of Step 4 of BlockRank on the STANFORD/BERKELEY dataset for a random jump probability 1 − c = 0.15 (c = 0.85). Note that to achieve convergence to a residual of 10−4, using the BlockRank start vector leads to a speedup of a factor of 2 on the STANFORD/BERKELEY dataset. The LARGEWEB dataset yielded an increase in convergence rate of 1.55. These results are summarized in Table 7.7. Combined with the first effect described above (from the sorted link structure), in this scenario, our algorithm yields a net speedup of over 3. (For higher values of c, as explored in [40], the speedup is even more significant; for example, we got a 10-times speedup on the STANFORD/BERKELEY dataset when we set c = 0.99.)

Table 7.5 Running times for the individual steps of BlockRank for c = 0.85 in achieving a final residual of <10−3.

Table 7.6 Wallclock running times for 4 algorithms for computing PageRank with c = 0.85 to a residual of <10−3.

These results are the most significant speedup results to date for computing PageRank, and the only results showing significant speedup for c = 0.85 on large datasets. Also, it should be noted that the BlockRank algorithm can be used in conjunction with other methods of accelerating PageRank, such as Quadratic Extrapolation [40] or Gauss-Seidel [7, 30] (or both). These methods would simply be applied to Step 4 in the BlockRank algorithm. When used in conjunction with these methods, one should expect even faster convergence for BlockRank; these hybrid approaches are left for future study.

Table 7.7 Number of iterations needed to converge for standard PageRank and for BlockRank (to a tolerance of 10−4 for STANFORD/BERKELEY, and 10−3 for LARGEWEB).

Figure 7.5 Convergence rates for standard PageRank (solid line) versus BlockRank (dotted line). The x-axis is the number of iterations, and the y-axis is the log of the L1 residual. STANFORD/BERKELEY data set; c = 0.85.

Earlier, we noted that we can also utilize the strategy of reusing local PageRank vectors when we wish to recompute PageRank after several pages have been added or removed from the Web. Since the Web is highly dynamic, with Web pages being added or removed all the time, this is an important problem to address. In particular, we wish to crawl certain hosts, such as daily news providers such as cnn.com, more frequently than others.

If we use BlockRank to compute the PageRank vector , and store the local PageRank vectors , then we only need to recompute the local PageRanks of those hosts to which pages have been added or removed at each update.

So, Step 1 would simply be replaced by

Compute the local PageRank vector for each block J that has been updated.

7.5 DISCUSSION

We have shown that the hyperlink graph of the Web has a nested block structure, something that has not yet been thoroughly investigated in studies of the Web. We exploit this structure to compute PageRank in a fast manner using an algorithm we call BlockRank. We show empirically that BlockRank speeds up PageRank computations by factors of 2 and higher, depending on the particular scenario.

This work is rich in implications for future work. For instance, some host-blocks are fast-mixing, and others are slow-mixing. Slow-mixing blocks generally have a strong nested block structure within them. Finding the “best” blocks for BlockRank is an interesting problem. Also, the block structure of the Web can be exploited for hyperlink-based algorithms other than Web search. It would be interesting to develop algorithms for clustering or classification that exploit this structure.

In the next section, we will discuss how to extend BlockRank to compute personalized PageRank.

7.6 PERSONALIZED PAGERANK

In [56], it was originally suggested that, by changing the random jump vector 0 to be nonuniform, the resultant PageRank vector can be biased to prefer certain kinds of pages. For example, a random surfer interested in sports may get bored every once in a while and jump to http://www.espn.com, while a random surfer interested in current events may jump to http://www.cnn.com. While personalized PageRank is a compelling idea, in general it requires computing a large number of PageRank vectors, which is computationally prohibitive even for a moderate number of users.

We use the BlockRank algorithm and a simple restriction on the jump behavior of the random surfer to dramatically reduce the computation time of personalized PageRank. The restriction is this: instead of being able to choose a distribution over pages to which he jumps when he’s bored, the random surfer may choose hosts. For example, the random surfer interested in sports may jump to the www.espn.com host, but he may not, for example, to http://www.espn.com/ncb/columns/forde_pat/index.html. We can then encode the personalization vector in the k-dimensional vector (where k is the number of host-blocks in the Web) that is a distribution over different hosts.

With this restriction, the local PageRank vectors will not change for different personalizations. In fact, since the local PageRank vectors do not change for different personalizations, neither does the block matrix B.

Only the BlockRank vector will change for different personalizations. Therefore, we only need to recompute the BlockRank vector for each block- personalization vector .

The personalized PageRank computation could proceed as follows. Assuming you have already computed a generic PageRank vector once using the BlockRank algorithm, and have stored the block-transition matrix B, the personalized BlockRank algorithm is simply the last 3 steps of the generic BlockRank algorithm (Algorithm 13).

For a given block-personalization vector ,

  1. Compute the personalized BlockRank vector .

    = PageRank(B, , ), where is a uniform start vector.

  2. Find an approximation to the global PageRank vector by weighting the local PageRanks of pages in block J by the personalized BlockRank of J.

  3. Use this approximation as a start vector for a standard PageRank iteration.

    = PageRank(G, , ), where is an induced personalization vector as described below.

Algorithm 14: Simple Personalized BlockRank Algorithm

7.6.1 Inducing Random Jump Probabilities over Pages

The Personalized BlockRank algorithm requires that the random surfer not have the option of jumping to a specific page when he bores (he may only jump to the host). However, the last step in the BlockRank algorithm requires a random jump probability distribution over pages. Thus, we need to induce the probability p(J) that the random surfer will jump to a page j if we know the probability p(J) that he will jump to host J in which page J resides. We induce this probability as follows:

That is, the probability that the random surfer jumps to page j is the probability that he will jump to host J, times the probability of being at page j given that he is in host J.

Since the local PageRank vector is the stationary probability distribution of pages within host J, p(j | J) is given by the element of corresponding to page J. Therefore, the elements LjJ of the matrix L correspond to LjJ = p(j|J). Also, by definition, the elements (vk)J = p(J). Therefore, in matrix notation, (7.2) can be written as

7.6.2 Using “Better” Local PageRanks

If we have already computed the generic PageRank vector , we have even “better” local PageRank vectors than we began with. That is, we can normalize segments of to form the normalized global PageRank segments as described in Section 7.2. These scores are of course better estimates of the relative magnitudes of pages within the block than the local PageRank vectors , since they are derived from the generic PageRank vector for the full Web. So we can modify Personalized BlockRank as follows. Let us define the matrix H in a manner similar to the way we defined L, except using the normalized global PageRank segments, rather than the local PageRank vectors .

Again, we only need to compute H once. We define the matrix BH to be similar to the matrix B as defined in (7.1), but using H instead of L:

Thus, assuming you have already computed a generic PageRank vector once using the BlockRank algorithm, and have computed H from the generic PageRank vector , and BH as defined in (7.3), the algorithm is given as follows:

For a given block-personalization vector ,

  1. Compute the personalized BlockRank vector

    = PageRank(BH, , ), where is a uniform start vector.

  2. Find an approximation to the global PageRank vector by weighting the local PageRanks of pages in block J by the personalized BlockRank of J.

  3. Induce the personalization vector over pages from the personalization vector over hosts .

  4. Use this approximation as a start vector for a standard PageRank iteration.

Algorithm 15: Personalized BlockRank Algorithm with Induced Jumps

7.6.3 Experiments

We test this algorithm by computing the personalized PageRank of a random surfer who is a graduate student in linguistics at Stanford. When he bores, he has an 80% probability of jumping to the linguistics host www-linguistics.stanford.edu, and a 20% probability of jumping to the main Stanford host www.stanford.edu. Figure 7.6 shows that the speedup of computing the personalized PageRank for this surfer according to Algorithm 15 shows comparable speedup benefits to standard BlockRank. However, the main benefit is that the local PageRank vectors do not need to be computed at all for Personalized BlockRank. The matrix H is formed from the already computed generic PageRank vector. Therefore, the overhead to computing personalized PageRank vectors using the Personalized BlockRank algorithm is minimal.

Figure 7.6 Convergence of personalized PageRank computations using standard PageRank and personalized BlockRank.

7.6.4 Topic-Sensitive PageRank

Haveliwala [34] showed that one could compute a personalized PageRank vector by taking the linear combination of several precomputed personalized PageRank vectors. For example, if Bob is interested in sports and finance, and there is already a personalized PageRank vector computed for sports, and another personalized PageRank vector already computed for finance, then Bob’s personalized PageRank vector would simply be , where β1 represents Bob’s level of interest in sports, and represents Bob’s level of interest in finance. We choose β1 + β2 = 1.

In order to increase coverage, we may choose a number of granular topics and precompute their PageRank vectors, and take the linear combinations at query time. Because of their topical nature, we refer to these vectors as “topic-sensitive PageRank vectors,” or “personalized basis vectors.”

We prove here that is equivalent to Bob’s personalized PageRank vector , where is the left principal eigenvector of the matrix [cP + β1(1 − c)E1 + β2(1 − c)E2]T (where E1 and E2 represent the random jump matrices for sports and finance). That is, let and β1 + β2 = 1. We prove that:

The proof is simple. When we multiply out the terms on the left-hand side of (7.4), we get

Recall that for any that has L1 norm of 1, since E1T is a stochastic rank 1 matrix. Likewise, for any . Therefore, we can simplify the left-hand side to be

Combining terms and remembering that β1 + β2 = 1, we get

Gathering terms again:

By definition, this is equal to .

There are two limitations to topic-sensitive PageRank. First, one needs to be able to store all the basis vectors in memory at query time. This limits the number of possible basis vectors to the low tens at most. Haveliwala stored 12 basis vectors in memory in [34]. Second, even if there were no memory issue, the time and resources to compute thousands of basis vectors on a regular basis are too great.

For true personalization, we wish to be able to quickly compute and store in memory thousands of basis vectors at a speed that will allow us to compute them on a regular basis.

7.6.5 Pure BlockRank

While the BlockRank algorithm provides significant speedups in the computation of PageRank, it is still not nearly enough for our needs. Furthermore, we have not addressed the issues of storing in memory thousands of basis vectors.

A simple modification to BlockRank solves both of these problems. The solution is to eliminate Step 4 of the BlockRank algorithm. That is, instead of using BlockRank as a starting point for PageRank, use BlockRank as a ranking algorithm itself.

The revised personalized PageRank algorithm would then proceed as follows: assuming you have already computed a generic PageRank vector once using the BlockRank algorithm, and have computed H from the generic PageRank vector , and BH as defined in (7.3), the algorithm proceeds as in Algorithm 16.

In fact, Step 2 may be done at query time. In the context of topic-sensitive PageRank, we may store in memory several block-basis vectors , and at query time, compose them with one another to get a personalized block vector , and compose the result with the local rankings as described in Step 2. Furthermore, we would only have to do this for those pages that show up as a result to the query, making this operation only a few thousand multiplies, which may be easily done at query time.

For a given block-personalization vector ,

  1. Compute the personalized BlockRank vector

    = PageRank(BH, , ), where is a uniform start vector.

  2. Find an approximation to the global PageRank vector by weighting the local PageRanks of pages in block J by the personalized BlockRank of J.

Algorithm 16: Personalized BlockRank Algorithm

The computation of these block-basis vectors takes on the order of 7-8 minutes on a single machine as described in the previous chapter, and they are less than 1/1000 the size of full PageRank vectors, making it possible to store tens of thousands of these personalized basis vectors in memory, rather than just tens.

This is the speed and memory-efficiency needed for true personalized web search.

1 There may of course be deeper structure found in the path component, although we currently do not directly exploit such structure.

2 Generally this requires a site-based crawler (such as the WebBase crawler [18]) which maintains a pool of active hosts and crawls hosts to completion before adding new hosts to the pool.

3 3As in Section 7.1, this entails assigning document ids in lexicographic order of the url (with the components of the full hostname reversed).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.186.124