52. The Web as a Dynamic Graph

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

S. N. Maheshwari

Indian Institute of Technology Delhi, Delhi

52.4Properties of Web Graphs and Web Algorithmics

Generating Function Framework•Average Path Length•Emergence of Giant Components•Search on Web Graphs•Crawling and Trawling

52.5Conclusions

References

52.1Introduction

The World Wide Web (the Web) was started as an experimental network in 1991. Its growth since then can only be termed explosive. It has several billion pages today and is growing exponentially with time. This growth is totally distributed. There is no central authority to control the growth. The hyperlinks endow the Web with some structure in the sense that viewing the individual web pages as nodes and the hyperlinks as directed edges between them, the Web can be looked upon as a directed graph. What stands out is that this directed graph is not only dynamic—it is rapidly growing and changing—it has been much too large for some time to even have a complete snapshot. Experimental understanding of its structure is based on large but partial web crawls. What properties are being investigated is itself driven by the requirements of the increasingly sophisticated nature of the applications being developed as well as analogies and insights from fields like bibliometrics involving study of citations in academic literature [1].

Let us briefly consider topic search, discussed in detail in Chapter 51, which involves searching for pages on the Web which correspond closely to a given search topic. The seminal work in this area is Kleinberg’s HITS algorithm [2] that assumes that for any topic on the Web there are pages which could be considered to be “authoritative” on that topic, and pages which are “hubs” in the sense that they contain links to relevant pages on that topic. Given a collection of pages and links between them, selected by some sampling method as pertaining to the given topic, HITS algorithm ranks the pages by weights which are representative of the quality of the pages as hubs or authorities. These weights are nothing but principal eigen values, and are, in some sense, a “measure” of the “denseness” of the interconnections between the pages. This model of dense interconnection between hubs and authorities of a given topic gave rise to the notion of “cyber communities” in the Web associated with different topics. Underlying this model of cyber communities was the hypothesis that a subgraph representing a Web community would contain “bipartite cores.” Bipartite cores, as the name suggests, are complete bipartite graphs corresponding to the hubs and authorities around which the communities are supposed to have developed. Experimental investigations of the structure of the web graph, taking place on the graphs extracted out of the partial crawls, has confirmed much of the above and more. The structural understanding resulting from the experimental investigations has fueled both, theoretical model building which attempts to explain experimentally observed phenomena, and development of new algorithmic techniques that solve traditional problems of search and information retrieval on the web graph in novel ways. Moreover, the reality of the Web as a structure which is too large and continuously changing, makes the standard off-line and on-line models for algorithm design totally inapplicable. Over the last seven-eight years researchers have attempted to grapple with these unique issues of complexity in the study of the Web. Interestingly, contributions to this study of the Web have come not only from from computer scientists, but also from physicists who have brought to Web model building techniques from statistical mechanics that have been successful in predicting macro level behavior of a variety of natural phenomenon from millions of its constituent parts. In this chapter an attempt is made to put together what to the author are the main strands of this rapidly evolving model building. The rest of the chapter is organized as follows: Section 2 surveys the experimental observations and reflects what are the major trends in the findings. Section 3 contains the basic theoretical framework developed to explain the experimental findings. Section 4 contains examples of web algorithmics. Section 5 is crystal gazing and reflects what to the author are the grand challenges.

52.2Experimental Observations

Recent literature contains reports of a number of experiments conducted to investigate topological properties satisfied by the web graph [3–5]. These experiments were conducted over a period of time, and using Web samples of varying sizes. Albert, Jeong and Barabasi [3] used nd.edu subset of the Web. Kumar et al.[5] used a cleaned up version of a 1997 web crawl carried out by Alexa Inc. Broder et al. [4] based their measurements on an Altavista crawl having about 200 million pages and 1.5 billion hyperlinks. The most fundamental observation that emerges from these experiments conducted at different times, focusing on different subparts of the Web, is that the degree distribution of nodes in the web graph follows a power law. The degree distribution is said to satisfy power law if the fraction of nodes of degree x is proportional to x^−α for α > 0. Power law distribution is observed for both the indegrees and the outdegrees of the web graph. Broder at al. report that for indegrees the power coefficient—indexexperimental observations!indegree distributionα≈2.1, and for outdegrees α≈2.72. There is a very close match in literature in the value of α for indegree distribution. For outdegree distribution the value of α reported varies from 2.4 to 2.72 [3].

Broder et al. [4] also analysed the crawl for connectedness. Viewing the web graph as an undirected graph, it was observed that 91% of the nodes were connected to each other and formed a giant connected component. Interestingly, it was found that the distribution of the number of connected components by their sizes also satisfied power law (α≈2.5). Power law distribution in the sizes of the components was observed even when the graph was viewed as directed. However, the size of the largest strongly connected component (giant SCC) was only 28% of the total web crawl. The giant SCC was reachable from about 22% of the nodes (the set IN). About similar percentage of nodes were reachable from the giant SCC (the set OUT). A significant portion of the rest of the nodes constituted, in Broder et al.’s terminology, “tendrils”, nodes reachable from IN or from which the set OUT is reachable. All the experiments done so far point to fractal like self similar nature of the Web in the sense that structure described above is likely to be exhibited in any non-trivial crawl carried out at any time.

Kumar et al. [5] also carried out experiments to measure the number of bipartite cores in the Web. In a cleaned up version of the web graph consisting of 24 million nodes, they reported discovering around 135,000 bipartite cliques K_i,j with i≥3 and j = 3. The number of K_i,j’s with i, j = 3 was approximately 90,000, and the numbers dropped exponentially with increase in j. Finding such cliques in the web graph is an algorithmically challenging problem which we will further discuss in the section on Web algorithmics.

Measurements have also been made of the “diameter” of the Web [3,4]. If the Web has the structure as asserted in Reference 4, then the probability of a path existing between any two random vertices is approximately 24%, and the average shortest path length is 16. Albert et al. [3] measured the average shortest path length on a directed graph generated having in and outdegree distribution satisfying the power law coefficients of 2.1 and 2.45 respectively. In a directed graph with 8×108 vertices the average shortest path length was determined to be approximately 19, and was a linear function of the logarithm of the number of vertices. The Web, therefore, is considered to exhibit the “small worlds” phenomenon [6].

52.3Theoretical Growth Models

This fractal like structure of an evolving graph, whose growth processes are so organised that the degree distribution of its vertices satisfy power law, which has a large number of bipartite cliques as subgraphs, and exhibits the small world phenomenon has generated a lot of interest among computer scientists recently. Part of the reason is that the web graph does not belong to the much studied G_n,p model which consists of all graphs with n vertices having p as the probability that there is an between any two vertices [7]. Graphs in G_n,p are essentially sparse and are unlikely to have many bipartite cliques as subgraphs. Moreover, for large n the degree distribution function is Poisson. There is consensus that the primary reason for the web graph to be different from a traditional random graph is that edges in the Web exhibit preferential attachment. Hyperlinks from a new web page are more likely to be directed to popular well established website/webpage just as a new manuscript is more likely to cite papers that are already well cited. It is interesting to note that preferential attachment has been used to model evolving phenomena in fields ranging from economics [8], biology [9], to languages [10]. Simon [11] used preferential attachment to explain power law distributions already observed in phenomena like distribution of incomes, distribution of species among genera, distribution of word frequencies in documents etc. A modern exposition of Simon’s argument in the framework of its applicability to web graphs is provided by Mitzenmacher [12], and is the basis of Barabasi et al.’s “mean field theory” based model [13], as well as the “rate equation” approach of Krapivsky and Redner [14]. All three use continuous differential equations to model the dynamics of the evolving web graph. This approach is attractive because of its simplicity and the ease with which it enables one to focus on understanding the issues involved, particularly those relating to power law distributions associated with the Web.

Let us consider the following growth model (we will call this the basic model) which forms the kernel of most of the reported models in literature. At each time step a new node is added to the web graph. This node gets one edge incident at it from one of the existing nodes in the graph, and has one edge pointing out of it. Let us assume that the tail (the node from which the edge emanates) of the edge which is incident at the node just added is chosen uniformly at random from the existing nodes. The head (the node at which the edge is incident) of the edge which emanates from the node just added is chosen with probability proportional to the indegree of the head in keeping with the preferential attachment paradigm that new web pages tend to attach themselves to popular web pages. We assume, to keep matters simple, that whole process starts with a single node with one edge emanating which is incident at the node itself.

Let I_k(t) and O_k(t) denote the number of nodes in the evolved web graph with indegree and outdegree equal to k respectively after t timesteps. I_k(t) and O_k(t) are random variables for all k, t ≥ 1. We will assume for the purposes of what follows that their expected values are concentrated around their means, and we will use I_k(t) and O_k(t) to denote the expected value also. Note that, at time t, one node and two directed edges are added to the graph. The expected increase in the value of I_k(t) is controlled by two processes. One is the expected increase in the value of I_k−1(t), and the other is the expected decrease in the value of I_k(t). The expected increase is (k − 1)I_k−1(t)/t. This is because of our assumption that the probability is proportional to the indegree of the head of the node at which the edge emanating from the node just added is incident and t is the total number of nodes in the graph. Reasoning in the same way we get that the expected decrease is kI_k(t)/t. Since this change has has taken place over one unit of time we can write

\frac{Δ (I_{k} (t))}{Δ (t)} = \frac{(k - 1) I_{k - 1} (t) - k I_{k} (t)}{t}

$\frac{Δ (I_{k} (t))}{Δ (t)} = \frac{(k - 1) I_{k - 1} (t) - k I_{k} (t)}{t}$

or, in the continuous domain

$\frac{d I_{k} (t)}{d t} = \frac{(k - 1) I_{k - 1} (t) - k I_{k} (t)}{t} .$ $\frac{d I_{k} (t)}{d t} = \frac{(k - 1) I_{k - 1} (t) - k I_{k} (t)}{t} .$	(52.1)

We can solve Equation 52.1 for different values of k starting with 1. For k = 1 we note that the growth process at each time instance introduces a node of indegree 1. Therefore, Equation 52.1 takes the form

\frac{d I_{1} (t)}{d t} = 1 - \frac{I_{1} (t)}{t}

$\frac{d I_{1} (t)}{d t} = 1 - \frac{I_{1} (t)}{t}$

whose solution has the form $I_{1} (t) = i_{1} t$ $I_{1} (t) = i_{1} t$ . Substituting this in the above equation we get i1 = 1/2. Working in the same way we can show that I2(t) = t/6. It is not too difficult to see that I_k(t) are a linear function of t. Therefore, substituting $I_{k} (t) = i_{k} t$ $I_{k} (t) = i_{k} t$ in Equation 52.1 we get the recurrence equation

i_{k} = \frac{k - 1}{k + 1} i_{k - 1}

$i_{k} = \frac{k - 1}{k + 1} i_{k - 1}$

whose solution is

$i_{k} = \frac{1}{k (k + 1)} .$ $i_{k} = \frac{1}{k (k + 1)} .$	(52.2)

What interpretation do we put to the solution $I_{k} (t) = i_{k} t$ $I_{k} (t) = i_{k} t$ ? It essentially means that in the steady state (i.e., $t \to \infty$ $t \to \infty$ ) the number of of nodes in the graph with indegree k is proportional to i_k. Equation 52.2 implies I_k ∼ k⁻², i.e. indegree distribution satisfies power law with α = 2.

Let us now develop an estimate for O_k. The counterpart of Equation 52.1 in this case is

$\frac{d O_{k} (t)}{d t} = \frac{O_{k - 1} (t) - O_{k} (t)}{t} .$ $\frac{d O_{k} (t)}{d t} = \frac{O_{k - 1} (t) - O_{k} (t)}{t} .$	(52.3)

Assuming that in the steady state $O_{k} (t) = o_{k} t$ $O_{k} (t) = o_{k} t$ , we can show that o1 = 1/2, and for k > 1

o_{k} = \frac{o_{k - 1}}{2} .

$o_{k} = \frac{o_{k - 1}}{2} .$

This implies O_k ∼ 2^−k. That the outdegree distribution is exponential and not power law should not come as a surprise if we recall that the process that affected outdegree was uniform random and had no component of preferential attachment associated with it.

It would be instructive to note that very simple changes in the growth model affect the degree distribution functions substantially. Consider the following variation on the growth model analysed above. The edge emanating out of the node added does not just attach itself to another node on the basis of its indegree. With probability β the edge points to a node chosen uniformly at random, and with probability 1 − β the edge is directed to a node chosen proportionally to its indegree. The Equation 52.1 now takes the form

\frac{d I_{k} (t)}{d t} = \frac{(β I_{k - 1} (t) + (1 - β) (k - 1) I_{k - 1} (t)) - (β I_{k} (t) + (1 - β) k I_{k} (t))}{t} .

$\frac{d I_{k} (t)}{d t} = \frac{(β I_{k - 1} (t) + (1 - β) (k - 1) I_{k - 1} (t)) - (β I_{k} (t) + (1 - β) k I_{k} (t))}{t} .$

Note that the first term of the first part of the r.h.s. corresponds to increase due to the uniform process and the second term increase due to the preferential attachment process. The recurrence equation now takes the form [12]

i_{k} (1 + β + k (1 - β)) = i_{k - 1} (β + (k - 1) (1 - β))

$i_{k} (1 + β + k (1 - β)) = i_{k - 1} (β + (k - 1) (1 - β))$

or,

\frac{i_{k}}{i_{k - 1}} = 1 - \frac{2 - β}{1 + β + k (1 - β)}

$\frac{i_{k}}{i_{k - 1}} = 1 - \frac{2 - β}{1 + β + k (1 - β)}$

\sim 1 - (\frac{2 - β}{1 - β}) (\frac{1}{k})

$\sim 1 - (\frac{2 - β}{1 - β}) (\frac{1}{k})$

for large k. It can be verified by substitution that

$i_{k} \sim k^{- \frac{2 - β}{1 - β}}$ $i_{k} \sim k^{- \frac{2 - β}{1 - β}}$	(52.4)

satisfies the above recurrence. It should be noted that in this case the power law coefficient can be any number larger than 2 depending upon the value of β.

How do we get power law distribution in the outdegree of the nodes of the graph? Simplest modification to the basic model to ensure that would be to choose the tail of the edge incident at the added node to be chosen with probability proportional to the outdegree of the nodes. Aiello et al. [15], and Cooper and Frieze [16] have both given analysis of the version of the basic model in which, at any time epoch, apart from edges incident and emanating out of the added node being chosen randomly and according to the out and in degree distributions of the existing nodes, edges are added between existing nodes of the graph. Both in and out degree distributions show power law behaviour. From Web modeling perspective this is not particularly satisfying because there is no natural analogue of preferential attachment to explain the process that controls the number of hyperlinks in a web page. Never-the-less all models that enable power law distribution in outdegree of nodes in literature resort to it in one form or the other. We will now summarize the other significant variations of the basic model that have been reported in literature.

Kumar et al. [17] categorise evolutionary web graph models according to the rate of growth enabled by them. Linear growth models allow one node to be added at one time epoch along with a fixed number of edges to the nodes already in the graph. Exponential growth models allow the graph to grow by a fixed fraction of the current size at each time epoch. The models discussed above are linear growth models. Kumar et al. in Reference 17 introduce the notion of copying in which the head of the edge emanating out of the added node is chosen to be the head of an edge emanating out of a “designated” node (chosen randomly). The intuition for copying is derived from the insight that links out of a new page are more likely to be directed to pages that deal with the “topic” associated with the page. The designated node represents the choice of the topic and the links out of it are very likely links to “other” pages relating to the topic. Copying from the designated node is done with probability 1 − β. With probability β the head is chosen uniformly at random. The exponential growth model that they have analyzed does not involve the copying rule for distribution of the new edges to be added. The tail of a new edge is chosen to be among the new nodes with some probability factor. If the tail is to be among the old nodes, then the old node is chosen with probability proportional to its out degree. The analysis, as in References 15,16, is carried out totally within the discrete domain using martingale theory to establish that the expected values are sharply clustered. For the linear growth model the power law coefficient is the same as in Equation 52.4. The copying model is particularly interesting because estimates of the distribution of bipartite cliques in graphs generated using copying match those found experimentally.

Another approach that has been used to model evolutionary graphs is the so called Master Equation Approach introduced by Dorogovtsev et al. [18] which focuses on the probability that at time t a node introduced at time i, has degree k. If we denote this quantity by p(k, i, t), then the equation controlling this quantity for indegree in the basic model becomes

$p (k, i, t + 1) = \frac{k - 1}{t + 1} p (k - 1, i, t) + (1 - \frac{k}{t + 1}) p (k, i, t) .$ $p (k, i, t + 1) = \frac{k - 1}{t + 1} p (k - 1, i, t) + (1 - \frac{k}{t + 1}) p (k, i, t) .$	(52.5)

First term on the r.h.s. corresponds to the probability with which the node increases its indegree. The second term is the complementary probability with which the node remains in its former state. The over all degree distribution of nodes of indegree k in the graph is

P (k, t) = \frac{1}{t + 1} \sum_{i = 0}^{i = t} p (k, i, t) .

$P (k, t) = \frac{1}{t + 1} \sum_{i = 0}^{i = t} p (k, i, t) .$

Using this definition over Equation 52.5 we get

(t + 1) P (k, t + 1) = (k - 1) P (k - 1, t) + (t - k) P (k, t) .

$(t + 1) P (k, t + 1) = (k - 1) P (k - 1, t) + (t - k) P (k, t) .$

For extremely large networks the stationary form of this equation, i.e. at $t \to \infty$ $t \to \infty$ is

$P (k) = \frac{k - 1}{k + 1} P (k - 1) .$ $P (k) = \frac{k - 1}{k + 1} P (k - 1) .$	(52.6)

Notice that the solution to Equation 52.6 with appropriate initial conditions is of the form P(k) ∼ k⁻² which is the same as that obtained by working in the continuous domain.

Rigorous analysis of the stochastic processes done so far in References 15–17 has so far taken into account growth, that is, birth, process only. The combinatorics of a web graph model that involves death processes also has still to be worked out. It must be pointed out that using less rigorous techniques a series of results dealing with issues like non-linear preferential attachment and growth rates, growth rates that change with time (aging and decay), and death processes in the form of edge removals have appeared in literature. This work is primarily being done by physicists. Albert and Barabasi [19], and Dorogovtsev and Mendes [20] are two comprehensive surveys written for physicists which the computer scientists would do well to go through. However, with all this work we are still far away from having a comprehensive model that takes into account all that we understand of the Web. All models view web growth as a global process. But we know that a web page to propagate Esperanto in India is more likely to have hyperlinks to and from pages of Esperanto enthusiasts in rest of the world. From modeling perspective every new node added during the growth process may be associated with the topic of the node chosen, let us say by preferential attachment, to which the added node first points to. This immediately reduces the set of nodes from which links can point to it. In effect the probability for links to be added between two nodes which are associated with some topics will be a function of how related the topics are. That there is some underlying structure to the web graph that is defined by the topics associated with the nodes in the graph has been in the modeling horizon for some time. Search engines that use web directories organised hierarchically as graphs and trees of topics have been designed [21]. Experiments to discover the topic related underlying structures have been carried out [22–24]. A model that takes into account the implicit topic based structure and models web growth as a number of simultaneously taking place local processes [25] is as step in this direction. All that is known about the Web and the models that have been developed so far are pointers to the realisation that development mathematically tractable models that model the phenomena faithfully is a challenging problem which will excite the imagination and the creative energies of researchers for some time.

52.4Properties of Web Graphs and Web Algorithmics

The properties of web graphs that have been studied analytically are average shortest path lengths between two vertices, size of giant components and their distributions. All these properties have been studied extensively for G_n,p class of random graphs. Seminal work in this area for graphs whose degrees were given was done by Malloy and Reed [26] who came up with a precise condition under which phase transition would take place and giant components as large as the graph itself would start to show up. In what follows we will discuss these issues using generating functions as done by Newman, Strogatz, and Watts [27] primarily because of the simplicity with which these reasonably complex issues can be handled in an integrated framework.

52.4.1Generating Function Framework

Following [27] we define for a large undirected graph with N nodes the generating function

$G_{0} (x) = \sum_{k = 0}^{k = \infty} p_{k} x^{k},$ $G_{0} (x) = \sum_{k = 0}^{k = \infty} p_{k} x^{k},$	(52.7)

where p_k is the probability that a randomly chosen vertex has degree k. We assume that the probability distribution is correctly normalised, that is, G0(1) = 1. G0(x) can also represent graphs where we know the exact number n_k of vertices of degree k by defining

G_{0} (x) = \frac{\sum_{k = 0}^{k = \infty} n_{k} x^{k}}{\sum_{k = 0}^{k = \infty} n_{k}} .

$G_{0} (x) = \frac{\sum_{k = 0}^{k = \infty} n_{k} x^{k}}{\sum_{k = 0}^{k = \infty} n_{k}} .$

The denominator is required to ensure that the generating function is properly normalised. Consider the function [G0(x)]2. Note that coefficient of the power of xn in [G0(x)]2 is given by $\sum_{i + j = n} p_{i} p_{j}$ $\sum_{i + j = n} p_{i} p_{j}$ which is nothing but the probability of choosing two vertices such that the sum of their degrees is n. The product of two or more generating functions representing different degree distributions can be interpreted in the same way to represent probability distributions reflecting independent choices from the two or more distributions involved.

Consider the problem of estimating the average degree of a vertex chosen at random. The average degree is given by $\sum_{k} k p_{k}$ $\sum_{k} k p_{k}$ which is also equal to $G_{0}^{'} (1)$ $G_{0}^{'} (1)$ . Interestingly the average degree of a vertex chosen at random and the average degree of a vertex pointed to by a random edge are different. A random edge will point to a vertex with probability proportional to the degree of the vertex which is of the order of kp_k. The appropriately normalised distribution is given by the generating function

$\frac{\sum_{k} k p_{k} x^{k}}{\sum_{k} k p_{k}} = x \frac{G_{0}^{'} (x)}{G_{0}^{'} (1)} .$ $\frac{\sum_{k} k p_{k} x^{k}}{\sum_{k} k p_{k}} = x \frac{G_{0}^{'} (x)}{G_{0}^{'} (1)} .$	(52.8)

The generating function given above in Equation 52.8 will have to be divided by x if we wanted to consider the distribution of degree of immediate neighbors excluding the edges by which one reached them. We will denote that generating function by G1(x). The neighbors of these immediate neighbors are the second neighbors of the original node. The probability that any of the second neighbors connect to any of the immediate neighbors or one another is no more than order of N⁻¹ and hence can be neglected when N is large. Under this assumption the distribution of the second neighbors of the originally randomly chosen node is

\sum_{k} p_{k} [G_{1} (x)]^{k} = G_{0} (G_{1} (x)) .

$\sum_{k} p_{k} [G_{1} (x)]^{k} = G_{0} (G_{1} (x)) .$

The average number of second neighbors, therefore, is

$z_{2} = {[\frac{d}{d x} G_{0} (G_{1} (x))]}_{x = 1} = G_{0}^{'} (1) G_{1}^{'} (1),$ $z_{2} = {[\frac{d}{d x} G_{0} (G_{1} (x))]}_{x = 1} = G_{0}^{'} (1) G_{1}^{'} (1),$	(52.9)

using the fact that G1(1) = 1.

52.4.2Average Path Length

We can extend this reasoning to estimate the distributions for the mth neighbors of a randomly chosen node. The generating function for the distribution for the mth neighbor, denoted by Gm(x), is given by

G^{m} (x) = {\begin{array}{cc} G_{0} (x) & m = 1 \\ G^{(m - 1)} (G_{1} (x)) & m \geq 2. \end{array}

$G^{m} (x) = {\begin{array}{cc} G_{0} (x) & m = 1 \\ G^{(m - 1)} (G_{1} (x)) & m \geq 2. \end{array}$

Let z_m denote the average number of mth nearest neighbors. We have

$z_{m} = {\frac{d G^{m} (x)}{d x} \|}_{x = 1} = G_{1}^{'} (1) G^{(m - 1)^{'}} (1) = {[G_{1}^{'} (1)]}^{m - 1} z_{1} = {[\frac{z_{2}}{z_{1}}]}^{m - 1} z_{1} .$ $z_{m} = {\frac{d G^{m} (x)}{d x} \|}_{x = 1} = G_{1}^{'} (1) G^{(m - 1)^{'}} (1) = {[G_{1}^{'} (1)]}^{m - 1} z_{1} = {[\frac{z_{2}}{z_{1}}]}^{m - 1} z_{1} .$	(52.10)

Let l be the smallest integer such that

1 + \sum_{i = 1}^{i = l} z_{i} \geq N .

$1 + \sum_{i = 1}^{i = l} z_{i} \geq N .$

The average shortest path length between two random vertices can be estimated to be of the order l. Using Equation 52.10 we get

l \sim \frac{l o g [(N - 1) (z_{2} - z_{1}) + z_{1}^{2}] - l o g z_{1}^{2}}{l o g (z_{2} / z_{1})} .

$l \sim \frac{l o g [(N - 1) (z_{2} - z_{1}) + z_{1}^{2}] - l o g z_{1}^{2}}{l o g (z_{2} / z_{1})} .$

When N≫z1 and $z_{2} ≫ z_{1}$ $z_{2} ≫ z_{1}$ the above simplifies to

l \sim \frac{l o g (N / z_{1})}{l o g (z_{2} / z_{1)}} + 1.

$l \sim \frac{l o g (N / z_{1})}{l o g (z_{2} / z_{1)}} + 1.$

There do exist more rigorous proofs of this result for special classes of random graphs. The most interesting observation made in Reference 27 is that only estimates of nearest and second nearest neighbors are necessary for calculation of average shortest path length, and that making these purely local measurements one can get a fairly good measure of the average shortest distance which is a global property. One, of course, is assuming that the graph is connected or one is limiting the calculation to the giant connected component.

52.4.3Emergence of Giant Components

Consider the process of choosing a random edge and determining the component(s) of which one of its end nodes is a part. If there are no other edges incident at that node then that node is a component by itself. Otherwise the end node could be connected to one component, or two components and so on. Therefore, the probability of a component attached to the end of a random edge is the sum of the probability of the end node by itself, the end node connected to one other component, or two other components and so on. If H1(x) is the generating function for the distribution of the sizes of the components which are attached to one of the ends of the edge, then H1(x) satisfies the recursive equation

$H_{1} (x) = x G_{1} (H_{1} (x)) .$ $H_{1} (x) = x G_{1} (H_{1} (x)) .$	(52.11)

Note that each such component is associated with the end of an edge. Therefore, the component associated with a random vertex is a collection of such components associated with the ends of the edges leaving the vertex, and so, H0(x) the generating function associated with size of the whole component is given by

$H_{0} (x) = x G_{0} (H_{1} (x)) .$ $H_{0} (x) = x G_{0} (H_{1} (x)) .$	(52.12)

We can use Equations 52.11 and 52.12 to compute the average component size which, in an analogous manner to computing the average degree of a node, is nothing but

$H_{0}^{'} (1) = 1 + G_{0}^{'} (1) H_{1}^{'} (1) .$ $H_{0}^{'} (1) = 1 + G_{0}^{'} (1) H_{1}^{'} (1) .$	(52.13)

Similarly using Equation (52.11) we get

H_{1}^{'} (1) = 1 + G_{1}^{'} (1) H_{1}^{'} (1),

$H_{1}^{'} (1) = 1 + G_{1}^{'} (1) H_{1}^{'} (1),$

which gives the average component size as

$1 + \frac{G_{0}^{'} (1)}{1 - G_{1}^{'} (1)} .$ $1 + \frac{G_{0}^{'} (1)}{1 - G_{1}^{'} (1)} .$	(52.14)

The giant component first appears when $G_{1}^{'} (1) = 1.$ $G_{1}^{'} (1) = 1.$ This condition is is equivalent to

$G_{0}^{'} (1) = G_{0}^{″} (1) .$ $G_{0}^{'} (1) = G_{0}^{″} (1) .$	(52.15)

Using Equation 52.7 the condition implied by Equation 52.15 can also written as

$\sum_{k} k (k - 2) p_{k} = 0.$ $\sum_{k} k (k - 2) p_{k} = 0.$	(52.16)

This condition is the same as that obtained by Molloy and Reed in References 26. The sum on the l.h.s. of Equation 52.16 increases monotonically as edges are added to the graph. Therefore, the giant component comes into existence when the sum on the l.h.s. of Equation 52.16 becomes positive. Newman et al. state in Reference 27 that once there is a giant component in the graph, then H0(x) generates the probability distribution of the sizes of the components excluding the giant component. Molloy and Reed have shown in References 28 that the giant component almost surely has cN + o(N) nodes. It should be remembered that the results in References 26–28 are all for graphs with specified degree sequences. As such they are applicable to graphs with power law degree distributions. However, web graphs are directed and a model that explains adequately the structure consistent with all that is known experimentally is still to be developed.

52.4.4Search on Web Graphs

Perhaps the most important algorithmic problem on the Web is mining of data. We will, however, have not much to say about it partly because it has been addressed in Chapter 51, but also because we want to focus on issues that become specially relevant in an evolving web graph. Consider the problem of finding a path between two nodes in a graph. On the Web this forms the core of the peer to peer (P2P) search problem defined in the context of locating a particular file on the Web when there is no information available in a global directory about the node on which the file resides. The basic operation available is to pass messages to neighbors in a totally decentralised manner. Normally a distributed message passing flooding technique on an arbitrary random network would be suspect suspect because of the very large number of messages that may be so generated. Under the assumption that a node knows about the identities of its neighbors and perhaps neighbors’ neighbors, Adamic et al. [29] claim, experimentally through simulations, that in power law graphs with N nodes the average search time is of the order of N^0.79 (graphs are undirected and power law coefficients for the generated graphs is 2.1) when the search is totally random. The intuitive explanation is that even in a random search the search process tends to gravitate towards nodes of high degree. When the strategy is to choose to move to the highest degree neighbor the exponent comes down to 0.70. Adamic et al. [29] have come up a with a fairly simple analysis to explain why random as well as degree directed search algorithms need not have time complexity more than rootic in N (reported exponent is 0.1). In what follows we develop the argument along the lines done by Mehta [30] whose analysis, done assuming that the richest node is selected on the basis of looking at the neighbors and neighbors’ neighbors, has resulted in a much sharper bound than the one in Reference 29. We will assume that the cut off degree, m, beyond which the the probability distribution is small enough to be ignored is given by m = N^1/α [31].

Using

p_{k} = p (k) = \frac{k^{- α}}{\sum k^{- α}}

$p_{k} = p (k) = \frac{k^{- α}}{\sum k^{- α}}$

as the probability distribution function, the expected degree z of a random node is

$z = G_{0}^{'} (1) \approx \frac{\int_{1}^{m} k^{1 - α} d k}{\int_{1}^{m} k^{- α} d k} = \frac{(α - 1) (m^{2 - α} - 1)}{(α - 2) (m^{1 - α} - 1)} \sim \ln m,$ $z = G_{0}^{'} (1) \approx \frac{\int_{1}^{m} k^{1 - α} d k}{\int_{1}^{m} k^{- α} d k} = \frac{(α - 1) (m^{2 - α} - 1)}{(α - 2) (m^{1 - α} - 1)} \sim \ln m,$	(52.17)

under the assumption that α tends to 2 and m^1−α is small enough to be ignored. Similarly the PDF for G1(x) is

$p_{1} (k) = \frac{p_{k} k}{\sum p_{k} k} = \frac{k^{1 - α}}{(\sum k^{1 - α})} = c k^{1 - α},$ $p_{1} (k) = \frac{p_{k} k}{\sum p_{k} k} = \frac{k^{1 - α}}{(\sum k^{1 - α})} = c k^{1 - α},$	(52.18)

where

$c = \frac{1}{\sum k^{1 - α}} = \frac{α - 2}{(m^{2 - α} - 1)} \approx \frac{1}{\ln m} = \frac{α}{\ln N} .$ $c = \frac{1}{\sum k^{1 - α}} = \frac{α - 2}{(m^{2 - α} - 1)} \approx \frac{1}{\ln m} = \frac{α}{\ln N} .$	(52.19)

The CDF for G1(x) using (52.18) and (52.19) is

$P (x) = \int_{1}^{x} p_{1} (k) d k = \frac{\ln x}{\ln m} .$ $P (x) = \int_{1}^{x} p_{1} (k) d k = \frac{\ln x}{\ln m} .$	(52.20)

The search proceeds by choosing at each step the highest degree node among the n neighbors of the current node. The total time taken can be estimated to be the sum of the number of nodes traversed at each step. The number of steps itself can be bounded by noting that the sum of the number of steps at each step can not exceed the size of the graph. Let p_max(x, n) be the distribution of the degree of the richest node (largest degree node) among the n neighbors of a node. Determining the richest node is equivalent to taking the maximum of n independent random variables and in terms of the CDF P(x),

$p_{m a x} (x, n) = {\begin{cases} 0 & x = 0 \\ (P (x) - P (x - 1))^{n} & 0 < x \leq m \\ 0 & x > m \end{cases}$ $p_{m a x} (x, n) = {\begin{cases} 0 & x = 0 \\ (P (x) - P (x - 1))^{n} & 0 < x \leq m \\ 0 & x > m \end{cases}$	(52.21)

This can be approximated as follows

$p_{m a x} (x, n) = \frac{d (P (x)^{n})}{d x} = n P (x)^{n - 1} \frac{d P (x)}{d x} = n (\log_{m} x)^{n - 1} c x^{- 1} .$ $p_{m a x} (x, n) = \frac{d (P (x)^{n})}{d x} = n P (x)^{n - 1} \frac{d P (x)}{d x} = n (\log_{m} x)^{n - 1} c x^{- 1} .$	(52.22)

We can now calculate the expected degree of the richest node among the n nodes as

$f (n) = E [x_{m a x} (n)] = \sum_{1}^{m} x p_{m a x} (x, n) = n c \sum_{1}^{m} (\log_{m} x)^{n - 1} .$ $f (n) = E [x_{m a x} (n)] = \sum_{1}^{m} x p_{m a x} (x, n) = n c \sum_{1}^{m} (\log_{m} x)^{n - 1} .$	(52.23)

Note that if the number of neighbors of every node on the average is z, then the number of second neighbors seen when one is at the current richest node is f(z)(z − 1). this is because one of the neighbors of every other node is the richest node. At the next step, the expected degree of the node whose current degree is f(z) is given by E(f(z)) = f(f(z)) and the number of second neighbors seen at this stage correspondingly is f(f(z))(z − 1). The overlap between two successive computations of neighbors takes place only when there is an edge which makes a node a neighbor as well as the second neighbor. However, the probability of this happening is of the order of N⁻¹ and can be ignored in the limit for large N. Therefore we can assume that at every stage new nodes are scanned and the number of steps, l, is controlled by

$(z - 1) \sum_{i = 0}^{i = l} f^{i} (z) = N .$ $(z - 1) \sum_{i = 0}^{i = l} f^{i} (z) = N .$	(52.24)

Assuming for the sake of simplicity f(f(z)) = f(z) (simulations indicate that value of fi(z) increases with i and then stabilises) we can set E[n] = n or

E (n) = n = n c \sum_{1}^{m} (\log_{m} (x))^{n - 1},

$E (n) = n = n c \sum_{1}^{m} (\log_{m} (x))^{n - 1},$

$(\ln m)^{n} = \sum_{1}^{m} (\ln x)^{n - 1} \sim \int_{1}^{m} (\ln x)^{n - 1} d x .$ $(\ln m)^{n} = \sum_{1}^{m} (\ln x)^{n - 1} \sim \int_{1}^{m} (\ln x)^{n - 1} d x .$	(52.25)

Substituting et for x we get

\begin{aligned} (\ln m)^{n} & = \int_{1}^{\ln m} e^{t} t^{n - 1} d t \\ \sim \int_{1}^{\ln m} t^{n - 1} (1 + \sum_{0}^{\infty} \frac{t^{i}}{i!}) d t \\ = \frac{(\ln m)^{n}}{n} + {| \sum_{0}^{\infty} \frac{t^{n + i}}{i! (n + i)} |}_{0}^{\ln m} \\ \leq \frac{(\ln m)^{n}}{n} + \frac{(\ln m)^{n}}{n + c} e^{\ln m}, \end{aligned}

$\begin{aligned} (\ln m)^{n} & = \int_{1}^{\ln m} e^{t} t^{n - 1} d t \\ \sim \int_{1}^{\ln m} t^{n - 1} (1 + \sum_{0}^{\infty} \frac{t^{i}}{i!}) d t \\ = \frac{(\ln m)^{n}}{n} + {| \sum_{0}^{\infty} \frac{t^{n + i}}{i! (n + i)} |}_{0}^{\ln m} \\ \leq \frac{(\ln m)^{n}}{n} + \frac{(\ln m)^{n}}{n + c} e^{\ln m}, \end{aligned}$

where c is a constant. The above implies

1 = \frac{1}{n} + \frac{m}{n + c},

$1 = \frac{1}{n} + \frac{m}{n + c},$

which gives

n = O (m) .

$n = O (m) .$

Substituting this in Equation 52.24 we get

(\ln m - 1) \sum_{0}^{l} O (m) = N,

$(\ln m - 1) \sum_{0}^{l} O (m) = N,$

l \sim \frac{N}{m \ln m} = \frac{N^{1 - 1 / α}}{\ln m} .

$l \sim \frac{N}{m \ln m} = \frac{N^{1 - 1 / α}}{\ln m} .$

Taking α to be 2.1 the number of steps would be of the order of N^0.52/lnm. Mehta [30] also reports results of simulations carried out on graphs generated with α equal to 2.1 using the method given in Reference 31. The simulations were carried out on the giant components. He reports the number of steps to be growing proportional to N^0.34. The simulation results are very preliminary as the giant components were not very large (largest was of the order of 20 thousand nodes).

52.4.5Crawling and Trawling

Crawling can be looked upon as a process where an agent moves along the nodes and edges of a randomly evolving graph. Off-line versions of the Web which are the basis of much of what is known experimentally about the Web have been obtained essentially through this process. Cooper and Frieze [32] have attempted to study the expected performance of a crawl process where the agent makes a fixed number of moves between the two successive time steps involving addition of nodes and edges to the graph. The results are fairly pessimistic. Expected proportion of unvisited nodes is of order of 0.57 of the graph size when the edges are added uniformly at random. The situation is even worse when the edges are distributed in proportion to the degree of nodes. The proportion of the unvisited vertices increases to 0.59.

Trawling, on the other hand, involves analyzing the web graph obtained through a crawl for subgraphs which satisfy some particular structure. Since the web graph representation may run into Tera bytes of data, the traditional random access model for computing the algorithmic complexity looses relevance. Even the standard external memory models may not apply as data may be on tapes and may not fit totally on the disks available. In these environments it may even be of interest to develop algorithms where the figure of merit may be the number of passes made over the graph represented on tape (S. Rajagopalan. Private Communication). In any case the issue in trawling is to design algorithms that efficiently stream data between secondary and main memory.

Kumar et al. discuss in Reference 5 the design of trawling algorithms to determine bipartite cores in web graphs. We will focus only on the core issue here which is that the size of the web graph is huge in comparison to the cores that are to be detected. This requires pruning of the web graph before algorithms for core detection are run. If (i, j) cores (i represents outdegree and j indegree) have to be detected then all nodes with outdegree less than i and indegree less than j have to be pruned out. This can be done with repeated sorting of the web graph by in and out degrees and keeping track of only those nodes that satisfy the pruning criteria. If an index is kept in memory of all the pruned in vertices then repeated sorting may also be avoided.

The other pruning strategy discussed in Reference 5 is the so called inclusion exclusion pruning in which the focus at every step is to either discover an (i, j) core or exclude a node from further contention. Consider a node x with outdegree equal to i (these will be termed fans) and let Γ(x) be the set of nodes that are potentially those with which x could form an (i, j) core (called centers). An (i, j) core will be formed if and only if there are i − 1 other nodes all pointing to each node in Γ(x). While this condition is easy to check if there are two indices in memory, the whole process can be done in two passes. In the first identify all the fans with outdegree i. Output for each such fan the set of i centers adjacent to it. In the second pass use an index on the destination id to generate the set of fans pointing to each of the i centers and compute the intersection of these sets. Kumar et al. [5] mention that this process can be batched with index only being maintained of the centers that result out of the fan pruning process. If we maintain the set of fans corresponding to the centers that have been indexed, then using the dual condition that x is a part of the core if and only if the intersection of the sets Γ⁻¹(x) has size at least j. If this process results in identification of a core then a core is outputted other wise the node x is pruned out. Kumar et al. [5] claim that this process does not result in any not yet identified cores to be eliminated.

The area of trawling is in its infancy and techniques that will be developed will depend primarily on the structural properties being discovered. It is likely to be influenced by the underlying web model and the whether any semantic information is also used in defining the structure. This semantic information may be inferred by structural analysis or may be available in some other way. Development of future web models will depend largely on our understanding of the web, and they themselves will influence the algorithmic techniques developed.

52.5Conclusions

The modeling and analysis of web as a dynamic graph is very much in its infancy. Continuous mathematical models, which has been the focus of this write up provides good intuitive understanding at the expense of rigour. Discrete combinatorial models that do not brush the problems of proving concentration bounds under the rug are available for very simple growth models. These growth models do not incorporate death processes, or the issues relating to aging (newly created pages are more likely to be involved in link generation processes) or for that matter that in and out degree distributions on the web are not independent and may depend upon the underlying community structure. The process of crawling which can visit only those vertices that have at least one edge pointing to it gives a very limited understanding of degree distributions. There are reasons to believe that a significantly large number of pages on the web have only out hyperlinks. All this calls for extensive experimental investigations and development of mathematical models that are tractable and help both in development of new analytical as well as algorithmic techniques.

The author would like to acknowledge Amit Agarwal who provided a willing sounding board and whose insights have significantly influenced the author’s approach on these issues.

References

1.L. Egghe and R. Rousseau. Introduction to Infometrics: Quantitative Methods in Library, Documentation and Information Science. Elsevier, 1990.

2.J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999.

3.R. Albert, H. Jeong, and A. L. Barabasi. Diameter of the World Wide Web. Nature, 401:130–131, 1999.

4.A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph Structure in the Web: Experiments and Models. In Proceedings of the 9th World Wide Web Conference, 2000.

5.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling Emerging Cyber-communities Automatically. In Proceedings of the 8th World Wide Web Conference, 1999.

6.D. J. Watts. Small Worlds: The Dynamics of Networks between Order and Randomness. Princeton University Press, 1999.

7.B. Bollobas. Modern Graph Theory. Springer-Verlag, New York, 1998.

8.B. Mandelbrot. Fractals and Scaling in Finance. Springer-Verlag, 1997.

9.G. U. Yule. A Mathematical Theory of Evolution based on the Conclusions of Dr J.C. Willis, F.R.S. Philosophical Transaction of the Royal Society of London (Series B), 213:21–87, 1925.

10.G. K. Zipf. Selective Studies and the Principle of Relative Frequency in Language. Harvard University Press, 1932.

11.H. A. Simon. On a Class of Skew Distribution Functions. Biometrika. 42:425–440, 1955.

12.M. Mitzenmacher. A Brief History of Generative Models for Power Law and Lognormal Distributions. Internet Mathematics, 1(2):226–251, 2004.

13.A. L. Barabasi, R. Albert, and H. Jeong. Mean Field Theory for Scale-free Random Networks. Physica A, 272:173–189, 1999.

14.P. L. Krapivsky and S. Redner. Organisation of Growing Random Networks. Physical Review E, 63:066123001–066123014, 2001.

15.W. Aiello, F. Chung, and L. Lu. Random Evolution in Massive Graphs. In Handbook on Massive Data Sets, (Eds. J. Abello et al.), pp. 97–122, 2002.

16.C. Cooper and A. Frieze. A General model of Web Graphs. Random Structures & Algorithms, 22(3):311–335, 2003.

17.R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic Models for the Web Graph. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science, pp. 57–65, 2000.

18.S. N. Dorogovtsev, J. F. F. Mendes, and A. N. Samukhin. Structure of Growing Networks with Preferential Linking. Phys. Rev. Lett. 85:4633, 2000.

19.R. Albert and A. L. Barabasi. Statistical Mechanics of Complex Networks. Reviews of Modern Physics, 74:47–97, 2002.

20.S. N. Dorogovtsev and J. F. F. Mendes. Evolution of Networks. Advances in Physics, 51:1079–1187, 2002.

21.S. Chakrabarti, M. Van den Berg, and B. Dom. Focused Crawling: A New Approach to Topic Specific Web Resource Discovery. Computer Networks: The International Journal of Computer and Telecommunications Networking, 31:1623–1640, 1999.

22.S. Chakrabarti, M. M. Joshi, K. Punera, and D. V. Pennock. The Structure of Broad Topics on the Web. In Proceedings of the 11th World Wide Web Conference, pp. 225–234, 2002.

23.C. Chekuri, M. Goldwasser, P. Rahgavan, and E. Upfal. Web Search using Automatic Classification. In Proceedings of the 6th World Wide Web Conference, 1997.

24.D. Gibson, J. Kleinberg, and P. Raghavan. Inferring Web Communities from Link Topology. In Proceedings of the 9th ACM Conference on Hypertyext and Hypermedia, pp. 225–234, 1998.

25.A. Agarwal, S. N. Maheshwari, and B. Mehta. An Evolutionary web graph Model. Technical Report, Department of Computer Science and Engineering, IIT Delhi, 2002.

26.M. Malloy and B. Reed. A Critical Point for Random Graphs with a given Degree Sequence. Random Structures and Algorithms, 6:161–179, 1995.

27.M. E. J. Newman, S.H. Strogatz, and D.J. Watts. Random Graphs with Arbitrary Degree Distribution and their Applications. Physical Review E, 64:026118, 2001.

28.M. Malloy and B. Reed. The Size of a Giant Component of a Random Graph with a given Degree Sequence. Combinatorics Probability and Computing, 7:295–305, 1998.

29.L. A. Adamic, R. M. Lukose, A. R. Puniyani, and B. A. Huberman. Search in Power Law Networks. Physical Review E., 64:46135–46143, 2001.

30.B. Mehta, Search in Web Graphs. M. Tech. Thesis, Department of Computer Science and Engineering I.I.T. Delhi, 2002.

31.W. Aiello, F. Chung, and L. Lu. A Random Graph Model for Massive Graphs. In Proceedings of the 32nd Annual ACM Symposium on Theory of Computing, pp. 171–180, 2000.

32.C. Cooper and A. Frieze. Crawling on Web Graphs. In Proceedings of the 43rd Annual Symposium on Theory of Computing, pp. 419–427, 2002.

^* This chapter has been reprinted from first edition of this Handbook, without any content updates.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 52. The Web as a Dynamic Graph

Create new playlist

Sign In

Sign Up

Table of Contents for
52. The Web as a Dynamic Graph