Understanding PageRank algorithm

At some point in our lives, we must have come across this term, PageRank. PageRank is one of the many ways in which Google ranks the web pages for searching and indexing. A simple Google search (pun totally intended) will return results that tell you how it basically involves a collection of nodes from which we can walk in a random direction. However, what does that really mean?

Given that the control is dropped on any node within a graph, we are saying that the control can jump to any node on the graph unbiased with a probability of alpha, and when it does land on any node, it shares a portion of its rank equally with all of its connected nodes before traversing along one of these nodes edges randomly with a probability of (1 - alpha).

How and why does this matter? It's just jumping from one node to another and then randomly traversing to some other connected node, right?

If you do this for long enough, you would land on all the nodes and some of the nodes more times than the other. You see where I am going with this? This would end up telling you which nodes are more frequented compared to others, which could happen for the following two reasons :

  • We just happened to jump to the same node multiple times
  • That node is connected to multiple nodes

The first scenario can happen, but, since we know that our jumps are unbiased and the Law of Large Numbers dictates that this would yield the normalized value when done for long enough, we can safely rule it out.

The second scenario, on the other hand, is not only possible but also very important to PageRank. Once you land on one of these nodes, that's when we calculate the PageRank for that node-based on alpha and on the rank inherited from the preceding node.

We were talking in abstract terms of nodes and edges; however, for a brief time, let's take a look at a statement made in the very first publication of PageRank by Sergey Brin and Lawrence Page (http://infolab.stanford.edu/~backrub/google.html):

We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also, C(A) is defined as the number of links going out of page A. The PageRank of page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.

From the preceding statement, we can see that the PageRank (PR) of a given page/node is derived from the PR of its citations (T1...Tn), but how does one know where to start since we need to know its citations to calculate the PR for T1. The simple answer is that we do not need to know the value of the PR(T1) or any other citation as a matter of fact. What we can do instead is to simply guess a value for PR(T1) and recursively apply the values that are derived from the preceding step.

However, why on earth would that work, you ask? The answer is simple, remember the Law of Large Numbers? If you repeat an action for long enough, the result of the said action will converge to the median value. Then, there are questions about how you can do it for the millions and billions of web pages and be effective? There are ways and means, which are beyond the scope of this chapter and book; however, for those interested, this book that explains Google's Page Rank is a great read, available at https://press.princeton.edu/titles/8216.html. I hope that this book sheds some light on the basic principle involved.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.235.215