Implementing the PageRank algorithm

The most important part of the PageRank algorithm is to come up with the best way to calculate the importance of each page that is returned by the query results. To calculate a number from 0 to 1 that can quantify the importance of a particular page, the algorithm incorporates information from the following two components:

  • Information that was specific to the query entered by the user: This component estimates, in the context of the query entered by the user, how relevant the content of the web page is. The content of the page is directly dependent on the author of the page.
  • Information that was not relevant to the query entered by the user: This component tries to quantify the importance of each web page in the context of its links, views, and neighborhood. This component is difficult to calculate as web pages are heterogeneous and coming up with criteria that can be applied across the web is difficult to develop.

In order to implement the PageRank algorithm in Python, first, let's import the necessary libraries:

import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline

For the purpose of demonstration, let's assume that we are analyzing only five webpages in the network. Let's call this set of pages myPages and together they are in a network named myWeb:

myWeb = nx.DiGraph()
myPages = range(1,5)

Now, let's connect them randomly to simulate an actual network:

connections = [(1,3),(2,1),(2,3),(3,1),(3,2),(3,4),(4,5),(5,1),(5,4)]
myWeb.add_nodes_from(myPages)
myWeb.add_edges_from(connections)

Now, let's plot this graph:

pos=nx.shell_layout(myWeb)
nx.draw(myWeb, pos, arrows=True, with_labels=True)
plt.show()

It creates the visual representation of our network, as follows:

In the PageRank algorithm, the patterns of a web page are contained in a matrix called the transition matrix. There are algorithms that constantly update the transition matrix to capture the constantly changing state of the web. The size of the transition matrix is n x n, where n is the number of nodes. The numbers in the matrix are the probability that a visitor will next go to that link due to the outbound link.

In our case, the preceding graph shows the static web that we have. Let's define a function that can be used to create the transition matrix:

Note that this function will return G, which represents the transition matrix for our graph.

Let's generate the transition matrix for our graph:

Note that the transition matrix is 5 x 5 for our graph. Each column corresponds to each node in the graph. For example, column 2 is about the second node. There is a 0.5 probability that the visitor will navigate from node 2 to node 1 or node 3. Note that the diagonal of the transition matrix is 0 as in our graph, there is no outbound link from a node to itself. In an actual network, it may be possible.

Note that the transition matrix is a sparse matrix. As the number of nodes increases, most of its values will be 0.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.180.244