Applying the PageRank algorithm

A research paper, titled The Anatomy of a Large-Scale Hypertextual Web Search Engine, by Sergey Brin and Lawrence Page, revolutionized web searching, and Google based its search engine on this concept of PageRank and came to dominate other web search engines.

When searching the web using Google, pages that are ranked highly by its algorithm are displayed. In the context of graphs, instead of web pages, if vertices are ranked based on the same algorithm, lots of new inferences can be made. From the outside, it may sound like this PageRank algorithm is useful only for web searches. But it has immense potential to be applied to many other areas.

In graph parlance, if there is an edge, E, connecting two vertices, from V1 to V2, according to the PageRank algorithm, V2 is more important than V1. In a huge graph of vertices and edges, it is possible to calculate the PageRank of each and every vertex.

The PageRank algorithm can be applied very well to the tennis tournament analysis use case covered in the preceding section. In the graph representation that is adopted here, each match is represented as an edge. The source vertex has the winner's details and the destination vertex has the loser's details. In the game of tennis, if this can be termed as some fictitious importance ranking, then in a given match the winner has higher importance ranking than the loser.

If the graph in the previous use case is taken to demonstrate the PageRank algorithm, then that graph has to be reversed so that the winner of each match becomes the destination vertex of each and every edge. At the Scala REPL prompt, try the following statements:

scala> import org.apache.spark._
  import org.apache.spark._ 
  scala> import org.apache.spark.graphx._
  import org.apache.spark.graphx._    
  scala> import org.apache.spark.rdd.RDD
  import org.apache.spark.rdd.RDD
    scala> //Define a property class that is going to hold all the properties of the vertex which is nothing but player informationscala> case class Player(name: String, country: String)
      defined class Player
    scala> // Create the player verticesscala> val players: RDD[(Long, Player)] = sc.parallelize(Array((1L, Player("Novak Djokovic", "SRB")), (3L, Player("Roger Federer", "SUI")),(5L, Player("Tomas Berdych", "CZE")), (7L, Player("Kei Nishikori", "JPN")), (11L, Player("Andy Murray", "GBR")),(15L, Player("Stan Wawrinka", "SUI")),(17L, Player("Rafael Nadal", "ESP")),(19L, Player("David Ferrer", "ESP"))))
players: org.apache.spark.rdd.RDD[(Long, Player)] = ParallelCollectionRDD[212] at parallelize at <console>:64
    scala> //Define a property class that is going to hold all the properties of the edge which is nothing but match informationscala> case class Match(matchType: String, points: Int, head2HeadCount: Int)
      defined class Match
    scala> // Create the match edgesscala> val matches: RDD[Edge[Match]] = sc.parallelize(Array(Edge(1L, 5L, Match("G1", 1,1)), Edge(1L, 7L, Match("G1", 1,1)), Edge(3L, 1L, Match("G1", 1,1)), Edge(3L, 5L, Match("G1", 1,1)), Edge(3L, 7L, Match("G1", 1,1)), Edge(7L, 5L, Match("G1", 1,1)), Edge(11L, 19L, Match("G2", 1,1)), Edge(15L, 11L, Match("G2", 1, 1)), Edge(15L, 19L, Match("G2", 1, 1)), Edge(17L, 11L, Match("G2", 1, 1)), Edge(17L, 15L, Match("G2", 1, 1)), Edge(17L, 19L, Match("G2", 1, 1)), Edge(3L, 15L, Match("S", 5, 1)), Edge(1L, 17L, Match("S", 5, 1)), Edge(1L, 3L, Match("F", 11, 1))))
matches: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Match]] = ParallelCollectionRDD[213] at parallelize at <console>:64
    scala> //Create a graph with the vertices and edgesscala> val playGraph = Graph(players, matches)
playGraph: org.apache.spark.graphx.Graph[Player,Match] = org.apache.spark.graphx.impl.GraphImpl@263cd0e2
    scala> //Reverse this graph to have the winning player coming in the destination vertex
	scala> val rankGraph = playGraph.reverse
rankGraph: org.apache.spark.graphx.Graph[Player,Match] = org.apache.spark.graphx.impl.GraphImpl@7bb131fb
    scala> //Run the PageRank algorithm to calculate the rank of each vertex
	scala> val rankedVertices = rankGraph.pageRank(0.0001).vertices
rankedVertices: org.apache.spark.graphx.VertexRDD[Double] = VertexRDDImpl[1184] at RDD at VertexRDD.scala:57
    scala> //Extract the vertices sorted by the rank
	scala> val rankedPlayers = rankedVertices.join(players).map{case 
	(id,(importanceRank,Player(name,country))) => (importanceRank,
	name)}.sortByKey(ascending=false)
	
	rankedPlayers: org.apache.spark.rdd.RDD[(Double, String)] = ShuffledRDD[1193] at sortByKey at <console>:76
    
	scala> rankedPlayers.collect().foreach(println)
      (3.382662570589846,Novak Djokovic)    
      (3.266079758089846,Roger Federer)    
      (0.3908953124999999,Rafael Nadal)    
      (0.27431249999999996,Stan Wawrinka)    
      (0.1925,Andy Murray)    
      (0.1925,Kei Nishikori)    
      (0.15,David Ferrer)    
      (0.15,Tomas Berdych)
    

If the preceding code is scrutinized carefully, it can be seen that the highest ranked players have won the highest number of matches.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.69.185