PageRank with Apache Giraph

This recipe is primarily aimed at building and testing the default Apache Giraph PageRank example, modeled after the Google Pregel implementation. It will demonstrate the steps involved in submitting and executing a Giraph job to a pseudo-distributed Hadoop cluster.

Getting ready

For first-time Giraph users, we recommend running this recipe using a pseudo-distributed Hadoop cluster.

For the client machine, you will need Subversion and Maven installed and configured on the user environment path.

This recipe does not require a full understanding of the Giraph API, but it does assume some familiarity with Bulk Synchronous Parallel (BSP) and the design goals of vertex-centric APIs including Apache Giraph and Google Pregel.

How to do it...

Carry out the following steps to build and test the default Giraph PageRank example:

  1. Navigate to a base folder and perform an SVN checkout of the latest Giraph source, located at the official Apache site:
    $ svn co https://svn.apache.org/repos/asf/giraph/trunk
  2. Change the folder into a trunk and run the build:
    $ mvn compile
  3. Once the build finishes, navigate to the target folder created in the trunk and you should see the JAR file giraph-0.2-SNAPSHOT-jar-with-dependencies.jar.
  4. Run the following command:
    hadoop jar giraph-0.2-SNAPSHOT-jar-with-dependencies.jar org.apache.giraph.benchmark.PageRankBenchmark -V 1000 -e 1 -s 5 -w 1 -v
  5. You should see the job execute and the MapReduce command line output show success.
  6. The Giraph stats counter group in the printout should show the following stats:
    INFO mapred.JobClient:   Giraph Stats
    INFO mapred.JobClient:     Aggregate edges=1000
    mapred.JobClient:     Superstep=6
    mapred.JobClient:     Last checkpointed superstep=0
    mapred.JobClient:     Current workers=1
    mapred.JobClient:     Current master task partition=0
    mapred.JobClient:     Sent messages=0
    mapred.JobClient:     Aggregate finished vertices=1000
    mapred.JobClient:     Aggregate vertices=1000

How it works...

First, we use Subversion to check out the latest source from the official Apache site. Once we build the JAR file, the PageRankBenchmark example job is available for submission. Before we are ready to test Giraph, we must set the following command line options:

  • -V: This shows the number of total vertices to run through PageRank. We chose 1000 just for testing. For a more accurate testing we would want millions of vertices over a fully-distributed cluster.
  • -e: This shows the number of outgoing edges defined for each vertex. This will control the number of messages that are output during each superstep to any neighboring vertices, where a neighbor is defined as a vertex connected to another vertex by one or more edges.
  • -s: This shows the total number of supersteps to run before terminating PageRank.
  • -w: This shows the total number of workers available to handle distinct graph partitions. Since we are running a pseudo-distributed cluster (single host), it is safe to limit this to one. In a fully-distributed cluster, we would want multiple workers spread out across different physical hosts.
  • -v: This activates the verbose mode to follow the job progress on the console.

The job contains no additional classpath dependencies outside of core Hadoop/ZooKeeper. It can be directly submitted to the cluster via the hadoop jar command from the command line.

The PageRankBenchmark example does not output the results back to HDFS. It is designed primarily to test and expose certain cluster bottlenecks that might hinder other production Giraph jobs. Running the job against a large number of vertices with multiple edges may expose memory constraints, network I/O connectivity issues between workers, and other potential problems.

There's more...

Apache Giraph is a relatively new open source batch computation framework. The following tips will help you further your understanding:

Keep up with the Apache Giraph community

Apache Giraph has a very active developer community. The API is constantly being enhanced with new features, bug fixes, and occasional refactoring. It is a good idea to update your source from trunk at least once a week. At the time of this writing, Giraph has no public Maven artifact. This will change in the very near future, but for now SVN is required to pull source updates.

Read and understand the Google Pregel paper

Somewhere in 2009, Google published a research paper describing in high-level technical detail their proprietary software, which was made for scalable graph-centric processing based on the Bulk Synchronous Parallel (BSP) model.

Apache Giraph is an open source implementation of many of the concepts found in this research paper. Familiarity with the Pregel design will help to explain many components found in the Giraph codebase.

A basic introduction to BSP can be found on Wikipedia at http://en.wikipedia.org/wiki/Bulk_Synchronous_Parallel.

See also

  • Single-source shortest-path with Apache Giraph
  • Using Apache Giraph to perform a distributed breadth-first search
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.12.207