Time for action – representing the graph

Let's define a textual representation of the graph that we'll use in the following examples.

Create the following as graph.txt:

12,3,40C
21,4
31,5,6
41,2
53,6
63,5
76

What just happened?

We defined a file structure that will represent our graph, based somewhat on the adjacency list approach. We assumed that each node has a unique ID and the file structure has four fields, as follows:

  • The node ID
  • A comma-separated list of neighbors
  • The distance from the start node
  • The node status

In the initial representation, only the starting node has values for the third and fourth columns: its distance from itself is 0 and its status is "C", which we'll explain later.

Our graph is directional—more formally referred to as a directed graph—that is to say, if node 1 lists node 2 as a neighbor, there is only a return path if node 2 also lists node 1 as its neighbor. We see this in the graphical representation where all but one edge has an arrow on both ends.

Overview of the algorithm

Because this algorithm and corresponding MapReduce job is quite involved, we'll explain it before showing the code, and then demonstrate it in use later.

Given the previous representation, we will define a MapReduce job that will be executed multiple times to get the final output; the input to a given execution of the job will be the output from the previous execution.

Based on the color code described in the previous section, we will define three states for a node:

  • Pending: The node is yet to be processed; it is in the default state (white)
  • Currently processing: The node is being processed (gray)
  • Done: The final distance for the node has been determined (black)

The mapper

The mapper will read in the current representation of the graph and treat each node as follows:

  • If the node is marked as Done, it gives output with no changes.
  • If the node is marked as Currently processing, its state is changed to Done and gives output with no other changes. Each of its neighbors gives output as per the current record with its distance incremented by one, but with no neighbors; node 1 doesn't know node 2's neighbors, for example.
  • If the node is marked Pending, its state is changed to Currently processing and it gives output with no further changes.

The reducer

The reducer will receive one or more records for each node ID, and it will combine their values into the final output node record for that stage.

The general algorithm for the reducer is as follows:

  • A Done record is the final output and no further processing of the values is performed
  • For other nodes, the final output is built up by taking the list of neighbors, where it is to be found, and the highest distance and state

Iterative application

If we apply this algorithm once, we will get node 1 marked as Done, several more (its immediate neighbors) as Current, and a few others as Pending. Successive applications of the algorithm will see all nodes move to their final state; as each node is encountered, its neighbors are brought into the processing pipeline. We will show this later.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.168.211