Let's define a textual representation of the graph that we'll use in the following examples.
Create the following as graph.txt
:
12,3,40C 21,4 31,5,6 41,2 53,6 63,5 76
We defined a file structure that will represent our graph, based somewhat on the adjacency list approach. We assumed that each node has a unique ID and the file structure has four fields, as follows:
In the initial representation, only the starting node has values for the third and fourth columns: its distance from itself is 0 and its status is "C", which we'll explain later.
Our graph is directional—more formally referred to as a directed graph—that is to say, if node 1 lists node 2 as a neighbor, there is only a return path if node 2 also lists node 1 as its neighbor. We see this in the graphical representation where all but one edge has an arrow on both ends.
Because this algorithm and corresponding MapReduce job is quite involved, we'll explain it before showing the code, and then demonstrate it in use later.
Given the previous representation, we will define a MapReduce job that will be executed multiple times to get the final output; the input to a given execution of the job will be the output from the previous execution.
Based on the color code described in the previous section, we will define three states for a node:
The mapper will read in the current representation of the graph and treat each node as follows:
The reducer will receive one or more records for each node ID, and it will combine their values into the final output node record for that stage.
The general algorithm for the reducer is as follows:
If we apply this algorithm once, we will get node 1 marked as Done, several more (its immediate neighbors) as Current, and a few others as Pending. Successive applications of the algorithm will see all nodes move to their final state; as each node is encountered, its neighbors are brought into the processing pipeline. We will show this later.
3.15.168.211