Chapter 12

Parallel patterns: graph search

Juan Gómez-Luna and Izzat El Hajj

Abstract

This chapter presents the parallel graph search pattern. Since graph search computation is about examining the vertex values, there is very little computation on these values once they are loaded from memory. As a result, the speed of graph search is typically limited by memory bandwidth. This chapter presents a graph data format similar to the CSR storage format for sparse matrix that helps minimize the consumption of memory bandwidth. We then introduce work queues, an important class of parallel data structures that supports work-efficient iterative algorithms that require dynamic discovery and collection of the data to be processed. We will show that the privatization technique in the form of block-level queues and warp-level queues can be productively used to minimize serialization when collecting data into the work queues. The chapter concludes with potential further optimizations on kernel launch overhead and load balance.

Keywords

Graphs; graph algorithms; graph search; breadth-first search; social networks; driving direction map services; integrated circuits; maze routing; work efficiency; edge; vertex; double buffering; algorithm complexity; privatization; queue; hierarchical queue; hierarchical kernels

Our final parallel pattern is graph search. A graph is a data structure that represents the relations between entities. The entities involved are represented as vertices and the relations are represented as edges. Many important real-world problems are naturally formulated as large-scale graph problems and can benefit from massively parallel computation. Prominent examples include social networks and driving direction services. Graphs are intrinsically related to sparse matrices. In fact, graph computation can be formulated in terms of sparse matrix operations. However, one can often improve the efficiency of graph computation by exploiting properties that are specific to the type of graph computation being performed. In this chapter, we will focus on graph search, a graph computation that underlies many real-world applications. Since graph search computation is about examining the vertex values, there is very little computation on these values once they are loaded from memory. As a result, the speed of graph search is typically limited by memory bandwidth. We will discuss graph data formats that help minimize the consumption of memory bandwidth. We will then introduce work queues, an important class of parallel data structures that supports work-efficient iterative algorithms that require dynamic discovery and collection of the data to be processed. We will show that the privatization technique can be productively used to minimize serialization when collecting data into the work queues.

12.1 Background

A graph data structure represents the relation between entities. For example, in social media, the entities are users and the relations are connections between users. For another example, in map driving direction applications, the entities are locations and the relations are the roadways between them. Some relations are bi-directional, such as friend connections in a social network. Other relations are directional. For example, roads may be one-way streets. We will focus on directional relations; bi-directional relations can be represented with two directional relations, one in each direction. A directional relation is represented as an arrowed edge going from a source vertex to a destination vertex.

Fig. 12.1 shows a simple graph with directional edges. We assign a unique number to each vertex. There is one edge going from vertex 0 to vertex 1 and one going from vertex 0 to vertex 2. For a driving direction application, we may need to find all the alternative routes that we could take going from the location represented by vertex 0 to that represented by vertex 5. By visual inspection, we see that there are three possible paths: 0→1→3→4→5, 0→1→4→5, and 0→2→5.

image
Figure 12.1 A simple graph with directional edges.

An intuitive representation of a graph is an adjacency matrix. We assign a unique number to each vertex. When there is an edge going from vertex i to vertex j, the value of element A[i][j] of the adjacency matrix is 1. Otherwise, it is 0. Fig. 12.2 shows the adjacency matrix for the simple graph in Fig. 12.1. We see that A[1][3] and A[4][5] are 1 since there are edges going from vertex 1 to vertex 3. For clarity, we leave the 0 values out of the adjacency matrix. That is, if an element is empty, its value is 0.

image
Figure 12.2 Adjacency matrix representation of the simple graph example.

If a graph with N vertices is fully connected, each vertex should have (N-1) outgoing edges. There should be a total of N(N-1) edges, since there is no edge going from a vertex to itself. For example, if our 9-vertex graph were fully connected, there should be eight edges going out of each vertex. There should be a total of 72 edges. Obviously, our graph is much less connected; each vertex has three or fewer outgoing edges. Such graph is referred to as being sparsely connected. That is, the average number of outgoing edges from each vertex is much smaller than N-1.

At this point, the reader has most likely made the correct observation that sparsely connected graphs can probably benefit from a sparse matrix representation. Indeed, many real-world graphs are sparsely connected. For example, in a social network such as Facebook, Twitter, or LinkedIn, the average number of connections for each user is much smaller than the total number of users. This makes the number of non-zero elements in the adjacency matrix much smaller than the total number of elements. As we have seen in Chapter 10, Parallel patterns: sparse matrix computation, using a compressed representation such as Compressed Sparse Row (CSR) can drastically reduce the amount of storage for and the number of wasted operations on the zero elements.

Fig. 12.3 shows a CSR representation of our simple graph example. We will refer to the row pointer array as the edges array. Recall that each row pointer gives the starting location for the non-zero elements in a row. For example, edges[3]=7 gives the starting location of the non-zero elements in row 3 of the original adjacency matrix. Also, edges[4]=9 gives the starting location of the non-zero elements in row 4 of the original matrix. Thus, we expect to find the non-zero data for row 3 in data[7] and data[8] and the column indices for these elements in destination[7] and destination[8]. These are the data and column indices for the two edges leaving vertex 3. The reason we call the column index array destination is that the column index of an element in the adjacency matrix gives the destination of the represented edge. In our example, we see that the destination of the two edges for source vertex 3 are destination[7]=4 and destination[8]=8.

image
Figure 12.3 Sparse matrix (CSR) representation of adjacency matrix.

Obviously, the data array is unnecessary. Since the value of all its elements is 1, we really don’t need to store it. We can make the data implicit—whenever one of the non-zero element values is used, we can just assume it is 1. That is, the existence of each column index in the destination array implies that an edge does exist. However, in some applications, the adjacency matrix may store additional information about the relationship, such the distance between two locations or the date when two social network users became connected. In those applications, the data array will need to be used.

Sparse representation can lead to significant savings in storing the adjacency matrix. For our example, assuming that the data array can be eliminated, the CSR representation requires storage for 25 locations versus the 92=81 locations if we stored the entire adjacency matrix. For real-life problems where a very small fraction of the adjacency matrix elements are non-zero, the savings can be tremendous.

12.2 Breadth-First Search

An important graph computation is breadth-first search (BFS). BFS is often used to discover the shortest number of edges that one needs to take in order to go from one vertex to another vertex of the graph. There are several forms of BFS. Each form derives a different type of result but one can typically derive the result of one form from that of another.

A simple form of BFS, given a vertex referred to as the source, label each vertex with the smallest number of edges that one needs to traverse in order to go from the source to the vertex.

Fig. 12.4(A) shows the desired BFS result with vertex 0 as the source. Through one edge, we can get to vertices 1 and 2. Thus, we mark these vertices as level 1. By traversing another edge, we can get to vertices 3 (through vertex 1), 4 (through vertex 1), 5 (through vertex 2), 6 (through vertex 2) and 7 (through vertex 2). Thus we mark these vertices as level 2. Finally, by traversing one more edge, we can get to vertex 8 (through any of vertices 3, 4, or 6). Obviously, the BFS result with another vertex as the source, say vertex 2, would be quite different.

image
Figure 12.4 Breadth-first search results. (A) Vertex 0 is source, (B) vertex 2 is source.

Fig. 12.4(B) shows the desired result of BFS with vertex 2 as the source. The level 1 vertices are 5, 6, and 7. The level 2 vertices are 8 (through vertex 6) and 0 (through vertex 7). Only vertex 1 is at level 3 (through vertex 0). Finally, the level 4 vertices are 3 and 4 (both through vertex 1). It is interesting to note that the outcome is quite different for each vertex even though we moved the source to a vertex that is only one edge away from the original source.

Once we have all the vertices labeled, we can easily find a path from the source to any of the vertices in terms of the number of edges traveled. For example, in Fig. 12.4(B), we see that vertex 1 is labeled as level 3. So we know that the smallest number of edges between the source (vertex 2) and vertex 1 is 3. If we need to find the path, we can simply start from the destination vertex and trace back to the source. At each step, we select the predecessor whose level is one less than the current vertex. If there are multiple predecessors with the same level, we can randomly pick one. Any one thus selected would give a sound solution. The fact that there are multiple predecessors to choose from means that there are multiple equally good solutions to the problem. In our example, we can find a shortest path from vertex 2 to vertex 1 by starting from vertex 1, choosing vertex 0, then vertex 7, and then vertex 2. Therefore a solution path is 2→7→0→1. This of course assumes that each vertex has a list of pointers to the source vertices of all the incoming edges so that one can find the predecessors of a given vertex.

Fig. 12.5 shows an important application of BFS in computer-aided design (CAD). When designing an integrated circuit chip, there are many electronic components that need to be connected to complete the design. The connectors of these components are called net terminals. Fig. 12.5(A) shows two such net terminals as red dots, one belongs to a component in the upper left part and the other belongs to another component in the lower right part of the chip. Assume that the design requires that these two net terminals be connected. This is done by running, or routing, a wire of a given width from the first net terminal to the second net terminal.

image
Figure 12.5 Maze routing in integrated circuits—an application for breadth-first search. (A) Breadth-first search, (B) identifying a routing path.

The routing software represents the chip as a grid of wiring blocks where each block can potentially serve as a piece of a wire. A wire can be formed by extending in either the horizontal or the vertical direction. For example, the black J-shape in the lower half of the chip consists of 21 wiring blocks and connects three net terminals. Once a wiring block is used as part of a wire, it can no longer be used as part of any other wires. Furthermore, it forms a blockage for wiring blocks around it. No wires can be extended from a used block’s lower neighbor to its upper neighbor, or from its left neighbor to its right neighbor, etc. Once a wire is formed, all other wires must be routed around it. Routing blocks can also be occupied by circuit components, which impose the same blockage constraint as when they are used as part of a wire. This is why the problem is called a maze routing problem. The previously formed circuit components and wires form a maze for the wires that are yet to be formed. The maze routing software finds a route for each additional wire given all the constraints from the previously formed components and wires.

The maze routing application represents the chip as a graph. The routing blocks are vertices. An edge from vertex i to vertex j indicates that one can extend a wire from block i to block j. Once a block is occupied by a wire or a component, it is either marked as a blockage vertex or taken away from the graph, depending on the design of the application. Fig. 12.5 shows that the application solves the maze routing problem with a BFS from the source net terminal to the destination net terminal. This is done by starting with the source vertex and labeling the vertices into levels. The immediate vertical or horizontal neighbors (a total of four) that are not blockages are marked as level 1. We see that all four neighbors of the source are reachable and will be marked as level 1. The neighbors of level 1 vertices that are neither blockages nor visited by the current search will be marked as level 2. The reader should verify that there are 4 level-1 vertices, 8 level-2 vertices, and 12 level-3 vertices, etc. in Fig. 12.5(A). As we can see, the BFS essentially forms a wave front of vertices for each level. These wave fronts start small for level 1 but can grow very large very quickly in a few levels.

Fig. 12.5(B) shows that once the BFS is complete, we can form a wire by finding a shortest path from the source to the destination. As we explained earlier in this chapter, this can be done by starting with the destination vertex and tracing back to the predecessors whose levels are one lower than the current vertex. Whenever there are multiple predecessors that have equivalent levels, there are multiple routes that are of the same length. One could design heuristics to choose the predecessor in such a way that minimizes the difficulty of constraints for wires that are yet to be formed.

12.3 A Sequential BFS Function

We are now ready to write a sequential breadth-first function. We assume that the graph is represented in the CSR format shown in Fig. 12.3. The function receives the index of the source vertex, the edges (edges) array, and the destination (dest) array for the graph. Furthermore, it receives a label array whose elements will be used to store the visit status information for the vertices.

Before the search, the label element for the source is initialized to 0, indicating that it is a level 0 vertex. All other label elements are initialized to -1, indicating that their associated vertices have not been visited. At the end of the search, all label array elements corresponding to vertices reachable from the source should be set to a positive level number. If the label array element of any vertex remains -1 after the search, it means that the vertex is unreachable from the source.

Fig. 12.6 shows a sequential implementation of the BFS function. It maintains two frontier arrays: one stores the frontier vertices discovered in the previous iteration (previous frontier), and one stores the frontier vertices that are being discovered in the current iteration (current frontier). These arrays are declared as frontier[0][MAX_FRONTIER_SIZE] and frontier[1][MAX_FRONTIER_SIZE]. The roles of these two arrays alternate. During the first iteration, frontier[0] stores the current frontier and frontier[1] stores the previous frontier, the source vertex. During the second iteration, the two arrays exchange their roles: frontier[0] stores the previous frontier and frontier[1] stores the current frontier. That is, what is being assembled as the current in one iteration becomes the previous frontier in the next iteration. This way, one of the arrays holds the stable frontier formed during the previous iteration while the other one’s contents are being assembled. By switching the roles of these two arrays, we avoid the need for copying the contents from a current frontier array to a previous frontier array when we move to the next iteration. This technique is commonly called ping-pong buffering.

image
Figure 12.6 A sequential breadth-first search function.

The function assumes that all label array elements are initialized to -1 by the caller. At the beginning of the function, the label[source] element is initialized to 0, indicating that the source is the level 0 vertex for the search. It maintains a pointer variable c_frontier to point to the beginning of the current frontier array and another pointer variable p_frontier to point to the beginning of the previous frontier array. At the beginning of the function, c_frontier is initialized so that it points to frontier[0] and p_frontier to frontier[1]. The function also maintains two tail indices. The p_frontier_tail variable indicates the number of elements that have been inserted into the previous frontier array. The c_frontier_tail variable stores the index of position at which a newly discovered frontier vertex can be accommodated in the current frontier array. It also indicates the number of frontier vertices that have been inserted into the current frontier array thus far.

Before the first iteration, the source vertex is inserted into the previous frontier. The insert_frontier function will place the source into p_frontier[0] and increment the p_frontier_tail variable to 1. This makes the source the only vertex in the previous frontier array for processing in the first iteration.

Note that there is no easy way to determine the number of iterations that the while-loop will take before entering the while-loop. Even with the same number of vertices and edges, some graphs will have more levels and others will have fewer. In fact, some of the vertices are even unreachable from the source, making it inappropriate to try to use a test such as “all vertices have been visited” as a termination condition. So, the only reliable way to detect that all levels have been discovered is when there is no new current frontier vertex being discovered in the current iteration. This condition is available as p_frontier_tail > 0 before entering the next iteration.

We will use the example in Fig. 12.4(B) to illustrate the design of the while-loop, which implements the iterative process for labeling the vertices. The outer for-loop iterates through all the vertices in the previous frontier array. For the first iteration, there is only one vertex in the previous frontier array, the source. In our example, it is vertex 2. This means that the outer for-loop will only iterate once for the first iteration of the while-loop. During this only iteration of the outer for-loop, we first assign the value of p_frontier[0] (which is 2) to c_vertex.

We will then identify all the edges that go from c_vertex to its neighbors. As we have shown in Fig. 12.3, these edges are in a dest array section that starts at index edges[c_vertex] and ends at the location edges[c_vertex+1]-1. In our example edges[2] has value 4 and edges[3]-1 has value 6. This means that the edges for vertex 2 can be found in dest[4], dest[5], and dest[6]. The inner for-loop will iterate through these three edges.

For each edge, the if-statement checks if the destination of the edge has been visited. If the label value of the destination is still -1, it has not been visited before and a new vertex has been discovered for the current frontier. The code inside the if-statement inserts the destination into the current frontier array. It marks the destination as one level higher than the c_vertex. For our example, since vertex 2 is at level 0, the destination vertices dest[4] (vertex 5), dest[5] (vertex 6), and dest[6] (vertex 7) will all be labeled as level 1 at the end of the inner for-loop. This is indeed the correct result for these three vertices according to Fig. 12.4(B).

In our example, since vertex 2 is the only one in the p_frontier array during the first iteration of the while-loop, the outer for-loop will not iterate beyond the first iteration, we are at the end of the first iteration of the while-loop. The c_frontier array contains the three new frontier vertices 5, 6, and 7. The code at the end of the while-loop swaps the roles of the two frontier arrays. It copies the value of c_frontier_tail value (3) to p_frontier_tail, indicating that there are three vertices in the p_frontier array for the next iteration of the while-loop. It then resets the c_frontier_tail to 0, effectively empties the c_frontier array for use by the next while-loop iteration.

During the next iteration of the while-loop, the outer for-loop will iterate three iterations, one for each of the previous frontier vertices 5, 6, and 7. The inner for- loop instance for each of these three vertices are more interesting. The if-statement of the inner-loop iteration for vertex 5 will discover that the destination of the only edge leaving vertex 5, vertex 6, has been visited in the previous while-loop iteration; its label is 1. Thus, no further action will be taken for this edge. The reader should verify that one of the edges from vertex 7 requires action (to vertex 0) and the other one does not (to vertex 6).

12.4 A Parallel BFS Function

When it comes to parallelizing BFS, there are a few options. For example, Harish and Narayanan propose a parallelization where each thread is assigned to a vertex. During each iteration, all vertices are visited [HN 2007]. If any of the sources of the incoming edges of a vertex just become visited in the previous iteration, the vertex will be marked as visited in the current iteration. The amount of work done is proportional to V*L where V is the total number of vertices in the graph and L is the number of levels of the search results. For large graphs, the number of levels can be quite high and the work efficiency of the algorithm can be very low, causing the parallel code to run slower than sequential code.

One can design a parallel BFS algorithm that has work efficiency comparable to the sequential algorithm. Luo et al. propose to parallelize each iteration of the while-loop in Fig. 12.6 by having multiple threads to collaboratively process the previous frontier array and assemble the current frontier array [LWH 2010]. This effectively parallelizes the outer for-loop in Fig. 12.6. We will pursue this direction in the current section. In the next section, we will examine optimization strategies to enhance the performance of kernels produced with this strategy.

A straightforward parallelization strategy to parallelize each iteration of the while-loop is to assign a section of the previous frontier array to each thread block. Fig. 12.7 shows a sketch of the changes that we need to make to the sequential BFS_sequential function so that it can properly launch a CUDA kernel to perform the main activities of each iteration of the while-loop in parallel. Basically, the function needs to allocate device global memory version of edges, dest, and label. The pointers to these device global memory versions will be called edges_d, dest_d, and label_d. The contents of these arrays also need to be copied from host to device using cudaMemcpy.

image
Figure 12.7 A sketch of the BFS host code function.

The kernel in Fig. 12.8 declares an extra array visited (compared to the sequential code) to track whether a node has participated in a frontier. The reason for using this new array is that we will be using atomic operations on the elements of the array and it is much simpler if the value of each element is limited to 0 or 1. The label array elements need to track the level information, which makes it more complicated for atomic operations. It is more convenient to separate visit marking (visited) from the level of information (label).

image
Figure 12.8 A parallel BFS kernel based on block-level privatized queues.

The host code then allocates the frontier_d array in the device global memory. Note that there is no need for host to maintain a copy of the frontier array since it will be only accessed by the device. The c_frontier_d and p_frontier_d pointers will be pointing to either the first half or the second half of frontier_d. Initially, the host code initializes c_frontier_d to point to the first half and p_frontier_d to the second half. Their roles will swap at the end of each while-loop iteration. The host needs to also allocate the tail variables in the device global memory. The pointers to these variables will be c_frontier_tail_d and p_frontier_tail_d.

The host code then needs to launch a simple kernel to initialize all visited_d elements to 0 except source to 1, the c_frontier_tail_d variable to 0, p_frontier_d[0] to source, p_frontier_tail_d variable to l, and label[source] to 0. After all this work, the device is set up to execute the main activities of each while-loop iteration in parallel. Thus, the bulk of the code in the while-loop is replaced with a kernel launch and a call to cudaMemcpy() to read the value of the total number of vertices in the newly discovered frontier. This value will be used to determine if the current iteration has made any progress and the while-loop should be allowed to continue.

A kernel based on this strategy is shown in Fig. 12.8. The threads divide the section in an interleaved manner to enable coalesced memory access to the p_frontier array. This is shown as the statement that accesses p_frontier at the beginning of the kernel. As each thread processes a vertex in the p_frontier array, it inserts or writes the unvisited neighbors of the vertex into the c_frontier array. This is shown in the first for-loop in Fig. 12.8. Once all threads complete their processing of the p_frontier array, the c_frontier array will contain all the vertices of the new frontier and will become the p_frontier array for the next iteration of the while-loop.

The for-loop that visits each neighbor of a thread’s assigned frontier vertex looks similar to the inner for-loop in Fig. 12.6. However, there is a slight but important difference in terms of their execution efficiency. Each of the outer for-loop iterations in Fig. 12.6 processes the neighbors for one frontier vertex. It is very possible that frontier vertices have common neighbors. For example, in Fig. 12.4(A), vertices 3, 4, and 6 are all in the level-2 frontier and they have a common neighbor vertex 8. The outer for-loop iterations in Fig. 12.6 are executed sequentially. In general, we are referring to the situation where two frontier vertices A and B have a common neighbor and the neighbor has not been visited so far. Let us assume that the outer for-loop iteration that processes A is executed first. The neighbor will be marked as visited as a result of processing A. When B is processed in a later iteration, it will find the neighbor marked as visited so it will not mark it again. In our example, during the processing of level-2 frontier of Fig. 12.4(A), assume that vertex 3 is processed first. Vertex 8 will be marked as visited (level 3) and will be inserted into the c_frontier array. When vertices 4 and 6 are subsequently processed, they will find vertex 8 already visited so they will not insert it into the c_frontier.

In the parallel kernel, the frontier vertices are processed by threads that execute in parallel. Since the global memory writes that are performed by a thread are not guaranteed to be visible by other threads until the kernel termination or memory fence, they will not see the marks made by each other. In our example, the threads that process vertices 3, 4, and 6 all execute in parallel. They may or may not be able to see the marks by each other. So, each of them will likely mark vertex 8 as level 3 and insert it into the c_frontier. As a result, a vertex could appear multiple times in the c_frontier. This is harmless in terms of correctness. The threads that process these redundant copies of the frontier vertices will take the same actions and will not affect the final execution result. However, there could be a significant number of such redundant processing for large graphs.

In order to avoid generating redundant copies of frontier vertices, we use atomic operations to mark and check the visit status of vertices in Fig. 12.8. The kernel uses a visited array to track whether a vertex has been visited. Each thread first uses an atomic operation to check if each destination of its current vertex still needs to be visited. Keep in mind that the atomic operations performed by one thread-block are visible to all other thread-blocks. This way, if a vertex is the destination of multiple vertices in the current frontier, only one thread will succeed in the condition and the destination vertex will only be entered into the c_frontier array once.

There are three important considerations with respect to writing vertices into the c_frontier array. First, the vertices written by a thread during the current iteration of the while-loop will likely need to be processed by another thread in another block during the next iteration of the while-loop. Recall that a write to global memory by a thread is not guaranteed to be visible to threads in other blocks without a kernel termination/relaunch or a memory fence. As a result, we will terminate the kernel at the end of each while-loop iteration and relaunch the kernel for the next iteration of the while-loop.

Second, since the threads would be simultaneously inserting vertices into the c_frontier array, they need to use atomic operations when they perform read-modify-write on the c_frontier_tail variable to ensure the integrity of updates to the variable.

Third, for each previous frontier vertex, a thread will likely write multiple vertices into the c_frontier array. This would likely create a global memory write pattern that cannot be coalesced. We will use a privatized buffer in the shared memory to assemble the contribution by the threads in a block, and have threads to write the contents of the shared memory buffer into the global memory in a coalesced manner at the end of the kernel. We will call this privatized buffer a block-level queue. We will also need to create a privatized c_frontier_tail_s variable in the shared memory for insertion into the block level queue.

In Fig. 12.8, the block-level queue is declared as a shared memory array c_frontier_s. Insertion into c_frontier_s is made through the shared memory variable c_frontier_tail_s variable. Thread 0 initializes the value of c_frontier_tail_s to 0 while all other threads wait for this initialization at the __syncthreads() barrier. In the first for-loop, each thread inserts a new found neighbor into the c_frontier_s array. This is done by performing an atomic operation on the c_frontier_tail_s variable and writing the neighbor into the c_frontier_s array location whose index is the old c_frontier_tail_s value returned by the atomic operation. In the case where the block-level queue overflows, the remaining entries are stored directly in the c_frontier array.

The total number of new frontier vertices found by all threads in the block is given by the final value of c_frontier_tail_s. We use the if-statement to identify thread 0 to reserve a section in the global c_frontier array by performing an atomic operation on c_frontier_tail. The atomic operation will return the beginning index of the reserved section. It will increase the c_frontier_tail value by the total number of vertices to be written into the section. Thus, the next block will start its section at the location indexed by the new c_frontier_tail value. This is illustrated by the bottom part of Fig. 12.9.

image
Figure 12.9 Block-level queue (b-queue) contents are copied into the global queue (g-queue) at the end of the kernel in a coalesced manner.

The second for-loop in Fig. 12.8 implements the coalesced writes to the global c_frontier array. During each iteration, each thread will write one element of the c_frontier_s array into c_frontier array. We design the indexing scheme so that adjacent threads will write adjacent locations in the c_frontier array. All threads will iterate until they have collectively completed writing all the contents of the c_frontier_s array.

Since the block-level queue is a performance optimization scheme, falling back to the global-queue will not affect correctness. It will likely reduce performance as a reasonable tradeoff.

12.5 Optimizations

While we have achieved parallel execution with the BFS_Bqueue kernel in Fig. 12.8, there are several areas of improvements as far as the performance and efficiency are concerned. We will go over each area in this section [MG 2012].

Memory Bandwidth

When a thread processes its assigned frontier vertex in the kernel of Fig. 12.8, it accesses two consecutive edges array elements in the global memory in the for-loop followed by a number of consecutive dest array locations in the global memory. It then accesses a sequence of label array elements that are more or less random, indexed by the dest elements values. This means that adjacent threads are not accessing adjacent global memory locations when accessing the edges, dest, and label arrays, thus these accesses are not coalesced. One should perform these accesses through the texture memory. We will leave it as an exercise.

Fig. 12.10 illustrates the global memory access pattern for processing the level-2 frontier vertices in Fig. 12.4(B). The source of the search is vertex 2. The two level-2 frontier vertices are 0 and 8. Let us assume that threads 0 and 1 will process these vertices. The access pattern to the p_frontier array is coalesced. The accesses to the edges array are clearly not coalesced, thread 0 and thread 1 access edges[0] and edges[8] first. They are not accessing consecutive locations. They then access edges[1] and edges[9]. Again, they are not accessing consecutive locations.

image
Figure 12.10 Memory access pattern for processing the level-2 frontier in Fig. 12.5.

Based on the edges element values, thread 0 will access dest[0] and dest[1] whereas thread 1 will not make any further accesses since vertex 8 does not have any outgoing edges. One can easily imagine that if vertex 8 had any outgoing edges, they would not be in consecutive locations as the ones accessed by thread 0. Thread 0 will access the label array based on dest[0] and dest[1] values. In this example, as shown in Fig. 12.10, it happens to access label[1] and label[2]. In general, it could access locations that are of arbitrary distance away from each other, depending on the shape of the graph and the way the vertices are numbered. Obviously, the accesses to the label array are not coalesced in general. Therefore, accesses to the edges, dest, and label arrays should go through the texture memory.

Hierarchical Queues

The block-level queue c_frontier_s in Figs. 12.8 and 12.9 is an example of a hierarchical queue design. In general, when we have a queue that receives insertion requests from a large number of parallel threads, their atomic operations on the tail variable will likely cause excessive contention and therefore serialization among these threads. Giving each block its private queue significantly reduces the level of contention in queue insertion. The cost is the extra step at the end of the kernel, where the contents of the private queue need to be consolidated into the global queue.

As it turns out, even the block level queues can suffer heavy contention. This is because all threads in a warp are guaranteed to cause contention when they access their block-level queue. All threads in the same warp execute the same instruction at any point in time. So, all of them will execute the atomic operations at the same time and cause a very high level of contention. Such contention will effectively serialize the execution of the threads in a warp and drastically reduce the execution speed.

The contentions inside each warp can be addressed by adding another level of queues to the hierarchy, as shown in Fig. 12.11. We will call these warp-level queues (w-queues). The number of such warp-level queues is usually a power of two and is a parameter that can be tuned. During kernel execution, we classify threads into the same number of classes as the number of warp-level queues using the least significant bits of their threadIdx.x values. The rationale is that we want to evenly distribute the atomic operations executed by threads in a warp to the warp-level queues.

image
Figure 12.11 The design and consolidation process of w-queue, b-queue, and g-queue.

For example, if we have four warp-level queues, we direct all threads according to the least significant two bits of threadIdx.x value. All threads whose two least significant bits of the threadIdx.x values are 00 will access warp-level queue 0. Assume that we have 64 threads in a block, the 16 threads which are directed to the warp-level queue 0 are 0, 4, 8, 12, …, 56, 60. In this case, there are two warps. In warp 1, 8 out of its 32 threads are directed to warp-level queue 0. These threads are 0, 4, 8, 12, 16, 20, 24, and 28. In warp 1, 8 of its 32 threads are also directed to warp-level queue 0. These threads are 32, 36, 40, 44, 48, 52, 56, and 60. The point is that whenever a warp executes an atomic operation, one fourth of its threads will be directed to warp-level queue 0. Similarly, the 16 threads which are directed to warp-level queue 1 are 1, 5, 9, 13, …, 57, 61. Thus, one fourth of the threads of a warp will be directed to warp-level queue 1.

At the end of the kernel, we need to first consolidate the warp-level queue contents into the block-level queue, as illustrated in Fig. 12.11. Note that it may be advantageous to use a different warp to copy each warp-level queue contents into the block-level queue. This part involves significant thread index manipulation and is left as an exercise. We can then consolidate the block-level queue contents into the global queue as shown in Fig. 12.8.

By increasing the number of warp-level queues, we can decrease the level of contention to each warp-level queue. However, there is a cost of having more w-queues. As we increase the number of w-queues, the size of each w-queue becomes smaller. This increases the probability for one of the queues to overflow. The threads should check the overflow condition in a way similar to what we discussed for the block-level queue and redirect any overflowing vertices to the block-level queue. In some cases, the thread may find the block-level queue in an overflow condition and thus need to redirect the vertex to the global queue. We leave the detailed implementation of the BFS kernel with three levels of queue as an exercise.

Kernel Launch Overhead

In most graphs, the frontiers of the first several iterations of a BFS can be quite small. The frontier of the first iteration only has the neighbors of the source. The frontier of the next iteration has all the neighbors of the current frontier vertices. For these initial iterations, the kernel launch overhead may outweigh the benefit of parallelism. In general, the size of the frontier grows by a factor that is the average number of out-going edges of vertices from one iteration to the next. One way to deal with these initial iterations is to prepare another kernel that is launched only with one thread block. The kernel uses only a block-level queue except for overflow. It implements the initial interations of the while-loop. Since the block-level queue is in the shared memory, we can use __syncthreads() to ensure that during the next iteration, other threads in the block can use the queue entries prepared by each thread in the current iteration.

Once the frontier reaches a size that overflows the block-level queue, the kernel copies the block-level queue contents to the global queue and returns to the host code. The host code will launch the regular kernel in the subsequent iterations of the while-loop. The specialized kernel eliminates the kernel launch overhead for the initial iterations. We leave the specialized kernel as an exercise.

Load Balance

The amount of work to be done by each thread depends on the connectivity of the vertex assigned to it. In some graphs, such as social network graphs, some vertices (celebrities) may have several orders of magnitude more out-going edges than others. When this happens, one or a few of the threads can take excessively long and slow down the execution of the entire thread grid. This is an extreme example of load imbalance in parallel computing. We can potentially address this by having the threads which encounter vertices that have extremely large number of out-going edges to launch a kernel and use many threads to process the problematic vertices. The mechanism that enables threads to launch new kernels without involving the host is called dynamic parallelism, which will be addressed in Chapter 13, CUDA dynamic parallelism.

12.6 Summary

The graph search pattern is rich with several challenges. It is a memory bound computation. It has a significant portion of irregular memory accesses. Its input set is dynamic and depends on the data. The collection of input data for each iteration requires a well-designed hierarchy of queues that are invaluable in many real applications. Its workload varies over time and requires careful design of the kernel and even some specialized kernels.

12.7

Exercises

1. Extend the BFS_Bqueue kernel to check and handle the overflows when threads insert new frontier vertices in the block-level queue.

2. Extend the BFS_Bqueue kernel to use texture memory to access the edges, dest, label array.

3. Extend the BFS_Bqueue kernel to implement the warp-level queue.

4. Write a BFS_small_frontier kernel to implement the first iterations of the search until the frontier grows beyond 1024 vertices.

References

1. Harish, P., & Narayanan, P. J. (2007). Accelerating large graph algorithms on the GPU using CUDA. In: International conference on high-performance computing. India.

2. Luo, L., Wong, M., & Hwu, W. (2010). An effective GPU implementation of breath-first search. In: ACM/IEEE design automation conference (DAC).

3. Merill, D., & Garland, M. (2012). Scalable GPU graph traversal. In: Proceedings of the 17th ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP) (pp. 117–128).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.24.106