Chapter 13

CUDA dynamic parallelism

Juan Gómez-Luna and Izzat El Hajj

Abstract

This chapter introduces CUDA dynamic parallelism, an extension to the CUDA programming model that enables a CUDA kernel to create new thread grids by launching new kernels. Dynamic parallelism allows algorithms that dynamically discover new work to prepare and launch kernels without burdening the host or resorting to complex software techniques. The chapter starts with a simple pattern that benefits from dynamic parallelism. It then presents the essential concepts required in the practical use of dynamic parallelism: memory data visibility, device configuration, memory management, synchronization, streams, and events. It then uses two advanced examples, Bezier Curve calculation and Quad Tree construction to illustrate some of the important subtleties and details that programmers will likely encounter when using dynamic parallelism in real application.

Keywords

sparse matrix computation; sparse matrix-vector multiplication; compressed sparse row format; ELL format; coordinate format; jagged diagonal storage format; divergence; transposition; iterative linear solvers; conjugate gradient solvers

CUDA Dynamic Parallelism is an extension to the CUDA programming model enabling a CUDA kernel to create new thread grids by launching new kernels. Dynamic parallelism is introduced with the Kepler architecture, first appearing in the GK110 chip. In previous CUDA systems, kernels can only be launched from the host code. Algorithms that involved recursion, irregular loop structures, time-space variation, or other constructs that do not fit a flat, single level of parallelism needed to be implemented with multiple kernel launches, which increase burden on the host, amount of host-device communication, and total execution time. In some cases, programmers resort to loop serialization and other awkward techniques to support these algorithmic needs at the cost of software maintainability. The dynamic parallelism support allows algorithms that dynamically discover new work to prepare and launch kernels without burdening the host or impacting software maintainability. This chapter describes the extended capabilities of the CUDA architecture which enable dynamic parallelism, including the modifications and additions to the CUDA programming model, as well as guidelines and best practices for exploiting this added capacity.

13.1 Background

Many real-world applications employ algorithms that either have variation of work across space or dynamically varying amount of work performed over time. As we saw in Chapter 12 Parallel Patterns: Graph Search, in a graph search, the amount of work done when processing each frontier vertex can vary dramatically in graphs like social networks. For another example, Fig. 13.1 shows a turbulence simulation example where the level of required modeling details varies across both space and time. As the combustion flow moves from left to right, the level of activities and intensity increases. The level of details required to model the right side of the model is much higher than that for the left side of the model. On one hand, using a fixed fine grid would incur too much work for no gain for the left side of the model. On the other hand, using a fixed coarse grid would sacrifice too much accuracy for the right side of the model. Ideally, one should use fine grids for the parts of the model that require more details and coarse grids for those that do not.

image
Figure 13.1 Fixed versus dynamic grids for a turbulence simulation model.

Previous CUDA systems require all kernels to be launched from host code. The amount of work done by a thread grid is pre-determined during kernel launch. With the SPMD programming style for the kernel code, it is tedious if not extremely difficult to have thread-blocks to use different grid spacing. This limitation favors the use of fixed grid system. In order to achieve the desired accuracy, such fixed grid approach, as illustrated in the upper right portion of Fig. 13.1, typically needs to accommodate the most demanding parts of the model and perform unnecessary extra work in parts that do not require as much detail.

A more desirable approach is shown as the dynamic, variable grid in the lower right portion of Fig. 13.1. As the simulation algorithm detects fast changing simulation quantities in some areas of the model, it refines the grid in those areas to achieve desired level of accuracy. Such refinement does not need to be done for the areas that do not exhibit such intensive activity. This way, the algorithm can dynamically direct more computation work to the areas of the model that benefit from the addition work.

Fig. 13.2 shows a conceptual comparison of behavior between a system without dynamic parallelism and one with dynamic parallelism with respect to the simulation model in Fig. 13.1. Without dynamic parallelism, the host code must launch all kernels. If new work is discovered, such as refining the grid of an area of the model during the execution of a kernel, it needs to terminate itself, report back to the host code and have the host code to launch a new kernel. This is illustrated in Fig. 13.2(A), where the host launches a wave of kernels, receives information from these kernels after their termination, and launches the next level of kernels for any new work discovered by the completed kernels.

image
Figure 13.2 Kernel launch patterns for algorithms with dynamic work variation, with and without dynamic parallelism.

Fig. 13.2(B) shows that with dynamic parallelism, the threads that discover new work can just go ahead and launch kernels to do the work. In our example, when a thread discovers that an area of the model needs to be refined, it can launch a kernel to perform the computation step on the refined grid area without the overhead of terminating the kernel, reporting back to the host, and having the host to launch new kernels.

13.2 Dynamic Parallelism Overview

From the perspective of programmers, dynamic parallelism means that they can write a kernel launch statement in a kernel. In Fig. 13.3, the main function (host code) launches three kernels, A, B, and C. These are kernel launches in the original CUDA model. What is different is that one of the kernels, B, launches three kernels X, Y, and Z. This would have been illegal in previous CUDA systems.

image
Figure 13.3 A simple example of a kernel (B) launching three kernels (X, Y, and Z).

The syntax for launching a kernel from a kernel is the same as that for launching a kernel from host code:

kernel_name<<< Dg, Db, Ns, S >>>([kernel arguments])

ent Dg is of type dim3 and specifies the dimensions and size of the grid.

ent Db is of type dim3 and specifies the dimensions and size of each thread-block.

ent Ns is of type size_t and specifies the number of bytes of shared memory that is dynamically allocated per thread-block for this call, which is in addition to the statically allocated shared memory. Ns is an optional argument that defaults to 0.

ent S is of type cudaStream_t and specifies the stream associated with this call. The stream must have been allocated in the same thread-block where the call is being made. S is an optional argument that defaults to 0. Streams are discussed in more detail in Chapter 18.

13.3 A Simple Example

In this section, we provide a simple example of coding in each of two styles – first, in the original CUDA style, and second, in the dynamic parallelism style. The example is based on a hypothetical parallel algorithm that does not compute useful results, but provides a conceptually simple computational pattern that recurs in many applications. It serves to illustrate the difference between the two styles and how one can use the dynamic parallelism style to extract more parallelism while reducing control flow divergence when the amount of work done by each thread in an algorithm can vary dynamically.

Fig. 13.4 shows a simple example kernel coded without dynamic parallelism. In this example, each thread of the kernel performs some computation (line 05) then loops over a list of data elements it is responsible for (line 07), and performs another computation for each data element (line 08).

image
Figure 13.4 A simple example of a hypothetical parallel algorithm coded in CUDA without dynamic parallelism.

This computation pattern recurs frequently in many applications. For example, in graph search, each thread could visit a vertex then loop over a list of neighboring vertices. The reader should find this kernel structure very similar to that of BFS_Bqueue_kernel in Fig. 12.8. In sparse matrix computations, each thread could first identify the starting location of a row of non-zero elements and loop over the non-zero values. In simulations such as the example in the beginning of the chapter, each thread could represent a coarse grid element and loop over finer grid elements.

There are two main problems with writing applications this way. First, if the work in the loop (lines 07-09) can be profitably performed in parallel, then we have missed out on an opportunity to extract more parallelism from the application. Second, if the number of iterations in the loop varies significantly between threads in the same warp, then the resulting control divergence can degrade the performance of the program.

Fig. 13.5 shows a version of the same program that uses dynamic parallelism. In this version, the original kernel is separated into two kernels, a parent kernel and a child kernel. The parent kernel starts off the same as the original kernel, executed by a grid of threads referred to as the parent grid. Instead of executing the loop it launches a child kernel to continue the work (lines 07–08). The child kernel is then executed by another grid of threads called the child grid that performs the work that was originally performed inside the loop body (line 18).

image
Figure 13.5 A revised example using CUDA dynamic parallelism.

Writing the program in this way addresses both problems that were mentioned about the original code. First, the loop iterations are now executed in parallel by the child kernel threads instead of serially by the original kernel thread. Thus, we have extracted more parallelism from the program. Second, each thread now executes a single loop iteration which results in better load balance and eliminates control divergence. Although these two goals could have been achieved by the programmer by rewriting the kernels differently, such manual transformations can be awkward, complicated and error prone. Dynamic parallelism provides an easy way to express such computational patterns.

13.4 Memory Data Visibility

In the next three sections, we will briefly explain some important details that govern the execution behavior of programs that use dynamic parallelism. It is important for a programmer to understand these details in order to use dynamic parallelism confidently. We will cover the rules for memory data visibility in this section. These rules specify how the data objects of a parent grid can be accessed by threads in a child grid. These rules are extensions to the data consistency rules between threads from the same grid vs. between threads from different grids in a non-dynamic-parallelism program. For example, the global memory data written by threads in a grid are not guaranteed to be visible to other threads until either an explicit memory fence or kernel termination. Such rules are extended in dynamic parallelism so that one can clearly understand how a parent and a child can make data values visible to each other.

Global Memory

A parent thread and its child grid can make their global memory data visible to each other, with weak consistency guarantees between child and parent. The memory views of the parent thread and the child grid are said to be consistent with each other if the effects of their memory operations are fully visible to each other. There are two points in the execution of a child grid when its view of memory is consistent with the parent thread:

1. When the child grid is created by a parent thread. That means that all global memory operations in the parent thread prior to invoking the child grid are visible to the child grid.

2. When the child grid completes as signaled by the completion of a synchronization API call in the parent thread. That means all memory operations of the child grid are visible to the parent after the parent has synchronized on the child grid’s completion (see Section 13.6 for details about synchronization).

Zero-Copy Memory

Zero-copy system memory has identical consistency guarantees as global memory, and follows the same semantics as detailed above. A kernel may not allocate or free zero-copy memory, however, but may use pointers passed in from the host code.

Constant Memory

Constants may not be written to by a kernel, even between dynamic parallelism kernel launches. That is, the value of all __constant__ variables must be set from the host prior to launch of the first kernel. Constant memory variables are globally visible to all kernels, and so must remain constant for the lifetime of the entire dynamic parallelism launch tree invoked by the host code.

Taking the address of a constant memory object from within a thread has the same semantics as for non-dynamic-parallelism programs, and passing that pointer from parent to child or from a child to parent is fully supported.

Local Memory

Local memory is private storage for a thread, and is not visible outside of that thread. It is illegal to pass a pointer to local memory as a launch argument when launching a child kernel. The result of dereferencing such a local memory pointer from a child is undefined. For example the following is illegal, with undefined behavior if x_array is accessed by any threads that execute the child_launch kernel:

int x_array[10];    // Creates x_array in parent’s local memory

child_launch<<< 1, 1 >>>(x_array);

It is sometimes difficult for a programmer to know when a variable is placed into local memory by the compiler. As a general rule, all storage passed to a child kernel should be allocated explicitly from the global-memory heap, either with malloc() or new() or by declaring __device__ storage at global scope. For example, Fig. 13.5(A) shows a valid kernel launch where a pointer to a global memory variable is passed as an argument into the child kernel. Fig. 13.5(B) shows an invalid code where a pointer to a local memory (auto) variable is passed into the child kernel.

The NVIDIA CUDA C compiler will issue a warning if it detects that a pointer to local memory is being passed as an argument to a kernel launch. However, such detections are not guaranteed (Figure 13.6).

image
Figure 13.6 Passing a pointer as an argument to a child kernel.

Shared Memory

Shared memory is private storage for an executing thread-block, and data is not visible outside of that thread-block. Passing a pointer to a shared-memory variable to a child kernel either through memory or as an argument will result in undefined behavior.

Texture Memory

Texture memory accesses (read-only) are performed on a memory region that may be aliased to the global memory region. Texture memory has identical consistency guarantees as global memory, and follows the same semantics. In particular, writes to memory prior to a child kernel launch are reflected in texture memory accesses of the child. Also, writes to memory by a child will be reflected in the texture memory accesses by a parent, after the parent synchronizes on the child’s completion.

Concurrent texture memory access and writes to global memory objects which alias the texture memory objects between a parent and its children or between multiple children will result in undefined behavior.

13.5 Configurations and Memory Management

Dynamic parallelism allows a CUDA thread to play the role of host code in launching kernels. There are two other types of major host code activities that support the kernel launch: configuring the device hardware and prepare the device memory for executing the kernel. A programmer also needs to understand how these activities are applied to the kernels launched by a CUDA thread.

Launch Environment Configuration

A kernel launched with dynamic parallelism inherits all device configuration settings from its parent kernel. Such configuration settings include shared memory and L1 cache size as returned from cudaDeviceGetCacheConfig() and device execution parameter limits as returned from cudaDeviceGetLimit(). For example, if a parent kernel is configured with 16K bytes of shared memory and 48K bytes of L1 cache, then the child kernel it launches will have identical configurations. Likewise, a parent’s device limits such as stack size will be passed as-is to its children.

Memory Allocation and Lifetime

Dynamic parallelism makes it possible to invoke cudaMalloc and cudaFree from kernels. However they have slightly modified semantics. Within the device environment the total allocatable memory is limited to the device malloc() heap size, which may be smaller than the available unused device memory. Moreover, it is an error to invoke cudaFree from the host program on a pointer which was allocated by cudaMalloc on the device, or to invoke cudaFree from the device program on a pointer which was allocated by cudaMalloc on the host. These limitations may be removed in future versions of CUDA.

 cudaMalloc() on HostcudaMalloc() on Device
cudaFree() on HostSupportedNot supported
cudaFree() on DeviceNot supportedSupported
Allocation limitFree device memorycudaLimitMallocHeapSize

Nesting Depth

Kernels launched with dynamic parallelism may themselves launch other kernels, which may in turn launch other kernels, and so on. Each subordinate launch is considered a new “nesting level,” and the total number of levels is the “nesting depth” of the program. The maximum nesting depth is limited in hardware to 24.

In the presence of parent-child synchronization, there are additional constraints on nesting depth due to the amount of memory required by the system to store parent kernel state. These constraints will be discussed in Section 13.6 when we discuss synchronization depth.

Pending Launch Pool Configuration

The pending launch pool is a buffer that tracks the kernels that are executing or waiting to be executed. This pool is allocated a fixed amount of space, thereby supporting a fixed number of pending kernel launches (2048 by default). If this number is exceeded, a virtualized pool is used, but leads to significant slowdown which can be an order of magnitude or more. To avoid this slowdown, the programmer can increase the size of the fixed pool by executing the cudaDeviceSetLimit() API call from the host function to set the cudaLimitDevRuntimePendingLaunchCount configuration.

Errors and Launch Failures

Like CUDA API function calls in host code, any CUDA API function called within a kernel may return an error code. Any failed kernel launch due to reasons such as insufficient execution resources also appears to return with an error code. The last error code returned is also recorded and may be retrieved via the cudaGetLastError() call. Errors are recorded on a per-thread basis, so that each thread can identify the most recent error that it has generated. The error code is of type cudaError_t, which is a 32-bit integer value.1

13.6 Synchronization, Streams, and Events

Synchronization

As with kernel launches from the host, kernel launches from the device are non-blocking. If a parent thread wants to wait for a child kernel to complete before proceeding, it must perform synchronization explicitly.

One way for a parent thread to perform synchronization with its child kernels on the device is by invoking cudaDeviceSynchronize(). A thread that invokes this call will wait until all kernels launched by any thread in the thread-block have completed. However, this does not mean that all threads in the block will wait, so if a block-wide synchronization is desired, then cudaDeviceSynchronize() invoked by one thread in the block must also be followed by __syncthreads() invoked by all threads in the block. Synchronization can also be performed on streams within the same thread-block (which will be discussed shortly).

If a parent kernel launches other child kernels and does not explicitly synchronize on the completion of those kernels, then the runtime will perform the synchronization implicitly before the parent kernel terminates. This ensures that the parent and child kernels are properly nested, and that no kernel completes before its children have completed. This implicit synchronization is illustrated in Fig. 13.7.

image
Figure 13.7 Completion sequence for parent and child grids.

Synchronization Depth

If a parent kernel performs explicit synchronization on a child kernel, it may be swapped out of execution while waiting for the child kernel to complete. For this reason, memory needs to be allocated as a backing-store for the parent kernel state. Ancestors of the synchronizing parent kernel may also be swapped out. Thus the backing store needs to be large enough to fit the state of all kernels up to the deepest nesting level at which synchronization is performed. This deepest nesting level defines the synchronization depth.

Conservatively, the amount of memory allocated for the backing store for each level of the synchronization depth must be large enough to support storing state for the maximum number of live threads possible on the device. On current generation devices, this amounts to ~150 MB per level, which will be unavailable for program use even if it is not all consumed. The maximum synchronization depth is thus limited by the amount of memory allocated by the software for the backing store, and is likely to be a more important constraint than the maximum nesting depth stipulated by the hardware.

The default amount of memory reserved for the backing store is sufficient for a synchronization depth of two. However, the programmer can increase this amount using the cudaDeviceSetLimit() API call from the host function to set a larger value for the cudaLimitDevRuntimeSyncDepth configuration parameter.

Streams

Just like host code can use streams to execute kernels concurrently, kernel threads can also use streams when launching kernels with dynamic parallelism. Both named and unnamed (NULL) streams can be used.

The scope of a stream is private to the block in which the stream was created. In other words, streams created by a thread may be used by any thread within the same thread-block, but stream handles should not be passed to other blocks or child/parent kernels. Using a stream handle within a block that did not allocate it will result in undefined behavior. Streams created on the host have undefined behavior when used within any kernel, just as streams created by a parent grid have undefined behavior if used within a child grid.

When a stream is not specified to the kernel launch, the default NULL stream in the block is used by all threads. This means that all kernels launched in the same block will be serialized even if they were launched by different threads. However, it is often the case that kernels launched by different threads in a block can be executed concurrently, so programmers must be careful to explicitly use different streams in each thread if they wish to avoid the performance penalty from serialization.

Similar to host-side launch, work launched into separate streams may run concurrently, but actual concurrency is not guaranteed. Programs that require concurrency between child kernels in order to run correctly are ill-formed and will have undefined behavior. An unlimited number of named streams are supported per block, but the maximum concurrency supported by the platform is limited. If more streams are created than can support concurrent execution, some of these may serialize or alias with each other.

The host-side NULL stream’s global-synchronization semantic is not supported under dynamic parallelism. To make this difference between the stream behavior on the host-side and the device side with dynamic parallelism explicit, all streams created in a kernel must be created using the cudaStreamCreateWithFlags() API with the cudaStreamNonBlocking flag (an example is shown later in Fig. 13.10). Calls to cudaStreamCreate() from a kernel will fail with a compiler “unrecognized function call” error, so as to make clear the different stream semantic under dynamic parallelism.

The cudaStreamSynchronize() API is not available within a kernel, only cudaDeviceSynchronize() can be used to wait explicitly for launched work to complete. This is because the underlying system software implements only a block-wide synchronization call, and it is undesirable to offer an API with incomplete semantics (that is, the synchronization function guarantees that one stream synchronizes, but coincidentally provides a full barrier as a side-effect). Streams created within a thread-block are implicitly synchronized when all threads in the thread-block exit execution.

Events

Only the inter-stream synchronization capabilities of CUDA events are supported in kernel functions. Events within individual streams are currently not supported in kernel functions. This means that cudaStreamWaitEvent() is supported, but cudaEventSynchronize(), timing with cudaEventElapsedTime(), and event query via cudaEventQuery() are not. These may be supported in a future version.2

Event objects may be shared between the threads within a block that created them but are local to that block and should not be passed to child/parent kernels. Using an event handle within a block that did not allocate it will result in undefined behavior.

An unlimited number of events are supported per block, but these consume device memory. Owing to resource limitations, if too many events are created (exact number is implementation-dependent), then device-launched grids may attain less concurrency than might be expected. Correct execution is guaranteed, however.

13.7 A More Complex Example

We now show an example that is a more interesting and useful case of adaptive subdivision of spline curves. This example illustrates a variable number of child kernel launches, according to the workload. The example is to calculate Bezier Curves [Wiki_Bezier], which are frequently used in computer graphics to draw smooth, intuitive curves that are defined by a set of control points, which are typically defined by a user.

Mathematically, a Bezier curve is defined by a set of control points P0 through Pn, where n is called its order (n=1 for linear, 2 for quadratic, 3 for cubic, etc.). The first and last control points are always the end points of the curve; however, the intermediate control points (if any) generally do not lie on the curve.

Linear Bezier Curves

Given two control points P0 and P1, a linear Bezier curve is simply a straight line connecting between those two points. The coordinates of the points on the curve is given by the following linear interpolation formula:

B(t)=P0+t(P1-P0)=(1-t)P0+tP1,t[0,1]

image

Quadratic Bezier Curves

A quadratic Bezier curve is defined by three control points P0, P1, and P2. The points on a quadratic curve are defined as a linear interpolation of corresponding points on the linear Bezier curves from P0 to P1 and from P1 to P2, respectively. The calculation of the coordinates of points on the curve is expressed in the following formula:

B(t)=(1-t)[(1-t)P0+tP1]+t[(1-t)P1+tP2],  t[0,1],

image

which can be simplified into the following formula:

B(t)=(1-t)2P0+2(1-t)tP1+t2P2,  t[0,1].

image

Bezier Curve Calculation (Without Dynamic Parallelism)

Fig. 13.8 shows a CUDA C program that calculates the coordinates of points on a Bezier curve. The main function (line 48) initializes a set of control points to random values (line 513). In a real application, these control points are most likely inputs from a user or a file. The control points are part of the bLines_h array whose element type BezierLine is declared in line 07. The storage for the bLines_h array is allocated in line 50. The host code then allocates the corresponding device memory for the bLines_d array and copies the initialized data to bLines_d (lines 54–56). It then calls the computeBezierLine() kernel to calculate the coordinates of the Bezier curve.

image
Figure 13.8 Bezier curve calculation without dynamic parallelism (support code in Fig. A13.8).

The computeBezierLine() kernel starting at line 13 is designed to use a thread-block to calculate the curve points for a set of three control points (of the quadratic Bezier formula). Each thread-block first computes a measure of the curvature of the curve defined by the three control points. Intuitively, the larger the curvature, the more the points it takes to draw a smooth quadratic Bezier curve for the three control points. This defines the amount of work to be done by each thread-block. This is reflected in lines 20 and 21, where the total number of points to be calculated by the current thread-block is proportional to the curvature value.

In the for-loop in line 24, all threads calculate a consecutive set of Bezier curve points in each iteration. The detailed calculation in the loop body is based on the formula we presented earlier. The key point is that the number of iterations taken by threads in a block can be very different from that taken by threads in another block. Depending on the scheduling policy, such variation of the amount of work done by each thread-block can result in decreased utilization of SMs and thus reduced performance.

Bezier Curve Calculation (With Dynamic Parallelism)

Fig. 13.9 shows a Bezier curve calculation code using dynamic parallelism. It breaks the computeBezierLine() kernel in Fig. 13.8 into two kernels. The first part, computeBezierLine_parent(), discovers the amount of work to be done for each control point. The second part, computeBezierLine_child(), performs the calculation.

image
Figure 13.9 Bezier calculation with dynamic parallelism (support code in Fig. A13.8).

With the new organization, the amount of work done for each set of control points by the computeBezierLines_parent() kernel is much smaller than the original computeBezierLines() kernel. Therefore, we use one thread to do this work in computeBezierLines_parent(), as opposed to using one block in computeBezierLines(). In line 58, we only need to launch one thread per set of control points. This is reflected by dividing the N_LINES by BLOCK_DIM to form the number of blocks in the kernel launch configuration.

There are two key differences between the computeBezierLines_parent() kernel and the computeBezierLines() kernel. First, the index used to access the control points is formed on a thread basis (line 08 in Fig. 13.9) rather than block basis (line 14 in Fig. 13.8). This is because the work for each control point is done by a thread rather than a block, as we mentioned above. Second, the memory for storing the calculated Bezier curve points is dynamically determined and allocated in line 15 in Fig. 13.9. This allows the code to assign just enough memory to each set of control points in the BezierLine type. Note that in Fig. 13.8, each BezierLine element is declared with the maximal possible number of points. On the other hand, the declaration in Fig. 13.9 has only a pointer to a dynamically allocated storage. Allowing a kernel to call the cudaMalloc() function can lead to substantial reduction of memory usage for situations where the curvature of control points vary significantly.

Once a thread of the computeBezierLines_parent() kernel determines the amount of work needed by its set of control points, it launches the computeBezierLines_child() kernel to do the work (line 19 in Fig. 13.9). In our example, every thread from the parent grid creates a new grid for its assigned set of control points. This way, the work done by each thread-block is balanced. The amount of work done by each child grid varies.

After the computeBezierLines_parent() kernel terminates, the main function can copy the data back and draw the curve on an output device. It also calls a kernel to free all storage allocated to the vertices in the bLines_d data structure in parallel (line 61). This is necessary since the vertex storage was allocated on the device by the computeBezierLines_parent() kernel so it has to be freed by device code (Section 13.5).

Launch Pool Size

As explained in Section 13.6, the launch pool storage may be virtualized when the fixed-pool size is full. That is, all launched grids will still be queued successfully. However, using the virtualized pool has a higher cost than using the fixed-size pool. The Bezier curve calculation with dynamic parallelism helps us to illustrate this.

Since the default size of the fixed-size pool is 2048 (it can be queried with cudaDeviceGetLimit()), launching more than 2048 grids will require the use of the virtualized pool, when the fixed-size pool is full. That is, if N_LINES (defined in Fig. 13.8, line 45) is set to 4096, half of the launches will use the virtualized pool. This will incur a significant performance penalty. However, if the fixed-size pool is set to 4096, the execution time will be reduced by an order of magnitude.

As a general recommendation, the size of the fixed-size pool should be set to the number of launched grids (if it exceeds the default size). In the case of the Bezier curves example, we would use cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, N_LINES) before launching the computeBezierLines_parent() kernel (line 58).

Streams

Named and unnamed (NULL) streams are offered by the device runtime, as mentioned in Section 13.6. One key consideration is that the default NULL stream is block-scope. This way, by default all launched grids within a thread-block will use the same stream, even if they are launched by different threads. As a consequence, these grids will execute sequentially.

The Bezier example launches as many grids as threads in the computeBezierLines_parent() kernel (line 19 in Fig. 13.9). Moreover, since MAX_TESS_POINTS is equal to 32 (see Fig. 13.8, line 04) and the thread-block size in computeBezierLines_child() is 32, the number of blocks per grid will be 1 for the computeBezierLines_child() kernel. If the default NULL stream is used, all these grids with one single block will be serialized. Thus, using the default NULL stream when launching the computeBezierLines_child() kernel can result in a drastic reduction in parallelism compared to the original, non-CDP kernel.

Given that N_LINES is 256 and BLOCK_DIM is 64, only four blocks are launched in computeBezierLines_parent(). Thus, only four default streams will be available for the computeBezierLines_child() kernel. Consequently, some streaming multiprocessors (SM) will remain unused on any GPU with more than four SMs. Since each grid in the same stream consists of only one thread-block and all grids in the same stream are serialized with respect to each other, each SM can also be underutilized.

If more concurrency is desired (with the aim of better utilizing all SM), named streams must be created and used in each thread. Fig. 13.10 shows the sequence of instructions that should replace line 19 in Fig. 13.9.

image
Figure 13.10 Child kernel launch with named streams.

Using the code in Fig. 13.10, kernels launched from the same thread-block will be in different streams and can run concurrently. This will better utilize all SMs in the situation described above, leading to a considerable reduction of the execution time.

13.8 A Recursive Example

Dynamic parallelism allows programmers to implement recursive algorithms. In this section, we illustrate the use of dynamic parallelism for implementing recursion with a quadtree [Quadtree 1974]. Quadtrees partition a two-dimensional space by recursively subdividing it into four quadrants. Each quadrant is considered a node of the quadtree, and contains a number of points. If the number of points in a quadrant is greater than a fixed minimum, the quadrant will be subdivided into four more quadrants, that is, four child nodes.

Fig. 13.11 depicts an overview of how the construction of a quadtree can be implemented with dynamic parallelism. In this implementation one node (quadrant) is assigned to one thread-block. Initially (depth =0), one thread-block is assigned the entire two-dimensional space (root node), which contains all points. It divides the space into four quadrants, and launches one thread-block for each quadrant (depth =1). These child blocks will again subdivide their quadrants if they contain more points than a fixed minimum. In this example the minimum is two; thus, blocks 00 and 02 do not launch children. Blocks 01 and 03 launch a kernel with four blocks each.

image
Figure 13.11 Quadtree example. Each thread-block is assigned to one quadrant. If the number of points in a quadrant is more than 2, the block launches 4 child blocks. Shadowed blocks are active blocks in each level of depth.

As the flow graph in the right-hand side of Fig. 13.11 shows, a block first checks if the number of points in its quadrant is greater than the minimum required for further division and the maximum depth has not been reached. If either of the conditions fails, the work for the quadrant is complete and the block returns. Otherwise, the block computes the center of the bounding box that surrounds its quadrant. The center is in the middle of four new quadrants. The number of points in each of them is counted. A four-element scan operation is used to compute the offsets to the locations where the points will be stored. Then, the points are reordered, so that those points in the same quadrant are grouped together and placed into their section of the point storage. Finally, the block launches a child kernel with four thread-blocks, one for each of the four new quadrants.

Fig. 13.12 continues the small example in Fig. 13.11 and illustrates in detail how the points are reordered at each level of depth. In this example, we assume that each quadrant must have a minimum of two points in order to be further divided. The algorithm uses two buffers to store the points and reorder them. The points should be in buffer 0 at the end of the algorithm. Thus, it might be necessary to swap the buffer contents before leaving, in case the points are in buffer 1 when the terminating condition is met.

image
Figure 13.12 Quadtree example. At each level of depth, a block groups all points in the same quadrant together.

In the initial kernel launch from the host code (for depth =0), thread-block 0 is assigned all the points that reside in buffer 0, shown in Fig. 13.12(A). Block 0 further divides the quadrant into four child quadrants, groups together all points in the same child quadrant, and stores them in buffer 1, as shown in Fig. 13.12(B). Its four children, block 00 to block 03, are assigned each of the four new quadrants, shown as marked ranges in Fig. 13.12(B). Blocks 00 and 02 will not launch children, since the number of points in their respective assigned quadrant is only 2. They swap their points to buffer 0. Blocks 01 and 03 reorder their points to group those in the same quadrant, and launch four child blocks each, as shown in Fig. 13.12(C). Blocks 010, 011, 012, 013, 030, 031, and 032 do not launch children (they have 2 or fewer points) nor need to swap points (they are already in buffer 0). Only block 033 reorders its points, and launches four blocks, as shown in Fig. 13.12(D). Blocks 0330 to 0333 will exit after swapping their points to buffer 0, which can be seen in Fig. 13.12(E).

The kernel code in Fig. 13.13 implements the flow graph from Fig. 13.11 in CUDA. The quadtree is implemented with a node array, where each element contains all the pertinent information for one node of the quadtree (definition in Fig. A13.14 in Code Appendix). As the quadtree is constructed, new nodes will be created and placed into the array during the execution of the kernels. The kernel code assumes that the node parameter points to the next available location in the node array.

image
Figure 13.13 Quadtree with dynamic parallelism: recursive kernel (support code in Fig. A13.13).

At each level of depth, every block starts by checking the number of points in its node (quadrant). Each point is a pair of floats representing x and y coordinates (definition in Fig. A13.13 in Code Appendix). If the number of points is less than or equal to the minimum or if the maximum depth is reached (line 11), the block will exit. Before exiting, the block carries out a buffer swap if necessary. This is done in the device function check_num_points_and_depth() shown in Fig. 13.14.

imageimage
Figure 13.14 Quadtree with dynamic parallelism: device functions (support code in Fig. A13.14).

If the block doesn’t exit, the center of the bounding box is computed (line 17). A bounding box is defined by its top-left and bottom-right corners. The coordinates of the center are computed as the coordinates of the middle point between these two corner points. The definition of a bounding box (including function compute_center()) is in Fig. A13.13 in the Code Appendix.

As the center defines the four quadrants, the number of points in each quadrant is counted (line 26). The device function count_points_in_children() can be found in Fig. 13.144. The threads of the block collaboratively go through the range of points, and update atomically the counters in shared memory for each quadrant.

The device function scan_for_offsets() is called then (line 29). As can be seen in Fig. 13.14, it performs a sequential scan on the four counters in shared memory. Then, it adds the global offset of the parent quadrant to these values to derive the starting offset for each quadrant’s group in the buffer.

Using the quadrants’ offsets, the points are reordered with reorder_points() (line 32). For simplicity, this device function (Fig. 13.14) uses an atomic operation on one of the four quadrant counters to derive the location for placing each point.

Finally, the last thread of the block (line 35) determines the next available location in the node array (line 37), prepares the new node contents for the child quadrants (line 40), and launches one child kernel with four thread-blocks (line 43). The device function prepare_children() prepares the new node contents for the children by setting the limits of the children’s bounding boxes and the range of points in each quadrant. The prepare_children() function can be found in Fig. 13.14.

The rest of the definitions and the main function can be found in Fig. A13.14 in the Code Appendix.

13.9 Summary

CUDA dynamic parallelism extends the CUDA programming model to allow kernels to launch kernels. This allows each thread to dynamically discover work and launch new grids according to the amount of work discovered. It also supports dynamic allocation of device memory by threads. As we show in the Bezier Curve calculation example, these extensions can lead to better work balance across threads and blocks as well as more efficient memory usage. CUDA Dynamic Parallelism also helps programmers to implement recursive algorithms, as the quadtree example shows.

Besides ensuring better work balance, dynamic parallelism offers many advantages in terms of programmability. However, it is important to keep in mind that launching grids with a very small number of threads could lead to severe underutilization of the GPU resources. A general recommendation is launching child grids with a large number of thread-blocks, or at least thread-blocks with hundreds of threads, if the number of blocks is small.

Similarly, nested parallelism, which can be seen as a form of tree processing, will provide a higher performance when tree nodes are thick (that is, each node deploys many threads), and/or when the branch degree is large (that is, each parent node has many children). As the nesting depth is limited in hardware, only relatively shallow trees can be implemented efficiently.

13.10

Exercises

1. True or False: Parent and child grids have coherent access to global memory, with weak consistency between child and parent.

2. True or False: Zero-copy system memory has no coherence and consistency guarantees between parent and children.

3. True or False: Parent kernels can define new __constant__ variables that will be inherited by child kernels.

4. True or False: Child kernels can inherit parent’s shared and local memories, and coherence is guaranteed.

5. Six (6) blocks of 256 threads run the following parent kernel:
  __global__ void parent_kernel(int *output, int *input, int *size) {
    // Thread index
    int idx = threadIdx.x + blockDim.x*blockIdx.x;
    // Number of child blocks
    int numBlocks = size[idx] / blockDim.x;
    // Launch child
    child_kernel<<< numBlocks, blockDim.x >>>(output, input, size);
  }
How many child kernels could run concurrently?

a. 1536

b. 256

c. 6

d. 1

6. Choose the right statement for the Bezier example:

a. If N_LINES =1024, and BLOCK_DIM =64, the number of child kernel launches will be 16.

b. If N_LINES =1024, the fixed-size pool should be set to 1024 (Note: Default size is 2048).

c. If N_LINES =1024, BLOCK_DIM =64, and per-thread streams are used, a total of 16 streams will be deployed.

d. If N_LINES =1024, BLOCK_DIM =64, and aggregation is used, the number of child kernel launches will be 16.

7. Consider a two-dimensional organization of 64 equidistant points that is classified with a quadtree. What will be the maximum depth of the quadtree (including the root node)?

a. 21

b. 4

c. 64

d. 16

8. For the same quadtree, what will be the total number of child kernel launches?

a. 21

b. 4

c. 64

d. 16

A13.1 Code Appendix

image
Figure A13.8 Support code for Bezier Curve calculation without dynamic parallelism.
image
Figure A13.13 Support code for quadtree with dynamic parallelism: definition of points and bounding box.
imageimage
Figure A13.14 Support code for quadtree with dynamic parallelism: definitions and main function.

1No notification of ECC errors is available to code within a CUDA kernel. ECC errors are only reported at the host side. Any ECC errors which arise during execution of a dynamic parallelism kernel will either generate an exception or continue execution (depending upon error and configuration).

2To ensure that this restriction is clearly seen by the user, dynamic parallelism cudaEvents must be created via cudaEventCreateWithFlags(), which currently only accepts the cudaEventDisableTiming flag value when called from a kernel.

3Function initializeBLines() can be found in Fig. A13.8 in Code Appendix at the end of the chapter.

4The device functions in Fig. 13.14 are simplified for clarity.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.114.85