Chapter 10: Advanced OpenACC

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 10. Advanced OpenACC

Jeff Larkin, NVIDIA

With the basics of OpenACC programming well in hand, this chapter discusses two advanced OpenACC features for maximizing application performance. The first feature is asynchronous operations, which allow multiple things to happen at the same time to better utilize the available system resources, such as a GPU, a CPU, and the PCIe connection in between. The second feature is support for multiple accelerator devices. The chapter discusses two ways that an application can utilize two or more accelerator devices to increase performance: one using purely OpenACC, and the other combining OpenACC with the Message Passing Interface (MPI).

10.1 Asynchronous Operations

Programming is often taught by developing a series of steps that, when completed, achieve a specific result. If I add E = A + B and then F = C + D, then I can add G = E + F and get the sum of A, B, C, and D, as shown in Figure 10.1.

Figure depicts Summing four numbers step-by-step.

Figure 10.1 Summing four numbers step-by-step

Our brains often like to think in an ordered list of steps, and that influences the way we write our programs, but in fact we’re used to carrying out multiple tasks at the same time. Take, for instance, cooking a spaghetti dinner. It would be silly to cook the pasta, and then the sauce, and finally the loaf of bread. Instead, you will probably let the sauce simmer while you bake the bread and, when the bread is almost finished, boil the pasta so that all the parts of the meal will be ready when they are needed. By thinking about when you need each part of the meal and cooking the parts so that they all complete only when they are needed, you can cook the meal in less time than if you’d prepared each part in a separate complete step. You also would better utilize the kitchen by taking advantage of the fact that you can have two pots and the oven working simultaneously.

By thinking of your operations as a series of dependencies, rather than a series of steps, you open yourself to running the steps in potentially more efficient orders, possibly even in parallel. This is a different sort of parallelism than we have discussed so far in this book. The chapters before this one focus on data parallelism, or performing the same action on many different data locations in parallel. In the case of the spaghetti dinner analogy, we’re discussing task-based parallelism, or exposing different tasks, each of which may or may not be data parallel, that can be executed concurrently.

Let us look at the spaghetti dinner example a little more closely. You can treat cooking the pasta, preparing the sauce, and baking the bread as three separate operations in cooking the dinner. Cooking the pasta takes roughly 30 minutes, because it involves bringing the water to a boil, actually cooking the pasta, and then draining it when it is done. Making the sauce takes quite a bit longer, roughly two hours from putting the sauce in a pot, adding seasoning, and then simmering so that the seasoning takes effect. Baking the bread is another 30-minute operation, involving heating the oven, baking the bread, and then allowing the bread to cool. Finally, when these operations are complete, you can plate and eat your dinner.

If I did each of these operations in order (in any order), then the entire process would take three hours, and at least some part of my meal would be cold. To understand the dependencies between the steps, let’s look at the process in reverse.

1. To plate the meal, the pasta, sauce, and bread must all be complete.

2. For the pasta to be complete, it must have been drained, which requires that it has been cooked, which requires that the water has been brought to a boil.

3. For the sauce to be ready, it must have simmered, which requires that the seasoning has been added, and the sauce has been put into a pot.

4. For the bread to be done, it must have cooled, before which it must have been baked, which requires that the oven has been heated.

From this, you can see that the three parts of the meal do not need to be prepared in any particular order, but each has its own steps that do have a required order. To put it in parallel programming terms, each step within the larger tasks has a dependency on the preceding step, but the three major tasks are independent of each other. Figure 10.2 illustrates this point. The circles indicate the steps in cooking the dinner, and the rounded rectangles indicate the steps that have dependencies. Notice that when each task reaches the plating step, it becomes dependent on the completion of the other tasks. Therefore, where the rounded rectangles overlap is where the cook needs to synchronize the tasks, because the results of all three tasks are required before you can move on; but where the boxes do not cross, the tasks can operate asynchronously from each other.

Figure represents Dependency graph for spaghetti dinner example.

Figure 10.2 Dependency graph for spaghetti dinner example

Asynchronous programming is a practice of exposing the dependencies in the code to enable multiple independent operations to be run concurrently across the available system resources. Exposing dependencies does not guarantee that independent operations will be run concurrently, but it does enable this possibility. In the example, what if your stove has only one burner? Could you still overlap the cooking of the pasta with the simmering of the sauce? No, because both require the same resource: the burner. In this case, you will need to serialize those two tasks so that one uses the burner as soon as the other is complete, but the order in which those tasks complete is still irrelevant. Asynchronous programming is also the practice of removing synchronization between independent steps, and hence the term asynchronous. It takes great care to ensure that necessary synchronization is placed in the code to produce consistently correct results, but when done correctly this style of programming can better utilize available resources.

10.1.1 Asynchronous OpenACC Programming

By default, all OpenACC directives are synchronous with the host thread, meaning that after the host thread has sent the required data and instructions to the accelerator, the host thread will wait for the accelerator to complete its work before continuing execution. By remaining synchronous with the host thread OpenACC ensures that operations occur in the same order when run on the accelerator as when run in the original program, thereby ensuring that the program will run correctly with or without an accelerator.

This is rarely the most efficient way to run the code, however, because system resources go unused while the host is waiting for the accelerator to complete. For instance, on a system having one CPU and one GPU, which are connected via PCIe, then at any given time either the CPU is computing, the GPU is computing, or the PCIe bus is copying data, but no two of those resources are ever used concurrently. If we were to expose the dependencies in our code and then launch our operations asynchronously from the host thread, then the host thread would be free to send more data and instructions to the device or even participate in the calculation. OpenACC uses the async clause and wait directive to achieve exactly this.

Asynchronous Work Queues

OpenACC uses asynchronous work queues to expose dependencies between operations. Anything placed into the same work queue will be executed in the order it was enqueued. This is a typical first-in, first-out (FIFO) queue. Operations placed in different work queues can execute in any order, no matter which was enqueued first. When the results of a particular queue (or all queues) are needed, the programmer must wait (or synchronize) on the work queue (or queues). Referring to Figure 10.2, the rounded rectangles could each represent a separate work queue, and the steps should be put into that queue in the order in which they must be completed.

OpenACC identifies asynchronous queues with non-negative integer numbers, so all work enqueued to queue 1, for instance, will complete in order and operate independently of work enqueued in queue 2. The specification also defines several special work queues. If acc_async_sync is specified as the queue, then all directives placed into the queue will become synchronous with the host thread, something that is useful for debugging. If the queue acc_async_noval is used, then the operation will be placed into the default asynchronous queue.

The async Clause

Specific directives are placed into work queues by adding the async clause to the directive. This clause may be added to parallel and kernels regions and the update directive. This clause optionally accepts a non-negative integer value or one of the special values previously discussed. If no parameter is given to the async clause, then it will behave as if the acc_async_noval queue was specified, often referred to as the default queue. Listing 10.1 demonstrates the use of the async clause in C/C++, both with and without the optional parameter.

Listing 10.1 Example of async clause in C/C++

OPTIMIZATION STEP	TIME (MS)	SPEEDUP
Baseline	43.102	1.00×
Blocked	45.217	0.95×
Blocked with update	52.161	0.83×
Asynchronous	23.602	1.83×

NUMBER OF DEVICES	TIME (MS)	SPEEDUP
1	20.628	1.00×
2	12.041	1.71×
4	7.769	2.66×

Table of Contents for Chapter 10: Advanced OpenACC

Create new playlist

Sign In

Sign Up

Chapter 10. Advanced OpenACC

10.1 Asynchronous Operations

10.1.1 Asynchronous OpenACC Programming

Asynchronous Work Queues

The async Clause

The wait Directive

Joining Work Queues

Interoperating with CUDA Streams

10.1.2 Software Pipelining

Blocking the Computation

Blocking the Data Movement

Making It Asynchronous

10.2 Multidevice Programming

10.2.1 Multidevice Pipeline

10.2.2 OpenACC and MPI

MPI Without Direct Memory Access

MPI with Direct Memory Access

10.3 Summary

10.4 Exercises

Table of Contents for
Chapter 10: Advanced OpenACC