Chapter 5. Compiling OpenACC

Randy Allen, Mentor Graphics

Getting eight independent oxen to work in unison is not simple. However, it is child’s play compared with coordinating a thousand chickens. In either case, the problem is to harness independently operating beasts so that their efforts are coordinated toward achieving a single goal.

Exploiting parallelism—whether for pulling wagons or for speeding up computation—is challenging. At one level, it requires abstract thinking. Understanding the portions of an algorithm that can exploit parallelism—and, for that matter, creating an algorithm that can exploit parallelism—requires that you deal with problems at an abstract level. Gaining abstract understanding and designing algorithms are tasks best performed by humans.

At a different level, scheduling a thousand processors to coordinate effectively and arranging for data to be at the right place at the right time is complex. Complex, but only in its detail, not in its intellectual content. The human ability to manage detail of this complexity is at best spotty, but computers excel at exactly this.

Computers manage detail. Humans understand abstraction. Both are required for successful parallel execution. As with the gatekeeper and keymaster in Ghostbusters, it seems natural to get these two together. OpenACC does exactly that.

This chapter overviews the challenges that you must solve to effectively utilize massive parallelism and presents the theoretical foundations underlying solutions to those challenges. Then it closes with details of how compilers combine that theory with OpenACC directives to effect your requests and get effective speedup on parallel architectures.

5.1 The Challenges of Parallelism

Exploiting parallelism requires solving two challenges. First is figuring out the parts of the program that can be executed in parallel. Second is scheduling the parallel parts on the available hardware and getting needed data to the right place at the right time. A prerequisite for both is understanding the parallelism available in hardware.

The following section overviews that topic. Subsequent sections detail the process by which compilers map programs to parallelism, highlighting the type of information that they need to successfully effect a mapping. The following sections detail how OpenACC directives provide that information and how compilers use it to generate parallel code. OpenACC is not free, and the concluding sections detail the challenges introduced by OpenACC directives.

5.1.1 Parallel Hardware

At its most fundamental level, parallelism is achieved by replicating something. For computer hardware, that something is usually functional units. A functional unit may be an entire processor (including a program counter and branching logic), or a simple arithmetic unit (only arithmetic operations). The more replication you implement, the more operations that can be performed. However, just as chickens and oxen require coordination in the form of a harness and a driver, functional units need to be coordinated. Software can serve the role of the driver, but the harness, particularly for thousands of processors in GPUs, requires hardware support.

The simplest way of coordinating functional units is to link their execution so that all the units are executing the same instruction at the same time. This method is more or less required with arithmetic units, because arithmetic units rarely contain program counters. For processors, this method of coordination is often effected by having the processors share a single program counter. Sharing a single program counter is the approach taken in NVIDIA GPUs. Obviously, having processors redundantly execute the same instruction is productive only if each operates on different data. Following this approach, each chicken is essentially marching in lockstep.

An alternative is to have completely independent processors. Under that model, functional units are free to execute any instruction at any time on any data. No clock, program counter, or other tie constrains the functional units. At any given time, the processors may be executing different instructions on different data. In the pioneer analogy, each ox or chicken would be proceeding forward at its own pace, oblivious to the progress of the other animals.

Examples of functional units that concurrently execute the same instruction are vector units and lockstep processors. Generally, these are classified as single- instruction multiple-data (SIMD)1 or single-instruction multiple-thread (SIMT).2 The defining point semantically is simultaneous execution.

1. J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, second edition (San Francisco: Morgan Kaufmann, 1996).

2. M. McCool, J. Reinders, and A. Robison, Structured Parallel Programming: Patterns for Efficient Computation (San Francisco: Elsevier, 2013).

Examples of functional units that execute independently abound, and they are illustrated by virtually any processors connected by either an internal bus or an external network. The processors may share memory; memory may be distributed among them; or memory may be split between those two options. In any case, because the functional units are loosely connected, the relative order in which operations execute on two processors is indeterminate. This execution is multiple-instruction multiple-data (MIMD).3

3. Hennessy and Patterson, Computer Architecture.

GPUs contain both types of parallelism in their streaming multiprocessors (SMs) and warps. With an understanding of basic hardware parallelism, next we turn to understanding the mapping from programming languages to hardware.

5.1.2 Mapping Loops

To productively utilize hundreds or thousands of processors, an application must be doing much computation. Scheduling that many processors requires regularity. Without regularity, many processors can be programmed to do only one task. Without regularity, processors are figuratively executing like chickens flying in random directions. For programming languages, “regularity” and “repetition” implies loops. And, unless an application is boringly repeating the same calculation, regularity also means arrays. Pragmatically speaking, effectively utilizing parallelism is a matter of mapping loops onto parallel hardware. Also pragmatically speaking, this mapping is performed by having processors execute loop iterations. In other words, the unit of regularity is the loop body, and the loop iterations are partitioned among the processors.

Loops in standard programming languages are, by definition, sequential (meaning that the iterations of the loop are executed in consecutive order). The semantics of parallel hardware, by the adjective parallel, are not. Given that, it is not immediately obvious that loops can be directly mapped to parallel hardware. In fact, not all loops can be.4

4. J. R. Allen and K. Kennedy, Optimizing Compilers for Modern Architectures (San Francisco: Morgan Kaufmann, 2001).

Functional units that execute a single instruction implement simultaneous semantics. In terms of array accesses in loops, simultaneous semantics means that the behavior of the statement is defined by the final result obtained if all iterations of the loop are executed together. Consider the following C fragment:

for(i=0; i<n; i++) {
    a[i] = a[i] + 1;
}

Each element of a is incremented by 1. Because each element is accessed in only one loop iteration, the order in which the iterations occur is irrelevant. The loop may be iterated forward as written, backward, simultaneously (as with a vector processor or lockstep processor), or randomly (as with independent processors). The result will be the same.

This is not true of all loops. Here we change the fragment slightly:

for(i=1; i<n; i++) {
    a[i] = a[i-1] + 1;
}

This fragment sets each element of a to the value of its ordinal position (assuming elements are initially 0). Executed sequentially, a loop iteration picks up the value of a stored by the preceding iteration, increments it, and stores the result for the next iteration. Simultaneous execution changes the result. Executed simultaneously, all the initial values of a are fetched, then incremented, and finally stored. No updated values are used. Stated differently, every element of a is set to 1.

The reason this loop produces different results when executed simultaneously is that the loop body depends on values produced by previous iterations. When the loop executes simultaneously, the newly computed values are not available until the computation is complete. That is too late for the dependent iteration to use.

Because MIMD processors may execute loop iterations in any order, and because simultaneous execution is one possible order, not all sequential loops can be executed correctly on MIMD hardware. Sequential loops in which no statement depends on itself, directly or indirectly, may be executed simultaneously. The condition that determines whether sequential loops will execute correctly with indeterminate execution order is more restrictive. A loop can be executed in MIMD fashion if and only if no memory location is written/read, read/written, or written/written on two different loop iterations. To illustrate the difference, the following can be executed correctly simultaneously but not indeterminately.

for(i=1; i<n; i++) {
    a[i] = b[i] + c[i];
    d[i] = a[i-1] + 1;
}

Executed simultaneously, all executions of the first statement complete before any of the second statement begin. The second statement correctly gets updated values. Executed indeterminately, loop iteration 2 (for instance) may execute before loop iteration 1 starts. If that happens, loop iteration 2 will incorrectly fetch old values of a.

5.1.3 Memory Hierarchy

So far, parallelism has been discussed from a passive viewpoint. That is, you have seen methods that passively examine loops to uncover their semantics: A loop either is or is not acceptable for parallel execution. A more aggressive approach, undertaken by compilers, is to actively transform a nonparallel loop so that it can be correctly executed on parallel hardware.5

5. Allen and Kennedy, Optimizing Compilers for Modern Architectures.

Consider the ideal parallel loop presented earlier, slightly modified:

for(i=0; i<n; i++) {
    t = a[i] + 1;
    a[i] = t;
}

Rather than use the result directly, the expression value is first stored into a temporary. Using temporaries in this fashion is a common coding style for users wary of compiler optimization. Excluding this temporary, the semantics of the loop are identical to those of the original.

Although the semantics are identical, the loop iterations as written cannot be correctly executed either simultaneously or indeterminately. The reason is the scalar temporary t. On a nonparallel machine, the temporary provides a location to hold the computation to avoid recomputing it. Executing the iterations in either simultaneous or indeterminate order, it is possible (and indeed probable) that multiple loop iterations will be stored into the temporary at roughly the same time. If so, the value fetched on an iteration may not be the proper one.

In the specific example, the obvious solution is to propagate the scalar forward. More generally, however, there may be many uses of t in the loop, in which case saving the results of the computation may be the most efficient approach. The more general solution is to expand the scalar to create a location for each loop iteration.

for(i=0; i<n; i++) {
    t[i] = a[i] + 1;
    a[i] = t[i];
}

A literal expansion illustrates the basic concept, but there are more efficient ways to achieve the end result.6 The requirement is that each loop iteration (or thread) have a dedicated location for the value. On simultaneous architectures, the scalar is usually expanded into a vector register or equivalent in a transformation known as scalar expansion.7 MIMD architectures usually provide individual processors with dedicated memory—memory private to that processor. Allocating the scalar to this storage (a transformation known as privatization) removes the bottleneck.8

6. Allen and Kennedy, Optimizing Compilers for Modern Architectures.

7. M. J. Wolfe, Optimizing Supercompilers for Supercomputers (Cambridge, MA: MIT Press, 1989).

8. Allen and Kennedy, Optimizing Compilers for Modern Architectures.

5.1.4 Reductions

Simultaneous and indeterminate execution cover common forms of fully parallel loops. However, there are computations that are only partially parallel but that are important enough to merit special hardware attention. The most prevalent of these are reductions. Reductions are operations that reduce a vector or an array to a scalar. Dot product (a special version of a more general sum reduction) is one example:

t = 0;
for(i=0; i<n; i++) {
    t = t + a[i] * b[i];
}

Because of the update of t, these loop iterations obviously cannot be executed simultaneously or indeterminately. Locking access to the temporary before the fetch and releasing it after the update will eliminate the problem, but it effectively serializes the execution.

Locking is necessary when results are accumulated in a globally accessible variable. Following the lead of the preceding section, if the accumulations are instead made to a private variable, no locking is necessary. The final accumulation into the global location, which does require locking, can then be performed outside the loop. Assuming tp is a private variable and that the outer brackets indicate the start of parallel execution, the following illustrates the transformation.

t = 0;
{
    tp = 0;
    for(i=0; i<n; i++) {
        tp = tp + a[i] * b[i];
    }
    lock(); t = t + tp; unlock();
}

This code may accumulate the sums in a different order from the original, because each individual private variable accumulates part of the overall sum. Although mathematical addition is associative, computer floating-point addition is not. This means that accumulating the sum in a different order may cause the results to vary slightly from the sequential result, and from run to run. Despite the possible variances, reductions consume enough computation in enough applications to justify this partial parallelism.

5.1.5 OpenACC for Parallelism

Given that the semantics of sequential loops differ from those of parallel hardware, it is not hard to see the role of OpenACC directives in exploiting parallelism in sequential programs. Programmers armed with OpenACC directives can use them to indicate parallel loops to the compiler, which can then take care of scheduling them.

The OpenACC gang and worker loop directives state that a loop will execute correctly regardless of the order in which the iterations are executed. Accordingly, such loops can be correctly executed on MIMD hardware. In fact, given that any iteration order is valid, gang and worker loops can also be executed simultaneously. This precipitates one of the decisions faced by a compiler (discussed later): Does it execute loops exactly as prescribed by directives (i.e., as gang and worker), or does it use the information to schedule loops as it sees fit (i.e., schedule a gang loop as gang, worker, and vector)? From a loop semantics viewpoint, gang and worker both imply indeterminate semantics. The difference between the two is that gang implies that there is more beneficial parallelism underneath.

The OpenACC vector loop directive states that a loop will execute correctly if the iterations are executed simultaneously. Accordingly, the loop can be executed on vector hardware or on the vector lanes of a GPU.

In OpenACC, a programmer can specify that a scalar variable be made private to a loop, directing the compiler to allocate a copy of the variable to local storage. By directing that the variable be made private, you are guaranteeing that this particular variable will not block parallelism, and that the move to private will preserve the semantics of the loop. For this to be true, the scalar must be used in one of only two ways: (a) The scalar must be defined on each loop iteration prior to any use, or (b) the scalar must not be assigned within the loop. You distinguish these two cases by indicating firstprivate for the latter, indicating that the private variable needs to be initialized with the value of the global location.

Reductions are indicated on a parallel loop by using a keyword (i.e., sum) indicating the kind of the reduction and the scalar that is the result of the reduction. With this information, the compiler knows to create a private copy of the reduction variable inside the parallel loop, initialize it according to the reduction kind, and create appropriate locking and updating of the associated global variable.

5.2 Restructuring Compilers

Previous sections present challenges that you must solve to effectively utilize parallelism: recognizing loops that can be executed as gang, worker, and vector loops; recognizing reductions; and placing variables appropriately in the memory hierarchy. These challenges are not new; parallel architectures have existed for decades, and every parallel architecture faces the challenges presented here. Compiler writers tend to be smart. Compiler research over those decades has resulted in a theoretical foundation to meet many of those challenges. To best apply OpenACC directives, you need to know where compilers can best be helped.

5.2.1 What Compilers Can Do

Fundamentally, executing a sequential loop simultaneously or indeterminately is reordering the execution of the loop. Compilers approach this reordering by creating a relation representing all the execution orderings in the sequential program that must be preserved for the computation results to be preserved. This relation is called dependency.9 Here is an example of an ordering that must be preserved (i.e., a dependency):

9. Allen and Kennedy, Optimizing Compilers for Modern Architectures.

t[i] = a[i] + 1;
a[i] = t[i];

The first statement computes a value of t[i] that is used by the second. As a result, their relative execution order must be preserved or else the second statement will use an incorrect value of t[i]. (Similarly, changing the relative execution order will also cause a[i] in the first statement to receive an incorrect value.) With dependency defined in this way, a compiler is free to reorder a program at will, as long as it does not violate any dependencies, and the results that are computed will remain unchanged. Dependency theory also includes the ability to attribute dependencies to specific loops, enabling more precise reordering.

Loop-based dependency enables an exact definition of sequential loops that can be correctly executed using simultaneous or indeterminate semantics. As a result, given an exact dependency graph, restructuring compilers can do a good job of determining loops that can be correctly executed as gang, worker, or vector loops. The problem is building an exact dependency graph for loops based on array references. In general, this problem is undecidable. Pragmatically, dependency theory has the mathematical foundations to deal with array subscripts that are linear functions of the loop induction variables, and that covers most loops encountered in practice. Based on this, dependency-based compilers can do a good, but not perfect, job on most programs.

Dependency is based on memory access. Two statements can be dependent only if they access a common memory location. Although the presence of a dependency limits the reordering that can be effected, it also implies that a memory location is reused between two statements. Compilers are able to utilize that reuse to find the best locations for variables in the memory hierarchy.

The scalar precursor of dependency is data-flow analysis, which summarizes the relationships between definitions and uses of scalar variables.10 Data-flow analysis reveals whether a scalar is always defined before use in a loop (meaning it can be made private) or whether it needs initializing. Recognizing reductions is straightforward for humans, but the process is more abstract than it is mechanical, and general recognition is not simple for computers. Although it is not perfect, data-flow analysis does provide information necessary for recognizing most reductions.

10. S. S. Muchnick, Advanced Compiler Design and Implementation (San Francisco: Morgan Kaufmann, 1997).

With dependency and data-flow analysis, compilers can do a reasonable job exploiting parallel hardware. But they can’t do everything.

5.2.2 What Compilers Can’t Do

With an exact dependency graph, compilers are perfect at uncovering potential parallelism from a semantic point of view. However, as noted earlier, creating an exact dependency graph is in general undecidable. In addition to the limitation mentioned earlier (array subscripts must be linear functions of the loop induction variables),11 two programming constructs provide the biggest challenges for compilers: symbolic subscript expressions and aliasing.

11. Wolfe, Optimizing Supercompilers for Supercomputers.

Symbolic subscript expressions, wherein a loop variable’s coefficient or other factor is unknowable at compile time, defy a compiler’s best efforts. Such expressions arise more commonly than one might think, primarily because of multidimensional array references. In many cases, the symbolic expression involves physical quantities (e.g., mass or distance) that a user can reason about.

Aliasing, which occurs when the same memory location may have two or more references to it, is more subtle.12 The following fragment illustrates aliasing and associated problems.

12. K. D. Cooper and L. Torczon, Engineering a Compiler (San Francisco: Elsevier, 2004).

void copy(double a[], double b[], int n) {
    int i;
    for(i=0; i<n; i++)
        a[i] = b[i];
}
. . .
double x[100];
copy(&x[1], &x[0], 10);

Looking at the procedure copy in isolation, the loop appears parallelizable. But in fact, it’s not. The reason is the way in which copy is called. The fact that a and b are different names for the same memory means that the statement depends on itself. Resolving aliases in a program is an NP-complete problem, where NP stands for nondeterministic polynomial time. Compilers must adopt a conservative approach (that is, compilers assume that the worst possible case is happening unless it can prove otherwise), and this means that, in most cases of possible aliasing, compilers must assume that a dependency exists.

Another symbolic area that creates problems for compilers is symbolic loop bounds and steps. Doing the best job of allocating loops to parallel hardware requires knowing roughly how many times the loops iterate. For instance, there is little point in scheduling a loop with only three iterations as a vector loop; the overhead of scheduling the loop will overwhelm the benefit. Given two potential candidates for vector execution, one that iterates 3 times and the other that iterates 128 times, the larger iteration loop is absolutely the better choice. However, the compiler can’t make that choice without knowledge of the loop sizes, and if those sizes are symbolic, the compiler can only guess.

Compilers must work with the semantics of a computation as written and, in particular, with the semantics as expressed in a sequential language for OpenACC. Those semantics sometimes distort the mathematics underlying the physical phenomenon. Consider the following fragment:

for(i=1; i<m-1; i++) {
    for(j=1; i<n-1; j++) {
        a[i][j] = (a[i+1][j] + a[i-1][j] + a[i][j+1]
                + a[i][j-1]) / 4.0;
    }
}

This fragment is typical code used to solve differential equations, which have the property that at steady state, the value at a point is equal to the average value across a surrounding sphere. A compiler examining the fragment would correctly conclude that the statement depends upon itself in every loop and that it cannot be executed simultaneously or indeterminately. In fact, however, it can. The reason is that the fragment is converging toward a fixed point: when the values no longer change between iterations, a holds the solution. As a result, any errors that are encountered along the way by parallel execution are insignificant to the final solution. A human can readily recognize this; a compiler cannot.

The items in this section—symbolic expressions, aliasing, hidden semantics—are linked by a common theme: information. Compilers are not omniscient, and they must be conservative in their approach, so it’s not surprising that information is a primary need. OpenACC is designed as a means for transferring this information from the programmer to the compiler.

5.3 Compiling OpenACC

OpenACC directives obviously simplify parts of the compilation process for GPU programs. Gang, worker, and vector directives tell the compiler whether a loop executes correctly in an indeterminate order, thereby eliminating the need for sophisticated dependency analysis and detailed loop analysis. Moreover, private clauses dictate precisely where variables need to be located in the memory hierarchy. Reduction recognition is tricky and detailed; the reduction clause eliminates the need for lots of checks normally employed by compilers to ensure correctness. Explicit directive control of memory transfers between host and device relieves the compiler of having to do detailed interprocedural analysis when it tries to optimize memory transfers.

Although it’s easy to think that directives solve all the compiler’s problems, that is not the case, and, in some cases, they complicate the compiler’s work. The phrase effective parallelism has two parts: parallelism, which means getting an application to run in parallel, and effective, which means getting the application to run faster. Experienced programmers are aware that the two are not always the same. The ease of applying directives raises user expectations, making the compiler’s job harder.

Some of the expected simplifications aren’t necessarily simplifications. For instance, there are reasons to build a dependency graph beyond uncovering parallel loops. Users make mistakes in applying directives—mistakes that can be hard to debug by running the application. The compiler can help detect these at compile time only if it performs the analysis it would need to actually restructure the program.

Key challenges faced by an OpenACC-enhanced compiler are detailed next.

5.3.1 Code Preparation

A compiler that uncovers parallel loops is almost certainly guaranteed to build a dependency graph. Although it is not necessarily wise to do so, the addition of OpenACC directives allow the compiler to bypass building the graph. However, less well known is the fact that such compilers also must implement a number of auxiliary transformations in order to support building the graph. These cannot be bypassed.

One example is auxiliary induction variable substitution. It is not uncommon for programmers who don’t trust compiler optimization to perform optimizations by hand on their source code. Strength reduction is one example:

ix = 0;
iy = 0;
for(int i=0; i<n; i++) {
    y[iy] = y[iy] + alpha * x[ix];
    ix = ix + incx;
    iy = iy + incy;
}

What the user really intends is this:

for(int i=0; i<n; i++) {
    y[incy * i] = y[incy * i] + alpha * x[incx * i];
}

The programmer wrote the first form for efficiency. Additions are much cheaper than multiplications, so in terms of the integer support arithmetic, the first is faster (although a reasonable compiler will eventually generate that form regardless of what is written). The transformation that reduces the second form to the first form is known as strength reduction (replacing an operator with a cheaper operation). For vector or parallel execution, however, the second form is necessary, both to uncover the memory access strides through the vectors and to eliminate the scalar dependencies. This is true whether the parallelism is implicitly detected or explicitly stated by a directive. This transformation (known as auxiliary induction variable substitution)13 as well as other preliminary transformations must be performed, either by the compiler or by the user.

13. Wolfe, Optimizing Supercompilers for Supercomputers.

5.3.2 Scheduling

Scheduling parallel and vector code involves balancing two conflicting forces. In one direction, parallelism is effective only when all processors are being productive. The best chance of keeping all processors busy is to make each instruction packet to be executed in parallel as small as possible, so that there are as many packets as possible. This approach means that as much code as possible will be ready to execute at any time. However, parallel execution does not come free; there is always some overhead involved in executing an instruction packet in parallel. Maximizing the number of packets maximizes the overhead. Effective scheduling therefore requires that you keep instruction packets small enough to balance the load across the processors, but large enough to minimize the overhead.

Most effective scheduling techniques are dynamic: they schedule work while the application is running. The alternative is static scheduling, wherein the compiler fixes the schedule for the application during the compilation process. Static scheduling introduces less overhead, because the compiler schedules the code directly with no overhead, whereas dynamic scheduling generally achieves better load balance, because it can adjust for execution changes based on input data.

OpenACC gangs require static scheduling on GPUs; workers can be dynamically scheduled. The result strikes an efficient point between overhead and load balance.

5.3.3 Serial Code

GPUs start up in fully parallel mode, with all processors—blocks, warps, and vectors—executing the kernel. Its “natural” state is to be executing in parallel, and there is no one instruction that says, “Go serial.” As a result, it is relatively simple to execute parallel code; accordingly, the more difficult task is to execute serial code. Assuming that workers are dynamically scheduled using a bakery counter (all loop iterations to be performed are placed in a queue; as a worker becomes free, the queue gives the worker an iteration to perform), serial code at the worker level is not difficult. When workers are working on a serial section, all processors except one (a chief processor) are held in a pen. When the chief processor encounters a worker loop, it adds the iterations to the queue and opens up the pen.

Serializing the vector lanes down to a single execution is more challenging. Making only one lane of a warp execute (making it effectively the “chief” lane) requires neutering the remaining lanes so that they do not execute. The pragmatic way of effecting this is to place a jump around the serial section of code, predicated on the thread id. For the zeroth thread (the chief lane), the jump is not taken; for all others it is. The result is similar to this:

if (thread_id == 0) {
    /* serial code */
}
/* back to full vector code */

The transformation is simple when performed on a single basic block (i.e., a group of instructions without control flow change; each instruction in the block is executed if and only if another instruction in the block is executed). But the process for handling multiple basic blocks with control flow change is more difficult.

An alternative implementation for GPUs with fully predicated instructions is to use a process known as if conversion.14 If-conversion removes all branches in a section of code, replacing them with predicated instructions where the predicates exactly mimic the effects of the control flow. In optimizing compilers, if-conversion converts all control dependencies into data dependencies, thereby simplifying the construction of a dependency graph and allowing masked vector execution. It can produce the minimally necessary controlling condition for every statement, something that is not necessarily true for inserting jumps around serial blocks.

14. Allen and Kennedy, Optimizing Compilers for Modern Architectures.

5.3.4 User Errors

When optimizing a program, compilers analyze the program to uncover all the information they need to create more efficient forms of code. Compilers must be conservative; they do not effect an optimization unless they can absolutely prove that the change is safe. One would think that directives, including those in OpenACC, would save compilers the bother of implementing the analysis, given that the directives are providing the needed information. That approach works if the information provided by the user is correct. Unfortunately, users are not always perfect, and the coordination between OpenACC and user is not always perfect. Two critical areas compilers need to check are control flow and efficiency.

Control flow checks are useful because of the way in which GPUs implement control flow for simultaneous operations. When control flow splits (i.e., a conditional branch is encountered) the GPU continues; vector lanes that do not take the branch are enabled, and vector lanes that do take the branch are disabled. Eventually, the different paths should converge so that all vector lanes are again enabled. If that convergence does not occur, it prompts a condition known as divergence, and most likely it means that a valid program will hang at execution. Divergence normally does not happen with optimizing compilers under independent control, because compilers tackle only structured control flow, and divergence does not happen with structured programs. Users, however, are not limited to using structure programming or to placing directives around structured loops. Branches that terminate loops prematurely, particularly by branching prior to the loop, convolute convergence conditions.

Efficiency is a second consideration, one that exposes a weakness in the directive strategy. At a coarse level, parallel performance15 on virtually any parallel architecture is governed by a few simple principles.16

15. We ignore, for the moment, memory concerns, something that definitely cannot be ignored in practice.

16. Allen and Kennedy, Optimizing Compilers for Modern Architectures.

1. The outermost parallel loop should be the gang loop.

2. The next outermost parallel loop should be the worker loop.

3. If there is no other parallel loop, the outer loop should be both the gang and the worker loop.

4. Vector loops are most efficient when all the vectors access contiguous memory. In C, this means when the vector loop runs down the row with unit stride. In Fortran, this means when the vector loop runs down the column with unit stride.

5. Gang, worker, and vector loops all have a minimal size at which they become more efficient than scalar execution. Contrary to common conceptions, this size is usually more in the range of 5 to 10 than it is 2.

Note that compilers are proactive in trying to achieve these conditions. They will interchange loops to get a more profitable parallel loop to the outermost position or to get a better vector loop to an innermost position.

Although these principles are guiding, they are not absolute on every architecture. Directives in a program are usually immutable; as a result, they are not guaranteed to be optimally placed for every architecture. Compilers do have accurate models of architectures, so where they have full knowledge (e.g., loop lengths), they are better at loop selection than users.

One of the questions faced by compilers, given this, is whether the compiler should view OpenACC directives as prescriptive or descriptive. Prescriptive says that the user has prescribed exactly the desired mapping to hardware, and the compiler should follow it whether or not it is the most efficient. Descriptive says that the user has provided information to the compiler, describing the desired loop behavior, but the compiler should make the final decision. There are compilers supporting either strategy.

Whether it takes a prescriptive or a descriptive approach, a compiler should make the minimum check that the computation will speed up enough to justify the download time to the device. If not, it makes more sense to execute the code on the host processor.

5.4 Summary

Exploiting parallelism with any number of processors is challenging, but it is particularly so when you are trying to effectively use hundreds or thousands of processors. The theme of this chapter is that effectively exploiting parallelism—particularly massive parallelism—is a job that requires cooperation of both the programmer and the compiler, and that OpenACC directives provide a firm basis for that cooperation. In doing so, you will be able to not only succeed in shepherding your one thousand chickens, but also see performance gains for your eight oxen.

The key challenges include recognizing loops (which observe simultaneous and indeterminate semantics, allocating variables to the proper level in the memory hierarchy) and recognizing and scheduling reductions. OpenACC directives help meet these challenges and allow programmers to help overcome the compiler’s biggest weaknesses: aliasing, symbolic expressions, and hidden semantics. Although the use of OpenACC directives combined with a compiler is not a perfect solution, it greatly simplifies parallel programming and enables faster exploration of algorithms and loop parallelism to get the best possible performance.

5.5 Exercises

1. The OpenACC standard says that a vector loop cannot contain a gang or worker loop. Why?

2. Dr. Grignard, a longtime professor of parallel programming, has said, “If you can run a loop backwards (i.e., negating the step, starting with the upper bound, and ending at the lower) and you get the same result, you can run it correctly in vector.” Is he right?

3. Dr. Grignard has also said that a loop that passes the test of running backward can also be run correctly as a gang loop. Is that correct?

4. The following procedure is one encoding of daxpy, a procedure for multiplying a scalar times a vector and adding the result to another vector (Alpha × X + Y).

void daxpy(int n, double alpha, double x[], int incx,
           double y[], int incy)
{
    for(int i=0; i<n; i++) {
        y[incy * i] = y[incy * i] + alpha * x[incx * i];
    }
}

5. Given the definition of daxpy as a procedure that adds two vectors, it seems natural to assume that the loop can be run in vector and in parallel.

a. Can a compiler vectorize this loop as is?

b. What should a compiler do if a user places a vector directive on the loop?

6. The text states that if a statement in a loop does not depend upon itself, then it can be executed simultaneously. Consider the following loop:

for(i=1; i<n; i++) {
    d[i] = a[i-1] + 1;
    a[i] = b[i] + c[i];
}

There is a dependency from the second statement to the first, but neither statement depends upon itself. Can the statements be executed simultaneously? What should a compiler do if the user places a vector directive around the loop?

7. The text states that if there is a dependency in a loop caused by the loop, then the loop cannot be executed in gang or worker mode. Consider again the following loop:

for(i=1; i<n; i++) {
    d[i] = a[i-1] + 1;
    a[i] = b[i] + c[i];
}

This loop carries a dependency from the second statement to the first. Can it be executed in gang or worker mode? If not, can you think of ways of changing it so that it can?

8. Following is a fairly standard coding of matrix multiply:

for(j=0; j<n; j++) {
    for(i=0; i<m; i++) {
        c[i][j] = 0;
        for(k=0; k<p; k++) {
            c[i][j] += a[i][k] * b[k][j];
        }
    }
}

What OpenACC directives would you place, and where, to get the best execution speed? Are there other changes you would make to the loops? Suppose the loop were instead written this way:

for(j=0; j<n; j++) {
    for(i=0; i<m; i++) {
        t = 0;
        for(k=0; k<p; k++) {
            t += a[i][k] * b[k][j];
        }
        c[i][j] = t;
    }
}

What would you do differently? Which version would you expect to execute faster, and why?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.103.219