Chapter 5: Compiling OpenACC

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5. Compiling OpenACC

Randy Allen, Mentor Graphics

Getting eight independent oxen to work in unison is not simple. However, it is child’s play compared with coordinating a thousand chickens. In either case, the problem is to harness independently operating beasts so that their efforts are coordinated toward achieving a single goal.

Exploiting parallelism—whether for pulling wagons or for speeding up computation—is challenging. At one level, it requires abstract thinking. Understanding the portions of an algorithm that can exploit parallelism—and, for that matter, creating an algorithm that can exploit parallelism—requires that you deal with problems at an abstract level. Gaining abstract understanding and designing algorithms are tasks best performed by humans.

At a different level, scheduling a thousand processors to coordinate effectively and arranging for data to be at the right place at the right time is complex. Complex, but only in its detail, not in its intellectual content. The human ability to manage detail of this complexity is at best spotty, but computers excel at exactly this.

Computers manage detail. Humans understand abstraction. Both are required for successful parallel execution. As with the gatekeeper and keymaster in Ghostbusters, it seems natural to get these two together. OpenACC does exactly that.

This chapter overviews the challenges that you must solve to effectively utilize massive parallelism and presents the theoretical foundations underlying solutions to those challenges. Then it closes with details of how compilers combine that theory with OpenACC directives to effect your requests and get effective speedup on parallel architectures.

5.1 The Challenges of Parallelism

Exploiting parallelism requires solving two challenges. First is figuring out the parts of the program that can be executed in parallel. Second is scheduling the parallel parts on the available hardware and getting needed data to the right place at the right time. A prerequisite for both is understanding the parallelism available in hardware.

The following section overviews that topic. Subsequent sections detail the process by which compilers map programs to parallelism, highlighting the type of information that they need to successfully effect a mapping. The following sections detail how OpenACC directives provide that information and how compilers use it to generate parallel code. OpenACC is not free, and the concluding sections detail the challenges introduced by OpenACC directives.

5.1.1 Parallel Hardware

At its most fundamental level, parallelism is achieved by replicating something. For computer hardware, that something is usually functional units. A functional unit may be an entire processor (including a program counter and branching logic), or a simple arithmetic unit (only arithmetic operations). The more replication you implement, the more operations that can be performed. However, just as chickens and oxen require coordination in the form of a harness and a driver, functional units need to be coordinated. Software can serve the role of the driver, but the harness, particularly for thousands of processors in GPUs, requires hardware support.

The simplest way of coordinating functional units is to link their execution so that all the units are executing the same instruction at the same time. This method is more or less required with arithmetic units, because arithmetic units rarely contain program counters. For processors, this method of coordination is often effected by having the processors share a single program counter. Sharing a single program counter is the approach taken in NVIDIA GPUs. Obviously, having processors redundantly execute the same instruction is productive only if each operates on different data. Following this approach, each chicken is essentially marching in lockstep.

An alternative is to have completely independent processors. Under that model, functional units are free to execute any instruction at any time on any data. No clock, program counter, or other tie constrains the functional units. At any given time, the processors may be executing different instructions on different data. In the pioneer analogy, each ox or chicken would be proceeding forward at its own pace, oblivious to the progress of the other animals.

Examples of functional units that concurrently execute the same instruction are vector units and lockstep processors. Generally, these are classified as single- instruction multiple-data (SIMD)¹ or single-instruction multiple-thread (SIMT).² The defining point semantically is simultaneous execution.

¹. J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, second edition (San Francisco: Morgan Kaufmann, 1996).

². M. McCool, J. Reinders, and A. Robison, Structured Parallel Programming: Patterns for Efficient Computation (San Francisco: Elsevier, 2013).

Examples of functional units that execute independently abound, and they are illustrated by virtually any processors connected by either an internal bus or an external network. The processors may share memory; memory may be distributed among them; or memory may be split between those two options. In any case, because the functional units are loosely connected, the relative order in which operations execute on two processors is indeterminate. This execution is multiple-instruction multiple-data (MIMD).³

³. Hennessy and Patterson, Computer Architecture.

GPUs contain both types of parallelism in their streaming multiprocessors (SMs) and warps. With an understanding of basic hardware parallelism, next we turn to understanding the mapping from programming languages to hardware.

5.1.2 Mapping Loops

To productively utilize hundreds or thousands of processors, an application must be doing much computation. Scheduling that many processors requires regularity. Without regularity, many processors can be programmed to do only one task. Without regularity, processors are figuratively executing like chickens flying in random directions. For programming languages, “regularity” and “repetition” implies loops. And, unless an application is boringly repeating the same calculation, regularity also means arrays. Pragmatically speaking, effectively utilizing parallelism is a matter of mapping loops onto parallel hardware. Also pragmatically speaking, this mapping is performed by having processors execute loop iterations. In other words, the unit of regularity is the loop body, and the loop iterations are partitioned among the processors.

Loops in standard programming languages are, by definition, sequential (meaning that the iterations of the loop are executed in consecutive order). The semantics of parallel hardware, by the adjective parallel, are not. Given that, it is not immediately obvious that loops can be directly mapped to parallel hardware. In fact, not all loops can be.⁴

⁴. J. R. Allen and K. Kennedy, Optimizing Compilers for Modern Architectures (San Francisco: Morgan Kaufmann, 2001).

Functional units that execute a single instruction implement simultaneous semantics. In terms of array accesses in loops, simultaneous semantics means that the behavior of the statement is defined by the final result obtained if all iterations of the loop are executed together. Consider the following C fragment:

Table of Contents for Chapter 5: Compiling OpenACC

Create new playlist

Sign In

Sign Up

Chapter 5. Compiling OpenACC

5.1 The Challenges of Parallelism

5.1.1 Parallel Hardware

5.1.2 Mapping Loops

5.1.3 Memory Hierarchy

5.1.4 Reductions

5.1.5 OpenACC for Parallelism

5.2 Restructuring Compilers

5.2.1 What Compilers Can Do

5.2.2 What Compilers Can’t Do

5.3 Compiling OpenACC

5.3.1 Code Preparation

5.3.2 Scheduling

5.3.3 Serial Code

5.3.4 User Errors

5.4 Summary

5.5 Exercises

Table of Contents for
Chapter 5: Compiling OpenACC