Chapter 2. Loop-Level Parallelism

Adrian Jackson, EPCC

Loops are at the heart of the key computational kernels of many applications, especially in computational simulations. Because of the nature of the data that applications or algorithms operate on, data is stored in arrays, often multidimensional in scope. This means that the majority of work in the application is in iterating over those arrays and updating them using the algorithm being employed.

The simplest method for iterating over a multidimensional array in programming languages like C, C++, and Fortran is by using nested loops. Here’s an example in C:

double particles[N][M][P];
. . .
for(int i=0; i<N; i++){
    for(int j=0; j<M; j++){
        for(int k=0; k<P; k++){
            particles[i][j][k] = particles[i][j][k] * 2;
            . . .        }
    }
}

Here is one in Fortran:

float, dimension(P,M,N) :: particles
. . .
do i=1,N
    do j=1,M
        do k=1,P
            particles(k,j,i) = particles(k,j,i) * 2
            . . .
        end do
    end do
end do

When you are looking for a way to use the large range of computational resources that are available in parallel computing hardware, it is therefore natural to alight on the idea of distributing the iterations of these nests of loops to different processing elements in a computational resource; in this way, you distribute work across available hardware and reduce the overall runtime of the application. Indeed, this is a core feature of parallel programming constructs, such as OpenMP1 and OpenCL.3

1. http://www.openmp.org.

2. https://www.khronos.org/opencl/.

OpenACC, as discussed in Chapter 1, also provides for mapping loops to computational hardware by annotating loops to be parallelized with the kernels directive(#pragma acc kernels for C/C++, and !$acc kernels for Fortran) or parallel loop directive (#pragma parallel loop for C/C++, and !$acc parallel loop for Fortran).

However, recognizing that accelerators or manycore (more than 12–18 cores) hardware may have several levels of parallelism—that is, cores, threads, and vector units—OpenACC allows you to specify a more detailed description of the way loops are mapped to the hardware being exploited by a given execution of an application.

This chapter explores in detail how to parallelize loops using OpenACC, discussing the various levels of parallelism provided and the clauses that can be added to the parallelization directives to ensure that the resulting parallelism still produces correct results, or to improve performance.

2.1 Kernels Versus Parallel Loops

Before we discuss the details of the loop-level parallelism available in OpenACC, it is worth spending a little time exploring the differences between the two main types of OpenACC directives you can use to parallelize loops (kernels and parallel) using a real code example.

As mentioned in Chapter 1, both directives can be used to distribute one or more loops on parallel hardware, but each achieves parallelization in a slightly different way. The kernels directive specifies a region of code that the compiler should analyze and decides what to parallelize and how to distribute it to the computing hardware. The parallel directive, on the other hand, specifies which code in the parallel region is safe to parallelize and will run all the code within that parallel region on all the available threads in the hardware being used.

For instance, if you were to use the kernels directive to parallelize the loops outlined in this chapter, it could look like this, for C:

#pragma acc kernels
{
    for(int i=0; i<N; i++){
        for(int j=0; j<M; j++){
            for(int k=0; k<P; k++){
                particles[i][j][k] = particles[i][j][k] * 2;
            }
        }
    }
}

Here it is in Fortran:

!$acc kernels
do i=1,N
    do j=1,M
        do k=1,P
            particles(k,j,i) = particles(k,j,i) * 2
        end do
    end do
end do
!$acc end kernels

If you were to use the parallel directive to parallelize the same code, it would look like this, for C:

#pragma acc parallel
{
    for(int i=0; i<N; i++){
        for(int j=0; j<M; j++){
            for(int k=0; k<P; k++){
                particles[i][j][k] = particles[i][j][k] * 2;
            }
        }
    }
}

Here it is in Fortran:

!$acc parallel
do i=1,N
    do j=1,M
        do k=1,P
            particles(k,j,i) = particles(k,j,i) * 2
        end do
    end do
end do
!$acc end parallel

However, remember that the parallel directive, by itself, is relatively useless. Indeed, the preceding code, when parallelized using the parallel directive on its own, will not actually achieve any splitting or sharing of work; rather, it will simply run the same code on many threads at once, and will therefore neither speed up the code nor produce the correct result.

To achieve correct and efficient parallelization when using the parallel directive, you must combine it with one, or more, loop directives to specify where to apply work sharing. Therefore, the example should have been parallelized as follows, in C:

#pragma acc parallel loop
for(int i=0; i<N; i++){
    for(int j=0; j<M; j++){
        for(int k=0; k<P; k++){
            particles[i][j][k] = particles[i][j][k] * 2;
        }
    }
}

And this way in Fortran:

!$acc parallel loop
do i=1,N
    do j=1,M
        do k=1,P
            particles(k,j,i) = particles(k,j,i) * 2
        end do
    end do
end do
!$acc end parallel

Both kernels and parallel allow you to include multiple loops inside them; however, the way they treat those loops differs. If there are multiple loops inside a kernel region, each loop (that the compiler decides it can parallelize) will generate a separate kernel to be executed, and these will run sequentially, one after the other. In contrast, if there are multiple loops inside a single parallel region, then these may run at the same time, or at least there is no guarantee that the first loop will have finished before the second one starts. Consider this example, in Fortran:

!$acc parallel
!$acc loop
do i=1,1000
    a(i) = b(i)
end do
!$acc loop
do i=1,1000
    b(i) = b(i)*2
    c(i) = b(i)+a(i)
end do
!$acc end parallel

Here it is in C:

#pragma acc parallel{
    #pragma acc loop
    for(int i=0; i<1000; i++){
        a[i] = b[i];
    }
    #pragma acc loop
    for(int i=0; i<1000; i++){
        b[i] = b[i]*2;
        c[i] = b[i]+a[i];
    }
}

This code is not guaranteed to be correct, because the second loop could start executing before the first loop has finished, depending on how the loops have been mapped to the hardware by the compiler. If kernels had been used instead, this would have generated code that was guaranteed to be correct.

2.2 Three Levels of Parallelism

Because parallel hardware has different places where parallelism can occur—across processors, cores, threads, hyperthreads, vector units, and so on—OpenACC provides functionality for you to specify how to split work across different hardware units. This functionality is implemented in gang, worker, and vector.

These clauses, which can be applied to the loop directive, affect how loops and loop iterations are distributed to hardware components on the targeted processors or accelerator. This may affect both the performance and the functionality of an application, because the types of synchronization that are available between loop iterations are dependent on the hardware where the iterations are executed.

We discuss these clauses in this section, but note that what they actually translate to depends on the hardware your application is running and the compiler you are using. Most hardware that OpenACC programs run on supports at least two levels of parallelism (across cores and across vector units), but some hardware may support more or fewer levels. However, you can make some general assumptions: gang is the most coarse-grained, mapping to chunks of work assigned to various gangs of workers to perform independently; worker defines how work is distributed inside a gang; and vector maps to the hardware’s instruction-level parallelism (i.e., vector operations).

The use of gang, worker, and vector allows the complexity of different hardware to be abstracted by programmers, providing a generalized level of mapping between the program’s functionality and the hardware the program is running on. However, it is possible to specify in more detail exactly how to map loops to hardware—for instance specifying the vector length along with specifying the number of workers, or gangs, to use. This construct may provide optimized performance for specific hardware, albeit at the expense of how well the application’s performance stands up when the application is ported, often termed as performance portability to other hardware.

Finally, note that OpenACC does not require programmers to specify this mapping between program functionality and parallel hardware. If you choose not to specify gang, worker, or vector clauses on your loop directive, then the compiler will attempt a mapping based on what it knows about the compiled code and hardware being used.

2.2.1 Gang, Worker, and Vector Clauses

Gangs can best be thought about as groups of workers, wherein different groups (gangs) can work independently, possibly at the same time, possibly at different times. Gangs do not allow for synchronization between different gangs but can allow workers in the same gang to synchronize (pause and wait for all workers or a subset of workers to be in the same place in the code, or protect the update to a shared variable between the gang workers).

Workers are threads within a gang, and they allow users or developers to specify how threads can be mapped to the hardware being used. As such, they are an intermediate level between the group of threads (the gang) and the low-level parallelism implemented in vector (which specifies how instructions are mapped to the hardware available to an individual thread).

Vector parallelism has the finest granularity, with an individual instruction operating on multiple pieces of data (much like single-instruction, multiple-data (or SIMD) parallelism on a modern CPU, or warps on NVIDA GPUs). These operations are performed with reference to a vector length, which specifies how many units of data will be operated on for a given instruction.

In general, a gang consists of one or more workers, each of which operates on a vector of some length. Within a gang the OpenACC model exposes a cache memory, which can be used by all workers and vectors within that gang and allows synchronization within the gang, although not as a programmer-specifiable feature (i.e., you cannot add clauses or directives to manually control this synchronization).

These three levels of parallelism layout enable a mapping to all current hardware platforms, as shown in Table 2.1.

Table 2.1 Potential mapping of gang, worker, and vector to various hardware platforms

Platform

Mapping of

Gang

Worker

Vector

Multicore CPU

Whole CPU (NUMA domain)

Core

SIMD vector

Manycore CPU (e.g., Xeon Phi)

NUMA domain (whole chip or quadrant)

Core

SIMD vector

NVIDIA GPU (CUDA)

Thread block

Warp

Thread

AMD GPU (OpenCL)

Workgroup

Wavefront

Thread

Note: The mapping actually used is up to the compiler and its runtime environment, and this table is not guaranteed for an actual OpenACC implementation. The programmer can guide the compiler only by specifying a number of each element to use.

2.2.2 Mapping Parallelism to Hardware

As discussed earlier, you do not have to specify mapping to hardware via these clauses in order to write a correct OpenACC program. Therefore, it is good practice not to include them until your developed or ported program has been validated and you are ready to start optimizing it.

At this point it is crucial to understand both the hardware you are targeting and your algorithm or application. If you know the sizes of your loops (or at least roughly the range of sizes they may be) and the synchronization you need between loop iterations, then you should be able to make sensible choices about what hardware mapping directives you can use on which loops.

For NVIDIA GPU hardware, it is likely that gangs map to thread blocks, workers to warps, and vector to threads. Alternatively, you could simply use a gang to map to thread blocks and workers to map to threads in those thread blocks. Likewise, different loops in a parallel or kernel region could map to different thread blocks.

On multicore or manycore processors you can imagine gangs mapping to a processor or NUMA region of a processor, workers to cores or hyperthreads on the processor, and vectors to the vector units in the cores.3 Alternatively, you could imagine gangs and workers mapping to cores, vectors to the vector units, and the NUMA characteristics of the node being ignored when mapping OpenACC directives to hardware.

3. NUMA stands for non-uniform memory access. If you have multiple processors in a node it is likely all of them can access the memory of the node, but it’s quicker to access memory directly attached to a processor than it is to access memory attached to another processor in the node. Each processor would be a NUMA region: a region where local memory accesses are faster than remote memory accesses.

For parallel and kernels regions, in addition to manually specifying these mapping clauses on individual loops, you can specify the sizes of gangs, workers, and vectorization to use at the higher level (i.e., on the parallel or kernels directive itself). This is done using the clauses num_gangs(), num_workers(), and vector_length().

These clauses take an integer value that specifies, respectively, the number of gangs to be used for each kernel or gang loop in the region, the number of workers per gang, and the vector length to use in vector operations. Note that these are optional clauses, and you can use any combination of them required. Here’s an example:

!$acc parallel loop num_gangs(32) vector_length(128)
do i=1,N
    do j=1,M
        do k=1,P
            particles(k,j,i) = particles(k,j,i) * 2
        end do
    end do
end do
!$acc end parallel

If the value you specify is not supported or implementable on the hardware being used, the actual implementation may use a lower value than what you specify in these clauses.

2.3 Other Loop Constructs

This section discusses the loop constructs available in OpenACC.

2.3.1 Loop Collapse

When you manually specify how loop iterations should be mapped to the available parallel hardware, the loops you are parallelizing must have enough iterations to map to the various hardware levels. You must also specify loop clauses on all loops inside a parallel region, something that can be tedious and can impact the readability of the code.

As an alternative, you can treat a whole nest of loops as a single loop space to be distributed to the parallel hardware, specify the parallelism at a single loop level, and allow the compiler to sort out the mapping of those loops to specific hardware. This approach also can have the benefit of providing a larger loop iteration space to be distributed across hardware threads or vector hardware, and this is useful if some loops in the loop nest have small iteration counts, as demonstrated in the following code:

for(int i=0; i<8; i++){
    for(int j=0; j<8; j++){
        for(int k=0; k<8; k++){
. . .
        }
    }
}

To tell the compiler to treat a whole loop nest as a single loop, you can use the collapse keyword for the loop directive. This directive takes a single positive number as an argument and uses this number to determine how many loops within the loop nest to collapse into a single loop. Here’s an example:

!$acc parallel
!$acc loop collapse(3)
do i=1,8
    do j=1,8
        do k=1,8
            …
        end do
    end do
end do
!$acc end parallel

For collapse to work, the loops need to be tightly nested—that is, there is no code between loops—and the size of the loops must be computable and must not change during the loop.

2.3.2 Independent Clause

A large fraction of OpenACC’s functionality is based on the compiler’s ability to analyze code and work out whether it can be safely parallelized or not (i.e., kernel regions), but compilers do not always have enough information to make this decision safely.

Therefore, compilers often err on the side of caution when deciding which loops can be parallelized, and this may mean performance is lost in code because the functionality that could be run across many threads is being run sequentially. Here’s an example of such code:

!$acc kernels
!$acc loop
do i = 1, 512
    do j = 1, 1024
        local_index = (((i-1)*1024)+j
        halo_data(local_index) = particle_data(j,i)
    enddo
enddo
!$acc end kernels

The code in the innermost loop depends on the outer loop counter (to calculate local_index), and therefore the compiler may decide this is a loop dependency (i.e., one iteration of the loop depends on the result of the previous iteration); as a result, the compiler produces a sequential kernel rather than parallelize the loops. You can address this problem by adding the independent clause to the loop directive to tell the compiler you guarantee that the loop iterations are independent and therefore it can safely parallelize the loops.

Note that this issue is not restricted to OpenACC or accelerator applications; compilers have similar issues trying to vectorize code, and many compilers or parallel programming approaches will give you methods for giving the compiler a bit more information about loops to help it in distributing iterations to parallel hardware.

The independent clause is applicable only to loops in kernels regions, because the parallel directive already informs the compiler that all loops inside the parallel region are independent unless they explicitly include a seq clause.

In general, it is not necessary to add independent to all your loops in kernels regions. This is primarily an optimization step after you have parallelized your code with OpenACC.

You can usually be guided by compiler reports on the parallelism that it has implemented, or by profiling data on your application, regarding loops that don’t seem to be parallelizing. Then you can decide whether they truly are independent and therefore whether you can apply this clause.

For instance, the PGI compiler, when compiling code using the -Minfo=accel flag, may produce output like this:

      . . . Parallelization would require privatization of
            array 'halo_data(:)'
         Accelerator kernel generated
         !$acc loop seq

Such output indicates that the associated code loop needs further investigation.

2.3.3 Seq and Auto Clauses

At times a loop may appear parallelizable but contains some data dependency or calculations that the compiler does not recognize as restricting the parallelism. In these cases, when using the kernels directive, the compiler may erroneously parallelize a loop when it should not.

Furthermore, the parallel directive instructs the compiler to parallelize all the code contained within the parallel region, whether or not it is correct to do so.

Also, loop nests to be parallelized with OpenACC directives may have more loops available than the various types of parallelism you are targeting (i.e., more than three loops, which you could map to gang, worker, or vector). In this scenario it is often possible to collapse loops to reduce the loops to be distributed to hardware, but in some situations loop collapsing is not possible (e.g., loops that are not perfectly nested).

Therefore, OpenACC provides a clause, seq, that you can add to a loop to stop the compiler from parallelizing or vectorizing that loop, or to control where parallelism happens in the loop nest. You can add seq to any loop directive, but it is mutually exclusive with the gang, worker, and vector clauses (because they specify parallelism, and seq disables parallelism). Here’s an example:

#pragma acc parallel
#pragma acc loop gang
for (f=0; f<A; f++){
    #pragma acc loop worker
    for (g=0; g<B; g++){
        #pragma acc loop seq
        for (h=0; h<X; h++){
            #pragma acc loop vector
            for (i=0; i<Q; i++){
                …
            }
        }
    }
}

You can use auto, on the other hand, to instruct the compiler to define the various levels of parallelism that loops should be mapped to. This clause is automatically applied to loops inside a kernel region that have not had a parallelism clause (worker, vector, gang, or seq) applied to them.

However, note that auto does not instruct the compiler that loops are independent. Therefore, auto may need to be combined with the independent clause for loops the compiler is struggling to automatically parallelize.

You can also use auto to enable compiler checking on loops inside a parallel region. As previously discussed, these loops are assumed to be independent and parallelizable by default.

2.3.4 Reduction Clause

A number of code or algorithm features can cause compilers not to be able to automatically parallelize loops. One occurs when a single variable, or an array of variables, is updated during each loop iteration. Here’s an example, in C:

total = 0;
for(i=0;i<100;i++){
    total += data[i];
}

Here it is in Fortran:

total = 0
do i=1,100
    total = total + data(i)
end do

Because the variable total at iteration i depends on iteration i-1, the compiler will not be able to automatically parallelize the loop. However, you can fix this issue by using a reduction clause.

The reduction clause, which can be added to a kernels, parallel, or loop directive, specifies that a variable will be operated on locally for each thread that is working on it, and then there will be a final operation to bring together all the local versions of the final variable. It does something like the following.

Convert this loop:

total = 0
do i=1,100
   total = total + data(i)
end do

To this set of loops:

!declare an array of local_totals with an entry for each thread
  !running the code
total = 0
local_totals(thread_number) = 0
!parallelize this loop
do i=1,100
   local_totals(thread_number) =
&      local_totals(thread_number) + data(i)
end do
!do this loop in serial
do i=1,number_of_threads
   total = total + local_totals(i)
end do

The reduction clause enables you to maintain your original loop in your source code and automatically convert it into a set of parallel and sequential loops when compilation happens. In this way, the compiler parallelizes as much as possible while producing the correct result.

In OpenACC, a reduction clause can specify the operators on scalar variables, as shown in Table 2.2. (Reductions are currently not allowed for arrays or array elements, C structure members, C++ class or structure members, or parts of a derived type in Fortran.)

Table 2.2 Supported reduction operations

C AND C++

FORTRAN

+

+

*

*

Max

max

Min

min

&

iand

|

ior

&&

ieor

||

.and.

 

.or.

 

.eqv.

 

.neqv.

When using a reduction clause, you do not need to initialize your variable; it will be automatically initialized correctly for you.

Here’s an example of applying a reduction to the loop outlined earlier:

total = 0
!$acc parallel loop reduction(+:total)
do i=1,100
    total = total + data(i)
end do
!$acc end parallel loop

Here it is for C:

total = 0;
#pragma acc parallel loop reduction(+:total)
for(i=0; i<100; i++){
    total = total + data[i];
}

With parallel regions, you can use the reduction clause either for the whole parallel region (as a clause on the parallel directive itself) or on a loop directive.

If you are using the kernels directive, the reduction clause can be applied only at the loop directive level.

Within a parallel region, if you apply reduction on a loop by using the vector or worker clause (and no gang clause) and if the reduction variable is listed as a private variable, then the value of the private copy of the scalar will be updated at the exit of the loop. If the reduction variable is not specified as private, or if you apply the reduction to a loop by using the gang clause, the value of the variable will not be updated until execution reaches the end of the parallel region.

2.4 Summary

Loop parallelization functionality is at the heart of the ability of OpenACC to exploit multicore, manycore, and GPU hardware. Efficiently and correctly parallelizing loops across hardware is key to ensuring good performance.

The parallel and kernels directives differ; kernels gives the compiler more responsibility for identifying parallelism, and parallel mandates that the compiler parallelize the code within that region.

OpenACC provides extra functionality, additional clauses that can be added to parallelization directives, to enable you to help the compiler efficiently parallelize your code and then to sensibly map it to the hardware you are using.

Judicious use of independent, collapse, and reduction clauses, where required and safe, will enable the compiler to parallelize your code as much as possible. Specifying how to map loops to hardware features (threads, cores, vector hardware, etc.)—based on the hardware you are exploiting and your knowledge of your code—can lead to further performance improvements.

Hopefully, with this information and some experimenting, you will be able to build correct OpenACC code and make it fly. Just remember that correctness should come before performance, so having well-tested and validated OpenACC code should be your first target before you start optimizing the loops and exploiting the hardware you are using to its full potential.

2.5 Exercises

1. If you run the following fragment of OpenACC on a GPU using all 1,536 threads that the GPU has, how many times as fast would you expect it to run, compared with using a single thread on the GPU?

#pragma acc parallel
for(i=0;i<2359296;i++){
    answer[i] = part1[i] + part2[i] * 4;
}

2. The following fragment of code produces an incorrect answer. Can you fix it?

passing_count = 0
!$acc parallel loop
do i=1,10000
    moving(i) = (left(i) – right(i)) * 2
    left(i) = left(i) + 0.6
    right(i) = right(i) - 3
    passing_count = passing_count + 1
end do
!$acc end parallel loop

3. The following code does not efficiently utilize the hardware available on a GPU. Can you suggest a way of optimizing it to improve the utilization of the GPU?

#pragma acc parallel loop
for(i=0;i<8;i++){
    for(j=0;j<320;j++){
        for(k=0;k<320;k++){
            new_data[i][j][k] = (old_data[i][j][k] * 2) + 4.3;
        }
    }
}

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.253.4