6 Compiling for distributed memory machines

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 6 Compiling for distributed memory machines

Up until now, our focus has been almost entirely on compiling for parallelism on shared memory machines, motivated in part by the dominance of multicore processors. Nevertheless, all large machines for high-performance numerical computing, as well as many small scale clusters, have a physically distributed memory architecture and are programmed using a distributed memory programming model.

Distributed memory machines consists of nodes connected to one another by using Ethernet or a variety of proprietary interfaces. Examples of these include the IBM Blue Gene machines [29, 162], clusters of workstations and small clusters of rack mounted nodes [173]. Distributed memory machines usually execute single program multiple data (SPMD) programs. In an SPMD program, the program’s data is distributed (or spread) across multiple instances of the same program, allowing each instance of the program to compute in parallel on its part of the data. Because the computations are not fully independent, communication must sometimes be used to transmit the coherent copy of a datum to a process needing that copy. This communication is done across the nodes of a distributed memory machine using explicit messages and collective communication that must be specified by either the programmer or the compiler.

It is increasingly common for nodes to be multicore processors. In this case, each core is usually treated as a single node, and the fact that a shared memory exists across some of the nodes is exploited by the underlying MPI library, if at all. In some cases a hybrid model is used, where a multithreaded program runs on each nodes and communicates via message passing with other nodes, but we do not know of compiler support for this model. The complexity of maintaining two programming models has prevented this from becoming standard practice. In what follows, we assume that each core is treated as a different node.

Figure 6.1(b) shows a high-level diagram of a distributed memory machine, and Figure 6.1(a) shows a simple program that computes the average of the elements in an array. Several interesting concepts are touched on by this figure. For example, the network topology is not specified. In general, compilers view data as close and far, and are not cognizant of the details of specific network topology. The figure also shows that arrays are distributed across the processors while a copy of each scalar often resides on every process.

In Figure 6.1(c), the SPMD program that performs this computation is shown. For simplicity, it is assumed that the data array has 100 elements, and that there are four processes, i.e., the array size is evenly divisible by the number of processes. In forming the SPMD program, several problems must be solved:

Images

Figure 6.1: An example of a distributed memory program.

Data distribution. The data used by the program must be distributed across the different processes and placed into local arrays, and the subscripts accessing the original, undistributed arrays must be modified to access the local, distributed arrays. These issues are discussed in detail in Section 6.1.

Computing the data accessed by a reference. As a precursor to efficiently solving the next two problems, it is necessary to compute the data accessed by an array reference. How to do this is discussed in Section 6.2.

Computation partitioning. The bounds of the loops on each processor must be shrunk to only cover the iterations of the computation that is performed by each process. This problem is discussed in Section 6.3.

Generating communication. A global, coherent view of the data is maintained by explicit communication between the program instances running on different processes. As well, it is desirable to move communication out of inner-most loops, both to maximize parallelism and to minimize the overhead of communication operations. How compilers do this is discussed in Section 6.4.

Programming languages targeting distributed memory machines. Several programming languages have been designed to support program development for distributed memory machines. Three of the more popular are discussed in Section 6.5.

6.1 DATA DISTRIBUTION

Parallelism on distributed memory machines is exploited by spreading the computation across multiple processors or cores. On distributed memory machines, this is done by distributing the data across the processors executing the program. This has two benefits. First, each processor (or more accurately, each process executing on a processor) computes the new values for the data that resides on it, or is owned by it, in parallel with the other processes. The scheduling of computation on the process that owns the storage location whose value is being computed is called the owner computes rule [198]. Second, by distributing the large arrays, programs can scale both in performance and in their maximum size, since each new process added to the computation increases the amount of physical memory the program can use.

There are five commonly used data distribution mechanisms.

1. Replicated: a copy of the data is given to every process. Figures 6.2(a) and (b) shows an example of a replicated distribution of an array.

2. Block: The array is divided into |P| chunks (where |P| represents the number of processes), or blocks, of contiguous elements which are distributed onto the processes. Figures 6.3(a) and (b) show an example of a block distributed array.

3. Cyclic or cyclic(1): Elements of the array are distributed in a round-robin, or cyclic, fashion across processes. Figures 6.4(a) and (b) show an example of a cyclically distributed array.

4. Cyclic(bs): The array is broken into blocks containing bs (block size) elements, which are then distributed in a cyclic pattern across the processes. This distribution is sometimes referred to as a block-cyclic distribution. Figure 6.5(a) and (b) show an example of a block-cyclically distributed array.

5. General: Generalizations of block-cyclic distributions exist where the array is broken into varying sized blocks, but are beyond the scope of this lecture and are not supported by any commercially available or widely used compilers. In Chapter 8, papers that explore this in more depth are suggested.

Images

Figure 6.2: Layout of the replicated C array a[99] on three processes.

Images

Figure 6.3: Layout of a block distributed array onto three processes.

Images

Figure 6.4: Layout of a cyclically distributed array onto three processes.

Each dimension of an array is distributed independently of other dimensions. Thus, one dimension can be distributed with a block distribution, and another with a cyclic, block-cyclic or replicated distribution. The distributed array will have two sets of bounds: the global bounds, which are the bounds of the initial undistributed array, and the local bounds, which are the bounds of the array on some process. We now describe each distribution in detail for a one-dimensional array a with global lower and upper bounds of L_a and U_a, respectively.

Images

Figure 6.5: Layout of a block-cyclic distributed array onto three processes with a block size of three.

6.1.1 REPLICATED DISTRIBUTION

The replicated distribution is the simplest of the five distributions we will discuss. Each process p owns elements = {e|L_a ≤ e < U_a}. Element e is owned by all processes {p_i|0 ≤ i < |P|}. Because all processes own every element of the array, under the owner computes rule all processes will (redundantly) compute new values of all elements. and will use these locally computed values. This means that if some element of a replicated array a, e.g., a[i], is assigned a value that is a function of the process id, e.g., a[i] = pid, then the values of a[i] that exist, and are read on different processes will be different. Replicating an array is desirable when the cost of performing the redundant computation is less than the cost of communicating the subset of values computed on each process to all other processes.

6.1.2 BLOCK DISTRIBUTION

In the block distribution, the array needs to be divided into |P| blocks. We present a computationally simple way of describing this from [187]. Let L_a be the lower bound of a, and let U_a be the upper bound of a for the global bounds. The number of elements in a is then |a| = U_a − L_a + 1. The formula for L_a,p (U_a,p), the lower (upper) bound of a on process p is:

Images

Note that the upper bound is simply one less than the lower bound of the next highest processor.

The elements owned by are process p can be described by the triplet [L_a,p : U_a,p : 1]. The set = {e|L_a,p ≤ e ≤ U_a,p}. The owning process of some element a[σ(I)], where σ(I) is the subscript for the block distributed dimension whose bounds are being computed and I is the index variables for the loop nest surrounding the reference, is

Images

Because computing the intersection of different owned sets requires knowledge of other distributions, we defer this discussion until Section 6.4.

6.1.3 CYCLIC DISTRIBUTION

The cyclic distribution spreads elements of a across processes in a round robin fashion. Process p has the p’th element of the array, and there are U_a − L_a + 1 array elements total. Thus, after placing the first element onto process p, there are U_a − L_a + 1 − (p + 1) elements left to distributed, and p will get every |P|’th of those elements. More succinctly, the number of elements |e| is given by:

Images

Thus, the maximum element index on the process, in the global index space of the array, is

The bound on the number of elements on each process allow us to define the set of elements owned by each process as a Diophantine expression with bounds, i.e.,

6.1.4 BLOCK-CYCLIC DISTRIBUTION

The block and cyclic distributions are degenerate forms of the more general block-cyclic distribution. Whereas the block distribution emphasizes spatial locality, and the cyclic distribution emphasizes load balance, the block-cyclic distribution allows both benefits to be achieved. In the block-cyclic distribution, blocks of size bs are cyclically distributed across the |P| processes.

Historically, the block-cyclic distribution was the last distribution for which access and ownership functions were defined in a closed form. The problem was of great concern in the early nineties, because High Performance Fortran [124], which sought to automate much of the difficulty of writing distributed memory programs, specified the block-cyclic distribution as one of its fundamental and supported distributions. The difficulty of creating closed formed expressions for the block-cyclic distribution were, in hindsight, overstated. The crux of the problem is that while the elements owned by a process for an array that is block or cyclically distributed can be expressed as a linear function in one variable, the block-cyclic distribution requires the use of two variables. Once this key fact is observed, a solution is straightforward to obtain.

Because of this, there were no good solutions for block-cyclic distributions until the mid-1990s, at which time multiple solutions appeared almost simultaneously. A good overview of these solutions can be found in [239]. Most of these solutions varied in performance by a small constant, and in most cases the performance differences can be attributed to implementation tuning. In the remainder of this section we will focus on one of these techniques, that of [153, 154], because of the author’s familiarity with the method.

Figures 6.5(a) and (b) show the blocks of data distributed. The technique described below will use one variable to traverse the blocks that are owned by a process, and a second variable to traverse the array elements within a block.

We will label the blocks resident on the process by the first element, in the global index space of the array, of each block. Then, the blocks β_p owned by process p are given by

where each block is labeled by the index of the first element in the block with the local array bounds. To enumerate the elements within a block b ∈ β_p we use

and the region (or elements) of an array owned by a process p is

If we label each block on a process p from L_b,p = 0 to some upper bound U_b,p − 1, that is the number of blocks, then

Images

We can rewrite _a,p as

6.2 COMPUTING THE REFERENCE SET

The next problem is how to compute the set of elements owned by some process p—the set _a,p of elements of some array a—that are accessed by a a[. . . , σ_d(I), . . .], where σ_d(I) is the subscript of the block-distributed dimension d of a under consideration. I is the vector of normalized (see Section 4.2) index variable values in the loop nest surrounding the reference, and each i_j ∈ I takes on values 0 ≤ i_j < U_ij. It is desirable to be able to compute the elements of dimension d that are both accessed by this reference and that reside on the process.

Let σ_d(I) = c₁i₁ + c₂i₂ + . . . + c_ni_n + c_o, then the region of the array referenced by σ_d(I) across every process is

We are now ready to compute the array elements accessed in the dimension d of array reference a[σ_d(I)] on process p, using some distribution. Intuitively, we need to intersect the set _a,p of elements on process p under the distribution with the set ρ_d of elements accessed by the reference. This can be done by intersecting the affine expression for _a,p with the array elements accessed by the affine subscript expression σ, i.e., by solving the Equations 6.3 and 6.4.

Because block-cyclic is the most complicated distribution, we will examine computing the reference set for it in detail. For simplicity, we assume that σ_d(I) only involves a single loop index i_j.

For a block cyclic distribution, we again equate the expressions for what is accessed by the reference and what is owned by the process, getting:

Moving terms not involving the equation variables e, i_j and b to the left-hand side gives:

We can solve this equation for the intersection using the techniques for solving Diophantine equations discussed in Section 2.3.2 and in more detail in Chapter 7. We first form the C matrix:

and the A matrix:

Images

Forming the 2 × 4 IA matrix and using the elimination procedure noted in Section 7.1 gives:

Images

where the first three columns are the U matrix and the last column is the D matrix. Since TD = C we can write T as T = [c₀bs − bs p, t₂, t₃]. We now find the parametric equations for b, e and i_j by multiplying TU, giving

Images

where all terms on the right (except for the parameters t_i) are known values at run time.

6.3 COMPUTATION PARTITIONING

Given an array reference within a loop, and a surrounding loop nest, it is necessary to find the iterations of the loop nest that access the array. We will show how to do this for each of the four distributions discussed above. In general, we assume an affine expression and bounds on the variables of that expression that describe the elements owned by a process, an affine subscript function σ(I) in the index variables of the surrounding loop nest I = [i₁, i₂, . . . , i_n], and bounds on the loop. Let be an inverse function of the subscript function that returns a value for i₂ given an array element e and bounds on all index variables other than i_d. Our goal is to intersect the elements owned by a process with those accessed by the subscript, and from the resulting solution find the maximum and minimum values of i that are needed to access the elements.

Below, we describe how to find the bounds of the i loop for each dimension of a referenced array. Afterwards we discuss what to do when an array has multiple distributed dimensions, and when there are multiple references in a loop.

6.3.1 BOUNDS OF REPLICATED ARRAY DIMENSIONS

For a replicated distribution, all elements of the array on are each process, and therefore all iterations of the surrounding loop nest must be executed.

6.3.2 BOUNDS OF BLOCK DISTRIBUTED ARRAY DIMENSIONS

For a block distribution, let us first consider the case where σ is an affine subscript expression in a single index variable i_k, i.e., it has the form e = ci + c₀, and . Given the upper (lower) bound of a on p, U_a,p (L_a,p) then the bounds on i_k are

Images

6.3.3 BOUNDS OF CYCLICALLY DISTRIBUTED ARRAY DIMENSIONS

A cyclic reference is somewhat more difficult because the elements are not contiguous, and the stride of the array accesses must be accounted for.

The upper and lower bounds of i_k values for p can be found as above for a block distribution. The elements accessed by the subscript expression are simply the values of σ evaluated over the range given by the upper and lower bounds. The elements that are actually present on the process can be described by the function e = |P|i′ + p, i′ ∈ [0 : |e|], where |e| is the number of elements given in Equation 6.1.3. We solve for when the

i.e.,

Using techniques from the background section of Chapter 7.1 allows the solution for i to be cast in terms of a parametric equation that can be used to enumerate the loop index values.

6.3.4 BOUNDS OF BLOCK-CYCLICALLY DISTRIBUTED ARRAY DIMENSIONS

The case of block-cyclic is most difficult. What is needed is a loop that traverses the different blocks, a nested loop that traverses the different elements in the blocks, and a value of i in the global iteration space of the loop, and an index into the local array. First, we find the bounds on b, the starting elements of blocks that are within the bounds defined by the block size, array size and iteration space of the loop. From Equation 6.5 we know that b = t₃, and therefore the bounds on t₃ are the bounds on b. Next, we find the bounds on t₂, needed to compute the elements e that are accessed within each block. We know that 0 ≤ e ≤ bs − 1 from the previous discussion in Section 6.1.4 above. Substituting the right-hand side of the equation for e in Equation 6.5 above we get

and solving for t₂ in the inequality gives

In Equation 6.5, i_j = t₃, and therefore the inequality above gives the bounds on the i_j. Figure 6.6 shows how the bounds of a loop are transformed to only perform the iterations that access the locally owned elements of a block-cyclically distributed array.

Images

Figure 6.6: An example of determining loops bounds when accessing a block-cyclically distributed array.

6.3.5 SUBSCRIPTS WITH MULTIPLE LOOP INDICES

What do we do if σ is not in a single variable, but contains multiple index variables? Again, assume we are trying to find the bounds on a loop with index i_k. For all loops outside of i_k in the loop nest, the values of their index variables i₁, i₂, . . . , i_k₋₁ are known, and effectively constant, values when an instance of i_k begins to execute.

Thus, the sum of

can be formed and made part of the constant in the bounds formula above. For the loops nested within i_k, i.e., loops with index variables i_k₊₁, . . . , i_n we know the maximum and minimum values of the index variables. Thus, we can find the extreme values of the integer expression

as described in Section 7.3. For finding the lower bound of i_k we need the smallest value of which requires using the lower bound of the expression of Equation 6.9. Similarly, to find the upper bound we need to use the upper bound of the expression of Equation 6.9. Index variables not mentioned in the subscript of the distributed dimension are not constrained by the distribution, and the full range of iterations must be executed.

6.3.6 GENERATING LOOP BOUNDS FOR MULTI-DIMENSIONAL ARRAYS

The techniques described so far in this section compute a set _d that contains all iterations in the loop nest surrounding the iteration that must be executed to access the elements owned by a process, and referenced by the subscript expression for a single dimension of an array. In general, arrays will contain multiple distributed dimensions.

In the simple case the different distributed dimensions are indexed by different index variables. Let i_d be the index variable used to index dimension d of the array. Then the constraints imposed on the values of i_d found using the formulae of this section give the necessary bounds on i_d for this reference.

The more complicated case arises when one or more loop indices i_k are used to index several dimensions i_d₁, i_d₂, . . . , i_dn of the array. For a value of i_k to be in the iteration space accessing the reference, it must be valid for indexing into each dimension d₁, d₂, . . . , d_n. Therefore, the contribution of this reference to the iteration space of the i_k loop is given by , i.e., the intersection of the iteration space defined by each dimension.

6.3.7 GENERATING LOOP BOUNDS WITH MULTIPLE REFERENCES

Loops with a single reference are exceedingly rare and so it is necessary to combine the iteration spaces defined by each reference into a total iteration space. If some iteration of a loop index i is needed by some reference, it must be included in the iteration space of the loop. Thus, given the sets of iterations for the r references in the loop, the total loop nest is . Because the iteration sets are unioned, the resulting loop iteration set will likely be too large for at least some references. To protect references from being executed for invalid iterations, a guard must be inserted around each reference a_j that allows the reference to execute only in those iterations where the value of the current loop nest index is in the iteration set for a_i, i.e., when I ∈ _aj.

At first this seems horrendously expensive, since in the worst case almost the whole iteration space of the loop will be executed, with expensive guards, implemented as if statements, being executed for each reference. Typically, the situation is much better than this, especially in well written programs. The references within a loop generally access arrays of the same size, and with the same distribution. This leads to similar iteration sets, and loop bounds, being defined for each reference. When bounds do differ, the often differ by one. In this case, iteration peeling (see Section 4.1) can be used to pull off the iterations that are not shared by all references, making the majority of the iterations guard free. Even if iteration peeling cannot be used the loop bounds defined by each reference largely overlap, the work done on each process is roughly 1/|P| of the total work, and significant parallelism is realized in the execution.

6.4 GENERATING COMMUNICATION

The typical memory model for distributed computation says that a process will execute whatever code is necessary to compute the values of data (typically array elements) that have been distributed onto it, i.e., the owner’s computes rule [198]. Computing these values requires accessing data that is distributed onto other processes, and this requires generating message passing communication to get that data.

Generating communication requires two distinct activities. First, the kind of communication needed is determined. This is done by analyzing the dependence relations between writes of owned data, and reads of data that are potentially not owned. In general, any dependence that has a non-equal direction on a parallel loop will need to obtain data from other processes, since the presence of the non-equal direction means that any subset of the iteration space (such as the subset executing in a given process) will need data from outside the set (and therefore executing on another process). Second, the data that must be communicated needs to be determined. Because most MPI communication requires actions by both the sender and the receiver of the data it is necessary for the sending process to know that it must send data, and what data it must send.

There is not an over-arching strategy for generating communication operations in message passing programs, rather there are a collection of heuristics that attempt to discover a write/read pair corresponds to well-known and supported messaging patterns, and if so to generate code that efficiently creates communication that uses that pattern. In this section we will discuss three communication patterns – shift, using send receive; a collective operation; and a wave-front communication pattern – and how they are detected. This will give the general flavor of how communication can be generated by a compiler.

6.4.1 THE SHIFT COMMUNICATION PATTERN

Figure 6.8(a) shows an HPF program with an array A whose columns are block distributed onto four processes. Figure 6.8(b) shows the layout on the processes, with arrow indicating elements of A on one process must be sent to another process when computing values for the elements of A. The compiler computes, or generates code that computes at run time, the elements of A needed by each process P_i. The compiler also computes, or generates code to compute at run time, the values of I₁ and I₂ that define the bounds of the I₁ and I₂ loops on each process. By evaluating the subscript expression of the B reference over those values, the values of B that need to be read can be determined. These values can be intersected (by means of Diophantine equations for general distributions and affine array subscripts) with what is owned by each process, with the result of the intersection being that the left-hand strip of B on each process needs to be sent to the neighboring process to the left.

A simpler technique can also be used for certain subscript patterns. Given two arrays, A and B, that have the same size and distribution, when some column (row) access of B is offset by a constant value k < 0 there needs to be a transfer of k columns (rows) of B to the left (upper) process in the grid. If the constant offset is k, then there is a transfer of k columns (rows) to the right (lower) process. In the example of Figure 6.8, k = −1. If B is ALIGNed or otherwise has its distribution shifted relative to As distribution, then this shift is used to modify the value of k. Thus, if the HPF alignment distribution ALIGN B(I₁, I₁ − 1) WITH A(I₁, I₁) were used, B would be shifted to the left one row, the value of k would be increased by one, and the final k in the example would now be 0. Similarly, if the (pathological, in this case) alignment of ALIGN B(I₁, I₁ + 1) WITH A(I₁, I₁) were used, the elements of B would be shifted column to the right with respect to B, the final k value would be −2, and two columns of B would need to be communicated to the left process.

Where the communication is placed depends on the dependence structure of the loop. In the program of Figure 6.8(a), the B array is updated inside of the I₁ loop, and the elements updated are elements that will be communicated. Therefore, it is necessary to move the communication outside of this loop.

6.4.2 THE BROADCAST COMMUNICATION PATTERN

Figure 6.9(a) shows an HPF program where the same element (4) of array B is read by every process. Since B is a distributed array, this requires that B(4) be communicated to every process, which is a broadcast communication pattern. The compiler can easily detect this pattern in the program, and by finding the process that owns the constant element can set up the proper broadcast communication operation.

An analogous, but slightly more complicated, example of this is when a column or row of an array is read by every process, and that column or row resides entirely on a single process. Similarly, if an entire array is distributed onto a single process and read by every process, the entire array will need to be broadcast.

A wavefront communication pattern Consider the code of Figure 6.7(a). The parallelism of Figure 6.7(b) can be exposed by generating code as seen in Figure 6.7(c), and exists along the diagonals of the grid (shown with the same shading). The compiler can realize that the loop can be tiled (see Section 5.6) to form the necessary blocks of computation, with the execution order enforced by the communication placed outside of the inner two loops, as shown in Figure 6.7(c).

6.5 DISTRIBUTED MEMORY PROGRAMMING LANGUAGES

Several languages have been developed for targeting distributed memory machines. Nevertheless, the most popular programming model for distributed memory machines is C, C++ or Fortran combined with MPI. Under such a model the compiler’s role is to optimize the program as a sequential program—the exploitation of parallelism, data distribution, computation partitioning and generation of communication is entirely the programmer’s responsibility. Because MPI is a library, production compilers have no understanding of the underlying semantics of library calls, and make no effort to optimize code across calls.

Images

Figure 6.7: Example of a wavefront pipelining enforced by communication.

We now briefly discuss several other programming languages for distributed computing. These languages fall into the realm of Parallel Global Address Space (or PGAS) languages. They allow the programmer to manipulate data using the global indices for distributed arrays. This, in and of itself, is a huge improvement for the programmer over manipulation data using the local address space.

Images

Figure 6.8: An example of a shift communication pattern.

6.5.1 HIGH PERFORMANCE FORTRAN (HPF)

Compilers for High Performance Fortran, and the closely related research languages Fortran90D [35], Vienna Fortran [46], and Fortran D [103], arguably provide the most aggressive support for distributed memory computing. HPF only required the programmer to specify the data distribution of the program. Based on that distribution, the HPF compiler would perform computation partitioning on loops, and automatically generate communication, following the strategies outlined above. It was HPF, and its predecessors, that motivated much of the development and formalization of the work discussed earlier in this chapter, and it was the failure of HPF in the marketplace that led to the rise of Co-Array Fortran (discussed further in Section 6.5.2) and Unified Parallel C (discussed further in 6.5.3).

Novel features of HPF were the ability to declare processor grids¹ and templates, and to support advanced support of data distribution via declared processor grids, templates, and the align primitive. We demonstrate this functionality via an example which is illustrated in Figure 6.10. A 2 × 2 grid of processors is declared in S1. This is a logical processor grid, in that it will be mapped by the HPF runtime onto whatever number of processors are specified by the programmer.

Images

Figure 6.9: An example of a broadcast communication pattern.

Statements S2 and S3 declare a one and two-dimensional array. Remember that Fortran stores arrays in column-major order, and therefore the rightmost subscript corresponds to column elements rather than row elements as it would in C, C++, Java, or most other languages. In statement S4, the DISTRIBUTE directive distributes A onto the 4 × 4 processor grid P using A block distribution. HPF also supported cyclic, block-cyclic and replicated distributions. The ALIGN directive in statement S5 places each element B(I) on the same processor that column I of A is placed. The “*” in the second (column) dimension of A says to spread B across all of the columns of the A matrix. In statement S4, the DISTRIBUTE directive distributes A onto the 4 × 4 processor grid P using a block distribution. HPF also supported cyclic, block-cyclic and replicated distributions.

Compiling HPF programs requires many of the analyses and transformations discussed earlier. In the program of Figure 6.11(a) (and its transformed form in Figure 6.11(b)), a flow dependence exists from the write to A[I] to the read of A[I-1]. When compiling for shared memory, three possibilities exist. First, the loop could be left serial. Second, loop fission or distribution could be applied, and the barrier at the end of the first loop (containing S1) would enforce the dependence. Third, producer/consumer synchronization could be added to enforce the dependence. With a distributed memory target, adding send and receive operations within the loop would enforce the dependences, but because of the great expense of a communication operation, a guard if statement would need to be placed around the send and receive to ensure the send only executed in the last iteration (where data that is written is read by the next processor) and the receive only executes in the first iteration (where data that is read was written by the previous processor). The overhead of the if statements is high enough (both in execution time and in making the flow of control more complicated, potentially hindering other optimizations) that loop fission is applied instead, and the dependence is enforced by the send and receive placed between the two loops rather than by a barrier at the end of the first loop. This code is found in Figure 6.10(b).

Images

Figure 6.10: An example of HPF ALIGN and DISTRIBUTE directives.

Images

Figure 6.11: An example of transformations necessary to exploit parallelism in an HPF program.

Similarly, the code found in Figure 6.7(b) will, with a good HPF compiler, result in code that uses messages to both communication data and synchronize dependences, as shown in Figure 6.7.

These examples offer some insight into why HPF was even less successful than shared memory auto-parallelizing compilers. First, providing data distribution information gives hints as to how parallel code should be scheduled, but ultimately the problem is the same as confronted by an auto-parallelizing compiler for shared memory. In particular, data dependence analysis is necessary to find parallel loops in a shared memory compiler, and transformations are necessary to either eliminate anti and output dependences, and to enforce flow dependences. In an HPF compiler, dependence analysis is necessary to determine where communication is necessary, what kind of communication is necessary, and what must be communicated. The compiler’s job is made easier in some cases by the distributed memory model, for example, cross-iteration (and therefore cross-processor) anti-dependences can be eliminated by a combination of the serial execution within a process, process private storage for all array elements accessed by the process, and by not generating communication that would overwrite the dependent array element on the processor at the sink of a dependence. The distributed memory model makes the compiler’s job more difficult because it is necessary to either have exact distance information for cross-iteration flow dependences so that exactly the right data is communicated, or to over-communicate, leading to larger communication overheads. Because message passing communication is typically orders of magnitude more expensive than synchronization in shared memory processors, conservative analyses that lead to unnecessary or inefficient communication can lead to significant slowdowns in program performance. As well, just as interchange and other transformations are necessary when targeting shared memory to form an outer loop that has no dependences and can be parallelized, with HPF dependences must be moved out of the inner loop so that substantial work exists to amortize the cost of the communication. At the end of the day, the technical issues that must be confronted by an HPF compiler are similar, and in many cases fundamentally the same, as those that are confronted by shared memory compilers.

Less technical reasons also led to the lack of wide-spread adoption of HPF. A major issue was the compilation of block-cyclic distributions, or more precisely, the lack of any good method of generating code for block-cyclic distributions when major vendors began work on HPF [36, 89, 98]. Two strategies were followed: (i) vendors chose to simply not support block-cyclic; and (ii) library calls at run time handled the block-cyclic distributions correctly, but inefficiently. The end result of this is that the vendors that supported a subset of the language, usually because of block-cyclic, also decided to not implement some other difficult features. Because all vendors did not make the same choices, different languages were supported by different the compilers. Vendors that chose the runtime approach often focused less on performance and more on completeness—not a winning strategy for a language whose first two initials stand for “high” and “performance”. In fairness, it must be said that these vendor focused on performance going forward.

A second major issue was that potential users of HPF had wildly unrealistic expectations of what HPF would be able to accomplish. Compounding this, many HPF compilers gave little or no feedback to their users about where communication was placed, and why. Thus, programmers would compile their application, and get some level of performance, make a few seemingly small changes, and get very different performance. This made it difficult for programmers to work with the compilers to converge on the best performance the compiler could deliver.

6.5.2 CO-ARRAY FORTRAN

Co-Array Fortran is an extension of Fortran 95 and implements an SPMD programming model. It is currently part of the Fortran 2008 standard, released in 2010 [195]. A Co-Array Fortran program is replicated a fixed number of times, and each replication of the program is called an image. When the program is replicated, all data associated with the program is, by default, also replicated. Each image has a unique identifier (its index) that allows the flow of control within an image to be controlled, as desired, based on the identity of the image.

For useful work to be performed in parallel, data should be distributed among the images so that it can be simultaneously operated on by many processes. Co-arrays give that capability to Co-Array Fortran. Consider the following declaration:

This code does the following things. A statement S1 declares two co-arrays named A and B, each with N elements. Thus, each image has a portion of the co-array that contains N elements, and a total of N * number images elements are allocated. This statement is very similar to a normal Fortran declaration except for the “[*]” which makes it a co-array.

Statement S2 accesses the co-arrays, and assigns all of the values of elements of the B array on image P to the local storage for the co-array Aarray. We note that if every image has the same value of P, every portion of local A array gets the values of the B array from the same image, and thus a broadcast operation has been specified. As the reader may have surmised, the “:” notation for A and B arrays specifies that all elements of the array in Fortran 95 syntax, and all elements of the co-array on some image in Co-Array Fortran syntax. In general, for both Co-Array Fortran and Fortran 95, sections of arrays may be specified by triplet notation of the form [l : u : s].

A variety of communication patterns can be specified, using co-arrays and in some cases Fortran 95 intrinsic functions² Thus, the expression

moves the contents of the co-array B on the image executing the statement into the co-array residing on image P. The expression

broadcasts the element values of X on the image executing the statement to every images Y co-array. Note that while “(:)” specifies all of the elements of an array, the use of a bracket, e.g., “[:]” specifies all processes containing parts of the co-array. As the last example, the expression

uses the intrinsic MAXVAL function perform a max reduction over all elements of Y on all images. If every image executes this statement, every image will get the maximum value contained in some Y in their local scalar variable S.

It is the task of the programmer to map data into the distributed co-array. It is the task of the compiler to recognize remote accesses, generate the message passing code to create these accesses, and to recognize collective communication operations encoded as assignment statements involving co-arrays. Because communication is more explicit than in HPF, the programmer is more aware of the cost of communication operations that are required, and consequently can better manage the communication, and tune the performance of a program, at the source level.

6.5.3 UNIFIED PARALLEL C (UPC)

Unified Parallel C, as implied by the name, is a set of extensions to the C programming language. These extensions include constructs and/or semantics to enable explicit parallelism, support a shared address space across a distributed memory system, perform synchronization and to maintain strict or relaxed consistency models. Like Co-Array Fortran, it is primarily the responsibility of the programmer to tune their program for good performance, and like Co-Array Fortran it gives programmers enough control to achieve good performance while relieving them of the tedium of perform global to local address space calculations when accessing shared data structures. And also like Co-Array Fortran it also assumes an SPMD execution model.

UPC assumes a memory organization as shown in Figure 6.12(a). The memory is physically distributed across |P| processes, and each processes’ physical memory is logically partitioned to be in two parts: a globally accessible part treated as shared memory, and a private part only accessible by the process on which the memory resides. Data declared as shared is placed into the logically shared portion of the memory, and data declared as private is placed into the logically private part of the memory. The default distribution of data across a process is done using a block distribution. By changing the declared block size of data a programmer can achieve a cyclic or block-cyclic distribution. Figure 6.12(b) shows a program that declares three arrays (v, b, c and d) as private, block, cyclic and block-cyclically distributed, respectively. Figure 6.12(c) shows how these arrays are laid out in memory.

Images

Figure 6.12: Memory organization and data distribution in UPC.

Since UPC extends C, support for pointers into distributed data is mandatory. As can be seen in Figure 6.12(b), the location of data, and the distribution of data that resides in shared memory is expressed as a type. Therefore, UPC supports pointers to shared data.

UPC pointers can reside in either private or shared memory, and can point to either private or shared memory. A pointer in private memory that points to private memory provides a pointer that can only be changed by one UPC thread, and that points to data in that threads shared memory. An example of this is pointer p in Figure 6.13(a). A pointer in private memory that points to shared data allows a thread to have a pointer that cannot be changed by other threads and from which shared memory values can be accessed. An example of this is pointer q in Figure 6.13(a). A pointer in shared memory pointing to shared storage in shared memory provides a pointer than can be read and written by all threads, and that points to data accessible by all threads. An example of this is pointer r in Figure 6.13(a). The last combination of pointer location and the location of pointed-to storage is a pointer in shared memory that points to private memory. This pointer combination is prohibited by UPC, since it would allow the address of private data to be accessed by all threads. Moreover, since global pointers contain header information that provide information about which thread contains the pointed-to data, and pointers to private storage do not, dereferencing such a pointer would likely point into the dereferencing threads private storage, even if the pointer value was set by another thread. Figures 6.13(c) and 6.13(d) show pointers to the arrays declared in Figure 6.12(b).

Images

Figure 6.13: An example of UPC pointers and pointer declarations.

Figure 6.13(d) and 6.13(e) show pointers to data with different block sizes, and show that like sequential C, pointer arithmetic honors type information. In UPC, the block size is part of the type, and so the block information associated with a pointer, rather than with the pointed-to object, will affect the pointer arithmetic.

UPC also supports both relaxed and strict (sequentially consistent) memory models. Critical cycle analysis for SPMD programs, as described in Section 2.6, is used constrain compiler optimizations, and to guide the insertion of fence instructions to constrain reordering of instructions by the process.

In addition to the memory model analysis just mentioned, the UPC compiler performs a mapping from global to local index spaces for distributed arrays. It also must generate communication when accessing non-locally stored data, and when accessing locally stored (but perhaps shared or private) data it should either determine statically or at run time that the data is local and use a load or store rather than message passing to access the code.

Both UPC and Co-array Fortran have enjoyed more success that HPF because they both give the responsibility for performance to the programmer, provide a programming model that allows the programmer to gain some insight into the performance of the program being written, and relieve the programmer of the tedium of managing data distributions and the details of communication.

¹These are really grids of processes.

²Intrinsic functions are Fortran functions that are defined as part of the language specification, and whose semantics are known to the compiler.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6 Compiling for distributed memory machines

Create new playlist

Sign In

Sign Up

CHAPTER 6

Compiling for distributed memory machines

6.1 DATA DISTRIBUTION

6.1.1 REPLICATED DISTRIBUTION

6.1.2 BLOCK DISTRIBUTION

6.1.3 CYCLIC DISTRIBUTION

6.1.4 BLOCK-CYCLIC DISTRIBUTION

6.2 COMPUTING THE REFERENCE SET

6.3 COMPUTATION PARTITIONING

6.3.1 BOUNDS OF REPLICATED ARRAY DIMENSIONS

6.3.2 BOUNDS OF BLOCK DISTRIBUTED ARRAY DIMENSIONS

6.3.3 BOUNDS OF CYCLICALLY DISTRIBUTED ARRAY DIMENSIONS

6.3.4 BOUNDS OF BLOCK-CYCLICALLY DISTRIBUTED ARRAY DIMENSIONS

6.3.5 SUBSCRIPTS WITH MULTIPLE LOOP INDICES

6.3.6 GENERATING LOOP BOUNDS FOR MULTI-DIMENSIONAL ARRAYS

6.3.7 GENERATING LOOP BOUNDS WITH MULTIPLE REFERENCES

6.4 GENERATING COMMUNICATION

6.4.1 THE SHIFT COMMUNICATION PATTERN

6.4.2 THE BROADCAST COMMUNICATION PATTERN

6.5 DISTRIBUTED MEMORY PROGRAMMING LANGUAGES

6.5.1 HIGH PERFORMANCE FORTRAN (HPF)

6.5.2 CO-ARRAY FORTRAN

6.5.3 UNIFIED PARALLEL C (UPC)

Table of Contents for
6 Compiling for distributed memory machines