CHAPTER 5

Transformation of iterative and recursive constructs

The dependence information gathered using the techniques of Chapter 2 can be used to determine the legality, and sometimes the utility, of a variety of transformations on loops. Many of the transformation we now discuss have as their primary goal improving program performance instead of transforming programs to a form that is easier to parallelize.

5.1 LOOP BLOCKING OR STRIP MINING

Loop blocking is used break the iteration space of a loop into blocks of contiguous iterations. The two primary uses for loop blocking are (i) to split loops up into chunks of computation that can be executed as single vector instruction, and (ii) as one step in performing loop tiling (see Section 5.6) to achieve better locality.

Figure 5.1 shows a loop before and after blocking. The loop is split into a pair of loops, where the outer loop jumps from block to block of the iteration space, and the inner loop traverses the individual iterations in the block. bs is the block size, that is, the number of iterations in each block.

Images

Figure 5.1: An example of loop blocking.

As can be observed from the code, blocking itself has no affect on the order of memory accesses in the program, and is therefore always legal. The only adverse effect on performance comes from slightly higher loop overheads from the additional inner loop.

5.2 LOOP UNROLLING

Small loop bodies present several problems to a compiler. Compilers are limited in their ability to express optimizations that increase cross-iteration sharing of data values1. Other problems with small loop bodies include a smaller window over which to do instruction scheduling and there are fewer instructions over which loads can be moved to mask memory latency and a smaller window over which register allocation can be performed, potentially leading to additional loads and stores of data value.

Figure 5.2 demonstrates the benefits of loop unrolling. The computation shown is a simplified stencil. Each iteration of the original loop performs loads of a[i], a[i − 1] and a[i + 1], and stores into a[i]. The value of a[i] stored in an iteration is the value of a[i − 1] in the next iteration, and the value of a[i + 1] in an iteration is the value of a[i] in the next iteration. Therefore, it would be more efficient to reuse the values already in registers in the next iteration, and only have to load a new value of a[i + 1] instead of having to load values of a[i], a[i − 1] and a[i + 1] in each iteration. Code to do this is shown in Figure 5.2(c). As can be seen from the code in Figure 5.2(c) many common array accesses exist across statements in the unrolled loop, which would have been in separate iterations in the original loop. Register allocating across the body of the unrolled loop makes it easy to exploit these common data accesses, whereas in the rolled loop tracking register uses across iterations would have been very difficult.

Unrolling also opens up a wider window for other optimizations to occur. Thus, in the loop shown, the sum a[i] + a[i + 1] is the sum a[i − 1] + a[i] in the next iteration, and the temporary holding the first value can be reused for the second value. The number of compares and increments of the loop index variable are also reduced, making the loop overhead smaller. At run time, the larger number of operations between branches can also increase the window over which out-of-order processors can schedule code without speculating.

Because loop unrolling does not change the order of fetches and stores, it is always legal to perform.

5.3 LOOP FUSION AND FISSION

Loop fusion and fission (also called loop distribution in the literature [250]) are inverse transformations that merge multiple loops into a single loop, and split a single loop into multiple loops. An example of this is shown in Figure 5.3.

Loop fusion produces a program that is faster for several reasons: (i) with lower loop overheads only one loop starts, and since the number of iterations executed is one-half of the total for the two unfused loops, the number of index variable increments and checks is also halved; and (ii) temporal locality is better since accesses to common storage in the two loops will be done temporally closer together.

Images

Figure 5.2: An example of loop unrolling.

 

Loop fission can be used to create loops where one operation exists per loop, enabling vectorization. As seen in the loop fission example of Figure 5.3(a), when the loop is distributed the dependence between S1 and S2 is enforced by the sequential execution of the two loops. If both of the resulting loops, shown in Figure 5.3(c), are parallelized or vectorized, the barrier after each loop will enforce the dependence.

Because fusion and fission change the order in which loads and stores are performed, it is necessary to check the dependence relation before and after performing the transformation to determine if it is legal. In particular, when performing loop fission, every flow dependence from some statement Sp to Sq should be replace by a def-use edge from Sp to Sq, every anti-dependence from Sp to Sq should be replaced by a use-def edge from Sp to Sq, and every output dependence from Sp to Sq should be replace by a def-def edge from Sp to Sq. Thus, in the example of the figure, the flow dependence from S1 to S2 is replaced by a def-use chain from S1 to S2 in the new loop. We note that a lexically backwards flow (anti) dependence (i.e., consider if statements S1 and S2 were re-ordered in Figure 5.3(a)) will become a lexically forward def-use (use-def) edge after fission, and a lexically backward output dependence will become a lexically forward def-def edge with the source and sink reversed, thus violating the legality constraints. As described in Chapter 3.2, acyclic portions of the dependence graph may be sorted so that dependences are lexically forward, with a legal fissioning then being possible.

A legal fusion requires that all def-use (use-def/def-def) edges from Sp to Sq in the loops being fused should become with flow (anti/output) dependences from Sp to Sq, and no other dependences (other than those that existed in the unfused loops) should exist in the fused loop.

Images

Figure 5.3: Examples of loop fusion and fission.

5.4 LOOP REVERSAL

Loop reversal is a transformation that reverses the order in which the iterations of the loop are executed. Thus, a loop that originally runs from 0 . . . n − 1 will now run from n − 1 . . . 0, as shown in Figure 5.4. Figures 5.4(a) and (b) show the original pair of loops to be fused, and the use-def information for the loops. Figures 5.4(c) and (d) show the fused loop and the dependence graph for the loop. Observe that the Use/Def chain from S1 to S2 in the loop of Figure 5.4 has become a flow dependence in the fused loop, meaning the order of reads and writes of a elements has changed after fusing, and the fusion is illegal.

When the iteration space of a loop is reversed, the direction of dependences within that reversed iteration space are also reversed. Thus, a “<” dependence becomes a “>” dependence, and vice versa. More intuitively, in the notation expressing the dependence as the sign of the direction, the direction of dependences in the reversed loop are found by multiplying the direction in the original loop by −1. This effect of loop reversal can be used to fuse loops that might otherwise be illegal to fuse. After reversing the iteration space of the loop shown in Figure 5.4(c), the loop of Figure 5.4(e) is produced. Reversing the direction gives a flow dependence from S2 to S1 with a direction of “−1” or “>”. Now, since dependences must go from earlier in the loop execution to later, these directions cannot be correct; what has happened is that the flow dependence from S2 to S1 has become an anti dependence from S1 to S2 with a direction of “1” or “<”. This direction corresponds to the flow of data implied by the Use/Def information shown in Figure 5.4(b), and therefore the fusion is now legal.

Images

Images

Figure 5.4: An example of how loop reversal aids in fusion.

5.5 LOOP INTERCHANGE

Loop interchange is a transformation that increases the temporal and spatial locality of array accesses. The transformation takes two loops - iout and iin, where iout (iin) is an outer (inner) loop, and interchanges the positions of iout and iin so that iin is the outer loop.

There are two benefits from applying loop interchange. First, the locality, and therefore the cache behavior, of array accesses can be improved by interchange. Languages define the order that arrays are laid out in memory. Fortran column elements that are adjacent in the Cartesian space defined by the array are stored adjacent to one another in memory, while C, C++ and Java make row elements adjacent in memory2. Array accesses may occur in the wrong order because of library code being reused for another language (for example, Fortran arrays have adjacent column elements next to each other in memory, whereas C has adjacent row elements adjacent to each other in memory), the literal translation of algorithms from one language to another, and from transcribing algorithms in a book that might be written to emphasize intuitiveness without regard for locality effects. By interchanging the loop involved in indexing the arrays in the wrong order, the access order is changed to one with better cache behavior. If the loop has array accesses that occur in the proper order in dimensions indexed by the loops to be interchanged, they will now perform their accesses in the wrong order. In these cases, tiling, discussed in Section 5.6, should be used.

Second, if an inner loop is parallel, interchanging it with an outer loop will decrease the number of times the parallel loop is instantiated at run time and will increase the amount of work in the parallel loop. Both of these reduce the overhead of exploiting the parallelism inherent in the loop.

Because loop interchange affects the order of accesses—accesses that previously traversed elements in row order will now traverse them in column order, and vice versa—dependence information must be used to insure that the transformation is legal to apply. Figures 5.5(a)(c) show a double nested loop that is a candidate for interchange, an iteration space diagram for the loop, and the dependence graph for the loop. As can be seen, there is a flow dependence with a “[1, −1]” (“[<, >]”) direction vector on the loop. When interchange is applied to the loop nest, the direction vector elements associated with the interchanged loops are also interchanged. Figures 5.5(c) and (d) show the interchanged loop, ISD and dependence graph. After interchange, the flow dependence would have a direction vector of “[−1, 1]” (“[>, <]”). This, however, would imply that the dependence would travel backwards in the iteration space, which cannot happen since dependences always travel forward in the loop nest’s iteration space. Therefore, the interchanged loop now has an anti dependence running from the read to the write of a with a direction of “[1, −1]” (“[<, >]”). In simpler terms, this means that the read of an element of a now occurs after, rather than before, the write. Because these accesses occur in a different order than in the original program, the transformed program may give a different (and wrong) result.

Figures 5.6(a)(d) show another loop nest and its iteration space diagram before, and after, performing loop interchange. In this case, there is a flow dependence with a direction of “[1, 0]” (“[<, =]”). After interchange, the direction vector becomes “[0, 1]” (“[=, <]”), which indicates a flow dependence that is still forward in the iteration space of the loop. Thus, the interchange is legal.

For a two-deep loop nest, the possible direction vectors and what they become after interchange are:

Images

The direction vectors “[−1, 1]” (“[>, <]”), “[−1, 0]” (“[>, =]”) and “[−1, −1]” (“[>, >]”) are not possible because they imply the dependence sink executes before the source. All of the possible direction vectors except for “[1, −1]”(“[<, >]”) yield legal vectors when interchanged, and therefore can exist on the pair of loops to be interchanged. The “[1, −1]” (“[<, >]”) is the only direction vector that precludes loop interchange.

The discussion above has assumed that the interchange involves the outermost loop. If it does not, and the direction vectors of all dependences after the interchange are lexicographically greater than [0, 0, . . . , 0], the interchange is legal. In practice, this means that if for every direction vector there is at least one “<” element corresponding to a loop that is outside of the interchanged loops in the loop nest, the interchange is legal. Thus, in a triply nested loop, if the only dependence has a direction vector of “[1, 1, −1]” (“[<, >]”), after interchange the direction vector would be “[1, −1, 1]”(“[<, >, <]”), and the interchange is legal.

5.6 TILING

When a loop contains accesses to arrays that are being performed in an inefficient order, and accesses that are being performed in the correct order, interchange will enhance the locality for the accesses happening in the wrong order, and will degrade the locality of the accesses already happening in the efficient order. This situation occurs, for example, in matrix multiply and array transpose.

Consider the loop nest shown in Figure 5.7. The a[i1, i2] reference accesses the array in row-major order, and is locality friendly, as shown by the horizontal, dark arrows in Figure 5.7(c). The a[i2, i1] reference accesses the array in column-major order, as shown by the vertical, lighter arrows. Clearly, interchange alone will not improve the overall locality of the accesses. What is desired is to tile the iteration space with non-overlapping sub-spaces that both entirely cover the original iteration space, and reorder the accesses such that all of the array regions accessed in a tile fit into the cache.

Figure 5.7(c) shows a symbolic sequence of tiles being accessed by the program of Figure 5.7(a) after tiling (the top row) and the sequences of the a array accessed by each of tiles (the bottom row). After tiling, accesses to the array region for a single reference are occur in a contiguous set of iterations of the loop nest, i.e., they are temporally co-located.

The first step in producing code that gives this behavior is to block each loop (see Section 5.1), as shown in Figure 5.7(d). Blocking does not affect the order of accesses and is always legal. To make the accesses occur in a more cache friendly manner, a second step is performed, i.e., the Images and i2 loops are interchanged. This leaves the outer loops in the loop nest (or tiled portion of the loop nest, if there are other loops surrounding the tiled loops) being the i1 and i2 loops that iterate over the different blocks, or tiles, in the nest. The inner loops are the Images and Images loops that perform the iterations within each tile. Thus, the iteration space visits each tile in turn, and then traverses the iterations that perform accesses within the tile. This in turn causes each region of the array accessed by each tile to be visited, leading to a cache friendly behavior. We note that tiling is legal whenever the interchange used in the second step of the transformation is legal, as described in Section 5.5.

Images

Images

Figure 5.5: An example of where loop interchange is illegal.

Images

Figure 5.6: An example of where loop interchange is legal.

The tiled code that achieves this behavior is shown in Figure 5.7(c).

Images

Images

Figure 5.7: An example of tiling and its effect on the execution order and array access patterns.

5.7 UNIMODULAR TRANSFORMATIONS

Some, but not all, loop transformations are linear transforms of the iteration spaces and loop bounds. In particular, skewing, reversal and interchange can be represented as such, while tiling, strip mining, fusion, and distribution cannot. However, as we have seen, skewing can change the distances of dependences within a loop nest, and enable these other optimizations. These observations led to the development of a theory of Unimodular loop transformations [21, 242, 243].

Images

Figure 5.8: A loop whose performance can be improved by Unimodular transformations.

Consider the loop nest of Figure 5.8 with an outer i1 and inner i2 loop. The loop nest can be represented by the vector:

Images

and the interchange of the two indices can be represented by the matrix vector operation

Images

i.e., by multiplying the vector of loops by the appropriate permutation matrix. To model the effect of the interchange on the distance vectors of the dependence, the distance vectors can be multiplied by the same permutation matrix. In the example of Figure 5.8, the dependence distance vectors are [0, 0] and [1, −2]. Multiplying the transposed dependence vectors by the permutation matrix yields

Images

or distance vectors of [0, 0] and [−2, 1]. The [0, 0] vector is legal, but the [−2, 1] is lexicographically less than [0, 0] and indicates a dependence traveling backwards in the iteration space of the loop. Thus, the interchange is illegal.

In a similar fashion, loop reversal (see Section 5.4) can be expressed as a matrix operation. Again, using the loop of Figure 5.8 as an example, and reversing the inner i2 loop implies the system

Images

or distance vectors of [0, 0] and [1, 2]. Neither distance [0, 0] vector is lexicographically less than less than [0, 0], and therefore the reversal is legal.

Finally, skewing can be represented by a identity matrix with elements representing the skewing factor. Thus, the matrix

Images

represents skewing the iteration space by twice the distance on the outer i1 loop, and in the example loop would yield distance vectors of [0, 0] and [2, 0].

The above examples assume the dependence distances are known and constant. It is sometimes the case that a dependence direction is known, but the distance is not. The above framework can handle this in several steps. First, “compound” distances such as [image, image] are represented as a pair of directions [0, +] and [+, ±]. This prevents the impossible dependence direction [−, ±] from being considered in the following. Next, the directions can be represented as a ranges of integers, with the two directions above being represented as [0 : 0, 1 : ∞] and [1 : ∞, −∞ : ∞]. An arithmetic can be defined on these vectors of rangers. Let l1, l2, u1, u2Images, then l1 : u1 + l2 : u2 = l1 + l2 : u1 + u2, and s · l : u = s · l : s · u when s ≥ 0 and s · l : u = s · u : s · l otherwise. Multiplication or addition of any k and ∞(−∞) yields ∞(−∞). Using this representation, directions can be converted to ranges and manipulated similar to scalars. Loops that have ∞ cannot be parallelized, since no skewing factor exists that eliminates the dependence and allows overlap of execution.

The algorithms presented in [21, 242, 243] show how to efficiently find a Unimodular matrix T that represents the desired and combined effects of skewing, reversal and interchange to give the desired loop nest. Manipulation of the T matrix allows computation of the loop bounds for the new nest. While the use of Unimodular transformations does not give greater power than a framework directly applying skewing, reversal and interchange, it does allow these to be done by manipulating matrices rather than the compiler IR, and therefore allows simpler and easier to maintain approaches within the compiler to represent and perform these transformations.

1 Software pipelining, discussed in Section 3.7, gives one way of exploiting cross-iteration reuse of values.

2Java does not specify a layout for arrays, and prohibits programs from depending on the layout of arrays, but in practice adjacent array elements are contiguous in memory. As well, Java does not define two dimensional arrays but builds up multi-dimensional arrays from arrays of references or pointers, as shown in Figure 2.9. This effectively makes adjacent row elements contiguous.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.104.183