CHAPTER 4

Transformations to modify and eliminate dependences

If an original, unchanged loop is analyzed, dependences will almost certainly be found that prevent useful transformations, such as tiling, interchange and parallelization, from being performed. In this chapter, several transformations that modify, or completely eliminate dependences will be described. These transformations typically do not lead to an increase in performance—on the contrary, they may lead to more loop bounds and subscript index expressions and computation, resulting in more work being performed. They can, however, enable transformations that dramatically increases the performance of the program.

There is another class of transformations—reduction and recurrence recognition—that allow dependences to be ignored because the operations on the data involved in the dependence are commutative, but do not actually remove them. We discuss reductions in Sections 4.8 and give pointers to readings on recurrence recognition in Section 8.5.

4.1 LOOP PEELING AND SPLITTING

Loop peeling [248] is a transformation that removes a small number of iterations from the loop, reproducing them as straight-line code before or after the loop. Loop splitting is essentially the same transformation except that enough iterations are peeled off of the loop that executing them in a second loop is the best way to generate code.

In the loop of Figure 4.1(a), the last value for the array a is captured in the variable last. Standard dependence testing will assume that there are loop carried anti, output and flow dependences on last. By peeling off the last iteration of the loop, the remaining iterations have no loop carried dependences and therefore the loop can be executed in parallel.

In the loop of Figure 4.1(c), there are no loop-carried dependences within the first half of the iteration space, and there are none within the second half of the iteration space, but there are loop-carried dependences from the first half to the second half of the iteration space. The table of Figure 4.1(d) shows the elements of a accessed on each iteration, and the dependence relation between the two halves of the iteration space can be quickly inferred from the references. By using loop splitting to divide the loop into two loops, one running from 0 . . . n/2 − 1 and the other from n/2 . . . n − 1, two loops are formed that have no dependences and can therefore be made fully parallel.

Because the order of access is not changed for any reference, the transformation is always legal. The only difficulty for the compiler is determining when to profitably apply the transformation. This is generally done in a somewhat ad-hoc manner by having the compiler look at suspicious dependences to see if, e.g., a scalar references involved in loop-carried dependences is control dependent on an condition such that the scalar is only accessed in a single iteration.

Images

Images

Figure 4.1: An example of loop peeling and loop splitting.

4.2 LOOP SKEWING

Loop skewing changes the direction vector of a dependence, enabling loop parallelization and interchange (see Section 5.5). Consider Figure 4.2(a) and (b), which show a loop and its associated iteration space diagram. The dependence on a has a direction of (<, >), which precludes parallelization on both loops and prevents interchanging the i1 and i2 loops for better cache locality.

Increasing the sink iteration of the i2 loop by two iterations would give a direction of (<, =), and by more than two iterations would give a “<” direction. The case where the sink iteration of the i2 increases by two is shown in Figure 4.2.

Let iout be an outer loop, and iin some inner loop. Let δ be a dependence whose source and sink iterations for some loop i are isrc and isink, respectively. The distance of the dependence is D = [. . . , diout, . . . , diin]. Skewing allows us to shift the sink iteration of the iin dependence by the amount

k · diout, k > 0

where k = 0 is the case when skewing is not done.

The transform itself is straightforward. The shifting of iterations is achieved by altering the iin loop bounds by adding iout to both the lower and upper bounds. Where previously the lower bound of the iin loop was Liin, the new lower bound at the dependence source is Liin + iout,src and the new lower bound at the dependence sink is Liin + iout,src. This shifts the iteration space of the instance of the iin loop at the dependence source by iout,srciout,sink iterations, i.e., by disrc iterations. Because the values taken on by the loop index variables values have changed, the values of the subscript expressions must also change. By changing each i2 term in the subscript expression to iiniout, the subscript expression values will be the same as in the original loop.

If the original resulting dependences still have an undesired > direction in the iin element of the loop, the above transformation can be repeated, shifting the lower bounds of the iin loop by Liin + 2 · iout,src, Liin + 3 · iout,src, . . . , k · iout, and so forth. In practice, given distance diout and diin,

Images

is used to perform the skewing.

We note that skewing does not change the order in which dependence source and sinks execute, it only changes the “labels”, i.e., the values, of those iterations. Thus, skewing is always legal to perform.

Loop normalization is a form of skewing (or, more precisely, deskewing [250]) that is often used to make it easier to form dependence equations [244]. Loop normalization transforms the loop in Figure 4.3(a) into the loop of Figure 4.3(b). Since loop normalization is a form of skewing, it too is always legal to perform. As shown in the figure, normalization takes the loop from Lii < Ui, and converts it to a loop starting at zero, or one, i.e., to a loop from 0 ≤ i < UiLi. If Li > 0, which it typically is if it is not equal zero. Thus, normalization runs the risk of making an “=" or “<" dependence into a “>" dependence.

Images

Images

Figure 4.2: An example of loop skewing.

Images

Figure 4.3: An example of loop normalization.

4.3 INDUCTION VARIABLE SUBSTITUTION

A common programming idiom is to have an induction variable that is, implicitly or explicitly, a function of the loop induction variable. An example of this is the concatenation of one vector onto another, as seen in Figure 4.4(a). When attempting to parallelize the loop, the references to k are problematic. In particular, every iteration of the loop reads and writes the same memory location, causing a cross-iteration dependences between the read of k in S2 and the write of k in S3.

It is easily observed that at some iteration i = v, the value of k in S2 will be start + i. Recognizing this, and making the proper substitution into the use of i in S2 enables the elimination of statement S3. This in turn eliminates the cross-iteration anti, output and flow dependences on k. The loop after this transformation is shown in Figure 4.4(b).

More complicated situations can also be handled by induction variables substitution. In Figures 4.4(c) and (d), induction variable substitution is applied to triangular loops, and Figures 4.4(e) and (f) show it applied to multiple loop index variables.

Induction recognition proceeds as follows. First, all variables whose value is a function of only index variables, constants, and, recursively, other variables with the same properties are identified. The form of the expression is analyzed and a closed form expression in terms of loop index variables is generated. Next, the closed form expression is either assigned to a private variable that is substituted into all uses of the variable, or the expression itself is substituted into uses of the induction variable. The process continues until all induction variables have been expressed in terms of constants and loop index variables.

4.4 FORWARD SUBSTITUTION

Programmers often assign expressions into scalar temporaries to be reused throughout a computation. Compilers often perform the same optimization (called common sub-expression elimination) where repeatedly used expressions, or parts of an expression, are identified, with the value of the expression assigned into a temporary. If this is done within the body of the loop, a cross-iteration output dependence will exist on the scalar temporary. An example of this is shown in Figure 4.5.

Images

Images

Figure 4.4: Examples of induction variable substitution.

Let tmp be the problematic temporary. Let expr be the right-hand side expression, and let Vexpr be the variables read in expr. Assume that no value of any vVexpr (or any variable aliased to v) is changed between the assignment to tmp and a use of tmp. Then expr can be substituted into every use of tmp, eliminating all uses of, and the stores into, tmp, and therefore all dependences involving tmp. This transformation does increase the amount of work done in the loop, and in loops with a great deal of integer arithmetic (e.g., array subscript expressions) this integer arithmetic can form the critical path through the loop. Thus, forward substitution should be used when profitable, i.e., when it enables another transformation.

Images

Figure 4.5: An example of forward substitution.

4.5 SCALAR EXPANSION AND PRIVATIZATION

Scalar expansion and privatization attempt to eliminate the same sort of dependences that forward substitution does, but do so by using additional storage rather than additional computation. The basic idea is that if (i) a scalar’s value is always computed in an iteration of the loop before it is used, and (ii) the value read within an iteration is always the value computed in that iteration, then each iteration’s value can be stored in separate memory location. Since each iteration reads and writes from a separate location, there are no loop-carried dependences on the variable.

Figure 4.6(a) shows a loop amenable to this transformation, with t being the scalar to be expanded or privatized. Scalar expansion is legal when the following conditions are true:

1. No use of the variable is upward exposed, i.e., the use never reads a value that was assigned outside the loop.

2. No use of the variable is from an assignment in an earlier iteration.

Whether or not these conditions are met can be determined by consulting the use-def information for the variable (see Section 2.5.) For the loop of Figure 4.6(a) the conditions are met. Making the data read and written to t private to an iteration can be accomplished in two ways, as we now describe.

Figure 4.6(b) shows the first way of doing this by expanding a scalar into an array of elements, with one element per loop iteration. Since all writes and reads of a given element of the array are within a single iteration, there are no loop-carried dependences on the array. We observe that (despite what we have said earlier when discussing parallelization) the real problem is not dependences that cross just any iteration, but rather dependences that go from an iteration executing on one thread to an iteration executing on another thread. Because the iterations on a thread execute in-order, dependences spanning iterations within a thread are enforced by the sequential execution of the program within the thread. This observation can be used to generate the code found in Figure 4.6(c), where storage for a private scalar is allocated on a thread local stack, as denoted by the private keyword. This solution is preferable to scalar expansion as it does away with the computational overhead of address calculation when using an array, reduces the storage overhead to be on the order of the number of threads rather than on the order of the number of iterations, and increases locality. Privatization also decreases the cache footprint of the t array and eliminates false sharing across cores and threads.

Images

Figure 4.6: An example of scalar expansion and privatization.

4.6 ARRAY PRIVATIZATION

Like scalars, arrays also are sometimes used as temporaries within a loop nest and can be profitably parallelized. An example of this is shown in Figure 4.7(a) and its dependence graph in Figure 4.7(b).

The problem with the program in the example, and in general, is that elements of a d-dimensional array are repeatedly assigned in a d + k deep loop nest, where k > 0. This creates loop carried dependences on all the loops in the loop nest. Note that scalar privatization, discussed above, is the degenerate case where d = 0. An additional difficulty is that different iterations may assign the final value into different elements of the array, and if these values are used outside of the loop nest they need to be assigned into the shared, global version of the array.

To perform privatization on some array a to eliminate loop carried dependences on loop i the analysis must first determine that all values read from the array in some iteration of loop i are also written in that iteration. One way to determine this is to determine the sections of the array that must be written in an iteration prior to each read of the array. If it is true that all elements read from the array at each read in the iteration must have also been written earlier in that iteration, then it is legal to perform privatization. Because the analysis is checking for prior reads in the same iteration, the analysis for legality does not need to worry about accesses in other iterations. Thus, a single forward pass through the loop body can be made to gather the data about reads and writes.

Data that is read or written by some reference r can be represented by a regular section descriptor [Lr : Ur : s], where Lr is the first element of the array that is accessed by r, Ur is the last element accessed, and s is the stride, or distance between adjacent elements accessed by r. A non-unit stride can result from either a non-unit coefficient of an index variable, or a non-unit stride for a loop index variable. Sections are able to precisely represent the region of an array accessed by a counted for loop with affine subscript expressions (see Section 2.3.2), i.e., subscripts that are linear expressions in one or more dimensional spaces. Thus, in S1 above, the elements written in a single iteration of the i1 loop would be [i1 : i1 : 1], and in S2, the elements read in a single iteration of the i2 loop would be [2 : i1 : 1].

Images

Images

Figure 4.7: An example of array privatization.

Remember that for array privatization to be performed it is necessary that all values read from an array in an iteration must be written earlier in that iteration. Thus, the elements defined (i.e., written) in an iteration of the loop in which the array is being privatized must be a (not necessarily proper) superset of the elements read. Therefore, under-estimating the set of elements written by a reference will lead to lost opportunities for privatization, but not to an incorrect privatization. Similarly, over-estimating the set of elements read is also a conservative approximation of the actual elements read.

This conservative approach allows an accurate dataflow analysis to be developed to determine the legality of privatization. In particular, a single forward pass is made to determine what elements of the array have been written in an iteration. At the join of two control flow paths through the loop (typically because of an if statement) the sections describing the elements that have been written along each path are intersected, leading to a possible under-estimation of the elements defined. Along a path, the sections corresponding to the elements defined by multiple assignments to the array can be formed by unioning the sections if they overlap. Then, at each read, a check is made to see if the set of array elements that have been defined up to that point in the iteration is at least as large as what is being read. If this is true at all reads (uses) of the array in the loop, the array can be legally privatized.

The issue of storing last writes to an external array must also be handled if the array privatization is to be legal. In the example program, the writes to a in both S1 and S2 store into a[i1]. The write in the i2 loop is the second, and last write of each iteration, but later iterations may also write the array. In some cases, such as that shown in the example, it can be determined that a particular write to the array being privatized will be the last write. Because the i2 loop writes increasingly large portions of the array that subsume the elements of the array written in the last iteration of the outer i1 loop, it can be determined that the writes to a in the i2 loop when i1 = n − 1 will be the last writes into the element. More generally, a data structure can be kept that records the iteration that the last write of each element occurred. When the current iteration is greater or equal to this count, the global array is updated with the value being written to the privatized version of the array. The “equal” part of the test is necessary when multiple writes to an element occur in the same iteration of the loop nest. Again, in this case compiler analysis can often determine which will be the last write in an iteration, and eliminate the test for the earlier writes. This is particularly easy when there are no zero-trip loops or if branches containing writes to the privatized array.

Privatization is considered profitable if it eliminates dependences. This occurs whenever it is legal and the array being privatized has one or more elements read in different loop iterations.

4.7 NODE SPLITTING

Dependence cycles involving both flow and anti dependences can often be broken with the node splitting transformation. This transformation is enabled by the fact that anti dependences arise from the re-use of storage, i.e., the read of a memory location is followed by a write to the same location. By adding extra storage to save the read location and ensure it is not overwritten, the location of the read can be moved about the loop, thereby breaking the cycle.

Figure 4.8 gives an example of this. As can be seen in Figure 4.8(a), there is a lexically forward flow dependence from S1 (a[i]) to S2 (a[i-1]), and a lexically backward anti-dependence from S2 (a[i+1]) to S1 (a[i]), leading to the cycle shown in the dependence graph. The transformation identifies cross iteration anti-dependences, and assigns the value at the source of the dependence (a[i+1], in this case) to a temporary variable. As shown in Figure 4.8(b), this is insufficient to break the cycle. Now, because the anti-dependence is loop-carried, the assignment of the source to a temporary variable can be moved within the body of the loop, i.e., within a particular iteration, without changing the value read. By moving the statement before the use of the temporary variable (i.e., the original source of the anti-dependence) the lexically backward anti-dependence becomes lexically forward, potentially breaking the cycle. This is shown in Figure 4.8(c). Moving the assignment as early as possible in the loop is legal, but leaving it as lexically forward in the loop as it can be while still making the anti-dependence lexically forward may reduce the time that the value of the temporary is stored in a register, and increase the effectiveness of the register allocator.

While both privatization and node splitting target anti dependences, they are used in different situations. Privatization is useful when a variable is assigned a value that is only used within that iteration. Thus, the potential anti and output dependences do not really exist (in terms of the flow of data values in a sequential execution of the program), but parallelizing the loop and having multiple reads and stores, in an indeterminate order, to the single storage location will introduce races and be incorrect. Node splitting is useful when the anti-dependence may actually exist at run time, and thus the global state of the involved variables must be maintained. Because the two transformations are useful in different situations, they are complementary and can be used together.

4.8 REDUCTION RECOGNITION

The last technique we will discuss to remove dependences is reduction recognition. Reduction recognition exploits associativity to both break dependences and enable code motion and parallelization. Reductions are operations that reduce the dimensionality of at least one input operation using a commutative reduction operation, ⊕. Statement S2 in Figure 4.9(a) shows a reduction which reduces the one-dimensional array a into the scalar s where the reduction operation is +. This follows the general form of a reduction of s = s ⊕ a[. . .]

Figure 4.9(b) shows the parallelization of the reduction in Figure 4.9(a). The transformation creates a private storage location for each thread involved in the parallel execution of the loop. Each thread then computes the partial sum involving the elements of the reduced variable (a in this case) accessed in the iterations it executes. A second loop sums the partial sums to give the final value of s.

A better parallelization of the reduction is shown in Figure 4.9(c). Using log2(|T|) steps, where |T| is the number of threads, the summation of the partial sums is parallelized, as shown in the more complicated code of statements S8 to S12 for the second loop. Given |T| threads, this version will total the partial sums in log2(|T|) steps, as shown in Figure 4.9(e) for 16 threads.

Images

Images

Figure 4.8: Node splitting example.

To parallelize reductions, a compiler essentially looks for statements of the form s = s ⊕ expr. Two conditions must be met for the reduction to be parallelized. First, the value of expr must be the same regardless of the loop order it is evaluated in. If expr either only contains variables that are unchanged, or the values are changed in the same iteration of the loop they are used. That is, the flow or anti dependence from write to the read in the reduction statement has direction “or “0” in the loop being parallelized. Second, the left-hand side s must not be used in other statements. Reads of s in some iteration of the parallel loop will get a partial sum that has a different value than it would in a sequential execution of the loop. Figure 4.9(d) gives an example of a loop that violates both of these conditions.

Images

Images

Figure 4.9: Reduction recognition example.

4.9 WHICH TRANSFORMATIONS ARE MOST IMPORTANT?

As mentioned earlier, most loops cannot be directly parallelized because loop carried data dependences exist on the loops. True dependences by their very nature cannot be eliminated—at best the involved statements can be reordered and enforced by barriers, or enforced by producer-consumer synchronization. This is the goal of peeling, splitting and topologically sorting the dependence graph and performing fission. The effectiveness of these transformations is limited by both the form that a computation has been expressed and the data flow properties of the underlying algorithm.

Scalar expansion and array privatization are arguably among the most important transformations a compiler can perform to eliminate dependences. Common programming idioms, such as the use of scalars and arrays within deeply nested loops to hold temporary values, lead to numerous anti and output dependences. Because these dependences are not fundamental to a computation but rather are an artifact of insufficient resources being used, applying transformations that provide additional resources locally within a loop can have a dramatic effect on the number of dependences. It is interesting to observe register renaming, which maps architected registers to a combination of architected and unarchitected registers to break output and anti dependences, has eliminated renaming as a transformation in at least some compilers.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.71.94