3 Program parallelization

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 3 Program parallelization

This chapter discusses the use of dependence information to guide automatic program parallelization. We focus on loops without data dependences that would prevent any parallelization of the loop. In practice, it is often necessary to perform additional transformations to allow a loop to be partially or completely parallelized. Transformations to eliminate dependences, or to allow them to be ignored, are discussed in Chapter 4. Other transformations that may enhance parallelization are discussed in Chapters 5.

This chapter is organized as follows. In Section 3.1, the easy case of parallelizing loops with no dependences that prevent full parallelization of the loop is discussed. In Section 3.2, the parallelization of loops with more complicated dependence structures is discussed, and these capabilities are further expanded with a view towards targeting vector hardware in Section 3.3. Section 3.4 examines how producer/consumer synchronization (in contrast to mutex synchronization) can be used to extract more parallelism than would otherwise be possible from a loop. In Sections 3.5 and 3.6, issues arising from more complicated control structures, in particular from recursion and while loops, is discussed. Finally, Section 3.7 discusses software pipelining, which exposes fine-grained instruction level parallelism to the processor.

3.1 SIMPLE LOOP PARALLELIZATION

As discussed in Section 1.1, for a loop to be parallel without any additional transformations its iterations must be independent, i.e., there can be no flow of values or reuse of memory locations across iterations. More formally, there can be no loop-carried dependences on the loop to be parallelized, i.e., all dependences in the loop to be made fully parallel must have a direction of “0” (“=”). As mentioned above, in some cases problematic dependences can be eliminated or ignored. It is also generally desirable that the parallel loop be the outermost loop in the nest, as this maximizes the amount of work done in the parallel loop, and minimizes the number of instances of the parallel loop that are started, and the ensuing overhead. Loop interchange (Section 5.5) accomplishes this.

When loop-carried dependences have non-equal directions on several loops, it is sometimes possible to parallelize an inner loop. More specifically, if all dependences that are loop-carried, i.e., that have a non-equal direction on the i₂ loop also have a non-equal direction on some outer loop(s) i₁, the inner i₂ loop can be parallelized. An example of this can be seen in Figure 3.1. In the example, the outer i₁ loop is forced to execute sequentially (i.e., is frozen) to enable the instance of the inner i₂ loop within an iteration of the i₁ loop to execute in parallel. By examining the order of execution of the different i₂ loop iterations it can be seen how freezing the i₁ loop enables the i₂ loop parallelization. Because the source of the dependence is in an earlier iteration of the i₂ loop than the sink is, and the i₁ loop is executing sequentially, the sink of the dependence must execute later than the source, even when the i₂ loop is parallel. This forces the dependence to be honored, allowing the i₂ loop to be executed in parallel.

Images

Figure 3.1: A example of how freezing outer loops can enable inner loop parallelization.

3.2 PARALLELIZING LOOPS WITH ACYCLIC AND CYCLIC DEPENDENCE GRAPHS

Substantial parallelism can often be extracted from loops whose dependence structure is more complicated than the loops discussed in Section 3.1. In particular, we discuss how to extract parallelism when the dependence graph may be cyclic and loop freezing cannot be used to break the cycles. The following steps will be performed.

1. Construct a dependence graph for the loop nest.

2. Find strongly connected components (SCCs) formed by cycles of dependences in the graph, contract the nodes in the SCC into a single large node.

3. Mark all nodes in the graph containing a single statement as parallel.

4. Topologically sort the nodes in the graph so that all inter-node dependences are lexically forward.

5. Group independent, unordered, nodes reading the same data and marked as parallel into new nodes to optimize data reuse.

6. Perform loop fission (see Section 5.3) to form a new loop for each node in the sorted dependence graph.

7. Mark as parallel all loops resulting from nodes whose statements are marked as parallel in the sorted graph.

These steps will be explained in detail by means of an example in the remainder of this section.

When the dependence graph for a loop is acyclic, as shown in Figure 3.2(a) (showing a program) and Figure 3.2(b) (showing the acyclic dependence graph for the program) the loop can be fully parallelized by breaking it into multiple loops, each of which has no loop-carried dependences. If none of these multiple loops executes until all dependences into it are satisfied (i.e., until after the loop that contains the source of the dependence has already executed), then the barrier that is, by default, at the end of every parallel loop (see Section 1.2.1) will force the writes from each instance of the inner parallel loop to complete before other instructions execute, and the dependence is honored. Thus, the goal of the compiler is to break up the loop with dependences into multiple loops, none of which contain the source and sink of a loop carried (cross-iteration) dependences and then to order these new loops so that they only execute when all of the dependences they are involved in have been satisfied.

To do this requires ensuring that all dependences are lexically forward, i.e., that in branchless code the sink of the dependence is lexically forward of the source of the dependence. This can be done by topologically sorting the dependence graph. Because the dependences form a partial order on the nodes of Figure 3.2(b) there are several possible orderings of the nodes resulting from a topological sort; one valid order is shown in the graph of Figure 3.2(c).

Once the dependence graph is topologically sorted the compiler can reorder the statements in an order that is consistent with the topological sort order, i.e., the loop is transformed by reordering the statements to match the topological sort order. Loop fission (Section 5.3), which is always legal with lexically forward dependences, is then performed, yielding the sequence of loops seen in Figure 3.2(d). In the transformed program, no loop has cross iteration dependences. A more efficient parallelization can be performed by a different partitioning of statements among loops that is still consistent with the ordering implied by the topological sort. The more efficient partitioning keeps statements that are not related by a loop-carried dependence together in the same loop. Doing this with the original graph yields an ordering S2, S1, S4 and S3. Because there is no loop-carried dependence between S1 and S4 they can remain in the same loop, as shown in Figure 3.2(e). This can improve the locality on accesses to the b array, and eliminates the startup and iteration count overhead of one loop.

Images

Figure 3.2: An example of parallelizing loops with cross-iteration dependences.

In loops with cyclic dependence graphs with at least one loop-carried dependence, the cyclically dependent statements will form a strongly connected component (SCC) in the dependence graph. The most straightforward way to deal with the statements in each SCC is to place in a loop that is executed sequentially.

In loop nests the same strategy is followed except that it may be necessary to perform fission on all loops in the loop nest on which a pair of statements have loop-carried dependences.

Images

Figure 3.3: An example of decoupled software pipelining.

In Section 3.4, we show how parallelism can sometimes be extracted from these loops using producer-consumer synchronization. Another way of extracting parallelism from these loops is to execute the SCCs in a pipelined fashion. This technique is called decoupled software pipelining, and is described in detail in [189, 190]. An example of this is shown in Figure 3.3. Note that the maximum speedup is constrained by the size of the maximum SCC. In Section 8.9, we give further readings that discuss how to overcome this impediment to performance.

3.3 TARGETING VECTOR HARDWARE

To generate code for vector hardware it is necessary to partition the statements in the original loop into a sequence of loops, each of which contains a single operation to be performed by the vector unit. The resulting loops are then further transformed to execute a sequence of vector operations on data that corresponds to the number of scalar data items operated on by each vector instruction. When performing vectorization the innermost loop will be vectorized. If a loop other than the innermost loop is more amenable to vectorization, loop interchange (see Section 5.5) can sometimes be used to make another loop the innermost loop.

To vectorize an inner loop, the transformation described in Section 3.2 is first applied to the loop to be vectorized, resulting in each strongly connected component in the dependence graph being in a separate loop. This yields a loop similar to those shown in 3.2(c). In the loops of Figure 3.4(a) loop blocking or strip mining has been applied to each of the resulting loops, where the resulting inner blocked loop (shown in Figure 3.4(b)) has the same number of iterations as the number of operations performed by a vector instruction. For brevity, the fixup code for only the last loop is shown. The fixup code is necessary since the loops being executed on the vector hardware may not have a number of iterations evenly divisible by the number of operations performed at a time by the vector hardware.

The compiler back-end generates a hardware vector instructions each of the two inner loops. One more small complexity must be handled. In Figure 3.4(b), the statement’s right-hand-side (RHS) contains several operations. Vector hardware typically supports a single operation at a time (a notable exception is described below.) Either the high-level compiler that operates on loops, or the code-generating compiler, must break this RHS into multiple statements. If the statement in the strip-mined loop is viewed as a tree, as shown in Figure 3.4(c), code can be easily generated by doing an in-order traversal of the tree, and generating code for each interior node as a separate statement and loop, as shown in Figure 3.4(d).

An exception to the “one operation per instruction” rule is made for the fused multiply add (FMA) operation (y*z+w), which is extremely important in linear algebra. Because the multiply and add operations, and the register-to-register transfers can occur faster than a clock cycle on most processors, the FMA instruction can stream the result of the multiply operation in (y*z+w) to an adder, which takes w as the second input, returning the result of the multiply and add operations in a single cycle, giving roughly double the peak performance of performing the multiply operation and add as discrete clocked operations.

3.4 PARALLELIZING LOOPS USING PRODUCER/CONSUMER SYNCHRONIZATION

Dependences prevent parallelization because there is no guarantee the order that statements execute on different threads in the parallel version of the program will enforce the dependence, as discussed in Section 1.1. By using producer/consumer synchronization, this ordering can be forced on the program execution, allowing parallelism to be extracted from loops with dependences.

3.4.1 PRODUCER AND CONSUMER SYNCHRONIZATION

Several forms of producer/consumer synchronization (referred to simply as synchronization for the rest of this section) were implemented in the 1980s and early 1990s, during the age of the super mini-computers. Full/empty synchronization, implemented in the Denelcor HEP [126], attaches a full-empty bit to each word of storage. By having accesses either set, or reset, the full-empty bit, reads and writes to a memory word could be sequenced. Memory operations could either wait until a word was full before proceeding, and could either clear, or set, the full/empty bit.

Images

Figure 3.4: An example of vectorization.

The Alliant F/X 8 [1, 228] implemented the advance(r, i) and await(r, i) synchronization instructions. The advance instruction sets synchronization register r to i, and the await instruction would wait for r to take the value i. Thus, the dependence on a[i] and a[i − 1] in Figure 3.5 could be synchronized by executing advance(r, i) after the write to a[i], and await(r, i − 1) before the read. Variables guarded by the synchronization instructions could also be specified, allowing the compiler to move unsynchronized memory operations past the synchronization instructions.

Compiler exploitation of both of these synchronization instructions, and general producer/consumer synchronization, can be discussed in terms of post and wait synchronization. The instruction post(r, i, vars) writes the value i to r, and the instruction wait(r, i, vars) waits until the value of r is i. In both cases, it is assumed that no accesses on the memory locations denoted by the variables in the list vars are moved past the synchronization instruction by later compiler passes. The post and wait instructions also have a functionality equivalent to a fence instruction that will ensure that the result of all memory accesses before the post and wait are visible before the post or wait completes, and that the hardware does not move instructions past the synchronization operation at run time.

A compiler can synchronize a program directly from the dependence graph. After the source of dependence δ, it inserts the instruction post(r_δ, i, vars), where i is the loop index variable, r_δ is the synchronization register used for dependence δ, and vars contains the variables involved whose dependence is being synchronized. Before each dependence sink the compiler inserts the instruction wait(r_δ, i − d_j, vars), where d_i is the distance of the dependence on the i loop. Figure 3.5(d) shows the loop after synchronization.

As shown in Figure 3.5(c), parallelism comes from overlapping the execution of the sources of the forward dependence, from S1 to S3, on a. Figure 3.6 shows a program similar to that of the program of Figure 3.5(a) except that the cross iteration dependence on b from S2 to S3 is not present, and the backward dependence on c now has a distance of two. Synchronizing this loop as described above would lead to the execution order shown in Figure 3.6(b), and not only allows the source of the forward dependence to execute in parallel, but also allows adjacent iterations whose statements are involved in the dependence cycle to execute in parallel. This is because the distance of two on the dependences means that no backward dependence exists between adjacent iterations, and therefore the adjacent iterations can execute in parallel. Loops of this form are often referred to as doacross loops [61, 168] in the literature.

The reasons that producer/consumer synchronization instructions are no longer supported in hardware shows the impact that technology and economics have on what is a desirable architectural. With the rise of the killer micros [107], commodity processors became the economically preferred choice to use as the building block for parallel systems. Specialized synchronization instructions fell out of favor because of the increased latencies required when synchronizing across the system bus between general purpose processors, and because the RISC principles of instruction set design [105, 172] favored simpler instructions from which post and wait instructions could be built, albeit at a higher run time cost. With the rise of the multicore processor, and at least for a period of time, relatively low latencies in intra-processor, inter-core communication, specialized synchronization might again be attractive [145]. We note that except for questions of profitability, the compiler strategy for inserting and optimizing synchronization is indifferent to whether it is implemented in software or hardware.

Images

Figure 3.5: An example of using producer/consumer synchronization.

Images

Figure 3.6: An example of using producer/consumer synchronization for a doacross loop.

3.4.2 OPTIMIZING PRODUCER/CONSUMER SYNCHRONIZATION

A compiler can sometimes reduce the number of synchronization operations needed to synchronize the dependences in a loop. While all dependences must be enforced, performing this optimization reduces the overhead of enforcing them by allowing a single post/wait pair to synchronize more than one dependence, or a combination of post/wait instructions to synchronize additional dependences. The key observation leading to the optimization is that if the dependence is enforced by a combination of program orders and other dependences’ synchronization operations, no additional synchronization is needed. In the loop of Figure 3.7(a) and (b), a loop with two dependences, and the associated dependence graph, is shown.

Images

Figure 3.7: An example of synchronization optimization.

The ISD of Figure 3.7 shows the dependences to be enforced as solid lines, and execution orders implied by the sequential execution of the program by dotted lines. Consider the dependence with distance two from statement S1 in iteration i = 2 to statement S3 in iteration i = 4. Let S_j(k) represent the instance of statement S_j in iteration i = k. There is a path

from S1 in iteration 2 to S3 in iteration 4. If the dependence from S3 to S2 has been synchronized, then the existence of this path of enforced orders implies that the dependence from S1(2) to S3(4) is also enforced. Because the distances are constant, the iteration space can be covered by shifting the region in the dashed lines, and therefore every instance of the dependence within the iteration space is synchronized.

In [156] it is shown that this problem can be solved by performing a transitive reduction on the ISD. Because transitive reduction is used, it is possible for multiple dependences to work together to eliminate another dependence. The ISD on which the transitive reduction is performed needs to only contain a subset of the total iteration space (as shown by the dashed box in Figure 3.7(b)). For each loop in the loop nest over which the synchronization elimination is taking place, the number of iterations needed in the ISD for the loop is equal to the least product of the unique prime factors of the dependence distances, plus one.

Another synchronization elimination technique, described in [144], is based on pattern matching and works even if the dependence distances are not constant. The matched patterns identify dependences whose lexical relationship and distances are such that synchronizing one dependence will synchronize the other by forming a path as shown in Figure 3.7(b). In the program of Figure 3.7(a), let the forward dependences with a distance of two that is to be eliminated be δ_e, and the backward dependence of distance one be δ₁ that is used to eliminate the other dependence be δ_r. One pattern given in [144] is that if there is: (i) a path from the source of δ_e to the source of some δ_r; (ii) the sink of δ_r reaches the sink of δ_e; (iii) δ_r is lexically backward (i.e., the sink precedes the source in the program flow); (iv) the absolute value of the distance of δ_r is one; and (v) the signs of the distances of δ_r and δ_e are the same, then δ_e can be eliminated. Conditions (i) and (ii) establish the proper flow of δ_e and δ_r, (iii) recognizes that δ_r can be repeatedly executed to reach all iterations that are multiple of the distance away from the source, and (iv) and (v) say that because the absolute value of the distance is one and the signs of the two distances are equal the traversal enabled by (iii) will reach the source of δ_e.

In [157] these two techniques and others are compared and it is shown that the method of [144] actually performs better than that of [156], and has a lower compile time overhead on the loops examined.

3.5 PARALLELIZING RECURSIVE CONSTRUCTS

Iterative constructs have been the focus of most of the work on automatic parallelization for several reasons. First, in numerical programs, loops contain almost all of the work performed by the program, and as a consequence of Amdahl’s law are the natural place to target for performance improvements, including parallelization. Second, in many loop based numerical programs, the data accesses tend to be amenable to dependence analysis and dependence analysis based optimizations. This in turn provides positive feedback to programmers of these applications to write code amenable to analysis. Finally, many recursive codes operate on trees or other pointer-based data structures (or equivalently, indirect accesses through arrays). These applications are fundamentally more difficult to analyze, for the reasons discussed in Section 2.1.2. Nevertheless, significant results exist for the optimization of application code containing recursion.

Among the earliest work is that of Harrison [109] on the Miprac Scheme compiler [108]. Miprac performs an abstract interpretation over a Lisp/Scheme program’s recursive calls to determine if the calls can be converted to (i) a while loop that determines the exit condition, and (ideally) (ii) a for loop that can be easily parallelized and whose iteration count has been decided by the while loop. In some cases the while loop can also be parallelized (see Section 3.6).

We will briefly discuss two other approaches. Divide-and-conquer algorithms lend themselves to parallelization because at each level of recursion the new function invocations access disjoint sets of data. If a compiler can determine that this is the case it can allow each of the calls to proceed in parallel. In [90, 92], a dependence analysis similar to Eigenmann and Blume’s Range Analysis [31] is used to prove the recursive calls in a quicksort operate on disjoint data. Figure 3.8(a) shows the parallelization of the quicksort algorithm using Cilk-like [51] constructs. The spawn constructs create two instances of the Quicksort function and execute then in parallel on two different threads. Because the threads operate on completely disjoint sets of data, no synchronization is needed between the sets of calls until the end of the program, where the synch statement ensures that all of the threads are finished before terminating the program.

Images

Figure 3.8: Examples of parallelizable recursive code.

Another approach targets object oriented languages and uses commutativity analysis to determine when operations performed in multiple invocations of a function are commutative [197]. When this is true, all possible interleavings of operations across parallel invocations of the function will give the same result as a sequential execution of the function invocations. To determine commutativity, two tests need to be performed.

1. Test that values of instance variables are the same whether two operations Op₁, Op₂ (which may be method calls) execute in the order Op₁ then Op₂, or the order Op₂ then Op₁. This ensures that the outcomes of the operations are the same regardless of the order they execute in. An important case where this is true is when two operations operate on independent data, or the data accessed are read only—these operations always commute.

2. The set of operations performed when executing Op₁ and then Op₂ are the same as the set of operations performed when executing Op₂ and then Op₁. This ensures that the multiset of operations that execute in either order are the same, and allows the analysis of (1) to show that if all operations are commutative, then because the multiset of operations executed is the same in either order, then the outcomes of the complete operation must be the same.

The parallelizable invocations of the Quicksort routine are detected with commutativity analysis because the two function calls access disjoint data sets and are therefore independent, and commutative. The function calls of the traverse routine in Figure 3.8(b) are also parallelizable using commutativity analysis even though the data accessed is not independent. If a naive parallelization of the calls is performed, there is the danger of a race condition on the check of whether the function has been previously visited. By enclosing the accesses to the read and update of visited array in a mutex, the race will be avoided, and commutativity analysis inserts this synchronization.

3.6 PARALLELIZATION OF WHILE LOOPS

The parallelization of while loops is problematic for two reasons. First, unlike counted loops (e.g., a simple C for loop or Fortran do loop) the number of iterations is usually not known before the loop begins executing, making it difficult to spread the loop iterations across processors and to determine when to terminate the loop. Second, while loops are often used to perform computation on less regular data structures, including pointer based structures such as linked lists, trees, and the like, or array accesses involving non-linear subscripts, often in the form of subscripted subscripts (e.g., a[b[i]]). We will discuss the issues raised by each of these in turn.

Figure 3.9(a) shows a stylized while loop. The loop has four parts—the initialization, shown in S1, the test, shown S2, the increment, shown in S4, and the work or remainder, shown in S3 and S5. We now discuss two challenges in parallelizing the loop: determining the number and distribution of iterations in the loop and exploiting parallelism in both determining the loop iterations and in the work done by the body of the loop.

3.6.1 DETERMINING ITERATIONS TO BE EXECUTED BY A THREAD

The form of the update in the stereotypical while loop is shown in Figure 3.9(a). In the best situation, the loop iterations are controlled by a simple induction of the form a · i ± c, where a and c are loop invariants. If the test function τ (see statement S2) is a simple comparison (e.g., i < n) the while can be converted to a simple for loop. If the test is more complicated, then code such as is shown in Figure 3.9(b) can be used. The array first captures the first (lowest) iteration in each thread that meets the termination condition. After all threads complete their iterations, or meet the termination condition, the lowest iteration that met the termination condition across all threads is found. Threads that have executed iterations beyond that will need to roll back their state, as discussed in Section 3.6.2. If iterations are executed in order, then when the terminating iteration is encountered by a thread the iteration value can be sent to all threads, and they only need to complete iterations up to that value.

A variant of this form is the iteration sequence of the form of the recurrence x(i) − a · x(i − k) + c, where a, k and c are all loop invariant, as shown in Figure 3.9(c). The loop is split into two parts – the first solves the recurrence and the second does the actual work in the loop. The recurrence can be further optimized by being solved in parallel using a parallel prefix algorithm that finds the necessary partial results of the recurrence in time proportional to the log of the number of iterations in the loop.

Images

Figure 3.9: Examples of some issues in parallelizing while loops.

More complicated iteration sequences often require a sequential computation. An example of this is a while loop that iterates over a linked list, with the work function operating on each element of the list. In [193], three ways of dealing with this problem are outlined, as shown in Figure 3.10. Figure 3.10(a) shows the original loop, and the other parts of the figure show three strategies for parallelizing the loop.

Figure 3.10(b) shows a parallel version of this loop. A global (gP) and local (lP) pointer into the linked list are maintained. When a process needs another work item, it atomically assigns what is pointed to by the global pointer to the local pointer, moves the global pointer to the next element in the linked list, and then works on the node pointed to by the local pointer. The overhead of synchronization, and the coarse grained synchronization (i.e., locking the entire list) limit the parallelism of this approach. We note that more intelligent data structures and a compiler that is aware of their semantics [136, 137, 236] might at least partially ameliorate some of these drawbacks.

If there are |T| threads, the technique demonstrated in Figure 3.10(c) will have thread t_i work on nodes i, i + |T|, i + 2 |T |, . . . in the linked list. Each thread visits all of the nodes, but only performs work on those that is supposed to work on. Thus, the loop at S1_a executing in thread t_i skips over the first i − 1 nodes to find the first node to work on. Afterwards, the loop at S1_e skips over the |T| nodes worked on by the other threads to find the next node to work on. This technique eliminates the synchronization overhead of the first method, but each process will need to look at all N nodes, and so the amount of work done by each processor has an O(N) term, which will necessarily limit the achievable speedup.

The third method (Figure 3.10(d)) has the same overheads as the second, but can achieve a better load balance by virtue of assigning nodes to threads as they need work, rather than having an implicit pre-assignment of nodes as seen in the second method of Figure 3.10(c). As each thread gets an iteration i to execute, it examines the previous iteration (stored in prevIter) it executed. The difference between the current and previous iteration is the number of nodes processed by other threads, and these nodes are skipped over in the loop beginning with statement S1_c.

3.6.2 DEALING WITH THE EFFECTS OF SPECULATION

In the above examples it was known that a linked list was being accessed, and therefore it is also known that a traversal of the list will visit unique nodes. Depending on other data pointed to by these nodes (and accessed in the work routine) dependences may exist at run time among concurrently executing invocations of the work routine. Moreover, it is sometimes more efficient to speculatively execute iterations of a while loop under the assumption that the loop will not terminate before those iterations are executed. This assumption will not always hold, and in those cases it will be necessary to ensure that the effects of those excess iterations are not reflected in the permanent program state.

Two approaches can be used to solve this problem. First, techniques like transactional memory [101, 158] can be used to buffer the state until it is safe to commit. Unfortunately, software transactional memory carries significant performance penalties, and hardware transactional memory is not yet available on widely deployed processors. A second approach is to checkpoint the state accessed within a loop, and if the loop fails in some way, to restore the checkpoint and execute the loop sequentially. By strip mining the loop (see Section 5.1), its execution can be broken into blocks of contiguous iterations. A checkpoint is taken for each block, and when a block successfully finishes its values are committed, its checkpoint is discarded, and a new checkpoint is taken for the next block. This both reduces the memory footprint required and prevents a failure late in the execution of the loop from discarding all work that has been done so far in the execution of the loop.

Images

Figure 3.10: An example of parallelizing a while loop traversing a pointer-based data structure.

Additional readings on more advanced speculative techniques can be found in Section 8.9.

3.7 SOFTWARE PIPELINING FOR INSTRUCTION LEVEL PARALLELISM

Even after applying the parallelization techniques above, additional instruction level parallelism can be exposed in loops using software pipelining, which can result in significant additional speedups. Software pipelining rearranges instructions across loop iterations to expose the parallelism to the instruction scheduler within a processor, or to exploit VLIW [69, 81] processors. Consider the loop of Figure 3.11(a) and (b). Assume it takes one cycle to address a[i], to load an element of a, to perform a multiply operation, to store the result, to check the value of the index and increment it, and one cycle to perform the branch the branch. Under these assumptions, one iteration of the loop will take six cycles, and computing and storing n values will take 6n cycles. In the code in Figure 3.11(b), the timing assumptions are the same as before, but the hardware will execute multiple statements on a single line in one cycle. The transformed loop uses instruction level parallelism to execute an iteration of the loop in five cycles, but each iteration produces two results, for an average of 2.5 cycles a result, or 2.5n cycles total, ignoring the startup code.

Images

Figure 3.11: An example of software pipelining.

Software pipelining, like all transformations that alter the order of memory fetches, must ensure that the new access order enforces all dependences. In statement S11 of Figure 3.11(b), the value of a stored in some iteration i is occurring in parallel with the fetch and add of values from iteration a[i + 2]. In statement S12 a store of a value of a from iteration i + 1 is occurring in parallel with the fetch and add of a that would normally be done in iteration i + 3. As well, in statements S13 and S14 performs fetches that would normally be done in iterations i + 4 and iteration i + 5. Depending on the actual subscripts of a, these fetches and stores could be violating dependences, and would prevent software pipelining.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3 Program parallelization

Create new playlist

Sign In

Sign Up

CHAPTER 3

Program parallelization

3.1 SIMPLE LOOP PARALLELIZATION

3.2 PARALLELIZING LOOPS WITH ACYCLIC AND CYCLIC DEPENDENCE GRAPHS

3.3 TARGETING VECTOR HARDWARE

3.4 PARALLELIZING LOOPS USING PRODUCER/CONSUMER SYNCHRONIZATION

3.4.1 PRODUCER AND CONSUMER SYNCHRONIZATION

3.4.2 OPTIMIZING PRODUCER/CONSUMER SYNCHRONIZATION

3.5 PARALLELIZING RECURSIVE CONSTRUCTS

3.6 PARALLELIZATION OF WHILE LOOPS

3.6.1 DETERMINING ITERATIONS TO BE EXECUTED BY A THREAD

3.6.2 DEALING WITH THE EFFECTS OF SPECULATION

3.7 SOFTWARE PIPELINING FOR INSTRUCTION LEVEL PARALLELISM

Table of Contents for
3 Program parallelization