CHAPTER 8

A guide to further reading

Because of the tutorial focus of this book I have focused on giving relatively high-level sources, where possible, to supplement the material.

8.1 COMPILER FUNDAMENTALS

There are a variety of compiler textbooks available, and most of these cover parsing, code generation, and optimizations and analyses for sequential machines. The so called Dragon Book (because of its cover) is a classic in the field, but fairly technical. Two other, more accessible books are Fischer and Cytron’s Crafting a Compiler [77] and Torczon and Cooper’s Engineering a Compiler [57]. For advanced coverage, Muchnick’s Advanced Compiler Design & Implementation [161] surveys a range of advanced topics. Because of the breadth of the field, all compiler books are somewhat flawed, and the reader with finite time and resources is suggested to examine two or three from a library before settling on one.

The development of the Static Single Assignment (SSA) form in the 1980s had a significant impact on the design of intermediate representations for scalar compilers. A good early work on the program dependence graph, a precursor to the SSA form, is in Ferrante, Ottenstein and Warren paper [74]. The SSA form itself is discussed in detail in [62], and detailed development of an algorithm for using the SSA form to perform constant propagation is given in [240].

Padua’s 2000 article on The Fortran I Compiler [167] serves as a fascinating look into the first compiler, and gives a good perspective on how the initial technology has been replaced by more advanced and formalized methods.

An alternative approach to developing parallel programs is write a sequential program on top of a parallel library, with the library providing the parallelism. This has been used successfully for decades in the more limited domain of databases. Two projects along these lines are Rauchwerger’s STAPL [192, 217, 226] and Intel’s Thread Building Blocks [229].

8.2 DEPENDENCE ANALYSIS, DEPENDENCE GRAPHS AND ALIAS ANALYSIS

Good overviews of dependence analysis techniques can be found in the tutorials by Padua and Wolfe [169], Wolfe [249, 251] and Banerjee et al. [23]. Modern dependence analysis grew out of Utpal Banerjee’s PhD thesis at Illinois in the late 1970s. Two books give an in-depth explanation of classical Banerjee-Wolfe dependence analysis [22, 24] as well as explaining the necessary number theory and Unimodular operations on matrices. The textbooks by Kennedy and Allen [116] and Wolfe [250] also provide a good discussion of this topic. A comparative overview of dependence testing techniques can be found at [104].

In the years following the development of classical Banerjee-Wolfe dependence analysis, an number of different approaches have been taken to make the results less conservative. Early variants were the Power Test [252], which used Fourier-Motzkin elimination to decide if references are dependent, and the i-test [125]. Another variant is the Range Test [31], which determines that the range of values taken on by subscripts in potentially dependent array accesses do not overlap, and thus cannot take on the same value. Even more precise tests are the Omega test [181, 183, 184] and the Polyhedral and Polytope models [26, 33, 72, 73, 177, 178]. Both of these have significantly higher theoretical overheads than Banjerjee-Wolfe analysis, but are practical for use in a compiler.

Readers with an interest in analyzing explicitly parallel programs should first read Adve and Gharachorloo’s tutorial on shared memory consistency models [2]. Many of the issues in compiling for different memory models can be understood using the techniques of [209]. Problems in the original Java memory model are discussed in [185], and the current Java memory model, released with Java 5, is discussed in Manson, Pugh and Adve [147]. Finally, Boehm and Adve [32] did fundamental work in developing a memory model for C++. The two main compiler efforts targeting the compilation of explicitly parallel programs while preserving sequential consistency are Yelick’s Titanium [127, 128] and UPC [224], and the Pensieve project [141, 225].

8.3 PROGRAM PARALLELIZATION

For a good overview of parallelization, the sources mentioned earlier [23, 169, 244, 251] should be consulted. Other good overviews include [132]. Vectorization is discussed in the above works, and [9, 10, 40, 247]. Much of this work was inspired by the early CCD and Cray machines. More recently, the introduction of short vector instructions in the Intel and PowerPC processors has led to some recent efforts in this area, including [70, 196, 214], as has compiling for graphics processors being used as accelerators [38, 142, 149, 205, 227] and the Cuda [60] language compilers.

Related to dependence analysis and automatic parallelization, Rauchwerger, et al. did work in hybrid parallelization, where some information is developed at compile time but the final decision on whether a dependence exists and the loop can be parallelized is made at run time [202, 203].

Good overviews of producer/consumer synchronization can be found in [157, 246]. Much of the early research on producer/consumer synchronization [144, 156, 223], came from the Cedar Project [131], which implemented flexible synchronization mechanisms. The Rice compiler effort, under Ken Kennedy, also studied the problem of compiler generation of synchronization [42]. In [49], the effects of synchronization and the granularity of parallelism enabled by synchronization are discussed. With multi-core processors there has been some renewed interest in this topic, with some recent work being described in [115, 206, 257].

The parallelization of recursive and divide and conquer algorithms and while loops are related and is an intrinsically more difficult problem than the parallelization of loops, and in the domain of scientific computing these programs have been less common than programs with loop-based computation. Some readings in this area are [56, 91, 108, 109, 193, 201].

A good high-level discussion of software pipelining can be found in [20] and [5, 113] gives a comparison of some of the major techniques. Early work derived from compiling for VLIW machines [4, 68, 80, 81, 138, 171, 191, 199, 232].

8.4 TRANSFORMATIONS TO MODIFY AND ELIMINATE DEPENDENCES

A good overview of loop skewing can be found in [244, 250]. In [21, 140, 243], they discuss loop skewing in the context of Unimodular transformations. An example of a recent work on skewing is [112], which also discusses skewing of multiple loops.

Induction variable substitution is discussed in [3, 96, 176, 248, 253]. An discussion of induction variable analysis in recursive programs using abstract interpretation can be found in [15]

One of the earliest descriptions of the use of forward substitution can be found in [133]. An overview of the algorithm can be found in [169], a description of its implementation within a symbolic analysis framework is described in [94]. Descriptions of the use of forward substitution in parallelizing compilers are given by [30, 95, 210].

The overview works, e.g. [169, 250], give a good overview of scalar expansion and privatization. The transformation is discussed in the context of distributed memory machines by Banerjee, et al. in [170]. The seminal works on array privatization are by Padua and Tu [233, 234].

8.5 REDUCTION RECOGNITION

Kuck [130] provided one of the earliest discussions of the principles of reduction and recurrence recognition and optimization. Other early works on this topic include [78, 114, 175, 176] In [79] the closely related topic of generating parallel prefix programs from recurrences and reductions on data is discussed.

8.6 TRANSFORMATION OF ITERATIVE AND RECURSIVE CONSTRUCTS

Loop fusion and unrolling are closely related to the unroll-and-jam transformation [41, 44, 45] that operates on nested loop. Unroll and jam has been applied to many problems, of related to locality, as discussed in [150].

8.7 TILING

An example of an early dependence-based transformation to improve locality can be found in [83]. True tiling is developed in [242, 245]. More advance algorithms were developed by McKinley [150] and Pingali [121]. Related in function, but not in strategy, are various recursive layouts. Gustavson et al. [71, 93] and Chaterjee et al. [48] were pioneers in this. Because of the difficulty of analytically determining the precise tile sizes to best manage a complex memory hierarchy involving multilevel caches and translation look-aside buffers, automatic tuning tools that perform and intelligent search over a space of solutions, using run time profiling, to determine an optimal blocking for a given architecture. ATLAS [18, 254] (for linear algebra) and FFTW [76, 82] (for fast Fourier transforms).

8.8 COMPILING FOR DISTRIBUTED MEMORY MACHINES

Much of the early work in data distribution, communication generation, and so forth, occurred in the context of HPF [117, 123, 124]. Early versions of HPF targeted a variety of message passing systems, but currently the MPI [160, 213] library is used almost exclusively1. HPF grew out of three earlier projects – Fortran90D [35] project at Syracuse, Vienna Fortran [46] at the University of Vienna, and Fortran D [103] at Syracuse. As mentioned earlier, the block-cyclic distribution was a difficult problem for early HPF and distributed memory compilers, and Chatterjee’s survey paper [239] gives pointers to many of the proposed solutions.

The earliest, and foundational work on data distributions is Koelbel’s [122] in his discussion of local functions. Much of the later research was focused on relieving the programmer of the task of doing data distribution, and work by Gupta and Banerjee [88] (in the context of the Paradigm Compiler), and Kremer and Kennedy [118]. More advanced generalizations beyond what we have discussed can be found in [86, 235].

Key works in generating communication are again Koelbel’s [122], Gupta and Banerjee [88] and FortranD [103], as well as in the Fortran 90D [35] and Vienna Fortran [46] projects.

In [89], an overview is provided of the IBM research prototype that led to the xlHPF compiler. This book gives a good overview of the interaction between data distribution, communication generation, computation partitioning, and optimization of communication in the interface between the code and library. Two other important industrial compilers are the DEC compiler [98] and the PGI pghpf compiler [36].

Two of the more successful languages and associated compilers targeting distributed memory computers are Co-Array Fortran (CAF) [54, 163] and Unified Parallel C [43, 237]. A comparison of these two global address space languages is given in [55]. Co-Array Fortran is included in the Fortran 2008 standard approved in 2010 [195].

There are several implementations of both of these languages. The Rice Co-Array Fortran compiler is described in [66], and differs from the standard version by introducing additional features, such as processor sub-setting.

8.9 CURRENT AND FUTURE DIRECTIONS IN PARALLELIZING COMPILERS

There have been two main directions in parallelizing compiler technology in the last 10 or 15 years. These directions have been motivated by a desire to overcome the limitations of purely static compiler analysis. These limitations arise when attempting to parallelize non-loop code; loops with non-affine accesses, including pointer-based accesses; or loop-based and non-loop code with dependences that occur in a few executions of the code.

The first direction to overcome the limitations of the current analyses, and which already has been discussed in the this work, in the use of languages that provide parallelization hints in one form or another to the compiler. These provide information to the compiler above and beyond what it can ascertain via a static analysis. The most widely used of these is probably is OpenMP. The main limitation of this approach is that the languages tend to provide better support for loop and array-based programs and regular algorithms.

The second direction is the use of speculation, with the earliest compiler oriented work being that of [56, 63, 64, 193, 194, 218]. Speculative techniques optimistically parallelize program constructs, thus dependences may exist across code executing in parallel, and therefore some kind of check must be done at runtime to (i) ensure that the code executing in parallel really is independent and not dependences exist, and (ii) to roll back the execution of one or more of the code instances when a dependence is found, and to re-execute them correctly. Speculation is discussed briefly in Section 3.6.2. The role of the compiler in the current and recent work in speculation is to determine what code should be speculatively parallelized, to transform the code to make the speculative execution more efficient, and to insert the necessary library calls, etc., to enable the speculative execution. Another form of speculation is based on value prediction, where the value that will be communicated along a dependence is predicted (and speculated on), allowing the code at the sink of the dependence to execute before the value is actually computed. If the value is mis-predicted, the dependent code must be re-run.

Speculative approaches need to perform several tasks. As identified in [52], these include:

1. Identifying data references that must be speculated on. These are dependences that are either unlikely to actually exist at runtime, or where the source and sink of the dependence are unlikely to execute at the same time.

2. Maintain information about speculated data locations that have been accessed (e.g. [84]).

3. Schedule threads on which speculation occurs. Like all scheduling strategies it is necessary to worry about load balance. A major contribution of [52] is the use of a sliding window to enable good load balance and the amount of “roll back” necessary to achieve a correct execution.

4. Committing data when it is safe to do so. If a chunk of code has executed with no conflicts on speculated data, the results of that chunk can be committed to memory.

5. Squashing a restarting threads when conflicts on speculated data are detected. Because we are executing speculatively, it will sometimes be true that a dependence does exist, and is violated by the speculated execution. This must be detected, the violating threads terminated, and execution restarted from a correct point.

 

These basic actions can either be done in software, or more efficiently using special hardware [53, 99]. We note that the first hardware support for transactional memory will likely be in IBM’s Blue Gene/Q processor [97]. In [53], a description of a processor that enables thread level speculation, and supporting software, is provided. Unfortunately, software transaction memory systems (STMs) have relatively high costs for the above functionality, and often experienced degraded performance, relative to a sequential version of the program, on a small number of processors. One of the better STMs is STMlite [151] which is designed to use profiling information to facilitate automatic parallelization.

A study of the maximum performance that can be gained by speculative execution given zero-overhead synchronization, etc., and the effects of loop and function level parallelism is provided in [165]. They conclude that function level speculation is best, although much later work mentioned in this section suggests that both coarse and fine grained (loop iterations and other small sections of code) can be effectively exploited. In [28], some issues surrounding the scheduling and selection of code to be executed speculatively, with an emphasis on when to spawn speculative threads, is given. Gonzalez et al. [148] also discuss scheduling strategies utilizing profiling information to increase the ability of the compiler to make better decisions. In [67], a cost-model based framework for selecting and scheduling speculative threads is discussed, and [231] discusses the use of profiling information and machine learning techniques to develop better mappings of speculative threads to different hardware. The POSH compiler [146] uses profiling information and program structures (e.g., loops and functions) to partition programs, and shows that much of the performance improvement comes from overlapping work and prefetching.

Another strategy is to use program slices [241] (see Section 2.3.1 to execute a fast path through the program to compute essential values and allow the rapid fulfillment of dependences. This technique, in conjunction with profiling, is explored in [186]. This technique is used to pre-compute addresses (e.g., when following links in a pointer-based data structure) and branch outcomes in [258], and in [259] it is shown that the slice does not have to be correct since if the outcome is incorrect the underlying speculative mechanism allows the mis-calculation to be caught and a correct execution to be done. This observation allows the slice to be highly optimized, which is important since the time it takes the slice to execute affects the rate at which new work can be spawned, and therefore the potential parallelism.

Speculative support for decoupled software pipelining (discussed in Section 3.2) is one major approach. As discussed in Section 3.2, decoupled software pipelining executes SCCs across different threads, depends on their being multiple SCCs to form the different pipeline stages, and the performance of the resulting program is gated by the maximally sized SCC. By speculating on some dependences in the loop, more, and better balanced in size SCCs can be formed. Decoupled software pipelining and various optimizations using speculation are discussed in [37, 166, 188, 189, 190, 238].

A significant improvement in the performance of speculatively executed programs can be achieved by making use of high-level semantic information about the operations being executed speculatively. The Galois project [135, 136, 137] makes use of commutativity information as well as high level iterators and novel locking techniques.

Traditional transformations, along with new techniques, can help with exposing regions of code that are amenable to speculative parallelization [256].

Some early work in the use of value prediction and communication of values within the processor to enable thread level speculation is described in [219, 220, 255]. In [65], the use of value prediction as well as speculation on dependences is used to allow more parallelism in the program to be exploited.

1MPI is also used in almost all manually written distributed memory programs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.196.172