CHAPTER 1

Introduction and overview

Although interest in compiling for parallel architectures has exploded in the last decade with the widespread commercial availability of multicore processors [106, 110], the field has a long history, and predates the advent of the multicore processor by decades. Research has focused on several goals, the most ambitious being support for automatic parallelization. The goal of automatic parallelization is for a compiler to take a “dusty deck” (an unaltered and unannotated sequential program) and to compile it into a parallel program. Although successful automatic parallelization has been achieved for some programs, most attention in recent years has been given to compiler support for languages like OpenMP. OpenMP allows the programmer to tell the compiler about parallel regions of a program and to provide other hints, and then have the compiler generate efficient code for the program. In both the parallelization of dusty-deck programs and support for languages with parallel annotations, the focus has been on regular and dense array-based numerical programs. In this monograph we focus largely on the first goal of dust-deck parallelization as it allows the development and explanation of much of the knowledge, and many of the fundamental techniques, required to achieve the second goal. Techniques that target non-numerical and irregular numerical programs will be discussed briefly in Section 8.9.

The first real dependence-based auto-parallelizing compiler (hereafter simply referred to as a parallelizing compiler or simply a compiler) was the Parafrase [132] system developed at the University of Illinois. The code and techniques in this compiler led, in turn, to the Ptool compiler [8] at Rice University and the PTRAN project [6] at the IBM T.J. Watson Research Center. Many of the techniques described in this work were present, either full-fledged or in a more primitive form, in these early compilers. It is interesting that a large part of the early motivation for Parafrase was to allow the study of how to develop architectures to exploit the latent parallelism in off-the-shelf “dusty deck” programs [129], as well as how to automatically compile for the Cray-1 [204] machines that dominated supercomputing at the time.

Motivated in part by the early work in parallelizing compilers, a variety of early shared memory mini-supercomputers were developed, chief among them the Alliant FX series [228], Denelcor HEP [211], and Convex [47] machines. These machines were characterized by shared memory processors, interesting architectural innovations, and overall good performance on well-tuned benchmarks. With the rapid progress in microprocessor capabilities, whose performance followed Moore’s law, it became increasingly difficult for the custom processors in the mini-supercomputers to offer a significant performance advantage over cheaper microprocessor based machines. The triumph of the “killer micro” [107] was at hand, and with it the decline of both shared memory machines designed for high performance numerical computing and large-scale parallelization efforts targeting shared memory machines. Thus, a research area begun to understand the architectural implications of latent parallelism in programs was significantly, and negatively, impacted by advances in computer architecture, at least until those advances led to shared memory multicore processors.

Most of the remainder of this lecture focuses on regular numerical programs targeting a shared memory programming model, where the data of each thread are accessible to other threads, and communication is via loads and stores to memory. This is in contrast to a distributed address space programming model, where different processes communicate via inter-process messages, which is discussed in Chapter 6. Although the question of which of the models is “best” is open, the wide availability of shared memory multicore processors makes programming models that target them a research area with both intellectual depth and great practical importance.

1.1 PARALLELISM AND INDEPENDENCE

When a program executes in parallel, different parts of the program are spread across multiple processors or cores and allowed to execute simultaneously. Note that in a modern machine, different processors, and cores on a processor, may execute at different rates and with different capabilities. While this can be because of the result of differences in the hardware executing the program, i.e., different processors on a parallel machine may have different clock rates or functional units, even on identical hardware the speed at which a program region executes may vary widely on different processors or cores. Data and code cache misses, memory bus contention, O/S processes being invoked, and different branches being taken in the program, among other reasons, can cause program performance to vary.

Because of this, if the part of a program executing on one thread produces (consumes) data read by (produced by) another thread, no guarantees can be made about the rate of execution the data producer relative to the data consumer. This is shown graphically in Figure 1.1(b) and (c), which show the parallel loop of Figure 1.1(a) executing in parallel. If some processors are delayed (e.g., the one executing iteration i = 0) then read a[1] in iteration i = 2 may not get the correct values.

The task of a parallelizing compiler is four-fold. First, it must recognize when different regions of a program may produce, or consume, data produced by other regions. Second, it should try to transform the program to reduce the amount of interaction among different parts of the program that may prevent parallelization. Third, it should perform additional optimizations on the programs so that the sequential threads of execution in the parallelized program execute efficiently. Finally, it should produce a parallel program. We now describe with a concrete example these different tasks.

1.2 PARALLEL EXECUTION

The parallel execution of a program can be viewed in at least two dimensions. The first dimension differentiates between parallelism that targets shared and distributed memory systems, and the second dimension differentiates between the structure of the parallelism that is being exploited, independent of the target machine. We will first discuss the differences between targeting shared and distributed memory computers, and then will discuss the three main structures or forms of parallelism.

Images

Images

Figure 1.1: An example of how independence of references is necessary for parallelism.

1.2.1 SHARED AND DISTRIBUTED MEMORY PARALLELISM

The grossest difference between programs executing on shared and distributed memory is how communicating threads (or processes) communicate. In a shared memory environment, threads communicate with one another via reads and writes to shared memory. This is shown in Figure 1.2(a) and (b). In a distributed memory environment, processes communicate with one another via messages from one processor to another. This is shown abstractly in Figure 1.3(a) and (b). In both cases the simple loop of Figure 1.1(a) has been split into two loops, with each loop parallelized.

Shared memory execution

In the shared memory version of the program, each thread executes a subset of the iteration space of a parallel loop. A loop’s iteration space is simply the Cartesian space defined by the bounds of the loop, and instantiated at runtime as the loop executes. In Figure 1.4(a) and (b) a singly nested loop and its iteration space are shown, and parts (c) and (d) show the same for a doubly nested loop. Each point, or node, in the iteration space represents all of the statements occurring in corresponding iteration of the loop. Graphical representations of the iteration spaces (such as what is shown in Figure 1.5) are called Iteration Space Diagrams and are often annotated with additional information, including constraints on how the iterations can execute (see Section 2.3).

Parallel parts of the program are placed into a worker function that has, as parameters, all of the variables accessed in the loop as well as the bounds of the loop executed within a thread. This transformation is sometimes called outlining in the literature [50, 230]. The pseudo-code for this is shown in Figure 1.5(a). Each parallel construct (in this case a loop) is terminated, by default, with a barrier. The barrier ensures that all threads executing the parallel construct finish before any part of the program after the parallel construct is executed. This barrier ensures that the results of memory operations in the parallel construct are available for all threads that might execute code after the parallel construct.

As shown in Figure 1.5(b), for execution on a shared memory machine the sequential code preceding the parallel loop is executed by a single thread. When the parallel loop is encountered, a call is made to the function enqueueWork that (i) places an entry in a work queue that contains a pointer to the outlined function and the argument list for the outlined function, and (ii) wakes up worker threads from a pool to execute the differences instances of the outlined function. Each thread grabs work from the work queue until it is empty, executes the barrier, and then puts itself to sleep. The initial, or master thread, then continues executes. The process continues with the call to enqueueWork() to execute the outlined function for the loop containing S2.

Values used by S2 are made available to S2 by the stores executing in S1. There is generally no guarantee that the thread executing iteration i = v of some loop is the same as the thread executing some iteration j = v of another loop. However, values written and read are available to all threads through the shared memory.

Distributed memory

In the distributed memory version of the program, the data being operated on is distributed across the processes executing the parallel region, as shown in Figure 1.2(a). Each process also contains a copy of the entire program, but the bounds of parallel loops have been shrunk to correspond to the iterations of the parallel loop executed by this process. Sequential parts of the program are executed, by default, on a single processor. Each process also has access to a logical processor id (i.e., pid in the figure) that can be tested to allow only certain processors to execute selected code.

Images

Images

Figure 1.2: An example of data sharing in a shared memory program.

Images

Figure 1.3: An example of data and work distribution and in a distributed memory program.

When a parallel region is entered, all processes execute the indicated iterations of the parallel loop. When the parallel loop is finished, each processor is responsible for communicating (via a send, in this case) the data it has produced that is needed by other processes. Each process, in turn, is required to accept (via a receive, in this case) the data that it needs and that is sent by other processes. The send and receive operations not only maintain a consistent view of memory for the different processes, but it also force the use of the sent data in S2 to wait until the data is available. Thus, the communication also serves the purpose of the barrier operation in the shared memory program by forcing an ordering between accesses. The state of memory after the sends and receives have executed is shown in Figure 1.3(b).

Images

Images

Figure 1.4: An example of one and two-dimensional iteration spaces.

1.2.2 STRUCTURES OF PARALLEL COMPUTATION

Three basic structures, or forms, of parallel computation encompass the parallelism needed by most, if not all, applications. These three forms are Single Instruction, Multiple Data (or SIMD), Single Program Multiple Data (or SPMD), and Multiple Instruction Multiple Data (or MIMD).

SIMD execution is characterized by multiple processors or functional units executing the same instruction simultaneously on different data elements. This led to simpler hardware by allowing a single control unit to control many functional units. This is the form of computation used, for example, on the early Illiac machines [34], the Goodyear MPP [25], and the first Thinking Machines computer, the TM-1 [102]. In these machines, a bit for each processing element could be set to signal the processor to act on its data, or do nothing for this instruction. This allowed a limited form of customization of execution on each processor. Vector processing, using instruction sets such as those supported by Intel’s Streaming SIMD Extensions (SSE) [215, 216] or the Power Altivec [11, 12] instructions, provide a form of SIMD execution, since they apply a single instruction or operation to the elements of a vector. Graphics Processing Units, or GPUs [13, 111, 164], which have become popular as accelerators for numerical applications [38, 60, 142, 149, 205, 227], are another example of a modern SIMD machines. The deficiency of this form of parallel execution is that all computation must proceed in lock step, and processing elements not currently executing a particular operation are not free to perform other operations. Thus, at an if, the processing elements that need to execute the true branch might execute first while the processing elements needing to execute the false branch would be idle. Next, the processing elements that executed the part of the program in the true branch would be idle while the false branch processing elements execute.

Images

Images

Figure 1.5: Execution of a shared memory program.

The synchronous execution of instructions on these machines has implications for parallelization. Thus, in the example of Figure 1.1, the dependences across iterations would be enforced by the synchronous execution on the machine. Properly designed, these machines exhibit simpler control structures than equivalently powerful machines that execute the next two forms of parallelism.

SPMD execution consists of multiple copies of the same program each independently (i.e., not in lock step) executing against its own data set. The distributed memory execution shown in Figure 1.3(a) and (b) is an example of an SPMD execution—each process contains its own version of the program executing on its data. Shared memory programs executing many identical threads can also be thought of as SPMD programs, although as commonly used the term often refers (perhaps too narrowly) to distributed memory programs. For some analyses, the SPMD structure of the program simplifies analysis of the program. The IBM Blue Gene [29, 162] machines, clusters of workstations [173] and Beowulf clusters [27] are examples of systems that execute SPMD programs.

MIMD programs have different code executing on different processors, each operating on their own data. A multithreaded program where not all threads are identical are an example of this. The importance of MIMD programming has increased dramatically in recent years with the dominance of multicore processors, such as the Intel Sandy Bridge (see, for example, [207], IBM Power 7 [179, 180], and AMD [14] processors, as well as numerous other processors in the general purpose, embedded and high-performance computing markets.

Finally, the different forms of computation can be mapped onto one another. Thus, by merging all of the different programs in an MIMD program into one program, with control flow picking which processor executes which program, an MIMD program can be executed as an SPMD program. An SPMD program can be executed on an SIMD machine at the expense of many processing elements being idle, such as when one processing element’s program performs extra iterations, and so forth.

1.3 COMPILER FUNDAMENTALS

This section describes the fundamental compiler data structures that are used in many, if not most, compilers. The analyses and transformations whose focus is program parallelization often utilize these structures in their work. As well, many papers that treat the topics of this synthesis lecture in depth (see Chapter 8) often assume a knowledge of these data structures and related compiler basics.

1.3.1 COMPILER PHASES

The high level flow of a compiler is shown in Figure 1.6. Inputs and intermediate files are shown as rounded boxes, and actual phases of the compiler are shown as boxes.

Images

Figure 1.6: High-level structure of a parallelizing compiler.

Input to the compiler is usually a text representation of a source program. In practice, the source program can be a binary file, used in binary instrumentors and binary compilers [19, 174]; a bytecode file, such as is produced by a Java or Python source-to-byte code compiler, or a file with analysis information for the compilation unit included, to allow further analysis on a compilation unit.

A compilation unit is, as the name implies, the typical unit of a program over which the compiler operates, over which static symbol scopes are defined, and is the scope over which analysis and optimizations occur1. For C, C++ and Fortran, a compilation unit is typically a file. In the C and C++ languages the program has first passed through a preprocessor which inlines header files. These header files contain declarations and/or definitions of variables, types and functions – i.e., they give the name and type information about the object being declared, and may also define storage for the object. This allows better error checking by allowing the compiler to check for type mis-matches in assignments and function calls. For Java and some other languages, the compiler will check the declarations of functions in other files to see if they match what is expected. In Java programs, the unit of compilation for type checking is the entire program, whereas for optimization it is often a single function within a class. The javac Java-to-bytecode compiler performs no optimizations and the run time, or dynamic, compiler performs optimizations. Java optimizations are often restricted to a single method or class to minimize the time spent compiling – an important consideration since the compiler is executing while the program is running.

1.3.2 PARSING

When a compilation unit is presented to the compiler, it is lexically analyzed and parsed by the compiler. Lexical analysis is the process of breaking up the string of characters that represent the source program into tokens, e.g., a variable or procedure name, a binary operator, a keyword, and so forth. Lexical analysis is a process that formally can be performed by a finite state machine. Parsing is the process by which the compiler recognizes different syntactic constructs formed from the tokens recognized by the lexical analyzer—a variable reference, a binary operation, an assignment statement, a function call, and so forth. Formally, parsing for most common languages can be performed by a slightly augmented push-down automata. Language designers often attempt to design languages that can be processed by these formal structures. Languages so designed can have lexical analyzers and parsers created for them automatically using tools such as Lex, YACC, Flex and Bison [143] and Antler [16]. Such languages also tend to be easy for humans to read.

The compiler will generally parse global declarations and definitions, which appear at the top of the file in C programs, and then compile, in turn, each function that appears in the program unit. The output of the parser is an intermediate representation, or IR, that represents the semantics, or meaning, of the program unit.

1.3.3 INTERMEDIATE REPRESENTATIONS

The IR of the program is repeatedly examined by different analysis passes, and modified to reflect the effect of optimizing transformations. The IR is represented as an abstract syntax tree [3, 17, 77] which is a graphical representation of the parsed program. In this work, we will modify this slightly and represent programs as a control flow graph (CFG), which is often imposed on the abstract syntax tree used by a compiler. A CFG is a graph whose nodes biB are basic blocks and where an edge bibj means that block bj may execute directly after bi. A basic block is a sequence of operations with only one entry and one exit. Variations on the basic block are sometimes used in compilers, i.e., the basic block may contain several entry points (i.e., labels that can be branched to from outside the block) and one exit [87], or the basic block may only contain a single reference to a variable that can be accessed in a different thread [141]. The operations within a basic block may be represented as a sequence of low-level operations in three-address code, similar to what one might find in an assembly language program, or as expressions and statements represented as trees.

Figure 1.7(a) shows a fragment of a program, and Figure 1.7(b) shows the CFG for the program. Each block in the CFG is a basic block, and is labeled with the name of the block and the statements it contains. Figures 1.7(c) shows the reverse CFG for the program—each edge (a, b) is reversed to be an edge (b, a). The reverse control flow graph is used in the construction of the static single assignment form for programs (discussed shortly) in performing some kinds of dataflow analysis (discussed in Section 2.1), and for computing control dependence information (Section 2.4).

Images

Images

Figure 1.7: Example of control flow and reverse control flow graphs, showing the dominator and post-dominator relationships.

Images

Figure 1.8: The dominator and post-dominator relationships for the program of Figure 1.7.

Static Single Assignment (SSA) form

The control flow graph and operations within the nodes of the basic blocks are sometimes converted to static single assignment (SSA) form [62, 75]. In the SSA form, storage is logically assigned a value once. We say “logically” because the single-assignment property is enforced during compilation and analysis, but the program is converted from the SSA form to a more traditional form for execution. Intuitively, the conversion into SSA form is done by giving the target of each assignment a unique name, and all uses of the value are changed to read from the unique name. Thus, in the program of Figure 1.7(c), the target of each assignment into a variable v is renamed vj, vj+1, and so forth. If a program consisted of only straight-line code, the each use of v would be replaced by a reference to vj, where vj is the name given the immediately preceding assignment to v. Often, arrays are treated similarly to scalars, that is a write (read) to an array is assumed to be to every element in the array. The Array SSA form of [120] overcomes this deficiency at the cost of some complexity in the compiler data structure and in runtime monitoring of stores to arrays when exact information about these cannot be gathered at compile time. Figures 1.7(d) shows the SSA form of the program of Figures 1.7(a).

Control flow complicates this, and must be handled. For example, in the program fragment of Figure 1.7(a), the value of a read in S8 can come from the assignment of a in either S1 or S7. To deal with this, the SSA form uses image functions to choose between the different values of a that arrive at a program point, and assign the value to a new, uniquely numbered a.

The algorithm for placing Phi-functions uses the concept of the dominance frontier of a statement containing an assignment to a. A basic block bj is dominated by a basic block bd if all paths from the start of the program to bj must pass through bd. Figure 1.8(a) and (b) shows the dominance relationships for the CFG and reverse CFGs, respectively, for the program of Figures 1.7(a). The dominance frontier of some statement bj is all basic blocks bdf such that bdf dominates a predecessor of bj in the CFG but does not dominate bj [62]. More intuitively, the dominance frontier of bj contains the blocks where paths in the control flow graph that bypass bj join with the path that contains bj. One way of finding the dominance frontier for a CFG is to compute the control dependence nodes in the reverse CFG using the concept of dominators and post-dominators in the reverse CFG—these nodes are the dominance frontier. Intuitively a node X is control dependent on a node Y if the execution of Y controls whether or not X is executed. The concepts of control dependence and dominance frontiers are discussed in Section 2.4.

The SSA form is useful because it allows a variety of scalar optimizations such as constant propagation [240] (which finds variables that hold constant values) and use-def computation to be done efficiently. Most modern compilers have an IR that is in SSA form.

Other compiler structures in the intermediate representation

The IR of a program maintains other useful information about the program. The two most important data structures not yet discussed are the symbol table and call graph.

The symbol table contains information about every symbol in the program that can be accessed from the function or compilation unit being compiled. This includes classes, variables, functions, structures, and global and external variables and functions. Standard information maintained consists of variable types, return types and parameter types and number for functions, structure fields and their types, global and external variable types, whether variables are constants, and member functions of variables and class variables. Information such as whether a variable is constant may be derived by the parser when the variable declaration is parsed (if it is declared as “const”), or may be discovered later as a result of analysis.

The call graph (CG) has a node for each unique function or method in the program (or compilation unit). An edge ninj exists between two nodes in the CG if function ni calls function nj. If there is a path from ni to itself through zero or more other nodes, then ni is recursively invoked. The nodes of the CG often contain a CFG that represents the actions of the function, as described above.

Compiler optimizations for sequential and parallel machines

Compilers perform several different flavors of analysis and optimization. In general, the goals are to reduce the amount of work performed in a computation, reduce the amount of storage used, and to enable the work that must be performed to be performed in parallel. Examples of optimizations that reduce the amount of work performed include, but are not limited to:

dead code elimination which finds never executed code and removes it from the program. Dead code often results from flags used to support debugging (e.g., if (debug > 0) . . .) or as a side affect of other compiler transformations;

function inlining which replaces a call to a function by the body of the function, i.e., the function is treated similarly to a macro. In the case of small accessor functions in object oriented programs, the benefits of not having to go through a full function invocation to access a class field can be substantial;

tiling which seeks to reduce the overhead of cache misses;

strength reduction which replaces an operation (e.g., integer division by a power of two) with a cheaper operation (e.g., a right shift.) We note that even something as simple as this can become complicated. For example, some languages like Java round towards zero, whereas doing a right shift on a negative complex numbers will result in a rounding down. Thus, −5/2 should give a result of −2 in Java, but shifting right yields (1101 >> 1 → 1101, or -3. Handling such subtleties manually is error prone and a good reason to rely on compiler optimizations, especially for low-level optimizations.

An example of a transformation that enables sequential optimizations is function inlining. By bringing the called (or callee) function code inside of the calling (or caller) function, intra-procedural analysis techniques can be used to analyze and optimize the code in the called function along with the results in the callee. By creating a new copy of the function at each site where it is called and inlined, optimizations that are legal, or beneficial, only for the invocations at a single call site are enabled.

At a much lower level, register allocation is used to maximize the number of variable reads and writes that are into and out of registers while minimizing the number of loads and stores of values between registers and memory. Also at the low level, instruction scheduling can be used to order operations so that the hardware is likely to execute them in a way that maximizes instruction level parallelism.

The details of most of these optimizations is beyond the scope of this lecture. In Chapter 8, pointers to papers giving a good overview of these techniques can be found. The bulk of this lecture will discuss compiler support for parallel machines, as described in the next section. Nevertheless, parallel programs consist of collections of sequential code executing in parallel, and compiler techniques that optimize this sequential code also improve the performance of the program.

1.4 COMPILER SUPPORT FOR PARALLEL MACHINES

Ideally, a compiler would take a sequential program and automatically translate it into a parallel program that efficiently makes use of the cores in a multicore processor. With current technologies, a loop can execute in parallel if either (i) each iteration of the loop executes independently of all other iterations, or (ii) properties such as commutativity can be exploited to allow different iterations write and read the same data to execute in parallel. Figure 1.9 shows a program whose loops can be parallelized by transforming its loops into ones that meets the above criteria.

Consider first the i loop. Each iteration of the loop writes a unique memory location b[i], and no iteration of the loop reads this location. Thus, no iteration of the loop reads a memory location touched by another loop, every iteration of the loop is independent of all others, and the loop can be easily run in parallel. For the j loop the situation is more complicated. Every iteration of the loop reads and writes t. Should the loop, as is, be run in parallel, there will be many iterations trying to read and write the same memory location, and an execution of the program will probably not give a correct result. Finally, in the k loop every iteration also reads and writes the variable s.

Images

Images

Figure 1.9: An example of a loop with multiple dependences.

Compilers use dependence analysis, described in Chapter 2, to determine if accesses to an array may be to the same element of the array, and if so, how many iterations separate those accesses. In Figure 1.9(b) the dependence graph for the program is shown. The dependence graph summarizes the problematic accesses of memory on t and s described above. δf is a flow dependence, which indicates a location is written, and then later read later on, δo is an output dependence, which indicates a the same location is written, and then later written again one or more times, and δa is an anti dependence, which indicates a location is read, and then later written. We note that flow, anti and output dependences correspond to read-after-write (RAW), write-after-read (WAR) and write-after-write (WAW) hazards, respectively. The concepts of dependence will be refined and discussed in greater detail in Chapter 2.

In the loop of the example of Figure 1.9, the accesses to t can be made independent by replacing the scalar t with an array t′ that has one element for each iteration of the loop, and replacing each reference to t with a reference to t′[j]2. Doing this leads to the code in Figure 1.9(c). This, and other transformations to eliminate or modify dependences to enable parallelization, are described in Chapter 4.

The behavior of s in the k loop looks similar to that of t, but there is an important difference. Because the only operations on s in the loop are commutative, the final value of s can be computed as a reduction. As with t, s is expanded into an array s′ with one element for each thread. To get the final value of s, however, it is necessary to sum up the numThreads elements of s′ into s, as is done by the sequential k′ loop. Chapter 5 discusses reduction recognition, and how code is generated for reductions.

In Chapter 3, we discuss how the analyses used to parallelize programs can be used to perform other optimizations. The most widely used of these is tiling, which alters the order of array accesses in order to increase cache locality, but the analyses can support a variety of other transformations.

The analyses and optimizations discussed so far, and through Chapter 3 assume the input program is sequential. If it is not, then issues related to those raised by implementing cache coherence protocols on shared memory processors are raised. In Section 2.6 we discuss how compilers can deal with these issues arising from compiling programs that are already parallel.

An overview of compiling for distributed memory machines is given in Chapter 6. Whereas communication across threads executing in parallel on a shared memory machine is done via loads and stores, communication on a distributed memory machines requires explicitly calls to communication library code, and carries a significantly higher startup cost than a load or a store. This, in turn, makes it necessary to minimize the number of communication operations that occur, and makes it essential that each operation deliver as much data as possible, placing a unique burden on the compiler.

In Chapter 7, we discuss the solution of Diophantine equations, which are equations that whose solutions and coefficients are integers. Diophantine equations are used in several places by compilers.

Finally, in Chapter 8, we present a guide for further reading both on the topics discussed in this work, and in related and more advanced topics that were not covered.

1 Note, however, that many compilers also perform optimizations across compilation units, including optimizations across binary files. This is sometimes done by maintaining an intermediate representation (see Section 1.3.3) for multiple compilation units in the compiler This topic is beyond the scope of this lecture, however.

2 A more efficient allocation of t′ can be made by realizing that only one element is needed per thread, as discussed in Section 4.8.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.202.203