CHAPTER 7

Performance Tuning and Optimization

The UPC language has been designed specifically to allow programmers to get the best performance from a wide range of parallel computer architectures. Many features of the UPC language and toolset will aid in this task. But to obtain the best performance, it is essential to have in mind some basic knowledge about the characteristics, system architecture, and performance of parallel computers. The chapter begins with a brief primer on parallel machines and general performance issues in parallel programming. For a deeper insight into many of these issues, there are many texts on the subjects referred to in this chapter. After this general introductory overview of parallel system architecture, three critical factors in achieving performance with UPC are discussed:

  1. The UPC compiler, which analyzes the UPC application program and applies a variety of techniques to it to produce good executable code
  2. The UPC run-time system, which both enables running programs and observes their dynamic behavior in order to undertake actions to improve performance at run time
  3. Hand optimizations that are performed by programmers to enhance application performance

With this arsenal of language constructs and software tools, the programmer has the means to craft parallel programs and make them perform well. The UPC programming model exhibits significant control flexibility to permit some important optimizations. Some specific techniques for achieving these are discussed in this chapter. In particular, methods for taking advantage of data locality, improved synchronization, and latency hiding are presented. All of these useful practices are illustrated through detailed case studies.

7.1 PARALLEL SYSTEM ARCHITECTURES

UPC provides a semantic bridge between the user application problem and the target physical parallel system architecture, through the software medium of compilers and a run-time system. Although the semantic correctness should be a constant among separate machines of different scale and class, their performance properties may differ, sometimes dramatically. Several classes of parallel systems are in common use today. At least to some degree, each may vary in scale, perhaps by several orders of magnitude (measured in the number of computing elements). Although this has not always been the case, the prevalent parallel systems all employ a plethora of processors, memory banks, and internal communications networks [system area network (SAN)]. But the different classes of systems differ in the architecture of the microprocessors, the structure or organization of these combined elements, and their logical interrelationships. In this section we describe briefly the dominant classes of parallel system architectures, to provide a foundation with which to consider driving performance properties and optimization methods to address them.

Five classes of parallel system architecture are described. Although not an exhaustive set of the entire domain of parallel computing architecture, they represent the vast majority of commercial systems and all of the systems to which UPC compilers have been targeted:

  1. Symmetric multiprocessors (SMPs)
  2. Distributed shared memory multiprocessors (DSMs)
  3. Distributed memory multiprocessors
  4. Commodity clusters
  5. Parallel vector processors (PVPs)

It should be noted that these terms are not defined rigorously, and workers in the field may employ them with slight variation of intent. The following is one reasonable and representative description, but by no means the only credible one.

For the sake of simplicity, the following parallel system descriptions will emphasize two architecture properties: degree of coupling, and memory model. Degree of coupling is generally characterized in the range loosely to tightly coupled communication among the separate computing and memory elements comprising the parallel system. But this is a simplification of a combination of the total rate at which data can be moved across the system at one time (its bisection bandwidth), the rate at which any one processing node can inject or receive data from the SAN (its local bandwidth), the time it takes to move the data between any two such nodes (latency), and the possible variation of the latency among different node pairings (UMA versus NUMA), as well as the overhead or work required in the critical execution time to sending and receiving data messages from remote nodes.

The memory model of a system determines the visibility of the distributed main memory from the perspective of the processor node hardware. The three possible conditions that discriminate among system classes are:

  1. Full shared memory with cache coherence
  2. Shared memory without cache coherence
  3. Separate local memory partitions

The combination of these memory models and the degree of communication coupling is sufficient to describe and differentiate the four dominant parallel system architecture classes.

Symmetric multiprocessors (SMPs), also known as uniform memory access architectures (UMAs), represent a parallel computer architecture in which processors share the same memory physically such that the access time from any processor to the memory system is the same. In such a system, the set of processors have direct and equal access to the set of memory banks. The SAN connects the memories and the processors such that the processors have the same physical and logical access characteristics of true shared memory. Furthermore, coherency across processor caches is maintained consistent so that changes to the state of a variable by one processor are reflected by the copies of that variable in the caches of any of the other system processors. This permits a true shared memory execution model and provides very low overhead for all memory accesses across the system. Nonetheless, there are two important aspects of SMP systems that impose performance bottlenecks. The first is restricted scale. To support UMA across the memory system and full cache coherence, hardware mechanisms are employed that do not scale indefinitely. Therefore, the majority of SMPs comprise a small number of processors, two to four being typical and few systems exceeding 16 processors. The second is restricted bandwidth to main memory. Because of the need for cache coherence, all memory traffic has to be monitored, and the amount of memory bandwidth available to the processors is limited. When all processors attempt to engage the main memory simultaneously, there occurs significant contention in many cases, degrading overall system performance and efficiency.

Distributed shared memory machines (DSMs), also known as nonuniform memory access architecture (NUMA), comprise a system in which the memory is physically distributed across the processors but is globally addressable and accessible by all processors. This provides a single machine image. However, a processor can access its local memory modules faster than remote ones, although all of them are shared by all processors. The basic building block in a DSM machine is a node, which can be as simple as one processor with some memory attached to it, or an entire SMP system. The blocks are interconnected together with some interconnection network.

Unlike SMPs, distributed shared memory machines incorporate cache coherence mechanisms that scale. DSM systems with thousands of processors have been deployed commercially. The logical execution model of the DSM is the same as that of the SMP, including cache coherence across processors. But the performance model is different. There is a partition of memory that is relatively local to a given processor node while the remaining memory is relatively remote. Access to local memory is much faster (shorter latency and higher bandwidth) than accesses to remote memory. The difference can be as much as one to two orders of magnitude for very large systems. A local node may incorporate more than one processor and more than one memory bank in an SMP configuration where all processors within a single SMP node have equal access to local memory. The DSM class of system shares with the SMP system the low overhead of memory access and cache coherence, but suffers from nonuniform memory access times even as it benefits from potentially larger system scale.

Now things get a bit complicated. There is a variation of the DSM system that does not have cache coherence, but there is no agreed term for this class of system. Therefore, we will refer to it here as DSMwoCC, but this is not common usage. This is an important class of systems. It has better efficiency (lower overhead) of memory access than that exhibited by ordinary DSMs because it does not incur the additional hardware overhead times required for the cache coherence. It is also less expensive to build and can scale to larger systems. But DSMwoCC systems also suffer from NUMA behavior and do not guarantee a pure shared memory model such that the software has to take into consideration this potential conflict of copies. DSMwoCC is the closest hardware system to the abstract programming and execution model of UPC and is an ideal target platform for UPC compilers. However, UPC can be mapped as efficiently as other paradigms to other types of architectures.

Distributed memory (DM) systems, also known as multicomputers, are tightly coupled ensembles of independent processing nodes. Each node comprises one or more processors in an SMP configuration. The nodes are interconnected with a high-bandwidth low-latency custom SAN. This network is used to send messages between the nodes. Logically, a node processor can only directly address memory local to that node. No direct remote memory accesses are possible. This is referred to as a partitioned or fragmented memory model. For nodes in a DM system to cooperate on a single computation, explicit message passing between processes on separate nodes must be accomplished. Often, large blocks of contiguous data may be sent in a single message between processes running on separate nodes, amortizing the overhead and latency of communications while exploiting the relatively high bandwidth of the SAN. However, not all applications can benefit from that, as some may be latency sensitive. Although the overheads incur software action at each end and are thus aggravated compared to DSM systems, there are many problems whose data access patterns are such that this additional burden is acceptable. DM systems are less expensive than DSM systems and they can scale to much greater sizes. Systems on the order of 10,000 processors have been deployed, and DM systems of tens of thousands of processors are likely to emerge in the near future. Although the hardware logical model does not directly match the UPC execution model, the UPC compiler and run-time system enable application codes written in UPC to perform correctly on DM systems. However, the performance model due to the distributed memory semantics of the machine is different from that of the DSM, and performance tuning and optimization requires that such operational distinctions be taken into consideration.

Commodity clusters are collections of stand-alone computer systems manufactured for mainstream data processing markets and integrated using COTS local area or system area networks. Clusters of personal computers often running Linux are known as Beowulf cluster systems [STE01], named after the early NASA project that introduced them. Due to the benefits of economy of scale of production, commodity clusters deliver exceptional performance to cost for many (but not all applications), exceeding that of DSMs, DMs, and vector machines (see below) by as much as an order of magnitude or more. Today, more than half of the world's top 500 computing systems (measured by the Linpack benchmark) are commodity clusters.

Commodity clusters present the same logical model as that of distributed memory systems. Nodes of one or more processors in an SMP configuration communicate by a message-passing protocol between processes running on the separate nodes. The processors of each node can directly access (using load and store operations) only the memory local to that node. But, if anything, the performance model of a cluster is worse than that of a DM class machine. The overheads for remote communication are larger and the bandwidth and latencies of the COTS network are worse than the custom communication network of the DM systems. Latencies can be one to two orders of magnitude longer and bandwidth an order of magnitude lower than DM machines. Acquiring a remote value can require 1000 to 10,000 processor cycles. Performance tuning and optimization is driven by the need to exploit locality and minimize communication, especially fine-grained communication, even at the expense of performing more work or redundant work on each node. Despite these increased challenges, commodity clusters are likely to be a common target platform of UPC applications for some time to come.

Parallel vector processors (PVPs) are typically designed around a proprietary processor that is specially crafted to target vector computations. Vector processors can have vector registers that can hold a complete array; their functional units are designed as pipelines that can exploit the overlapped parallelism in processing the elements of a long array. The instruction sets of vector processors are such that an entire loop to fetch, store, or process an array can be replaced by a single vector instruction. Vector processors can sustain very high bandwidths through their networks to memory for certain types of access patterns, including unity stride and uniform stride accesses. Advanced systems can do sparse accesses efficiently as well with special gather scatter hardware. PVPs can be used in all of the system configurations described above, replacing the commodity microprocessors with custom vector processors. When used effectively, they can provide among the highest delivered performance of any parallel system for certain classes of applications. But the performance tuning and optimization using UPC presents a different set of code structuring trade-offs than other system types described.

UPC does not require any particular computer architecture to run. In fact, there are currently UPC compiler implementations for DSMs, multicomputers, clusters, and vector machines. However, it should be noted that UPC requires a shared memory view. Thus, implementations on DSMs and SMPs are relatively easier and, depending on the implementation, may be more efficient. The shared space view under UPC requires that multicomputer implementations have a layer that can present the underlying architecture to UPC as if it were a shared memory architecture. Among the software libraries that did this successfully are PGAS [HU00] and Elan [QUA03]. The fact that the shared space under UPC is partitioned makes it map more efficiently on multicomputers than strictly shared space programming paradigms.

7.2 PERFORMANCE ISSUES IN PARALLEL PROGRAMMING

Parallel Overhead

While increasing the number of processors of a single machine increases its overall peak computational power, it also adds a number of overheads and latencies that can degrade the potential benefit of parallel processing. Parallel programmers have to consider these overheads, latencies, and other system factors that reduce performance to enhance scalability and execution efficiency. Scaling here means getting more measured performance as the number of processors increases. The main types of potential overheads and latencies that inhibit performance are:

  • Interprocessor communications
  • Processor synchronizations
  • Load imbalance
  • Redundancy

Interprocessor communications are required so that the many processors of a parallel machine can work together effectively on the same computational problem. At the beginning of a computation, initial input data may need to be distributed across the processor nodes. During the computation, processors may need to exchange intermediate results. By the end of the computation, the final result data will need to be integrated to be directed ultimately to external outputs. In the uniprocessor case, none of these operations are required. In the parallel case, they are necessary but they take the processor away from its main purpose, which is computing. Overhead from both controlling the communications and latency of moving the data between processor contributes to the potential performance degradation. Therefore, interprocessor communications must be kept to a minimum.

Synchronization deals with the need to order and coordinate the work of processors (threads) to ensure that results obtained from parallel computations are similar to that we can obtain from a uniprocessor machine. We covered the topic of synchronization in Chapter 6. One thing that should be noted is that synchronization comes at a nontrivial cost to performance, and since it is not part of the actual computing, it is another overhead. Programmers interested in maximum performance should certainly reduce the amount of process and data synchronization and use the minimum to guarantee the correctness of the results.

Load imbalance is another source of inefficiency. It is very difficult in many cases to divide the data and work equally among all processors. Sometimes the size of the data is not divisible by the number of the processors, and sometimes some data require more processing than others, as in the case of zero and nonzero elements of a sparse matrix. Also, sometimes the workload changes dynamically at run time. In these cases some of the processors will complete their work much later than the rest, which delays the processing time of the overall machine. Programmers should strive to balance the work across the processors, but this may not always be possible.

Redundancy refers to the fact that in many cases exactly the same computations take place on all processors. For example, consider the problem of initializing a particular variable that is needed at all processors. One option is to initialize it at only one processor, then broadcast it. The other is to initialize it at all processors. In the former case we suffer communications overhead. In the latter we perform an operation that we used to do once in a sequential machine many times, which is a redundancy overhead that was necessary for the parallel execution to take place.

Machine Imbalances

One problem that aggravates communication overhead is the unbalanced rate of improvements in processor and interprocessor communication technologies. Processor performance has been doubling every 18 to 24 months, but speed of communication is not growing at the same fast pace. The same problem also occurs from the memory perspective, where processing speed is growing at a much higher rate than improvements in memory latency and bandwidth.

The performance of an interconnection network can be analyzed in many ways. However, two important metrics are bisection bandwidth and latency. Bandwidth measures the throughput of the machine; bisection bandwidth refers to the amount of data that can be transferred between both halves of a machine after we split it from the point of the weakest bandwidth. Latency is the amount of time taken to send the minimum transferable size of data from a source node to a destination node. Network traffic is generally pipelined. This means that the first data packet sees the full latency, but the following ones will be arriving in sequence immediately after the first packet is received. Programmers should develop an understanding of how to minimize communications overhead and/or its impact.

There are three ways to minimize the impact of communication overhead: locality exploitation, amortization, and overlapping. Locality exploitation refers to attempting to complete all work associated with a part of the data, before moving to the next. Locality exploitation can reduce the amount of data that needs to be accessed remotely. Locality exploitation also optimizes the operation of the memory hierarchy by increasing the likelihood of finding what we need in the cache.

Since the first packet sees the full latency delay, the programmer should try to make the messages as large as possible. This amortizes the initial overhead over the largest possible number of packets. However, such optimization is required only on architectures where communication networks may be suffering from high latency and no hardware cache support over shared memory spaces. There are two ways to do this: one is by integrating all of the data that needs to be communicated, even if they are unrelated, to produce a smaller number of messages. The other strategy is to increase the computational granularity. This can be accomplished by distributing data by the largest possible chunks and attempting to complete the work on that chunk locally prior to any exchange of data. In many applications, aggregating and maximizing the size of communication messages may not be possible. In these cases, UPC continues to have a great advantage in that it can perform short transfers efficiently, due to its shared memory view.

In addition, when possible, programmers can overlap communications with computation. Instead of in-demand synchronous communications, programmers can start reading/receiving data asynchronously well ahead of time, then ensuring completion of the asynchronous operation only before the relevant data is needed for processing.

7.3 ROLE OF COMPILERS AND RUN-TIME SYSTEMS

Getting a program to run on a computer involves a number of tools and processes that need to be understood, to some degree, to effectively optimize the performance of programs. Of course, it all starts with algorithm development and coding in UPC source code. A compiler is then used to translate the UPC source code to machine code for a specific computer. Next, a run-time system is responsible for loading the machine code into the system and starting execution. Since UPC is a parallel language, the run time is also parallel and provides a number of services to the executing program, such as I/O and parallel synchronization. Finally, some tools help analyze both the correctness of programs and their performance. These components interact in interesting ways that should be understood before effective optimization can be performed.

Compilers

UPC compilers are becoming widely available. Each one is produced by a company or other organization to support the preparation of UPC programs on one or more computer systems. A “correct” UPC compiler must take all “valid” UPC programs and ensure that the program, when executed, performs according to the rules of UPC. Appendix A provides the most recent specification of what is “valid” in terms of both the expression of UPC programs and their execution. Although most readers of this book will not need to study the specification, it is the final arbiter of what a UPC program is.

The compiler always has a second goal beyond the simple correct execution of a program: namely, the efficient execution of the program. This is possible because the UPC specification allows compilers to change UPC code into, potentially, many different sequences of machine instructions as long as it appears as though the program produces a “valid” execution. How does a compiler decide which sequence is both valid and will yield the most efficient use of the computer? Although the answer to this has been the subject of about 40 years of compiler research, some things are clear. First, the compiler must understand which sequences are efficient on the machine. Second, the compiler must be able to “see” all the activity that affects performance. For example, compilers generally have a hard time telling what a program is doing in its subroutines, especially if they are in a library. Of course, compilers will have diverse “skills” at this task, depending on the abilities of those who developed the compiler, how much effort they put into its development, and the quality of the computer that will execute the program.

Runtime

The run-time system works with the machine code produced by the compiler and is responsible for the correct and efficient execution of UPC programs. It is usually created and supported by the group that produced the compiler. The first thing the run-time system does is to load the program into the computer memory and cause its execution to begin. This means that for each thread, the run time will cause the programs main() routine to be called, then requests the program to perform specific basic functions. For example, run-time support is often needed to implement the upc_notify and upc_wait statements. The run-time system is usually responsible for such duties as interacting with the operating system when performing I/O functions and supporting debugging and performance analysis.

To achieve efficiency, the run-time system will often be able to observe certain characteristics of the program execution in real time and respond by taking anticipatory actions. For example, in the UPC programming model, there are many cases in which relaxed shared memory references may be prefetched, cached, deferred, or grouped. On some architectures such optimizations will have a dramatic effect on the performance of applications. Another characteristic of the run time to consider are the relative efficiency of shared memory accesses, locks, and barriers, so that one tailor the synchronization mechanisms chosen to achieve the best performance possible. Many UPC implementations have similar performance ratios for these mechanisms, so techniques learned on one implementation are usually relevant to others.

7.4 UPC HAND OPTIMIZATION

Hand optimization is a process by which a programmer attempts to “help” the compiler and/or run-time system to achieve more efficient execution of the program. In case of a new language, such as UPC, programmers might have to employ more hand optimizations until compilers mature enough to incorporate all possible optimizations and offer them to the user automatically, or through simple compile-time mechanisms such as flags. Hand optimizations can be divided into two types: general optimizations that apply to all systems, and machine-specific optimizations that apply only to a single (or limited class) of systems. Because system structures, mechanisms, and timing characteristics differ, sometimes substantially, among machine types, optimizations of parallel codes applied for one system may have deleterious effects on the performance of another system for which the specific optimizations were not targeted. Care is essential in this respect. Apparently innocuous modifications to the program may invoke subtle changes in system operation that yield dramatic consequences in performance, either positive or negative. Finally, it should be noted that hand optimizations often tend to obscure the readability of code, so it may well be harder to reuse or debug.

There are two main areas to consider for general hand optimization: helping the system (both compiler and runtime) to determine what is “local” to a thread, and helping the system to determine dependences between both private and shared accesses. As will be seen in the following sections, the determination of what is local is best performed by using the UPC affinity concept. There are many techniques for helping the system understand the true dependencies in a program, but they are all required because the rules of the language (UPC in this case) are stricter than the rules of the algorithm. For example, the compiler must perform the code

for (i=0; i<n;i++)
  x [y[i]]+= 17;

correctly even in the presence of the pathological case where all the elements of index array y[] contains the same value. To execute the code correctly, it must complete the previous store of x[] before beginning the fetch of the next. This can obviously lead to poor performance. However, if the user wrote the code as

for(i=0; i<n;i+=2)
{
  t1 = x [y [i]];
  t2 = x [y [i+1]];
  x [y [i]]= t1+ 17;
  x [y [i+1]] = t2 + 17;
}

the compiler is free to fetch two values of x[] at the same time, because the user has implied that even/odd values of y[] are different. This would result in better performance on most systems. Of course, this implication could be false and the user might discover a nasty bug. This illustrates that in performance-critical sections of a program, it might pay to write code in a more complex fashion so that the compiler can understand the actual requirements of the algorithm.

The UPC programming model and language provide a plethora of mechanisms for enabling optimization of parallel programs. In this section we present a number of these mechanisms along with an explanation of why they are effective. We begin with an in-depth discussion of data locality exploitation, and optimization, followed by the effective use of locks and barriers, with special emphasis on the split-phase barriers that UPC provides. Finally, we discuss the importance of using bulk data transfers, concentrating on the difference between UPC bulk transfers and the message-passing model's use of bulk data transfers.

Memory Locality and Consistency

As discussed above, one of the critical optimizations in programming modern parallel machines is to reduce the number of accesses to memory which are not “close” to the thread issuing the access. UPC's memory model is ideal for controlling such accesses: They may be reduced or eliminated by forming algorithms that access mainly local data or by strategically copying shared data to the privates space where accesses are much more efficient. Figure 2.2 depicts the fundamental UPC memory model introduced in Chapter 2. It shows that each thread has access to two data areas: private and shared. The great thing about the private area is that it will be guaranteed to provide efficient access on any UPC system. Shared data provides the means of saving space by sharing data structures among the threads and enabling complex algorithm interactions.

Another feature of the UPC programming model that is critical to performance enhancement and tuning is the concept of affinity. This concept allows each thread to consider a well-defined portion of shared data that is local to the thread and therefore provides the same performance benefits as its private data. This gives the UPC programmer the ability to share data globally but operate on it locally, both contributing to efficient execution. The main mechanisms for taking advantage of affinity are the “casting” of pointers to shared data into pointers to private data and use of the upc_forall() construct. More on these points follows in the next sections.

The memory consistency model of UPC can be an important aid in achieving good performance on many parallel systems. This model, in which references are marked as either strict or relaxed, allows compilers and run-time systems to deliver a combination of good performance when that is needed and tight consistency semantics when required logically. The key here is that these concepts relate to the order in which other threads see a given thread's shared memory accesses. In general, compilers and run-time systems will operate at higher efficiency when they are allowed more flexibility in the ordering of accesses. To understand this range, let's look at the extreme positions of all strict and relaxed shared references in a program, from both a performance and a consistency point of view.

In the strict case, every thread is guaranteed to observe the results of shared accesses in the order in which they are issued. Generally, this is important only when one thread makes a decision based on a value in the shared space: for example, code such as

strict shared int x;
strict shared int y;

thread 1:       thread 2:
x=0;            while (y == 0)
y=0;            /* do nothing */;
x=1;            printf(“%d”, x);
y=2;

A logical analysis of this program indicates that the only value that could possibly be printed for x would be 1, since y was set to 2 after x was set to 1 in thread 1. The only way to guarantee that this is the case is to declare x and y strict. To do this, the implementation of thread 1 must ensure that the value of x be updated before y is set. This induces a performance implication if it takes time, which on many systems it does. Thus, the all-strict case will make analysis of the type of code written above easier and have a very well defined behavior, but that comes at a cost to the efficient execution of the program.

Now consider the all-relaxed case. The only guarantees are that at synchronization events (e.g., upc_notify, upc_wait, and upc_fence) the side effects of all previous shared accesses have been completed. Consider the following code:

relaxed shared int x;
relaxed shared int y;

thread 1:      thread 2:
x=0;           while (y == 0)
y=0;           /* do nothing */;
x=1;           printf(“%d”, x);
upc_fence;
y=2;

The code will again ensure that the result value printed out is 1. But the first accesses may be performed in any order. In fact, the first store of 0 in x does not even have to happen, as it has no effect on the program's execution. Obviously, this code will perform better. In general, it is a good idea to use relaxed references for performance gain except when the execution of some other thread depends on the order of accesses seen, as was just shown.

Locks and Barriers

Beyond the parallel program coordination role played by shared memory as described in the earlier section on shared memory consistency, UPC provides two native coordination mechanisms: locks and barriers (introduced in Chapter 5). A lock is an object in shared space that allows a thread to perform operations called upc_lock() and upc_unlock() to guarantee mutual exclusion with respect to the lock. A barrier is a collective operation which ensures that all threads have completed all operations prior to the barrier operation before any threads continue beyond the barrier operation. The major difficulty with barriers and locks is that they can lead to very inefficient operation if not applied carefully.

Bulk Transfers and Collectives

The UPC library provides a variety of mechanisms that allow the efficient transfe of data between various combinations of shared and private space and support a rich set of collective operations. The principal benefit of these libraries, all of which can be written as UPC programs, is that they offer semantics that are both useful to programmers and may be implemented quite efficiently on a variety of UPC platforms.

The bulk data movement operations [upc_memget(), upc_memput(), and upc_memcpy()] allow efficient movement of data. In most cases, implementations operate near the hardware bandwidth of the underlying communication network. The upc_memget() function is used to transfer data from shared space to private space, upc_memput() transfers data from private space to shared space, and upc_memcpy() transfers data from one shared space to another. In the next section we show examples of using both upc_memget() and upc_memput().

Optimizing for Locality and Load Balance

We now turn to some more concrete examples of using UPC to achieve efficient execution. These are examples of a set of specific optimizations that are generally applicable to many of today's parallel machines. First to be addressed are techniques to enhance performance by exploiting locality. Then the use of split-phase barriers is addressed in some detail. The section concludes with a discussion of the use of locks.

The most elementary optimization for locality is to make a local copy of a remote shared data item that will be reused often. Consider the following transformation:

shared int x;

original:               optimized:
                        int local_x;
                        local_x = x;
for (i=1;i<big;i++)     for (i=1;i<big;i++)
f(x);                   f(local_x);

If the compiler could look inside the function f() (and all of the functions it calls) and see that x is never changed, it could do the optimization itself. Unfortunately, compilers often cannot perform this sort of analysis: for example, if the function in question is in a library. The benefit of the optimized version is that it reduces the global number of fetches of shared references. The nonoptimized fetched big times, while the optimized version fetches x only once, avoiding big-1 shared references.

The upc_memget() library function is perfect for making local copies of larger data structures efficiently, so it might be good to use that if x were an array, for example. By writing the code like this, the programmer is asserting that f() does not change x. If it did, we would have very bizarre behavior. Another potential problem with copies is that they take some time, and if only a small portion is accessed, it might not yield actual performance improvements. Of course, it will also require more memory for the copies on every thread.

If we are unable to make copies of data (either because it might be updated or there is no room for the copy), another approach is to use the affinity concept in UPC. Recall from earlier chapters that each elementary object in the shared space has affinity to a single thread, which allows us to do local computations on shared data. Consider the following transformation:

shared struct xs *xp;
original:
struct xs *mine;
for (i=0; i<big; i++)
  if (i%THREADS == MYTHREAD)
    f(&xp [i]);

optimized:
struct xs *mine;
for (i=0; i<big; i++)
  if (upc_affinityof(&xp[i]) == MYTHREAD)
  {
    mine = (struct xs *)&xp[i];
    f(mine);
  }

Here we have instituted an “owner-computes” method in which the thread with affinity to the data structure will actually compute the function f(), and in the optimized case, f() is actually called with a pointer to the shared data that appears to be pointing to private. This will cause f() to be an entirely local access function that will be compiled very efficiently. This loop can easily be rewritten using the upc_forall() construct. In general, using owner-computes and casting pointers to shared data to private is a foolproof way of getting better performance. Care is required, though: Casting a pointer to shared, which does not have affinity to the current thread, will usually produce an undefined (and usually bad) behavior in your program!

7.5 CASE STUDIES

The following case studies are examples of real UPC codes that demonstrate a number of very useful techniques of hand optimization for extracting enhanced computing efficiency and performance based on earlier discussions in this chapter.

Space Privatization

Example 7.1 demonstrates how to apply space privatization to our heat conduction example. Local shared accesses are replaced by privatized memory accesses when it is possible. Such optimization requires the computational block to be split into three parts: computation of the most upper face, computation of the middle chunk of faces, and computation of the lower face. The middle chunk of faces, lines 41 through 43, does not require any remote shared accesses; therefore, it can be fully privatized. For the computations over the boundary blocks, however, this is not the case. The upper and lower faces, lines 28 through 34 and 36 through 40, respectively, need to perform both local and remote shared accesses. Thus, accesses can be only partially optimized. Lines 17 through 27 correspond to the case when each thread is allocated only one face to process. In this case, local and remote shared accesses are also needed. Thus, it is possible to privatize only some of the shared memory accesses.

Example 7.1: heat_conduction7.upc
…
 1. // statically allocate grids [] to be matched to [2][N][N][N]
 2. shared [BLOCKSIZE] double grids [2][N][N][N];
 3. shared double dTmax_local [THREADS];
 4. double (*mygrid [2])[N][N], *mydTmax_ptr;

 5. void initialize(void)
 6. {

      …
 7.   /* set up private pointers */
 8.   mygrid [0]=(double (*) [N][N])
        &grids [0][N*MYTHREAD/THREADS][0][0];
 9.   mygrid [1]=(double (*) [N][N])
        &grids [1][N*MYTHREAD/THREADS][0][0];
10.   mydTmax_ptr =(double *) &dTmax_local[MYTHREAD];

11.   /* sets one edge of the cube to 1.0 (heat) */
12.   if(MYTHREAD == 0)
13.     for(y=1; y<N-1; y++)
14.       for(x=1; x<N-1; x++)
15.         mygrid [0][0][y][x]= mygrid [1][0][y][x]= 1.0;
16. }

17. void process_only_face(int sg, int dg, double *dTmax)

18. { // grid(,z-1,,) and grid(,z+1,,) are remote shared
19.   double T, dT;
20.   int y, x, global_z;

21.   global_z = N*MYTHREAD/THREADS;

      …
22.       T = (grids [sg][global_z+1][y][x] + grids [sg][global_z-
          1][y][x] +
23.        mygrid [sg][0][y+1][x]+ mygrid [sg][0][y-1][x]+
24.        mygrid [sg][0][y][x+1]+ mygrid [sg][0][y][x-1])/6.0;
25.        dT = T - mygrid [sg][0][y][x];
26.        mygrid [dg][0][y][x]= T;
      …
27. }

28. void process_front_face(int sg, int dg, double *dTmax)
29. { // grid(,z-1,,) is remote shared, all others are mygrid []

    …
30.   global_z = N*MYTHREAD/THREADS;
    …

31.     T = mygrid [sg][1][y][x]+ grids [sg][global_z-1][y][x]+
32.         mygrid [sg][0][y+1][x]+ mygrid [sg][0][y-1][x]+
33.         mygrid [sg][0][y][x+1]+ mygrid [sg][0][y][x-1])/6.0;

     …
34. }
35.
36. void process_back_face(int sg, int dg, double *dTmax)
37. {// grid(,z+1,,) is remote shared, all others are mygrid []

     …
38.   z = N/THREADS-1;
39.   global_z = N*MYTHREAD/THREADS + z;
    …
40. }

41. void process_interior_face(int z, int sg, int dg,
                             double *dTmax)
42. {// all accesses are done through mygrid [][][][]
    …
43. }

44. int heat_conduction(void)
45. {

    …
46.   do
47.   {
48.     dTmax = 0.0;

49.     if((N/THREADS) == 1)
50.      {
51.        if((MYTHREAD != 0) && (MYTHREAD != (THREADS-1)))
52.          process_only_face(sg, dg, &dTmax);
53.      }
54.    else
55.      {
56.        if(MYTHREAD != 0)
57.          process_front_face(sg, dg, &dTmax);
58.        for(z=1; z<(N/THREADS)-1; z++)
59.          process_interior_face(z, sg, dg, &dTmax);
60.        if(MYTHREAD != (THREADS-1))
61.          process_back_face(sg, dg, &dTmax);
62.      }

63.    *mydTmax_ptr = dTmax;
64.    upc_barrier;

    …
65.  } while(finished == 0);

66.  return nr_iter;
67. }

Aggregation of Remote Accesses

As explained earlier, aggregating remote accesses into one large access has the potential of achieving high network bandwidth. The aggregation of remote shared accesses is illustrated in Example 7.2 with our heat conduction problem. Only changes from the previous version of the example are shown. The main modifications have been made to the functions that operate on the boundaries. Lines 25 through 33 have been modified to use the ghost zones information, cached locally at each iteration by the function update_ghostzones(), line 44. Thus, all computations are done over data with affinity to the processing thread, thereby avoiding remote accesses. To amortize the communication overhead and improve the bandwidth, the ghost zones are refreshed, lines 5 through 17, using string (or bulk) functions, upc_memput() in this case.

Example 7.2: heat_conduction8.upc
1. #define FRONT 0
 2. #define BACK 1
 3. shared [2*N*N] double ghostzones[THREADS][2][N][N];
 4. double (*front_ghostzone)[N], (*back_ghostzone)[N];

 5. void update_ghostzones(int sg)
 6. {
 7.   if(MYTHREAD != 0)
 8.     {// send mygrid[sg][0] to back[MYTHREAD-1]
 9.       upc_memput(ghostzones[MYTHREAD-1][BACK],
10.                 &mygrid[sg][0][0][0],sizeof(double)*N*N);
11.    }

12.  if(MYTHREAD != THREADS-1)
13.    { // send mygrid[sg][N/THREADS-1] to front[MYTHREAD+1]
14.      upc_memput(ghostzones[MYTHREAD+1][FRONT],
15.        &mygrid[sg][N/THREADS-1][0][0],sizeof(double)*N*N);
16.    }
17. }

18. void initialize(void)
19. {

    …
20.   front_ghostzone = (double (*)[N])
21.     &ghostzones[MYTHREAD][FRONT][0][0];
22.   back_ghostzone = (double (*)[N])
23.     &ghostzones[MYTHREAD][BACK][0][0];

     …
24. }

25. void process_only_face(int sg, int dg, double *dTmax)
26. {// uses back_ghostzone[][] and front_ghostzone[][]
        instead of doing accesses to grids[][][][]

    …
27. }
28. void process_front_face(int sg, int dg, double *dTmax)
29. {// uses front_ghostzone[][]
30. }

31. void process_back_face(int sg, int dg, double *dTmax)
32. { // uses back_ghostzone[][]

    …
33. }

34. int heat_conduction(void)
35. {

    …
36.   if((N/THREADS)== 1)
37.     {// for this particular case, ensure that initial
            conditions are into ghostzones
38.      update_ghostzones(sg);
39.      upc_barrier;
40.    }

41.   do
42.   {

        …
43.     upc_barrier;
        …
44.     update_ghostzones(dg); // dg is the up-to-date
                                  grid[][][]
45.     upc_barrier;

        …
46.   } while(finished == 0);
47.   return nr_iter;
48. }

Parallel Binary Tree

In this study we present the use of locks to access and update a parallel tree data structure. In the parallel binary tree we use the following data structure for the node and access definitions:

struct tree_node {
  shared struct tree_node *up, *left, *right;
  shared void *data;
  upc_lock_t *lock;
};

shared struct tree_node insert_left
    (shared struct tree_node *node, shared void *data);

shared struct tree_node insert_right
    (shared struct tree_node *node, shared void *data);

upc_lock_t *find_node_lock
    (shared struct tree_node *node);

shared struct tree_node trav_right
    (shared struct tree_node *node);

shared struct tree_node trav_left
    (shared struct tree_node *node);

shared struct tree_node trav_up
    (shared struct tree_node *node);

The primitives of this data structure are to splice a new node into either the left or right child of a given node and traverse up, left or right. The find_node_lock() routine finds the next-higher node lock, as the density of locks in the tree is a tunable parameter. Start with inserting a new item in a tree node:

upc_lock_t *find_node_lock (shared struct tree_node *node)
{
  if (node->lock)
    return lock;
  return find_node_lock (node->up);
}

shared struct tree_node insert_left
    (shared struct tree_node *node, shared void *data)
{
  shared struct tree_node *new;
  upc_lock_t *lock;

  new = upc_local_alloc (1, sizeof struct tree_node);
  new->data = data;
  if (this_node_needs_a_lock)
    new->lock = upc_lock_alloc();
  else
    new->lock = NULL;

  lock = find_node_lock (node);
  upc_lock (lock);

  new->up = node;
  new->left = node->left;
  new->right = 0;
  new->left->up = new;
  node->left = new;
  upc_unlock(lock);

  return new;
}

shared struct tree_node trav_left
    (shared struct tree_node *node)
{
  upc_lock_t *lock;
  shared struct tree_node *tmp;

  lock = find_node_lock (node);
  upc_lock (lock);
  tmp = node->left;
  upc_unlcok (lock);
  return tmp;
}

7.6 SUMMARY

This chapter focuses on optimizing UPC applications. Performance optimization relies on understanding the general computer architecture and programming model issues and trying to write code that masks out the weaknesses and exploits the strengths of modern parallel computers. Many performance optimizations can be performed automatically by compilers and run-time systems. However, some must be done explicitly by the programmer. In a new language such as UPC, it will take time to have compilers that relieve the programmer from the burden of doing hand performance tuning. However, we expect, based on our experiences, that compilers and run-time systems will continue to improve to the point that direct intervention by the programmer will become minimal. Until this happens, programmers should employ hand tunings that can perform self-processing of local data, privatization of space, aggregation and prefetching of accesses, reduced synchronization when possible, and load balancing.

EXERCISES

7.1 Integrate the privatization, then the aggregation optimizations, into the split-phase heat conduction program described in Section 6.2.

7.2 Measure the wall time taken by heat conduction processing (this is excluding the time for initialization, displaying results, and producing the file output). The wall time is measured using gettimeofday(). Time multiple versions of the heat conduction problem as shown in Chapters 5, 6, and 7. Quantify the impact of the optimizations over the execution time and the scalability.

7.3 Implement a parallel insert into a distributed binary search tree.


UPC: Distributed Shared Memory Programming, by Tarek El-Ghazawi, William Carlson, Thomas Sterling, and Katherine Yelick
Copyright © 2005 John Wiley & Sons, Inc.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.240.222