Chapter 2 - Perspectives on Parallel Programming (1/5)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2

Perspectives on Parallel Programming

Contents

2.1 Limits on Parallel Program Performance . . . . . . . . . . . . . . . . . . . . . . 28

2.2 Parallel Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2.1 Comparing Shared Memory and Message Passing Models . . . . . . . . 33

2.2.2 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.2.3 Other Programming Models . . . . . . . . . . . . . . . . . . . . . . . . 39

2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

In the previous chapter, we mentioned how parallel architectures require programmers’ effort

in writing a parallel program. Writing a parallel program involves breaking down the program

into tasks that can be correctly executed independently from one another, and these tasks need to

communicate and cooperate in order to produce the same output as the original sequential program.

Given the prevalence of multicore architectures, even desktop and mobile platforms require parallel

programming to make use of multiple cores.

In the next three chapters, this book will cover parallel programming, starting from discussing

various parallel programming models (this chapter), steps in creating shared memory parallel pro-

grams (Chapter 3), and how to write parallel programs for applications with linked data structures

(Chapter 4). The purpose of these chapters is to give readers sufﬁcient background knowledge

so that they get a perspective on programming abstractions and primitives that a parallel multicore

architecture needs to support, and how efﬁcient such support needs to be. Such a background knowl-

edge will be useful when readers move on to future chapters that discuss the architecture of parallel

multicores.

This chapter discusses the author’s perspectives on parallel programming models. The chapter

starts by discussing limits of parallel program performance. The discussion points out to the signif-

icant challenges for producing scalable performance on a large number of processor cores, which

includes the fraction of execution that cannot be parallelized, load imbalance, and various overheads

from managing parallel execution, such as communication and synchronization. It then discusses

several popular parallel programming models that are widely used, important differences between

them, and the relationship between the programming models and the architecture of the system. The

choice of programming model has important implications on communication and synchronization

overheads, hence it is a critical factor in parallel program performance.

28 CHAPTER 2. PERSPECTIVES ON PARALLEL PROGRAMMING

The book’s coverage on parallel programming is not meant to give readers a comprehensive

coverage of parallel programming. There are numerous other textbooks that cover parallel program-

ming models and techniques. In this book, the chapters place an emphasis on only one programming

model (the shared memory programming model), which is chosen because most multicore systems

currently support it. Other programming models are useful and popularly used in other situations,

for example message passing is commonly used for large systems containing many multicore nodes,

MapReduce is commonly used for data center computation on distributed systems, etc. However,

in the author’s view, the shared memory programming model is a necessary prerequisite knowledge

for readers interested in learning parallel multicore architectures. Thus, it is selected for a deeper

coverage.

2.1 Limits on Parallel Program Performance

Fundamentally, what programmers expect from parallel execution of their algorithm or code is to

obtain smaller execution time compared to the sequential execution of the algorithm or code. One

useful tool for reasoning about parallel program execution time is Amdahl’s Law. Suppose that

the sequential execution of the algorithm or code takes T

time units, while parallel execution of

the algorithm or code on p processors (assuming p threads) takes T

time units. Suppose that out

of the entire execution of the program, s fraction of it is not parallelizable while 1 − s fraction is

parallelizable. Recall from Equation 1.4 that the idealized (or Amdahl’s) speedup of the parallel

execution compared to the sequential execution is:

speedup

s +

1−s

(2.1)

and when we have a very large number of processors:

speedup

∞

= lim

p→∞

s +

1−s

The equation above shows that obtaining a high degree of scalability is hard; it requires almost

all parts of a program to be almost completely parallelizable perfectly. All other parts that are

not parallelized must be very small in order to produce a scalable parallel execution performance.

Another implication is that increasing speedups by increasing the number of processors yield di-

minishing returns. This can be seen in Figure 2.1, which plots Amdahl’s speedups for three cases of

sequential fraction: 1%, 5%, and 10%. We can observe the declining steepness of all the speedup

curves as the number of threads/processors increase. In other words, there is a declining marginal

speedup from increasing the number of processors.

In practice, the sequential fraction s is not the only factor affecting speedups. There are other

factors that reduce speedups such as load imbalance, synchronization overheads, communication

overheads, and thread management overheads. Many of these overheads increase as the number of

processors increase. For example, thread management overheads increase at least linearly with the

number of threads because threads need to be spawned, assigned tasks from a protected task queue,

etc. Synchronization overheads tend to increase at least quadratically with the number of processors.

As the number of processors increases, the time to perform barrier synchronization, acquire locks,

2.1. LIMITS ON PARALLEL PROGRAM PERFORMANCE 29

100

125

150

175

200

s=10%

s=5%

s=1%

Number of processors

Speedup

Figure 2.1: Amdahl’s speedups illustrated with different values for the sequential fraction s.

etc. is longer and takes a larger fraction of the original execution time. Communication overheads

also tend to increase because a higher number of processors cannot be interconnected with very

short physical distances. The communication latency between any random pair of nodes increases

with the number of processors. Thread management overheads also increase with a higher number

of processors because thread management requires synchronization and communication. Hence,

if we include these overheads into s, s tends to increase as the number of processors increases,

independent of the serial fraction inherent in the algorithm.

 Did you know?

Under the perfect situation in which there is no fraction of the program that cannot

be parallelized completely (i.e., s = 0), the speedup will be speedup

= p, which is

referred to as a linear speedup. Linear speedup is the theoretical maximum speedup

attainable. However, in practice some programs have been known to show super-

linear speedups on some number of processors, where their speedups exceed the

linear speedups. The reason for this is that as the number of processors grows, the

combined aggregate resources (e.g., the total cache space or memory space) also

grow. At some tipping point, adding extra processors changes the situation from the

working set of the program exceeding the total cache space to the situation in which

the working set ﬁts into the total cache space. When this occurs, cache capacity

misses are reduced signiﬁcantly, and the execution of all threads become much faster,

leading to the super-linear speedups.

Figure 2.2 shows what happens to the speedup curves if we take into account linearly-increasing

overheads (s = 0.01 +

) or quadratically-increasing overheads (s = 0.01 +

). Note the

coefﬁcients in the cost function are arbitrary, hence we should focus on the trends of the curves

rather than the magnitude. The ﬁgure shows that the effect of linearly-increasing cost is reduced

maximum speedup and fewer processors that achieve the maximum speedup. The same observation

can be made with quadratically-increasing cost. Note also that with these overheads, there is an

30 CHAPTER 2. PERSPECTIVES ON PARALLEL PROGRAMMING

speedup-optimum number of threads that when exceeded, speedups decline.

100

125

150

175

200

s=1%+x^2/100,000

s=1%+x/10,000

s=1%

Number of processors

Speedup

Figure 2.2: Amdahl’s speedups if we include parallel overheads in s.

The nature of diminishing return from increasing the number of threads implies that there are

only a limited number of low hanging fruits that come from exploiting thread level parallelism.

Beyond an inﬂection point, it is not economical to keep increasing the number of cores in a multicore

chip, as the marginal performance beneﬁt is either too small or negative.

From an energy efﬁciency perspective, thread level parallelism has a declining energy efﬁciency

as the number of threads increases. The reason is that since power consumption is likely increasing

at least linearly with the number of processors, speedups increase by a smaller factor compared to

the increase in power consumption. Thus, the power-performance ratio increases as the number of

threads increases, which is equivalent to a declining energy efﬁciency. Only when speedups scale

perfectly with the number of threads we can maintain energy efﬁciency. Although moving from

an instruction-level parallelism focus to a thread-level parallelism focus brings an improvement in

energy efﬁciency, the gain in energy efﬁciency cannot be continued by just increasing the number

of processors. Therefore, if the power wall problem discussed in Chapter 1 worsens, chip designers

need to rely on other approaches to improve energy efﬁciency.

It is often the case that achieving a highly scalable parallel program requires using a very large

input set. Increasing the size of computation often allows the overheads of synchronization, com-

munication, and thread management to contribute to a smaller fraction of execution compared to

that of computation. Fortunately, many important scientiﬁc simulations do have very large input.

The use of input size that is scaled proportionally with the number of threads is often referred to as

weak scaling, as opposed to strong scaling where the input size that is ﬁxed regardless of the num-

ber of threads. There is some validity in the arguments for using weak scaling, since large systems

will likely be used for computation with larger input sets.

2.2. PARALLEL PROGRAMMING MODELS 31

 Did you know?

In the Amdahl’s law, there is an ambiguity as to what should constitute as the

execution time of a sequential execution of a program (T

). The choices include:

1. The same parallel program used in parallel execution, but it runs with a single

thread on a single processor.

2. The sequential version of the program.

3. The best sequential version of the program.

The ﬁrst choice is incorrect to use; the sequential execution should not include

parallel overheads, such as thread creation, calls to parallel libraries, etc. For ex-

ample, when a program containing OpenMP directives is compiled with an OpenMP

compiler, it includes calls to OpenMP libraries even when it runs on a single thread.

However, using thread-safe libraries, which include the overheads of entering and ex-

iting critical sections, for sequential execution may be acceptable if they are the only

libraries that are available.

The second choice of using the sequential version of the program is more correct

since parallel overheads are not included, but care must also be taken. A potential

pitfall of the choice is whether the sequential version of the program is a realistic

implementation for sequential execution.

The correct choice is the last option, which uses the best sequential version of

the program to measure T

, both algorithm-wise and code-wise. For example, some

algorithms are developed for the purpose of extracting parallelism, but it may not

be the best version chosen for a sequential implementation. The second aspect is

whether the code is the best sequential code given the algorithm. For example, some

aggressive compiler optimizations that can beneﬁt sequential execution are often not

performed when the program is compiled for parallel execution, but they can produce

a better performing sequential code when they are applied.

2.2 Parallel Programming Models

A programming model is an abstraction of the hardware that programmers can assume. It deter-

mines how easily programmers can specify their algorithms into computation tasks that the hard-

ware and compiler understand, and how efﬁciently those tasks can be executed by the hardware.

For non-parallel systems, the sequential programming model has worked really well in hiding the

hardware details from programmers, allowing programmers to effectively express their algorithms

in an intuitive manner as sequence of steps which are executed sequentially. In contrast, for mul-

tiprocessor systems, the programming model that can hide hardware details from the programmer

but simultaneously achieve high efﬁciency remains elusive.

There are at least two major parallel programming models that are widely used: shared memory

and message passing programming models. Both programming models are widely used, with the

message passing model more popular for large machines consisting of hundreds to thousands of

processor cores, and the shared memory model more popular for smaller machines. There are also

other programming models; we will defer their discussion to the later part of the chapter.

Let us deﬁne a parallel task as a unit of computation that can be executed independently from

others. Parallel tasks can run on different processors or cores. Figure 2.3 illustrates the features

of the programming models. In the shared memory model, the abstraction is that parallel tasks

executed by separate threads or processes can access any location of the memory. Hence, they com-

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 2 - Perspectives on Parallel Programming (1/5)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 2 - Perspectives on Parallel Programming (1/5)