i
i
i
i
i
i
i
i
Chapter 2
Perspectives on Parallel Programming
Contents
2.1 Limits on Parallel Program Performance . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Parallel Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.1 Comparing Shared Memory and Message Passing Models . . . . . . . . 33
2.2.2 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.3 Other Programming Models . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
In the previous chapter, we mentioned how parallel architectures require programmers’ effort
in writing a parallel program. Writing a parallel program involves breaking down the program
into tasks that can be correctly executed independently from one another, and these tasks need to
communicate and cooperate in order to produce the same output as the original sequential program.
Given the prevalence of multicore architectures, even desktop and mobile platforms require parallel
programming to make use of multiple cores.
In the next three chapters, this book will cover parallel programming, starting from discussing
various parallel programming models (this chapter), steps in creating shared memory parallel pro-
grams (Chapter 3), and how to write parallel programs for applications with linked data structures
(Chapter 4). The purpose of these chapters is to give readers sufficient background knowledge
so that they get a perspective on programming abstractions and primitives that a parallel multicore
architecture needs to support, and how efficient such support needs to be. Such a background knowl-
edge will be useful when readers move on to future chapters that discuss the architecture of parallel
multicores.
This chapter discusses the author’s perspectives on parallel programming models. The chapter
starts by discussing limits of parallel program performance. The discussion points out to the signif-
icant challenges for producing scalable performance on a large number of processor cores, which
includes the fraction of execution that cannot be parallelized, load imbalance, and various overheads
from managing parallel execution, such as communication and synchronization. It then discusses
several popular parallel programming models that are widely used, important differences between
them, and the relationship between the programming models and the architecture of the system. The
choice of programming model has important implications on communication and synchronization
overheads, hence it is a critical factor in parallel program performance.
27
i
i
i
i
i
i
i
i
28 CHAPTER 2. PERSPECTIVES ON PARALLEL PROGRAMMING
The book’s coverage on parallel programming is not meant to give readers a comprehensive
coverage of parallel programming. There are numerous other textbooks that cover parallel program-
ming models and techniques. In this book, the chapters place an emphasis on only one programming
model (the shared memory programming model), which is chosen because most multicore systems
currently support it. Other programming models are useful and popularly used in other situations,
for example message passing is commonly used for large systems containing many multicore nodes,
MapReduce is commonly used for data center computation on distributed systems, etc. However,
in the author’s view, the shared memory programming model is a necessary prerequisite knowledge
for readers interested in learning parallel multicore architectures. Thus, it is selected for a deeper
coverage.
2.1 Limits on Parallel Program Performance
Fundamentally, what programmers expect from parallel execution of their algorithm or code is to
obtain smaller execution time compared to the sequential execution of the algorithm or code. One
useful tool for reasoning about parallel program execution time is Amdahl’s Law. Suppose that
the sequential execution of the algorithm or code takes T
1
time units, while parallel execution of
the algorithm or code on p processors (assuming p threads) takes T
p
time units. Suppose that out
of the entire execution of the program, s fraction of it is not parallelizable while 1 s fraction is
parallelizable. Recall from Equation 1.4 that the idealized (or Amdahl’s) speedup of the parallel
execution compared to the sequential execution is:
speedup
p
=
1
s +
1s
p
(2.1)
and when we have a very large number of processors:
speedup
= lim
p→∞
1
s +
1s
p
=
1
s
The equation above shows that obtaining a high degree of scalability is hard; it requires almost
all parts of a program to be almost completely parallelizable perfectly. All other parts that are
not parallelized must be very small in order to produce a scalable parallel execution performance.
Another implication is that increasing speedups by increasing the number of processors yield di-
minishing returns. This can be seen in Figure 2.1, which plots Amdahl’s speedups for three cases of
sequential fraction: 1%, 5%, and 10%. We can observe the declining steepness of all the speedup
curves as the number of threads/processors increase. In other words, there is a declining marginal
speedup from increasing the number of processors.
In practice, the sequential fraction s is not the only factor affecting speedups. There are other
factors that reduce speedups such as load imbalance, synchronization overheads, communication
overheads, and thread management overheads. Many of these overheads increase as the number of
processors increase. For example, thread management overheads increase at least linearly with the
number of threads because threads need to be spawned, assigned tasks from a protected task queue,
etc. Synchronization overheads tend to increase at least quadratically with the number of processors.
As the number of processors increases, the time to perform barrier synchronization, acquire locks,
i
i
i
i
i
i
i
i
2.1. LIMITS ON PARALLEL PROGRAM PERFORMANCE 29
0
25
50
75
100
125
150
175
200
0
10
20
30
40
50
60
70
s=10%
s=5%
s=1%
Number of processors
Speedup
Figure 2.1: Amdahl’s speedups illustrated with different values for the sequential fraction s.
etc. is longer and takes a larger fraction of the original execution time. Communication overheads
also tend to increase because a higher number of processors cannot be interconnected with very
short physical distances. The communication latency between any random pair of nodes increases
with the number of processors. Thread management overheads also increase with a higher number
of processors because thread management requires synchronization and communication. Hence,
if we include these overheads into s, s tends to increase as the number of processors increases,
independent of the serial fraction inherent in the algorithm.
Did you know?
Under the perfect situation in which there is no fraction of the program that cannot
be parallelized completely (i.e., s = 0), the speedup will be speedup
p
= p, which is
referred to as a linear speedup. Linear speedup is the theoretical maximum speedup
attainable. However, in practice some programs have been known to show super-
linear speedups on some number of processors, where their speedups exceed the
linear speedups. The reason for this is that as the number of processors grows, the
combined aggregate resources (e.g., the total cache space or memory space) also
grow. At some tipping point, adding extra processors changes the situation from the
working set of the program exceeding the total cache space to the situation in which
the working set fits into the total cache space. When this occurs, cache capacity
misses are reduced significantly, and the execution of all threads become much faster,
leading to the super-linear speedups.
Figure 2.2 shows what happens to the speedup curves if we take into account linearly-increasing
overheads (s = 0.01 +
x
10
4
) or quadratically-increasing overheads (s = 0.01 +
x
2
10
5
). Note the
coefficients in the cost function are arbitrary, hence we should focus on the trends of the curves
rather than the magnitude. The figure shows that the effect of linearly-increasing cost is reduced
maximum speedup and fewer processors that achieve the maximum speedup. The same observation
can be made with quadratically-increasing cost. Note also that with these overheads, there is an
i
i
i
i
i
i
i
i
30 CHAPTER 2. PERSPECTIVES ON PARALLEL PROGRAMMING
speedup-optimum number of threads that when exceeded, speedups decline.
0
25
50
75
100
125
150
175
200
0
10
20
30
40
50
60
70
s=1%+x^2/100,000
s=1%+x/10,000
s=1%
Number of processors
Speedup
Figure 2.2: Amdahl’s speedups if we include parallel overheads in s.
The nature of diminishing return from increasing the number of threads implies that there are
only a limited number of low hanging fruits that come from exploiting thread level parallelism.
Beyond an inflection point, it is not economical to keep increasing the number of cores in a multicore
chip, as the marginal performance benefit is either too small or negative.
From an energy efficiency perspective, thread level parallelism has a declining energy efficiency
as the number of threads increases. The reason is that since power consumption is likely increasing
at least linearly with the number of processors, speedups increase by a smaller factor compared to
the increase in power consumption. Thus, the power-performance ratio increases as the number of
threads increases, which is equivalent to a declining energy efficiency. Only when speedups scale
perfectly with the number of threads we can maintain energy efficiency. Although moving from
an instruction-level parallelism focus to a thread-level parallelism focus brings an improvement in
energy efficiency, the gain in energy efficiency cannot be continued by just increasing the number
of processors. Therefore, if the power wall problem discussed in Chapter 1 worsens, chip designers
need to rely on other approaches to improve energy efficiency.
It is often the case that achieving a highly scalable parallel program requires using a very large
input set. Increasing the size of computation often allows the overheads of synchronization, com-
munication, and thread management to contribute to a smaller fraction of execution compared to
that of computation. Fortunately, many important scientific simulations do have very large input.
The use of input size that is scaled proportionally with the number of threads is often referred to as
weak scaling, as opposed to strong scaling where the input size that is fixed regardless of the num-
ber of threads. There is some validity in the arguments for using weak scaling, since large systems
will likely be used for computation with larger input sets.
i
i
i
i
i
i
i
i
2.2. PARALLEL PROGRAMMING MODELS 31
Did you know?
In the Amdahl’s law, there is an ambiguity as to what should constitute as the
execution time of a sequential execution of a program (T
1
). The choices include:
1. The same parallel program used in parallel execution, but it runs with a single
thread on a single processor.
2. The sequential version of the program.
3. The best sequential version of the program.
The first choice is incorrect to use; the sequential execution should not include
parallel overheads, such as thread creation, calls to parallel libraries, etc. For ex-
ample, when a program containing OpenMP directives is compiled with an OpenMP
compiler, it includes calls to OpenMP libraries even when it runs on a single thread.
However, using thread-safe libraries, which include the overheads of entering and ex-
iting critical sections, for sequential execution may be acceptable if they are the only
libraries that are available.
The second choice of using the sequential version of the program is more correct
since parallel overheads are not included, but care must also be taken. A potential
pitfall of the choice is whether the sequential version of the program is a realistic
implementation for sequential execution.
The correct choice is the last option, which uses the best sequential version of
the program to measure T
1
, both algorithm-wise and code-wise. For example, some
algorithms are developed for the purpose of extracting parallelism, but it may not
be the best version chosen for a sequential implementation. The second aspect is
whether the code is the best sequential code given the algorithm. For example, some
aggressive compiler optimizations that can benefit sequential execution are often not
performed when the program is compiled for parallel execution, but they can produce
a better performing sequential code when they are applied.
2.2 Parallel Programming Models
A programming model is an abstraction of the hardware that programmers can assume. It deter-
mines how easily programmers can specify their algorithms into computation tasks that the hard-
ware and compiler understand, and how efficiently those tasks can be executed by the hardware.
For non-parallel systems, the sequential programming model has worked really well in hiding the
hardware details from programmers, allowing programmers to effectively express their algorithms
in an intuitive manner as sequence of steps which are executed sequentially. In contrast, for mul-
tiprocessor systems, the programming model that can hide hardware details from the programmer
but simultaneously achieve high efficiency remains elusive.
There are at least two major parallel programming models that are widely used: shared memory
and message passing programming models. Both programming models are widely used, with the
message passing model more popular for large machines consisting of hundreds to thousands of
processor cores, and the shared memory model more popular for smaller machines. There are also
other programming models; we will defer their discussion to the later part of the chapter.
Let us define a parallel task as a unit of computation that can be executed independently from
others. Parallel tasks can run on different processors or cores. Figure 2.3 illustrates the features
of the programming models. In the shared memory model, the abstraction is that parallel tasks
executed by separate threads or processes can access any location of the memory. Hence, they com-
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.74.239