Chapter 8 - Hardware Support for Synchronization (4/8)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

280 CHAPTER 8. HARDWARE SUPPORT FOR SYNCHRONIZATION

Table 8.5: Illustration of ABQL mechanism.

Steps next can P1 P2 P3 Comments

ticket serve my ticket

Initially 0 [1, 0, 0, 0] 0 0 0 all initialized to 0

P1: f&i(next ticket) 1 [1, 0, 0, 0] 0 0 0 P1 tries to acq lock

P2: f&i(next ticket) 2 [1, 0, 0, 0] 0 1 0 P2 tries to acq lock

P3: f&i(next ticket) 3 [1, 0, 0, 0] 0 1 2 P3 tries to acq lock

P1: can serve[1]=1; 3 [0, 1, 0, 0] 0 1 2 P1 rels, P2 acqs lock

can serve[0]=0

P2: can serve[2]=1; 3 [0, 0, 1, 0] 0 1 2 P2 rels, P3 acqs lock

can serve[1]=0

P3: can serve[3]=1; 3 [0, 0, 0, 1] 0 1 2 P3 rels lock

can serve[2]=0

that since there are O(p) number of acquisitions and releases, the total lock release trafﬁc scales

on the order of O(p), a signiﬁcant improvement over the ticket lock implementation. However,

note that the fetch and inc’s scalability depends on its underlying implementation. If it is

implemented with an LL/SC, its scalability may partially restrict the overall scalability of the ABQL

implementation.

8.1.8 Qualitative Comparison of Lock Implementations

Table 8.6 compares the various lock implementations based on several criteria: uncontended latency,

the amount of trafﬁc following a single lock release, the amount of trafﬁc waiting while the lock is

held by a processor, storage overheads, and whether fairness guarantee is provided.

Table 8.6: Comparison of various lock implementations.

Criteria test&set TTSL LL/SC Ticket ABQL

Uncontended latency Lowest Lower Lower Higher Higher

1 Release max trafﬁc O(p) O(p) O(p) O(p) O(1)

Wait trafﬁc High – – – –

Storage O(1) O(1) O(1) O(1) O(p)

Fairness guarantee? No No No Yes Yes

The uncontended latency is the lowest on simpler lock implementations such as the test and set,

TTSL, and LL/SC. Ticket lock and ABQL implementations execute more instructions so their un-

contended lock acquisition latency is higher.

In terms of the trafﬁc following a single lock release, assuming all other threads are waiting

to acquire the lock next, test and set, TTSL, LL/SC, and ticket lock have the highest amount of

8.1. LOCK IMPLEMENTATIONS 281

maximum trafﬁc. This is because all threads spin on the same variable so they may all cache the

same block, and each lock release invalidates all other processors and forces them to suffer misses

to reload the block. On the other hand, in ABQL, a single release only invalidates one other cache

causing only one subsequent cache miss.

In terms of the trafﬁc generated while a lock is held by a thread, test and set performs very

poorly because all threads keep on attempting to acquire the lock using an atomic instruction and

even a failed attempt still generates an invalidation of all sharers and subsequent acquisition attempts

by the sharers. On the other hand, TTSL, LL/SC, ticket lock, and ABQL, spin wait using a load

instruction. Hence, once a thread discovers that the lock is held, it does not attempt to execute an

atomic instruction.

In terms of storage requirements, all of the lock implementations use one or two shared variables

so the storage requirement is constant across number of processors. Only ABQL has a storage

requirement that scales with the number of processors due to keeping the array can serve.

In terms of guarantee of fairness, only the ticket lock and ABQL provide it, due to the use of a

queue. In other implementations, it is possible (though in reality may not be likely) that a processor

has a better chance at acquiring the lock more than others following a release. For example, that may

occur when a thread that releases a lock (and invalidating other caches) quickly acquires the lock

to reenter the critical section again. In the mean time, other processors have not even had a chance

to reattempt a lock acquisition because they are still reloading the invalidated block. In this case,

it is possible that a thread keeps on acquiring and releasing a lock at the expense of other threads’

ability to acquire the lock. Other causes may also be possible. For example, if the bus arbitration

logic favors some requesters than others, some processors may be granted the bus faster than others.

In a distributed shared memory system, one processor may reload an invalidated block faster than

others, for example due to reloading the block from its local memory whereas others must reload

it from a remote memory. Following a release, the processor that reloads the block from its local

memory may race past others to attempt a lock acquisition.

Note, however, fairness guarantee creates a risk in performance. If a thread that is already in

the queue waiting to get a lock is context switched out, then even when the lock becomes available,

the thread will not attempt to acquire the lock. Other threads with larger ticket numbers cannot

acquire the lock either because they must wait until the thread that is switched out has acquired and

released the lock. Thus, the performance of all threads are degraded by the context switch of one of

the threads. Therefore, care must be taken to ensure context switches do not occur when using the

ticket lock or ABQL implementations.

From the point of view of software, it is not immediately clear which lock implementation is the

best. For software showing a high degree of lock contention, ABQL offers the highest scalability.

However, a high degree of lock contention is often a symptom of a deeper scalability problem,

such as the use of lock granularity that is too coarse grain. In such a case, using ABQL improves

scalability but may not make the program scalable. Better scalability can be obtained using a ﬁner

lock granularity. For example, in the linked list parallelization discussed in Chapter 4, using the

ﬁne grain lock approach, in which each node is augmented with its own lock, will likely make the

contention level of any particular lock very low. In that case, ABQL is not only unnecessary, it is

not even preferred due to a high uncontended latency and the fact that the storage overhead of one

array for each node will be too high.

282 CHAPTER 8. HARDWARE SUPPORT FOR SYNCHRONIZATION

8.2 Barrier Implementations

Barriers are very simple and widely used synchronization primitives in parallel programs. We have

discussed in Chapter 3 that in loop-level parallelization, barriers are often used at the end of a paral-

lel loop to ensure all threads have computed their parts of the computation before the computation

moves on to the next step. In the OpenMP parallel for directive, a barrier is automatically

inserted at the end of a parallel loop, and programmers must explicitly remove it if they believe that

the lack of barrier does not impact the correctness of their computation.

In this section, we will look at how barriers can be implemented, and compare the characteris-

tics of the implementations in various terms. Barriers can be implemented in software or directly

in hardware. Software barrier implementation is ﬂexible but is often inefﬁcient, whereas hard-

ware barrier implementation restricts the ﬂexibility and portability of the software but is often a

lot more efﬁcient. The simplest software barrier implementation is an implementation called the

sense-reversal global barrier. The key mechanism used is for threads to enter the barrier and spin

on a location that will be set by the last thread that arrives at the barrier, releasing all threads that are

waiting there. Obviously, spinning on a single location often involves a lot of invalidations when

the location is written, restricting scalability. One of the more complex but more scalable barrier

implementations is combining tree barrier in which the threads that participate in a barrier are or-

ganized like a tree, and a thread at a particular tree level spins on a common location only with its

siblings. This restricts the number of invalidations as threads spin on different locations. Finally,

we will look at a hardware barrier implementation support that is implemented in some very large

machine with thousands of processors.

The criteria for evaluating barrier performance include:

1. Latency: the time spent from entering the barrier to exiting the barrier. The latency in the

barrier should be as small as possible.

2. Trafﬁc: the amount of bytes communicated between processors as a function of the number

of processors.

In contrast to lock implementation, fairness and storage overheads are not important issues

because barriers are a global construct involving many threads.

8.2.1 Sense-Reversing Centralized Barrier

A software barrier implementation only assumes that lock acquisition and release primitives to be

provided by the system (through software, hardware, or a combination of hardware and software).

For example, as long as one of test-and-set, TTSL, LL/SC, ticket, or ABQL lock implementation is

available, a barrier implementation can be constructed as well.

We note that from the point of view of programmers, a barrier must be simple, that is, a barrier

should not require any arguments passed to it, either variable names or number of processors or

threads. In OpenMP standard, for example, a barrier can be invoked simply as #pragma omp

barrier. In the actual implementation, arguments may be used, as long as they are not exposed

to the programmers.

The basic barrier implementation is shown in Code 8.7. The implementation makes use of

several variables. barCounter keeps track of how many threads have so far arrived at the barrier.

8.2. BARRIER IMPLEMENTATIONS 283

barLock is a lock variable to protect a modiﬁcation to shared variables in a critical section. canGo

is a ﬂag variable that threads spin on to know whether they can go past the barrier or not yet. Hence,

canGo is set by the last thread to release all threads from the barrier.

Code 8.7 Simple (but incorrect) barrier implementation.

1 // declaration of shared variables used in a barrier

2 // and their initial values

3 int numArrived = 0;

4 lock_type barLock = 0;

5 int canGo = 0;

7 // barrier implementation

8 void barrier () {

9 lock(&barLock);

10 if (numArrived == 0) { // first thread sets flag

11 canGo = 0;

12 }

13 numArrived++;

14 myCount = numArrived;

15 unlock(&barLock);

17 if (myCount < NUM_THREADS) {

18 while (canGo == 0) {}; // wait for last thread

19 }

20 else { // last thread to arrive

21 numArrived = 0; // reset for next barrier

22 canGo = 1; // release all threads

23 }

24 }

For example, suppose that three threads P1, P2, and P3 arrive at the barrier in that order. P1

arrives at the barrier ﬁrst and enters the critical section. Then it initializes the variable canGo to 0

so that itself and other threads will wait until the last thread arrives and sets it to 1. It then increments

numArrived to 1, assigns its value to myCount, and exits the critical section. Next, it enters a

loop to wait until canGo has a value of 1. The second thread P2 enters the barrier, increments

numArrived to 2, discovers that it is not the last thread (myCount has a value of 2, smaller

than NUM THREADS) and also enters the loop to wait until canGo has a value of 1. Finally, the

last thread P3 enters the barrier, and increments numArrived to 3. It discovers that myCount

is equal to NUM THREADS so it is the last thread to arrive at the barrier. It then resets the value of

numArrived to 0 so that it can be used in the next barrier. Then, it releases all waiting threads by

setting canGo to 1. This allows all waiting threads to get out of the loop and resume computation

past the barrier.

Unfortunately, the code described above does not work correctly across more than one barrier.

For example, when the last thread P3 sets the value of canGo to 1, it invalidates all cached copies

of the block where canGo is located. Following the invalidations, the block resides in P3’s cache,

while P1 and P2 try to reload the block. Suppose that before that can happen, P3 enters the next

barrier, now as the ﬁrst thread that arrives at that second barrier. As the ﬁrst thread, it resets the

variable canGo to 0. It can do so quickly since it has the block in its cache in a modiﬁed state.

When P1 and P2 reload the block, they discover that the value of canGo is 0 as affected by the

284 CHAPTER 8. HARDWARE SUPPORT FOR SYNCHRONIZATION

second barrier rather than 1 as released from the ﬁrst barrier. Therefore, P1 and P2 stay in the loop

of the ﬁrst barrier and never get released, while P3 waits in the loop of the second barrier, never to

be released either.

One possible solution to the “deadlock” situation discussed earlier is to have a two-step release,

in which before the ﬁrst thread that enters a barrier initializes the canGo variable, it waits until

all threads have been released from the previous barrier. Implementing such a solution requires

another ﬂag, another counter, and extra code, which can be quite costly. Fortunately, there is a

simpler solution based on an observation that the error arises when the ﬁrst thread entering the next

barrier resets the value of canGo. If we avoid this reset, then the thread that enters the next barrier

will not prevent other threads from getting out of the ﬁrst barrier. To avoid resetting the value of

canGo, in the second barrier, threads can instead wait until the value of canGo changes back to 0,

which is accomplished by the last thread that enter the second barrier. With this approach, the value

of canGo that releases threads alternates from 1 in the ﬁrst barrier, 0 in the second barrier, 1 in the

third barrier, 0 in the fourth barrier, and so on. Because the value is toggling between barriers, this

solution is referred to as the sense-reversing centralized barrier. The code for the barrier is shown

in the following (Code 8.8).

Code 8.8 Sense-reversing barrier implementation.

1 // declaration of shared variables used in a barrier

2 int numArrived = 0;

3 lock_type barLock = 0;

4 int canGo = 0;

6 // thread-private variables

7 int valueToWait = 0;

9 // barrier implementation

10 void barrier () {

11 valueToWait = 1 - valueToWait; // toggle it

12 lock(&barLock);

13 numArrived++;

14 myCount = numArrived;

15 unlock(&barLock);

17 if (myCount < NUM_THREADS) { // wait for last thread

18 while (canGo != valueToWait) {};

19 }

20 else { // last thread to arrive

21 numArrived = 0; // reset for next barrier

22 canGo = valueToWait; // release all threads

23 }

24 }

The code shows that each thread, when entering a barrier, ﬁrst toggles the value that it will wait

from 0 to 1 or from 1 to 0. Then, it increments the counter and waits until the value of canGo is

changed by the last thread. The last thread is the one that toggles the value of canGo.

The centralized (or global) barrier implementation uses a critical section inside the barrier rou-

tine so as the number of threads grow, the time in the barrier increases linearly. Actually, it may

increase super-linearly depending on the underlying lock implementation. In addition, the trafﬁc is

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8 - Hardware Support for Synchronization (4/8)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 8 - Hardware Support for Synchronization (4/8)