i
i
i
i
i
i
i
i
280 CHAPTER 8. HARDWARE SUPPORT FOR SYNCHRONIZATION
Table 8.5: Illustration of ABQL mechanism.
Steps next can P1 P2 P3 Comments
ticket serve my ticket
Initially 0 [1, 0, 0, 0] 0 0 0 all initialized to 0
P1: f&i(next ticket) 1 [1, 0, 0, 0] 0 0 0 P1 tries to acq lock
P2: f&i(next ticket) 2 [1, 0, 0, 0] 0 1 0 P2 tries to acq lock
P3: f&i(next ticket) 3 [1, 0, 0, 0] 0 1 2 P3 tries to acq lock
P1: can serve[1]=1; 3 [0, 1, 0, 0] 0 1 2 P1 rels, P2 acqs lock
can serve[0]=0
P2: can serve[2]=1; 3 [0, 0, 1, 0] 0 1 2 P2 rels, P3 acqs lock
can serve[1]=0
P3: can serve[3]=1; 3 [0, 0, 0, 1] 0 1 2 P3 rels lock
can serve[2]=0
that since there are O(p) number of acquisitions and releases, the total lock release traffic scales
on the order of O(p), a significant improvement over the ticket lock implementation. However,
note that the fetch and incs scalability depends on its underlying implementation. If it is
implemented with an LL/SC, its scalability may partially restrict the overall scalability of the ABQL
implementation.
8.1.8 Qualitative Comparison of Lock Implementations
Table 8.6 compares the various lock implementations based on several criteria: uncontended latency,
the amount of traffic following a single lock release, the amount of traffic waiting while the lock is
held by a processor, storage overheads, and whether fairness guarantee is provided.
Table 8.6: Comparison of various lock implementations.
Criteria test&set TTSL LL/SC Ticket ABQL
Uncontended latency Lowest Lower Lower Higher Higher
1 Release max traffic O(p) O(p) O(p) O(p) O(1)
Wait traffic High
Storage O(1) O(1) O(1) O(1) O(p)
Fairness guarantee? No No No Yes Yes
The uncontended latency is the lowest on simpler lock implementations such as the test and set,
TTSL, and LL/SC. Ticket lock and ABQL implementations execute more instructions so their un-
contended lock acquisition latency is higher.
In terms of the traffic following a single lock release, assuming all other threads are waiting
to acquire the lock next, test and set, TTSL, LL/SC, and ticket lock have the highest amount of
i
i
i
i
i
i
i
i
8.1. LOCK IMPLEMENTATIONS 281
maximum traffic. This is because all threads spin on the same variable so they may all cache the
same block, and each lock release invalidates all other processors and forces them to suffer misses
to reload the block. On the other hand, in ABQL, a single release only invalidates one other cache
causing only one subsequent cache miss.
In terms of the traffic generated while a lock is held by a thread, test and set performs very
poorly because all threads keep on attempting to acquire the lock using an atomic instruction and
even a failed attempt still generates an invalidation of all sharers and subsequent acquisition attempts
by the sharers. On the other hand, TTSL, LL/SC, ticket lock, and ABQL, spin wait using a load
instruction. Hence, once a thread discovers that the lock is held, it does not attempt to execute an
atomic instruction.
In terms of storage requirements, all of the lock implementations use one or two shared variables
so the storage requirement is constant across number of processors. Only ABQL has a storage
requirement that scales with the number of processors due to keeping the array can serve.
In terms of guarantee of fairness, only the ticket lock and ABQL provide it, due to the use of a
queue. In other implementations, it is possible (though in reality may not be likely) that a processor
has a better chance at acquiring the lock more than others following a release. For example, that may
occur when a thread that releases a lock (and invalidating other caches) quickly acquires the lock
to reenter the critical section again. In the mean time, other processors have not even had a chance
to reattempt a lock acquisition because they are still reloading the invalidated block. In this case,
it is possible that a thread keeps on acquiring and releasing a lock at the expense of other threads’
ability to acquire the lock. Other causes may also be possible. For example, if the bus arbitration
logic favors some requesters than others, some processors may be granted the bus faster than others.
In a distributed shared memory system, one processor may reload an invalidated block faster than
others, for example due to reloading the block from its local memory whereas others must reload
it from a remote memory. Following a release, the processor that reloads the block from its local
memory may race past others to attempt a lock acquisition.
Note, however, fairness guarantee creates a risk in performance. If a thread that is already in
the queue waiting to get a lock is context switched out, then even when the lock becomes available,
the thread will not attempt to acquire the lock. Other threads with larger ticket numbers cannot
acquire the lock either because they must wait until the thread that is switched out has acquired and
released the lock. Thus, the performance of all threads are degraded by the context switch of one of
the threads. Therefore, care must be taken to ensure context switches do not occur when using the
ticket lock or ABQL implementations.
From the point of view of software, it is not immediately clear which lock implementation is the
best. For software showing a high degree of lock contention, ABQL offers the highest scalability.
However, a high degree of lock contention is often a symptom of a deeper scalability problem,
such as the use of lock granularity that is too coarse grain. In such a case, using ABQL improves
scalability but may not make the program scalable. Better scalability can be obtained using a finer
lock granularity. For example, in the linked list parallelization discussed in Chapter 4, using the
fine grain lock approach, in which each node is augmented with its own lock, will likely make the
contention level of any particular lock very low. In that case, ABQL is not only unnecessary, it is
not even preferred due to a high uncontended latency and the fact that the storage overhead of one
array for each node will be too high.
i
i
i
i
i
i
i
i
282 CHAPTER 8. HARDWARE SUPPORT FOR SYNCHRONIZATION
8.2 Barrier Implementations
Barriers are very simple and widely used synchronization primitives in parallel programs. We have
discussed in Chapter 3 that in loop-level parallelization, barriers are often used at the end of a paral-
lel loop to ensure all threads have computed their parts of the computation before the computation
moves on to the next step. In the OpenMP parallel for directive, a barrier is automatically
inserted at the end of a parallel loop, and programmers must explicitly remove it if they believe that
the lack of barrier does not impact the correctness of their computation.
In this section, we will look at how barriers can be implemented, and compare the characteris-
tics of the implementations in various terms. Barriers can be implemented in software or directly
in hardware. Software barrier implementation is flexible but is often inefficient, whereas hard-
ware barrier implementation restricts the flexibility and portability of the software but is often a
lot more efficient. The simplest software barrier implementation is an implementation called the
sense-reversal global barrier. The key mechanism used is for threads to enter the barrier and spin
on a location that will be set by the last thread that arrives at the barrier, releasing all threads that are
waiting there. Obviously, spinning on a single location often involves a lot of invalidations when
the location is written, restricting scalability. One of the more complex but more scalable barrier
implementations is combining tree barrier in which the threads that participate in a barrier are or-
ganized like a tree, and a thread at a particular tree level spins on a common location only with its
siblings. This restricts the number of invalidations as threads spin on different locations. Finally,
we will look at a hardware barrier implementation support that is implemented in some very large
machine with thousands of processors.
The criteria for evaluating barrier performance include:
1. Latency: the time spent from entering the barrier to exiting the barrier. The latency in the
barrier should be as small as possible.
2. Traffic: the amount of bytes communicated between processors as a function of the number
of processors.
In contrast to lock implementation, fairness and storage overheads are not important issues
because barriers are a global construct involving many threads.
8.2.1 Sense-Reversing Centralized Barrier
A software barrier implementation only assumes that lock acquisition and release primitives to be
provided by the system (through software, hardware, or a combination of hardware and software).
For example, as long as one of test-and-set, TTSL, LL/SC, ticket, or ABQL lock implementation is
available, a barrier implementation can be constructed as well.
We note that from the point of view of programmers, a barrier must be simple, that is, a barrier
should not require any arguments passed to it, either variable names or number of processors or
threads. In OpenMP standard, for example, a barrier can be invoked simply as #pragma omp
barrier. In the actual implementation, arguments may be used, as long as they are not exposed
to the programmers.
The basic barrier implementation is shown in Code 8.7. The implementation makes use of
several variables. barCounter keeps track of how many threads have so far arrived at the barrier.
i
i
i
i
i
i
i
i
8.2. BARRIER IMPLEMENTATIONS 283
barLock is a lock variable to protect a modification to shared variables in a critical section. canGo
is a flag variable that threads spin on to know whether they can go past the barrier or not yet. Hence,
canGo is set by the last thread to release all threads from the barrier.
Code 8.7 Simple (but incorrect) barrier implementation.
1 // declaration of shared variables used in a barrier
2 // and their initial values
3 int numArrived = 0;
4 lock_type barLock = 0;
5 int canGo = 0;
6
7 // barrier implementation
8 void barrier () {
9 lock(&barLock);
10 if (numArrived == 0) { // first thread sets flag
11 canGo = 0;
12 }
13 numArrived++;
14 myCount = numArrived;
15 unlock(&barLock);
16
17 if (myCount < NUM_THREADS) {
18 while (canGo == 0) {}; // wait for last thread
19 }
20 else { // last thread to arrive
21 numArrived = 0; // reset for next barrier
22 canGo = 1; // release all threads
23 }
24 }
For example, suppose that three threads P1, P2, and P3 arrive at the barrier in that order. P1
arrives at the barrier first and enters the critical section. Then it initializes the variable canGo to 0
so that itself and other threads will wait until the last thread arrives and sets it to 1. It then increments
numArrived to 1, assigns its value to myCount, and exits the critical section. Next, it enters a
loop to wait until canGo has a value of 1. The second thread P2 enters the barrier, increments
numArrived to 2, discovers that it is not the last thread (myCount has a value of 2, smaller
than NUM THREADS) and also enters the loop to wait until canGo has a value of 1. Finally, the
last thread P3 enters the barrier, and increments numArrived to 3. It discovers that myCount
is equal to NUM THREADS so it is the last thread to arrive at the barrier. It then resets the value of
numArrived to 0 so that it can be used in the next barrier. Then, it releases all waiting threads by
setting canGo to 1. This allows all waiting threads to get out of the loop and resume computation
past the barrier.
Unfortunately, the code described above does not work correctly across more than one barrier.
For example, when the last thread P3 sets the value of canGo to 1, it invalidates all cached copies
of the block where canGo is located. Following the invalidations, the block resides in P3’s cache,
while P1 and P2 try to reload the block. Suppose that before that can happen, P3 enters the next
barrier, now as the first thread that arrives at that second barrier. As the first thread, it resets the
variable canGo to 0. It can do so quickly since it has the block in its cache in a modified state.
When P1 and P2 reload the block, they discover that the value of canGo is 0 as affected by the
i
i
i
i
i
i
i
i
284 CHAPTER 8. HARDWARE SUPPORT FOR SYNCHRONIZATION
second barrier rather than 1 as released from the first barrier. Therefore, P1 and P2 stay in the loop
of the first barrier and never get released, while P3 waits in the loop of the second barrier, never to
be released either.
One possible solution to the “deadlock” situation discussed earlier is to have a two-step release,
in which before the first thread that enters a barrier initializes the canGo variable, it waits until
all threads have been released from the previous barrier. Implementing such a solution requires
another flag, another counter, and extra code, which can be quite costly. Fortunately, there is a
simpler solution based on an observation that the error arises when the first thread entering the next
barrier resets the value of canGo. If we avoid this reset, then the thread that enters the next barrier
will not prevent other threads from getting out of the first barrier. To avoid resetting the value of
canGo, in the second barrier, threads can instead wait until the value of canGo changes back to 0,
which is accomplished by the last thread that enter the second barrier. With this approach, the value
of canGo that releases threads alternates from 1 in the first barrier, 0 in the second barrier, 1 in the
third barrier, 0 in the fourth barrier, and so on. Because the value is toggling between barriers, this
solution is referred to as the sense-reversing centralized barrier. The code for the barrier is shown
in the following (Code 8.8).
Code 8.8 Sense-reversing barrier implementation.
1 // declaration of shared variables used in a barrier
2 int numArrived = 0;
3 lock_type barLock = 0;
4 int canGo = 0;
5
6 // thread-private variables
7 int valueToWait = 0;
8
9 // barrier implementation
10 void barrier () {
11 valueToWait = 1 - valueToWait; // toggle it
12 lock(&barLock);
13 numArrived++;
14 myCount = numArrived;
15 unlock(&barLock);
16
17 if (myCount < NUM_THREADS) { // wait for last thread
18 while (canGo != valueToWait) {};
19 }
20 else { // last thread to arrive
21 numArrived = 0; // reset for next barrier
22 canGo = valueToWait; // release all threads
23 }
24 }
The code shows that each thread, when entering a barrier, first toggles the value that it will wait
from 0 to 1 or from 1 to 0. Then, it increments the counter and waits until the value of canGo is
changed by the last thread. The last thread is the one that toggles the value of canGo.
The centralized (or global) barrier implementation uses a critical section inside the barrier rou-
tine so as the number of threads grow, the time in the barrier increases linearly. Actually, it may
increase super-linearly depending on the underlying lock implementation. In addition, the traffic is
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.96.102