i
i
i
i
i
i
i
i
8.1. LOCK IMPLEMENTATIONS 281
maximum traffic. This is because all threads spin on the same variable so they may all cache the
same block, and each lock release invalidates all other processors and forces them to suffer misses
to reload the block. On the other hand, in ABQL, a single release only invalidates one other cache
causing only one subsequent cache miss.
In terms of the traffic generated while a lock is held by a thread, test and set performs very
poorly because all threads keep on attempting to acquire the lock using an atomic instruction and
even a failed attempt still generates an invalidation of all sharers and subsequent acquisition attempts
by the sharers. On the other hand, TTSL, LL/SC, ticket lock, and ABQL, spin wait using a load
instruction. Hence, once a thread discovers that the lock is held, it does not attempt to execute an
atomic instruction.
In terms of storage requirements, all of the lock implementations use one or two shared variables
so the storage requirement is constant across number of processors. Only ABQL has a storage
requirement that scales with the number of processors due to keeping the array can serve.
In terms of guarantee of fairness, only the ticket lock and ABQL provide it, due to the use of a
queue. In other implementations, it is possible (though in reality may not be likely) that a processor
has a better chance at acquiring the lock more than others following a release. For example, that may
occur when a thread that releases a lock (and invalidating other caches) quickly acquires the lock
to reenter the critical section again. In the mean time, other processors have not even had a chance
to reattempt a lock acquisition because they are still reloading the invalidated block. In this case,
it is possible that a thread keeps on acquiring and releasing a lock at the expense of other threads’
ability to acquire the lock. Other causes may also be possible. For example, if the bus arbitration
logic favors some requesters than others, some processors may be granted the bus faster than others.
In a distributed shared memory system, one processor may reload an invalidated block faster than
others, for example due to reloading the block from its local memory whereas others must reload
it from a remote memory. Following a release, the processor that reloads the block from its local
memory may race past others to attempt a lock acquisition.
Note, however, fairness guarantee creates a risk in performance. If a thread that is already in
the queue waiting to get a lock is context switched out, then even when the lock becomes available,
the thread will not attempt to acquire the lock. Other threads with larger ticket numbers cannot
acquire the lock either because they must wait until the thread that is switched out has acquired and
released the lock. Thus, the performance of all threads are degraded by the context switch of one of
the threads. Therefore, care must be taken to ensure context switches do not occur when using the
ticket lock or ABQL implementations.
From the point of view of software, it is not immediately clear which lock implementation is the
best. For software showing a high degree of lock contention, ABQL offers the highest scalability.
However, a high degree of lock contention is often a symptom of a deeper scalability problem,
such as the use of lock granularity that is too coarse grain. In such a case, using ABQL improves
scalability but may not make the program scalable. Better scalability can be obtained using a finer
lock granularity. For example, in the linked list parallelization discussed in Chapter 4, using the
fine grain lock approach, in which each node is augmented with its own lock, will likely make the
contention level of any particular lock very low. In that case, ABQL is not only unnecessary, it is
not even preferred due to a high uncontended latency and the fact that the storage overhead of one
array for each node will be too high.