31.5ProcessorArchitectureOverviewandMemoryModels 485
void threadB()
{
b = true;
// <- write a, read b reorder possible, needs MemoryBarrier()
while (a)
{
if (!turn)
{
b = false;
while (!turn) { /* spin */ };
b = true;
}
}
// critical section
turn = false;
b = false;
}
Listing 31.3. Dekker’s mutual exclusion algorithm.
Assuming that each thread is running in a different process, Figure 31.4
shows a parallel operation that fails with the write-ordered, store-buffer forward-
ing memory model. Since the read (
cmp) can be reordered relative to the previous
write on each processor, it is possible for
threadA to read b as false on proces-
sor A and
threadB to read a as false on processor B, before any instructions to
set them to
true are finished executing. This causes both threads to enter into the
critical section at the same time, which breaks the algorithm. The way to solve
this is by inserting a CPU memory barrier before the write and read.
Figure 31.4. On modern x86 or x64 processors, a read instruction can be reordered with a
previous write instruction if they are addressing different memory locations.
Processor
A
Processor
B
mov dword ptr [a], 1
cmp dword ptr [b], 0
mov dword ptr [b], 1
cmp dword ptr [a], 0
486 31.ProducerConsumerQueues
Microsoft Visual C++ 2008 has a compiler-specific feature to treat access to
volatile variables with acquire and release semantics [Dawson 2008]. In order to
enforce acquire and release semantics completely on modern x86/64 processors,
the compiler would need to enforce compiler memory barriers up the call stack in
addition to using atomic or lock instructions. Unfortunately, when disassembling
Dekker’s algorithm on a Core2 Duo platform, locking instructions were not used,
thus leaving the possibility of a race condition. Therefore, to be safe, aligned in-
tegers that are shared across threads should be accessed using interlocked instruc-
tions, which provide a full barrier when compiled under Microsoft Visual C++. A
cross-platform implementation may require the use of preprocessor defines or
different code paths depending on the target compiler and target architecture.
31.6LockFreeAlgorithmDesign
On x86 and x64 architectures, lock-free algorithms center around the compare-
and-swap (CAS) operation, which is implemented atomically when supported as
a CPU instruction. Pseudocode for a 32-bit CAS operation is shown in Listing
31.4. If implemented in C++, this operation would not be atomic. However, on
x86 and x64 processors, the 32-bit
CMPXCHG instruction is atomic and introduces
a processor memory barrier (the instruction is prefixed with a
LOCK). There are
also 64-bit (CAS2) and 128-bit (CAS4) versions on supported platforms
(
CMPXCHG8B and CMPXCHG16B). In Win32, these are available as the Inter-
lockedCompareExchange()
and InterlockedCompareExchange64() functions.
At the time of this writing, there is no
InterlockedExchange128() function, but
there is a
_InterlockedCompareExchange128() intrinsic in Visual C++ 2010.
Support for the 64-bit
CMPXCHG4B instruction has been available since the
Pentium, so availability is widespread. The 128-bit
CMPXCHG8B instruction is
available on newer AMD64 architectures and Intel 64 architectures. These fea-
tures can be checked using the
__cpuid() intrinsic function, as shown in
Listing 31.5.
UINT32 CAS(UINT32 *dest, UINT32 old, UINT32 new)
{
UINT32 cur = *dest;
if (cur == old) *dest = new;
return (old);
}
Listing 31.4. CAS pseudocode.
31.7LockFreeImplementationofaFreeList 487
struct ProcessorSupport
{
ProcessorSupport()
{
int cpuInfo[4] = {0};
__cpuid(cpuInfo, 1);
supportsCmpXchg64 = ((cpuInfo[2] & 0x200) != 0);
supportsCmpXchg128 = ((cpuInfo[3] & 0x100) != 0);
}
bool supportsCmpXchg64;
bool supportsCmpXchg128;
};
ProcessorSupport processorSupport;
if (processorSupport.supportsCmpXchg64) { /*...*/ }
Listing 31.5. Verifying CAS processor support.
This chapter uses the 32-bit CAS and 64-bit CAS2 in its design. The ad-
vantage of using a CAS operation is that one can ensure, atomically, that the
memory value being updated is exactly as expected before updating it. This pro-
motes data integrity. Unfortunately, using the CAS operation alone in the design
of a lock-free stack and queue isn’t enough. Many lock-free algorithms suffer
from an “ABA” problem, where an operation on another thread can modify the
state of the stack or queue but not appear visible to the original thread, since the
original item appears to be the same.
This problem can be solved using a version tag (also known as a reference
counter) that is automatically incremented and stored atomically with the item
when a CAS operation is performed. This makes the ABA problem extremely
unlikely (although technically possible, since the version tag could eventually
wrap around). The implementation requires the use of a CAS2 primitive to re-
serve room for an additional version tag.
31.7LockFreeImplementationofaFreeList
The AllocRef data structure in Listing 31.6 contains the union shared by both
the stack and the queue algorithm. It is a maximum of 64 bits wide to accommo-
488 31.ProducerConsumerQueues
declspec(align(8)) union AllocRef
{
enum
{
NullIndex = 0x000FFFFF
};
AllocRef(UINT64 idx, UINT64 ver)
{
arrayIndex = idx;
version = ver;
}
AllocRef(UINT64 idx, UINT64 si, UINT64 v)
{
arrayIndex = idx;
stackIndex = si;
version = v;
}
AllocRef() : arrayIndex(NullIndex), version(0)
{
}
AllocRef(volatile AllocRef& a)
{
val = a.val;
}
struct
{
UINT64 arrayIndex : 20;
UINT64 stackIndex : 20;
UINT64 version : 24;
};
UINT64 val;
};
Listing 31.6. A 64-bit allocation block reference.
31.7LockFreeImplementationofaFreeList 489
date the CAS2 primitive. In addition, it includes a version tag to prevent the ABA
problem.
The
Allocator class in Listing 31.7 implements a free list, which is an algo-
rithm that manages a preallocated pool of memory. The code uses an array-based
lock-free implementation of a stack algorithm by Shafiei [2009]. Since this algo-
rithm also requires a stack index, we pack the stack index, stack value (array in-
dex), and version number in one 64-bit value. This limits the maximum number
of items in the stack and queue to 1,048,575 entries (
20
20 1
) and the version
number to 24 bits. The 20-bit index
0x000FFFFF is reserved as a null index.
__declspec(align(8)) class Allocator
{
public:
Allocator(UINT maxAllocs) : m_maxSize(maxAllocs + 1)
{
m_pFreeList = new AllocRef[m_maxSize];
assert(m_pFreeList != NULL);
m_pFreeList[0].arrayIndex = AllocRef::NullIndex;
m_pFreeList[0].stackIndex = 0;
m_pFreeList[0].version = 0;
for (UINT i = 1; i < m_maxSize; ++i)
{
m_pFreeList[i].arrayIndex = i - 1;
m_pFreeList[i].stackIndex = i;
m_pFreeList[i].version = i;
}
m_top.val = m_pFreeList[m_maxSize - 1].val;
}
Allocator::~Allocator()
{
delete[] m_pFreeList;
}
inline bool IsFull() const
{
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.77.208