31. Producer‐Consumer Queues (3/5)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

31.5ProcessorArchitectureOverviewandMemoryModels 485

void threadB()

{

b = true;

// <- write a, read b reorder possible, needs MemoryBarrier()

while (a)

{

if (!turn)

{

b = false;

while (!turn) { /* spin */ };

b = true;

}

// critical section

turn = false;

b = false;

}

Listing 31.3. Dekker’s mutual exclusion algorithm.

Assuming that each thread is running in a different process, Figure 31.4

shows a parallel operation that fails with the write-ordered, store-buffer forward-

ing memory model. Since the read (

cmp) can be reordered relative to the previous

write on each processor, it is possible for

threadA to read b as false on proces-

sor A and

threadB to read a as false on processor B, before any instructions to

set them to

true are finished executing. This causes both threads to enter into the

critical section at the same time, which breaks the algorithm. The way to solve

this is by inserting a CPU memory barrier before the write and read.

Figure 31.4. On modern x86 or x64 processors, a read instruction can be reordered with a

previous write instruction if they are addressing different memory locations.

Processor

mov dword ptr [a], 1

cmp dword ptr [b], 0

mov dword ptr [b], 1

cmp dword ptr [a], 0

486 31.Producer‐ConsumerQueues

Microsoft Visual C++ 2008 has a compiler-specific feature to treat access to

volatile variables with acquire and release semantics [Dawson 2008]. In order to

enforce acquire and release semantics completely on modern x86/64 processors,

the compiler would need to enforce compiler memory barriers up the call stack in

addition to using atomic or lock instructions. Unfortunately, when disassembling

Dekker’s algorithm on a Core2 Duo platform, locking instructions were not used,

thus leaving the possibility of a race condition. Therefore, to be safe, aligned in-

tegers that are shared across threads should be accessed using interlocked instruc-

tions, which provide a full barrier when compiled under Microsoft Visual C++. A

cross-platform implementation may require the use of preprocessor defines or

different code paths depending on the target compiler and target architecture.

31.6Lock‐FreeAlgorithmDesign

On x86 and x64 architectures, lock-free algorithms center around the compare-

and-swap (CAS) operation, which is implemented atomically when supported as

a CPU instruction. Pseudocode for a 32-bit CAS operation is shown in Listing

31.4. If implemented in C++, this operation would not be atomic. However, on

x86 and x64 processors, the 32-bit

CMPXCHG instruction is atomic and introduces

a processor memory barrier (the instruction is prefixed with a

LOCK). There are

also 64-bit (CAS2) and 128-bit (CAS4) versions on supported platforms

(

CMPXCHG8B and CMPXCHG16B). In Win32, these are available as the Inter-

lockedCompareExchange()

and InterlockedCompareExchange64() functions.

At the time of this writing, there is no

InterlockedExchange128() function, but

there is a

_InterlockedCompareExchange128() intrinsic in Visual C++ 2010.

Support for the 64-bit

CMPXCHG4B instruction has been available since the

Pentium, so availability is widespread. The 128-bit

CMPXCHG8B instruction is

available on newer AMD64 architectures and Intel 64 architectures. These fea-

tures can be checked using the

__cpuid() intrinsic function, as shown in

Listing 31.5.

UINT32 CAS(UINT32 *dest, UINT32 old, UINT32 new)

{

UINT32 cur = *dest;

if (cur == old) *dest = new;

return (old);

}

Listing 31.4. CAS pseudocode.

31.7Lock‐FreeImplementationofaFreeList 487

struct ProcessorSupport

{

ProcessorSupport()

{

int cpuInfo[4] = {0};

__cpuid(cpuInfo, 1);

supportsCmpXchg64 = ((cpuInfo[2] & 0x200) != 0);

supportsCmpXchg128 = ((cpuInfo[3] & 0x100) != 0);

}

bool supportsCmpXchg64;

bool supportsCmpXchg128;

};

ProcessorSupport processorSupport;

if (processorSupport.supportsCmpXchg64) { /*...*/ }

Listing 31.5. Verifying CAS processor support.

This chapter uses the 32-bit CAS and 64-bit CAS2 in its design. The ad-

vantage of using a CAS operation is that one can ensure, atomically, that the

memory value being updated is exactly as expected before updating it. This pro-

motes data integrity. Unfortunately, using the CAS operation alone in the design

of a lock-free stack and queue isn’t enough. Many lock-free algorithms suffer

from an “ABA” problem, where an operation on another thread can modify the

state of the stack or queue but not appear visible to the original thread, since the

original item appears to be the same.

This problem can be solved using a version tag (also known as a reference

counter) that is automatically incremented and stored atomically with the item

when a CAS operation is performed. This makes the ABA problem extremely

unlikely (although technically possible, since the version tag could eventually

wrap around). The implementation requires the use of a CAS2 primitive to re-

serve room for an additional version tag.

31.7Lock‐FreeImplementationofaFreeList

The AllocRef data structure in Listing 31.6 contains the union shared by both

the stack and the queue algorithm. It is a maximum of 64 bits wide to accommo-

488 31.Producer‐ConsumerQueues

declspec(align(8)) union AllocRef

{

enum

{

NullIndex = 0x000FFFFF

};

AllocRef(UINT64 idx, UINT64 ver)

{

arrayIndex = idx;

version = ver;

}

AllocRef(UINT64 idx, UINT64 si, UINT64 v)

{

arrayIndex = idx;

stackIndex = si;

version = v;

}

AllocRef() : arrayIndex(NullIndex), version(0)

{

}

AllocRef(volatile AllocRef& a)

{

val = a.val;

}

struct

{

UINT64 arrayIndex : 20;

UINT64 stackIndex : 20;

UINT64 version : 24;

};

UINT64 val;

};

Listing 31.6. A 64-bit allocation block reference.

31.7Lock‐FreeImplementationofaFreeList 489

date the CAS2 primitive. In addition, it includes a version tag to prevent the ABA

problem.

The

Allocator class in Listing 31.7 implements a free list, which is an algo-

rithm that manages a preallocated pool of memory. The code uses an array-based

lock-free implementation of a stack algorithm by Shafiei [2009]. Since this algo-

rithm also requires a stack index, we pack the stack index, stack value (array in-

dex), and version number in one 64-bit value. This limits the maximum number

of items in the stack and queue to 1,048,575 entries (

20 1



) and the version

number to 24 bits. The 20-bit index

0x000FFFFF is reserved as a null index.

__declspec(align(8)) class Allocator

{

public:

Allocator(UINT maxAllocs) : m_maxSize(maxAllocs + 1)

{

m_pFreeList = new AllocRef[m_maxSize];

assert(m_pFreeList != NULL);

m_pFreeList[0].arrayIndex = AllocRef::NullIndex;

m_pFreeList[0].stackIndex = 0;

m_pFreeList[0].version = 0;

for (UINT i = 1; i < m_maxSize; ++i)

{

m_pFreeList[i].arrayIndex = i - 1;

m_pFreeList[i].stackIndex = i;

m_pFreeList[i].version = i;

}

m_top.val = m_pFreeList[m_maxSize - 1].val;

}

Allocator::~Allocator()

{

delete[] m_pFreeList;

}

inline bool IsFull() const

{

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 31. Producer‐Consumer Queues (3/5)

Create new playlist

Sign In

Sign Up

Table of Contents for
31. Producer‐Consumer Queues (3/5)