31. Producer‐Consumer Queues (2/5)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

480 31.Producer‐ConsumerQueues

void Insert(const T& item)

{

for (;;)

{

EnterCriticalSection(&m_cs);

if (m_queue.IsFull())

{

LeaveCriticalSection(&m_cs);

SwitchToThread();

continue; // Queue full

}

else if (SUCCEEDED(ReleaseSemaphore(m_sm, 1, NULL)))

{

m_queue.Insert(item);

}

LeaveCriticalSection(&m_cs);

break;

}

T Remove()

{

T item;

for (;;)

{

if (WAIT_OBJECT_0 != WaitForSingleObject(m_sm, INFINITE))

break;

EnterCriticalSection(&m_cs);

if (!m_queue.IsEmpty())

{

item = m_queue.Remove();

LeaveCriticalSection(&m_cs);

break;

}

else

31.3AFirstApproach:UsingWin32Se maphoresandCriticalSections 481

{

LeaveCriticalSection(&m_cs); // Queue empty

}

return (item);

}

DWORD LastError()

{

return (::GetLastError());

}

private:

Fifo<T> m_queue;

CRITICAL_SECTION m_cs;

HANDLE m_sm;

};

Listing 31.2. A simple producer-consumer queue.

When the semaphore count is zero, the consumer(s) enter into a low-

overhead wait state until a producer inserts an item at the head of the queue. Note

that in this implementation, the

Insert() function can still return without insert-

ing an item, for example, if the

ReleaseSemaphore() function fails. Finally, the

Remove() function can still return without removing an item, for example, if the

WaitForSingleObject() function fails. These failure paths are rare, and the

application can call the

LastError() function to get the Win32 error result.

Some designs may want to defer to the application to retry the operation on

the producer side if the queue is full. For example, one may want to return false if

an item can’t be inserted at that moment in time because the queue is full. In the

implementation of a thread-safe queue, one needs to be careful how to proceed if

the queue is empty or full. For example, when removing an object, the code in

Listing 31.2 first requests exclusive access to check if the queue has at least one

item, but backs out and tries again if it has no items to remove. If the check was

done before entering the critical section, a thread context switch could occur and

another consumer could remove the last item right before entering the critical

section, which would result in trying to remove an item from an empty queue!

482 31.Producer‐ConsumerQueues

This particular class has a few limitations. For example, there is no way to

signal to all the consumers to stop waiting for the producer(s) to insert more data.

One way to handle this is by adding a manual reset event that is initially non-

signaled and calling the

WaitForMultipleObjects() function on both the stop

event and semaphore, with the

bWaitAll set to FALSE. After a SetEvent() call,

all consumers would then wake up, and the return value of the

WaitForMulti-

pleObjects()

function indicates whether it was the event that signaled it.

Finally, the semaphore object maintains an exact physical count of the ob-

jects, but this is only internal to the OS. To modify the consumer to get the num-

ber of items currently in the queue, we can add a member variable that is

incremented or decremented whenever the queue is locked. However, every time

the count is read, it would only be a logical snapshot at that point in time since

another thread could insert or remove an item from the queue. This might be use-

ful for gauging a running count to see if the producer is producing data too quick-

ly (or the consumer is consuming it too quickly), but the only way to ensure the

count would remain accurate would be to lock the whole queue before checking

the count.

31.4ASecondApproach:Lock‐FreeAlgorithms

The goal of every multithreaded algorithm is to maximize parallelism. However,

once a resource is requested for exclusive access, all threads must stall while

waiting for that resource to become available. This causes contention. Also,

many OS locks (such as mutexes) require access to the kernel mode and can take

hundreds of clock cycles or more to execute the lock. To improve performance,

some multithreaded algorithms can be designed not to use any OS locks. These

are referred to as lock-free algorithms.

Lock-free algorithms have been around for decades, but these techniques are

seeing more and more exposure in user-mode code as multicore architectures

become commonplace and developers seek to improve parallelism. For example,

Valve’s Source Engine internally uses lock-free algorithms for many of the mul-

tithreaded components in their game engine [Leonard 2007].

The design of new lock-free algorithms is notoriously difficult and should be

based on published, peer-reviewed techniques. Most lock-free algorithms are en-

gineered around modern x86 and x64 architectures, which requires compiler and

processor support around a specific multiprocessor memory model. Despite the

complexity, many algorithms that have high thread contention or low latency

requirements can achieve measurable performance increases when switching to a

lock-free algorithm. Lock-free algorithms, when implemented correctly, general-

31.5ProcessorArchitectureOverviewandMemoryModels 483

ly do not suffer from deadlocks, starvation, or priority inversion. There is, how-

ever, no guarantee that a lock-free algorithm will outperform one with locks, and

the additional complexity makes it not worth the headache in many cases. Final-

ly, another disadvantage is that the range of algorithms that can be converted to a

lock-free equivalent is limited. Luckily, queues are not among them.

One of the challenges when accessing variables shared among threads is the

reordering of read and write operations, and this can break many multithreaded

algorithms. Both the compiler and processor can reorder reads and writes. Using

the

volatile keyword does not necessarily fix this problem, since there are no

guarantees in the C++ standard that a memory barrier is created between instruc-

tions. A memory barrier is either a compiler intrinsic or a CPU instruction that

prevents reads or writes from occurring out of order across the execution point. A

full barrier is a barrier that operates on the compiler level and the instruction

(CPU) level. Unfortunately, the

volatile keyword was designed for memory-

mapped I/O where access is serialized at the hardware level, not for modern mul-

ticore architectures with shared memory caches.

In Microsoft Visual C++, one can prevent the compiler from reordering in-

structions by using the

_ReadBarrier(), _WriteBarrier(), or _ReadWrite-

Barrier()

compiler intrinsics. In Microsoft Visual C++ 2003 and beyond,

volatile variables act as a compiler memory barrier as well. Finally, the

Memory-

Barrier()

macro can be used for inserting a CPU memory barrier. Interlocked

instructions such as

InterlockedIncrement() or InterlockedCompareAndEx-

change()

also act as full barriers.

31.5ProcessorArchitectureOverviewand

MemoryModels

Synchronization requirements for individual memory locations depend on the

architecture and memory model. On the x86 architecture, reads and writes to

aligned 32-bit values are atomic. On the x64 architecture, reads and writes to

aligned 32-bit and 64-bit values are atomic. As long as the compiler is not opti-

mizing away the access to memory (use the

volatile keyword to ensure this),

the value is preserved and the operation is completed without being interrupted.

However, when it comes to the ordering of reads and writes as it appears to a par-

ticular processor in a multiprocessor machine, the rules become more

complicated.

Processors such as the AMD Athlon 64 and Intel Core 2 Duo operate on a

“write ordered with store-buffer forwarding” memory model. For modern x86 or

484 31.Producer‐ConsumerQueues

x64 multiprocessor architectures, reads are not reordered relative to reads, and

writes are not reordered relative to writes. However, reads may be reordered with

a previous write if the read accesses a different memory address.

During write buffering, it is possible that the order of writes across all pro-

cessors appears to be committed to memory out of order to a local processor. For

example, if processor A writes A.0, A.1, and A.2 to three different memory loca-

tions and processor B writes B.3, B.4, and B.5 to three other memory locations in

parallel, the actual order of the writes to memory could appear to be written or-

dered A.0, A.1, B.3, B.4, A.2, B.5. Note that neither the A.0, A.1, and A.2 writes

nor the B.3, B.4, and B.5 writes are reordered locally, but they may be interleav-

ed with the other writes when committed to memory.

Since a read can be reordered with a write in certain circumstances, and

writes can be done in parallel on multiprocessor machines, the default memory

model breaks for certain multithreaded algorithms. Due to these memory model

intricacies, certain lock-free algorithms could fail without a proper CPU memory

barrier such as Dekker’s mutual exclusion algorithm, shown in Listing 31.3.

static declspec(align(4)) volatile bool a = false, b = false,

turn = false;

void threadA()

{

a = true;

// <- write a, read b reorder possible, needs MemoryBarrier()

while (b)

{

if (turn)

{

a = false;

while (turn) { /* spin */ };

a = true;

}

// critical section

turn = true;

a = false;

}

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 31. Producer‐Consumer Queues (2/5)

Create new playlist

Sign In

Sign Up

Table of Contents for
31. Producer‐Consumer Queues (2/5)