480 31.ProducerConsumerQueues
void Insert(const T& item)
{
for (;;)
{
EnterCriticalSection(&m_cs);
if (m_queue.IsFull())
{
LeaveCriticalSection(&m_cs);
SwitchToThread();
continue; // Queue full
}
else if (SUCCEEDED(ReleaseSemaphore(m_sm, 1, NULL)))
{
m_queue.Insert(item);
}
LeaveCriticalSection(&m_cs);
break;
}
}
T Remove()
{
T item;
for (;;)
{
if (WAIT_OBJECT_0 != WaitForSingleObject(m_sm, INFINITE))
break;
EnterCriticalSection(&m_cs);
if (!m_queue.IsEmpty())
{
item = m_queue.Remove();
LeaveCriticalSection(&m_cs);
break;
}
else
31.3AFirstApproach:UsingWin32Se maphoresandCriticalSections 481
{
LeaveCriticalSection(&m_cs); // Queue empty
}
}
return (item);
}
DWORD LastError()
{
return (::GetLastError());
}
private:
Fifo<T> m_queue;
CRITICAL_SECTION m_cs;
HANDLE m_sm;
};
Listing 31.2. A simple producer-consumer queue.
When the semaphore count is zero, the consumer(s) enter into a low-
overhead wait state until a producer inserts an item at the head of the queue. Note
that in this implementation, the
Insert() function can still return without insert-
ing an item, for example, if the
ReleaseSemaphore() function fails. Finally, the
Remove() function can still return without removing an item, for example, if the
WaitForSingleObject() function fails. These failure paths are rare, and the
application can call the
LastError() function to get the Win32 error result.
Some designs may want to defer to the application to retry the operation on
the producer side if the queue is full. For example, one may want to return false if
an item can’t be inserted at that moment in time because the queue is full. In the
implementation of a thread-safe queue, one needs to be careful how to proceed if
the queue is empty or full. For example, when removing an object, the code in
Listing 31.2 first requests exclusive access to check if the queue has at least one
item, but backs out and tries again if it has no items to remove. If the check was
done before entering the critical section, a thread context switch could occur and
another consumer could remove the last item right before entering the critical
section, which would result in trying to remove an item from an empty queue!
482 31.ProducerConsumerQueues
This particular class has a few limitations. For example, there is no way to
signal to all the consumers to stop waiting for the producer(s) to insert more data.
One way to handle this is by adding a manual reset event that is initially non-
signaled and calling the
WaitForMultipleObjects() function on both the stop
event and semaphore, with the
bWaitAll set to FALSE. After a SetEvent() call,
all consumers would then wake up, and the return value of the
WaitForMulti-
pleObjects()
function indicates whether it was the event that signaled it.
Finally, the semaphore object maintains an exact physical count of the ob-
jects, but this is only internal to the OS. To modify the consumer to get the num-
ber of items currently in the queue, we can add a member variable that is
incremented or decremented whenever the queue is locked. However, every time
the count is read, it would only be a logical snapshot at that point in time since
another thread could insert or remove an item from the queue. This might be use-
ful for gauging a running count to see if the producer is producing data too quick-
ly (or the consumer is consuming it too quickly), but the only way to ensure the
count would remain accurate would be to lock the whole queue before checking
the count.
31.4ASecondApproach:LockFreeAlgorithms
The goal of every multithreaded algorithm is to maximize parallelism. However,
once a resource is requested for exclusive access, all threads must stall while
waiting for that resource to become available. This causes contention. Also,
many OS locks (such as mutexes) require access to the kernel mode and can take
hundreds of clock cycles or more to execute the lock. To improve performance,
some multithreaded algorithms can be designed not to use any OS locks. These
are referred to as lock-free algorithms.
Lock-free algorithms have been around for decades, but these techniques are
seeing more and more exposure in user-mode code as multicore architectures
become commonplace and developers seek to improve parallelism. For example,
Valve’s Source Engine internally uses lock-free algorithms for many of the mul-
tithreaded components in their game engine [Leonard 2007].
The design of new lock-free algorithms is notoriously difficult and should be
based on published, peer-reviewed techniques. Most lock-free algorithms are en-
gineered around modern x86 and x64 architectures, which requires compiler and
processor support around a specific multiprocessor memory model. Despite the
complexity, many algorithms that have high thread contention or low latency
requirements can achieve measurable performance increases when switching to a
lock-free algorithm. Lock-free algorithms, when implemented correctly, general-
31.5ProcessorArchitectureOverviewandMemoryModels 483
ly do not suffer from deadlocks, starvation, or priority inversion. There is, how-
ever, no guarantee that a lock-free algorithm will outperform one with locks, and
the additional complexity makes it not worth the headache in many cases. Final-
ly, another disadvantage is that the range of algorithms that can be converted to a
lock-free equivalent is limited. Luckily, queues are not among them.
One of the challenges when accessing variables shared among threads is the
reordering of read and write operations, and this can break many multithreaded
algorithms. Both the compiler and processor can reorder reads and writes. Using
the
volatile keyword does not necessarily fix this problem, since there are no
guarantees in the C++ standard that a memory barrier is created between instruc-
tions. A memory barrier is either a compiler intrinsic or a CPU instruction that
prevents reads or writes from occurring out of order across the execution point. A
full barrier is a barrier that operates on the compiler level and the instruction
(CPU) level. Unfortunately, the
volatile keyword was designed for memory-
mapped I/O where access is serialized at the hardware level, not for modern mul-
ticore architectures with shared memory caches.
In Microsoft Visual C++, one can prevent the compiler from reordering in-
structions by using the
_ReadBarrier(), _WriteBarrier(), or _ReadWrite-
Barrier()
compiler intrinsics. In Microsoft Visual C++ 2003 and beyond,
volatile variables act as a compiler memory barrier as well. Finally, the
Memory-
Barrier()
macro can be used for inserting a CPU memory barrier. Interlocked
instructions such as
InterlockedIncrement() or InterlockedCompareAndEx-
change()
also act as full barriers.
31.5ProcessorArchitectureOverviewand
MemoryModels
Synchronization requirements for individual memory locations depend on the
architecture and memory model. On the x86 architecture, reads and writes to
aligned 32-bit values are atomic. On the x64 architecture, reads and writes to
aligned 32-bit and 64-bit values are atomic. As long as the compiler is not opti-
mizing away the access to memory (use the
volatile keyword to ensure this),
the value is preserved and the operation is completed without being interrupted.
However, when it comes to the ordering of reads and writes as it appears to a par-
ticular processor in a multiprocessor machine, the rules become more
complicated.
Processors such as the AMD Athlon 64 and Intel Core 2 Duo operate on a
“write ordered with store-buffer forwarding” memory model. For modern x86 or
484 31.ProducerConsumerQueues
x64 multiprocessor architectures, reads are not reordered relative to reads, and
writes are not reordered relative to writes. However, reads may be reordered with
a previous write if the read accesses a different memory address.
During write buffering, it is possible that the order of writes across all pro-
cessors appears to be committed to memory out of order to a local processor. For
example, if processor A writes A.0, A.1, and A.2 to three different memory loca-
tions and processor B writes B.3, B.4, and B.5 to three other memory locations in
parallel, the actual order of the writes to memory could appear to be written or-
dered A.0, A.1, B.3, B.4, A.2, B.5. Note that neither the A.0, A.1, and A.2 writes
nor the B.3, B.4, and B.5 writes are reordered locally, but they may be interleav-
ed with the other writes when committed to memory.
Since a read can be reordered with a write in certain circumstances, and
writes can be done in parallel on multiprocessor machines, the default memory
model breaks for certain multithreaded algorithms. Due to these memory model
intricacies, certain lock-free algorithms could fail without a proper CPU memory
barrier such as Dekker’s mutual exclusion algorithm, shown in Listing 31.3.
static declspec(align(4)) volatile bool a = false, b = false,
turn = false;
void threadA()
{
a = true;
// <- write a, read b reorder possible, needs MemoryBarrier()
while (b)
{
if (turn)
{
a = false;
while (turn) { /* spin */ };
a = true;
}
}
// critical section
turn = true;
a = false;
}
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.89.2