HT Performance Issues

Introduction

Intel® recommends that the following software techniques be used to optimize code for execution on HT-enabled processors:

  • The OS scheduler should schedule threads to be executed on logical processors within different physical processors before scheduling threads to be executed on both of the logical processors within the same physical processor.

  • Eliminate spin-wait loops wherever possible.

  • Execute the HLT instruction on a logical processor when entering an idle period.

  • The OS scheduler should attempt to balance the load placed on each of the logical processors.

  • Attempt to share code and data between the threads executing on each of the logical processors within the same physical processor.

  • Eliminate or decrease the amount of memory code and data sharing between threads executing on different physical processors.

  • Ensure that the code executed on each of the logical processors within the same physical processor uses the WC (write-combining) buffers wisely.

Thread Distribution to Logical Processors

Refer to Figure 39-20 on page 995. The system shown consists of eight logical processors distributed within four Pentium® 4 Xeon MP processors. If both of the logical processors within one physical processor are executing threads, then each of the logical processors only has half of the processor's resources to work with. In addition, the information residing in the L1, L2 and L3 Caches is divided between the lines fetched by each of the two threads.

Figure 39-20. Quad Xeon MP System


For this reason, in order to achieve optimal performance the OS should schedule threads to logical processors on different physical processors before scheduling multiple threads to be executed on the same physical processor. In this way, all of the processor's resources are available to the thread executing on one of the physical processor's logical processors. In addition, if a thread executing on a logical processor within one physical processor enters a spin-wait loop (see “Thread Synchronization” on page 1001 for more information), it will not adversely affect a thread executing on a logical processor within a different physical processor.

If two threads are executing on logical processors within the same physical processor and one of the threads enters a spin/wait loop, the physical processor's resources remain partitioned (even though the other logical processor is accomplishing no useful work).

Load Balancing

In order to achieve the best possible performance, the OS scheduler should attempt to balance the resource use of the threads executing on the two logical processors within a physical processor.

HT and the Processor Caches

Physical Processors Operating on Separate Data Sets

Whether or not the following scenario occurs is a function of two things:

  • How the OS distributes threads amongst the pool of available logical processors.

  • The areas of system memory that the threads perform data accesses within.

Assume that the threads executing on the logical processors in one physical processor operate on a data set (i.e., a set of data items in system memory) that is no way shared (i.e., accessed) by the threads executing on the logical processors in another physical processor. Also assume that the memory type for the data area is WB memory (cacheable write-back memory).

In this case, the data cached in one processor is not present in the other processor's caches (and vice versa). This being the case, when a thread executing within a physical processor updates (i.e., stores into) a data item, one of the following situations occurs:

  • The store hits on a line in the processor cache. The line is one of two states:

    - The E (Exclusive) state. In this case, the line is updated and is transitioned to the Modified (M) state. No transaction is performed on the FSB.

    - The line is already in the M state. In this case, In this case, the line is updated and remains in the M state. No transaction is performed on the FSB.

    In both cases, no FSB bandwidth is consumed (that's a good thing) and the update has no effect on the data in the caches of the other physical processor (another good thing).

  • The store misses on the processor's caches. In this case, the processor arbitrates for ownership of the Request Phase signal group and initiates a Memory Read and Invalidate transaction on the FSB. The transaction is snooped in the caches of the other physical processor, but it results in a miss (because the threads executing on the two physical processors operate on separate data sets). The FSB transaction therefore does not cause the eviction of a line from the snooping processor's cache (and that's a good thing).

Data Sharing by Physical Processors
Introduction

The threads executing on the logical processors within one physical processor may access data in system memory that is also accessed by the threads executing on the logical processors within another physical processor. In this case, the updates to data items shared by threads on the separate physical processors can degrade system performance. This discussion assumes that the shared memory area is designated as WB memory.

Before a thread updates a data item in a cache line shared by other processors (i.e., the line is in the S state in the cache), it must gain exclusive ownership of the line. In a WB area of memory, exclusive ownership is attained by performing a Memory Read and Invalidate transaction on the FSB. This transaction is also commonly referred to as an RFO (Read for Ownership) or as an RWITM (Read With Intent To Modify). The logical processor initiates the transaction to read the latest copy of the line (from whoever has it) with the intent to modify it the moment it receives the line.

Upon detecting this transaction, any other processor with a copy of the line must take the actions defined by the state of that line in their respective caches:

- I. No action required.

- E. The line is invalidated.

- S. The line is invalidated.

- M. The line is sourced to the requesting processor and is then invalidated.

Once the line is invalidated in a processor's caches, any subsequent access to any location in that line by a local logical processor results in cache miss and a FSB transaction to obtain the latest copy of the line.

Thus, accesses to data items in a shared line can cause excessive activity on the FSB.

Using a Semaphore to Access a Shared Data Area

Multiple threads may share access to a data area in memory. A semaphore in memory may be used to signal when the memory area is being used by one thread and should not be accessed until the current owner of the data area indicates that it is done.

An Ideal Situation

The best scenario would be that each semaphore (if there are multiple synchronization semaphores) resides in a separate cache line that contains no other semaphores nor any non-semaphore data items.

A Bad Situation

Multiple threads executing on multiple processors use the same semaphore. The following scenario highlights why this is less than desirable:

  1. In a quad-processor system, threads executing on each of the physical processors are competing for a semaphore that protects a critical section of code or data.

  2. Processor A accesses the semaphore in an attempt to lock the critical code or data area.

  3. In order to modify the semaphore, processor A first obtains exclusive ownership of the line containing the semaphore by performing a Memory Read and Invalidate transaction of the FSB.

  4. Processors B, C, and D each invalidate their copy of that cache line (if they have one).

  5. Processor A reads the line into its cache, checks the semaphore value to determine if it's clear (i.e., all zeros) and assuming that it is, stores a non-zero value into it to set the semaphore (thereby signaling that the shared memory area is now in use). The line is now marked Modified (M) in processor A's cache.

  6. When any or all of the threads executing on processors B, C, and D attempt to access the semaphore (to test its state), each has a cache miss and they are forced to re-read the line containing the semaphore into their respective caches using the FSB. They do so by performing a Memory Read transaction.

  7. When the first Memory Read transaction for the semaphore is performed on the FSB by processor B, C or D, it results in a snoop hit on a modified line in processor A's cache.

  8. Processor A sources the modified line to the requesting processor and simultaneously to the system memory controller which uses the modified line to freshen the line in memory. Since the other processor is only reading the line, processor A changes the state of its copy from M to S.

  9. Each of the threads executing on processors B, C, and D checks the semaphore in turn and each is forced to repetitively execute a spin-wait loop to repetitively check the semaphore until processor A clears it back to zero.

  10. The thread executing on processor A completes its accesses to the shared code or data area and modifies the semaphore to zero to indicate this is the case.

  11. In order to re-gain exclusive ownership of the line (now in the S state) before modifying the semaphore, processor A performs a 0-byte Memory Read and Invalidate to kill the line within the other processors' caches.

  12. It then updates the semaphore to zero in its cache.

  13. As they are still in the spin-wait loop, each of the threads executing on processors B, C, and D then re-read the cache line and determine that the semaphore has been cleared.

  14. As a result, they all attempt to set the semaphore and one of them “wins” the bus arbitration and performs a Kill transaction (i.e., a Memory Read and Invalidate for 0 bytes) to obtain exclusive ownership of the line before setting the semaphore.

  15. This once again forces the other processors to invalidate their respective copies of the line.

  16. The losing processors return to their respective spin-wait loops.

If the Shared Data and the Semaphore Are in the Same Line

To compound the problem just described, assume that some or all of the shared data items accessed and updated by the thread executing on processor A reside in the same line as the semaphore:

- Processors B, C, and D are constantly reading the semaphore in order to test it, thereby causing processor A to lose exclusive ownership of the line.

- When the thread executing on processor A wishes to update any data item in the line that also contains the semaphore, it must first regain exclusive ownership of the line.

- Ownership of the cache line ends up continually passing from one processor to another, causing excessive generation of FSB transactions.

Solution

This situation is avoided if the semaphore resides in one cache line, and the data items updated by the thread reside in a different cache line. Frequently accessed semaphores and shared data should be located in different cache lines. To do so, the cache line size for the target platform must be known:

- On the P6 processor family, the cache line size is 32 bytes.

- On the Pentium® 4 processor family, the L2 and L3 cache line size is 128 bytes (sub-divided two 64-byte sectors), and the L1 Data Cache line size is 64 bytes.

The semaphore should be aligned at the cache line boundary.

Data items accessed by the thread should be located in cache lines separate from the semaphore that controls access to the shared data.

Data Sharing by Co-Resident Logical Processors

When the threads executing on the two logical processors in the same physical processor access the same data set, this is a very good thing. It optimizes the use of the processor's L1 Data Cache, as well as the L2 and L3 Caches.

Co-Resident Logical Processors with Separate Data Sets

It's not as good a situation when the two threads executing within a physical processor access completely different data sets. Basically, the data sets associated with each of the threads each consume part of the total space within the L1 Data Cache, as well as the L2 and L3 Caches. In addition, an access to a data item in one data set that results in a cache miss can result in the eviction of a data item from the other data set to make room for the new line being read into the cache.

Executing Identical Threads

It would seem that the execution of two identical code threads on the two logical processors in a physical processor would negate HT's most important underlying premise: that the two threads would provide the core with a greater variety of instructions to choose for dispatch and execution.

Performance analysis has shown, however, that HT is able to execution even very similar instruction streams efficiently. A slight shift in the timing of the two instruction streams can result in identical instructions arriving at the schedulers at different times, thereby supplying the schedulers with a better variety of instructions to choose from on a clock-by-clock basis. As an example, such a time shift can occur due to a cache miss stalling one of the two instruction streams.

Halt Usage

When a thread completes execution on a logic processor and the OS scheduler doesn't have another thread ready to execute on it, it should cause the idle logical processor to execute the HLT instruction. All partitioned resources are then recombined and are available for use by the logical processor that is still executing a thread. On executing the HLT instruction, the physical processor enters the ST0, the ST1, or the AutoHalt PowerDown state (if both logical processors are currently halted). Any interrupt delivered to a halted logical processor returns it to the executing state.

Thread Synchronization

Definition

At some point, two threads running side-by-side on the two logical processors within a physical processor may need to sync up with each other. This can be accomplished by having the thread performing the test go into a tight loop repeatedly reading a variable and checking it for a particular value (see the example that follows). When the other thread arrives at the synchronization point, it signals this by writing the expected value into the synchronization variable. The next time the variable is checked, it contains the expected value and the tight loop is exited.

The following are two examples of spin-wait loops:

do {
} while(sync_var != constant_value)


wait_loop: cmp eax, semaphore
          jne wait_loop

The Problem

On the Pentium® 4 family processors, however, this results in severe performance degradation. The following describes the cause of the degradation:

  1. When the load to read the synchronization variable is executed, it is placed in one of the logical processor's Load Buffers to await the return of the read data.

  2. In a spin-wait loop, the logical processor queues up each of the successive reads of the variable in multiple Load Buffers that are each awaiting fulfillment. The processor can dispatch the repetitive loads much faster than the caches can be accessed to fetch the variable.

  3. When the thread executing on the other logical processor finally executes the store to the synchronization variable, it is posted in one of the Store Buffers allocated to that logical processor.

  4. The processor core must ensure that the loads issued by the other logical processor prior to the store receive the pre-store data.

  5. It must then ensure that the load executed by the thread in the spin-loop immediately following the store receives the updated data from the other thread's Store Buffer.

  6. The logical processor executing the store to the variable is not permitted to complete the store until all of the pre-store loads have been retired from the pipeline.

  7. Only then is the store permitted to complete.

  8. And, finally, the post-store load is permitted to complete

The Fix

The Pentium® 4 introduced the PAUSE instruction to address this issue. When placed in a spin-wait loop, the PAUSE instruction causes a small delay between the issuance of each of the loads to read the synchronization variable. The net result is that there will only be one outstanding load that will be serviced by the store when it occurs. This helps the performance of the thread that performs the store because the store can complete very quickly. There are two additional side-benefits:

  • The number of μops that the schedulers have to handle for the paused thread is dramatically decreased, thereby allowing the schedulers to provide fast dispatch of the other thread's μops.

  • Whenever the number of μops that have to be handled is decreased, the power consumed by the pipeline stages decreases accordingly.

The following is an example of using PAUSE to slow down the rate at which the semaphore is checked:

wait_loop: pause
           cmp eax, sync_var
           jne wait_loop

Performing a pre-check of the synchronization variable can avoid the delay imposed by the PAUSE (in the event that the other thread had already written to the variable):

            cmp eax, sync_var
            je continue
wait_loop:  pause
            cmp eax, sync_var
            jne wait_loop
continue:   ...

The following is an example showing a locked RMW (read/modify/write) being used to check the variable. It is also referred to as a spin-lock.

get_lock:
 movn eax, 1
 xchg eax, mem  ;read current value and set it
 cmp  eax, 0    ;was it already set?
 jne  spin_loop ;spin if it was, else fall through
critical_section:
 <critical section code>
 mov  mem, 0    ;clear the variable
 jmp  continue
spin_loop:
 pause          ;short delay
 cmp  mem, 0    ;
 jne  spin_loop
 jmp  get_lock
continue:

When A Thread Is Idle

When a worker thread has run out of work to do, it could enter an idle loop wherein it checks a variable periodically to determine when it is to perform a task. If it might be a relatively long wait until it receives a task to perform, the OS scheduler should put that logical processor to sleep by causing it to execute a HLT instruction. In that way, all of the partitioned resources are recombined and dedicated to the other logical processor.

Spin-Lock Optimization

Spin-locks (see the final example code fragment under “The Fix” on page 1002) are typically used when more than one thread may attempt to modify a synchronization variable simultaneously. In this case, a locked RMW should be used to test and change the variable (in case multiple threads are in a race to change the variable).

When the variable is cleared by the thread that had set it, a number of threads on other processors may be in a race to set it again. As was described in “A Bad Situation” on page 998, this can result in significant performance degradation. Intel® recommends that no more than two threads should have write access to a given synchronization variable. In addition, as shown in final code fragment in “The Fix” on page 1002, the PAUSE instruction should be included in the wait loop.

WCB Usage

A Pentium® 4 processor implements six Write Combining Buffers (WCBs) in which stores to WC memory are posted [see “Write-Combining (WC) Memory” on page 582 for a complete description of the WC memory type and the WCBs]. At any given moment, however, two of the six WCBs may be in use handling writes to WB memory (see “A Special Use of the WCBs” on page 1080 for a complete description). This means that four of the WCBs are guaranteed to be available to handle writes to WC memory. When HT is enabled, these four WCBs are partitioned, so only two of the four WCBs can be used by each of the logical processors. If one of the logical processors should execute the HLT instruction, the WCBs are recombined and dedicated to the logical processor that is still executing code.

If the thread executing on a logical processor should execute stores to more than two 64-byte lines of memory space designated as WC memory, this would exceed the two WCB usage rule and performance begins to degrade (because any additional WC stores to a different line of WC memory space performed by the thread on that logical processor will be stalled until one or both of the WCBs are dumped to external memory).

For this reason, Intel® recommends that, when a thread is executing on a logical processor, any program loop that performs writes to more than two lines of WC memory space should be divided into more than one loop, each of which writes to no more than two lines of WC memory space. This is referred to as loop fissioning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.136.226