Hyper-Threading Improved

The 90nm version of the Pentium® 4 processor includes a number of improvements to enhance the performance of the Hyper-Threading feature.

Decreased Possibility of L1 Data Cache Blocking

In the earlier versions of the Pentium® 4 processor, the Data Cache would not block the servicing of load/store requests until four cache misses had occurred. This number has been increased to eight. While this has little effect on a processor executing a single thread, it enhances performance when both logical processors are executing threads (if the threads are accessing different data sets).

Increased the Size of the μop Queue

Refer to Figure 42-3 on page 1107. The size of the μop Queue that feeds the instruction pipeline has been increased by an unspecified amount to enhance performance when executing multiple threads.

Figure 42-3. The μop Queue Size Has Been Increased


Eliminated Page Table Walk/Split Line Access Conflict

Earlier versions of the Pentium® 4 processor could handle one or the other of the following events, but not both at the same time:

  • On a TLB miss, the processor could perform a lookup in the Page Directory and a Page Table.

  • Handle a data access that crossed a cache line boundary.

For a processor executing a single program, this limitation seldom caused a performance problem. When the processor is executing multiple threads, however, this can be a source of performance degradation.

The 90nm version of the Pentium® 4 processor can simultaneously handle both of these events.

Handling Multiple Page Table Walks that Miss All Caches

When a lookup in the Page Directory and a Page Table results in a miss on all of the caches (for either the PDE or the PTE access), the processor must access the Page Directory and possibly the Page Table in system memory. On the earlier versions of the Pentium® 4 processor, the processor could not perform any additional lookups in the Page Directory or a Page Table until the off-chip paging access had completed. While this was not a problem for a processor executing a single thread, it caused performance degradation on a processor executing multiple threads.

The 90nm version of the Pentium® 4 processor can continue to perform additional lookups in the Page Directory and a Page Table while the processor is accessing a PDE or a PTE in system memory.

Trace Cache Responds Quicker to a Thread Stall

When either of the threads experiences a stall condition (e.g., due to an unavailable resource) anywhere in the μop pipeline, the Trace Cache is informed. It then dedicates itself to servicing μop requests issued by the other thread.

In the 90nm version of the Pentium® 4 processor, the Trace Cache responds to this condition faster than it did on the earlier versions of the Pentium® 4 processor.

The Data Cache and Hyper-Threading

Author's Note

The description of the L1 Data Cache in “The L1 Data Cache” on page 1013 assumed that the L1 Data Cache is virtually-addressed and that each cache directory entry contains a physical page address tag. This assumption is based on the following statement from an Intel® Technology Journal article entitled Hyper-Threading Technology Architecture and Microarchitecture:

“The L1 data cache is 4-way set associative with 64-byte lines. It is a write-through cache, meaning that writes are always copied to the L2 cache. The L1 data cache is virtually addressed and physically tagged.”

The feature described in this section, however, implies that the L1 Data Cache is virtually addressed and that each cache directory entry contains a virtual rather than a physical page address tag.

Introduction

The author believes that this feature was first introduced in the Pentium® M processor. Bit 24 in the IA32_MISC_ENABLE MSR (see Figure 56-21 on page 1373) controls the L1 Data Cache Context Mode:

  • When bit 24 is set to 1, the L1 Data Cache is placed in Shared Mode.

  • When bit 24 is cleared to 0 (its default state after reset), the L1 Data Cache is placed in Adaptive Mode.

To determine if a processor supports this feature, execute a CPUID request type 1 and verify that ECX[10] = 1. If it is not, the processor does not support this feature and software must not alter the state of bit 24 in the IA32_MISC_ENABLE MSR.

Shared Mode

When the L1 Data Cache is in Shared Mode, a data cache hit initiated by either logical processor is permitted to access the selected cache line.

Adaptive Mode

Each logical processor has its own CR3 (Page Directory Base Address Register) which identifies the set of page address translation tables associated with the thread executing on a logical processor. If the same base address is in CR3 of both logical processors, then the two threads executing on the two logical processors share the same set of page address translation tables. On the other hand, if the two CR3s contain different Page Directory base addresses, the two threads are using different sets of page address translation tables. The two threads may generate identical 32-bit linear addresses on some memory accesses, but they are most likely mapped to different physical pages in memory for most if not all accesses.

When the processor is in Adaptive Mode, each access to the Data Cache is accompanied by a Context ID bit:

  • If the CR3 values are the same (as indicated by the Context ID bit presented by a logical processor) and the linear address has a match on a cache entry, the logical processor ID is not checked and the line is accessed.

  • If the CR3s are not the same (as indicated by the Context ID bit presented by the logical processor) and the linear address has a match on a cache entry, the access is only permitted if the logical processor ID matches the one stored in the directory entry. If the logical processor ID doesn't match, then it's a cache miss and the request is forwarded upstream (to the L2 ATC).

The MONITOR and MWAIT Instructions

These two instructions are part of the SSE3 instruction set and were introduced in the 90nm version of the Pentium® 4 processor.

Background

When the OS scheduler has no work to do, it typically enters an idle spin-wait loop until there is something to do. This could also be true of any thread that runs out of work to do. The idle thread is accomplishing no useful work and yet the processor's partitioned resources (e.g., the μop Queue, the General μop Queue, the Memory μop Queue, etc.) remain partitioned. When the thread executing on a logical processor enters the idle state, it makes far better sense to execute a HLT instruction thereby causing the processor to recombine the partitioned resources and dedicate all resources to the thread that is still performing useful work. If both threads executing on a physical processor should become idle, it would make sense to place the processor in a low-power state similar to the AutoHalt Powerdown state (see “The AutoHalt Power Down State” on page 686).

When one or both of the logical processors are idle, there must be a way to “wake” that logical processor back up and have it exit the idle state when there is once again useful work (i.e., a thread) for that logical processor to perform.

The Monitor Instruction

Three input parameters are supplied with the MONITOR instruction:

  • EAX = the offset portion of a linear memory address defined by DS:Offset (where the offset portion of the address is specified in the EAX register). This must be an address in WB memory. When a special, hardware-based monitoring facility is subsequently activated, the facility monitors for a memory write within a linear address range starting at this address. The range of addresses covered can be determined by a executing a CPUID request type 5 (new in the 90nm Pentium® 4) and checking the byte count specified in EAX[15:0] (see “Request Type 5” on page 1458).

  • ECX = optional extensions to the MONITOR instruction (none of which are currently implemented on the 90nm Pentium® 4). This register must contain zero.

  • EDX = optional hints to the MONITOR instruction (none of which are currently implemented on the 90nm Pentium® 4). This register must contain zero.

When executed, the MONITOR instruction accomplishes the following:

  • The processor uses the three input parameters to set up a hardware-based monitoring facility that, when subsequently activated, will monitor for a memory write (i.e., a store) to any location(s) within the specified memory area.

  • The processor clears its Monitor Event Pending Flag (see “Example Code Usage” on page 1112 and “The Wake Up Call” on page 1112).

The Mwait Instruction

Two input parameters are supplied with the MWAIT instruction:

  • EAX = optional hints for the MWAIT instruction (none of which are currently implemented on the 90nm Pentium® 4). This register must contain zero.

  • ECX = optional extension for the MWAIT instruction (none of which are currently implemented on the 90nm Pentium® 4). This register must contain zero.

When executed, the MWAIT instruction accomplishes the following:

  • The logical processor that executed the MWAIT instruction ceases program execution and awaits a wake up call (see “The Wake Up Call” on page 1112).

  • It places the processor in a processor design-specific mode. As an example:

    - If the other logical processor is still actively executing its thread, the processor's partitioned resources are recombined and dedicated to the still active logical processor.

    - If the other logical processor is also in the MWAIT state, the processor enters a low-power state similar to the AutoHalt Powerdown state (see “The AutoHalt Power Down State” on page 686).

Example Code Usage

The MONITOR/MWAIT instruction pair is typically used as illustrated in the following code fragment:

mov  eax,Trigger         ;eax = offset portion of DS:offset
                         ;of store trigger address in WB memory
mov  ecx,Extensions      ;ecx = optional monitor instruction
                         ;extensions
mov  edx,hints           ;edx = hints for monitor instruction
While (!trigger_store_happened) {
     monitor eax,ecx,edx ;trigger monitoring activated and
                         ;Monitor Event Pending Flag is cleared
     If (!trigger_store_happened) {
          mwait eax,ecx  ;enter optimized state & await trigger
     }
}

The Wake Up Call

Any of the following events will cause the logical processor to resume program execution:

  • A store to the WB memory address range being monitored causes the logical processor to fall through to the instruction that immediately follows the MWAIT instruction.

  • Any interrupt to the logical processor causes the processor to jump to the appropriate interrupt handler.

  • An NMI delivered to the logical processor causes the processor to jump to the NMI handler.

  • An SMI delivered to the logical processor causes the processor to jump to the SMI handler.

  • A Debug exception causes the processor to jump to the Debug exception handler.

  • A Machine Check exception causes the processor to jump to the MC handler.

  • The assertion of the processor's BINIT# signal.

  • The assertion of the processor's INIT# signal or the delivery of an INIT interrupt to the logical processor causes the processor to jump to the power-on restart address.

  • The assertion of the processor's RESET# input causes the processor to jump to the power-on restart address.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.114.125