Store-to-Load Forwarding

Background

The Pentium® 4 processor implements 24 Store Buffers (sometimes referred to as the Store Forwarding Buffers) into which store data is written and held pending its ultimate delivery to the cache or to system memory. If Hyper-Threading is enabled, the Store Buffers are partitioned into two groups of 12 buffers each, with each group dedicated to handling stores performed by one of the logical processors.

When a store is executed, the store is posted in the Store Buffer that was reserved by the Allocator and the store cannot be performed to the cache (if it's a write to cacheable memory), or to system memory (if it's a write to uncacheable memory) until the instruction is retired. The processor has a very deep instruction pipeline (20 stages) and μops are retired in strict program order at the rate of three per clock cycle. As a result, it could be quite a while before the data is written into the cache or to memory.

If load μops that will read data produced by stores earlier in the program flow were forced to wait until the data is written into the cache or system memory, load performance would suffer.

Description

If a load requests data produced by a previously-executed store, the data is forwarded from the Store Buffer to fulfill the load (without waiting for the data to be written to the cache or to system memory. This is referred to as Store-to-Load Forwarding (or just Store Forwarding).

Consider the following code fragment:

      ---
      jne  bypass     ;if condition met, jump around store
      ---             ;predicted path
      mov  mem1, eax  ; memory write into Store Buffer
bypass:
      mov  ebx, mem1  ;memory read from buffer or memory

If a load is predicted to be dependent on a previously-executed store, it gets its data from the Store Buffer and tentatively proceeds. If it turns out that the load was not dependent on the store, the load is re-executed (replayed) when the real data has been read from memory (or the cache).

The processor performs Store Forwarding by looking for a partial address match (bits [31:16]) between a load and all of the previously-executed stores in the Store Buffers. This occurs in parallel with the load's L1 Data Cache access. If the load's partial address matches that of a store in the Store Buffers, the load obtains its data from the buffer rather than from the cache. The Store Forwarding comparison has the same latency as a cache lookup. To perform the buffer lookup this quickly, the buffer logic cannot take the time to do a full address and access size check. The Memory Order Buffer (MOB) performs these checks later in the pipeline. The MOB ensures that the load obtained its data from the most recent write to that address. If the buffer logic's partial address check resulted in an incorrect forward from one of the Store Buffers, the load must be re-executed (replayed) after the correct store is retired and has written to the L1 Data Cache. The load then obtains its data from the cache.

Linear Address Mismatch Allows Load Before Store

Consider the following code fragment:

---
mov  mem1,eax       ;store to linear address 1
mov  eax,mem2       ;load from linear address 2
                    ;because the load is from a different
                    ;linear address than the store, the
                    ;load can be executed before the store

A load can be executed before a store that occurs earlier in the program if it is not predicted to use the same linear address as the store.

Linear Address Match Results in Store Forwarding

Consider the following code fragment:

---
mov  mem1,eax       ;store to linear address 1
mov  eax,mem2       ;load from linear address 1
                    ;Because the load uses the same
                    ;linear address as the store, the
                    ;load must be executed after the
                    ;store. The load then obtains its
                    ;data from the Store Buffer.

If a Load will read using the same linear address, it cannot be executed until after the respective Store has executed and its write data has been posted in a Store Buffer. The load then obtains its read data from the Store Buffer.

Store Forwarding Rules

When a load obtains it data from a Store Buffer, it does not have to wait for the store to write to the caches or to memory. The data from the Store Buffer can be used to satisfy the load if all of the following conditions are met:

  • Sequence. The data to be forwarded to the load is generated by an earlier store that was already executed.

  • Size. The bytes read must be the same as or a subset of the bytes stored.

  • Alignment:

    - The store cannot cross a cache line boundary, and

    - the linear address used in the load must be the same as that used in the store.

If all of these conditions are not met, the load cannot be executed until the store instruction has been retired and its data has been written to the cache. In that case, the load would then be executed and would obtain its data from the cache or from memory.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.46.130