μop Fusion

Background

A conventional μop consists of a single operation that operates on two sources. The Instruction Decoder decodes an IA32 instruction into multiple μops whenever:

  • The IA32 instruction operates on more than two sources, or

  • When the nature of the operation requires a sequence of unrelated operations.

There are quite a few cases of IA32 instructions that are decoded into several μops, two of which are:

  • Store operations. IA32 instructions that store data in memory are decoded as two μops:

    - The Store Address μop calculates the memory address to be stored to.

    - The Store Data μop stores the data into a Store Buffer.

    Decoding the store into two μops allows the Store Address μop to dispatch earlier (even in cases where the data to be stored has not yet been generated by another instruction).

  • Load and Operate (read then modify) operations. A typical Load and Operate IA32 instruction (e.g., ADD EAX, [12345]) consists of two μops:

    - The first μop reads the operand from an address in memory, and

    - The second μop calculates the result using the data read from memory and a register operand.

    A Load and Operate IA32 instruction may use up to three register operands, and is implemented using two μops. The second μop cannot start until the first μop completes.

Decoding the IA32 instruction into multiple μops has a penalty:

  • More μops entering the pipeline results in more resources being used. These resources include: μop Queue entries, ROB entries, alias registers, General and Memory μop Queue entries and Scheduler Queue entries.

  • Delivering more μops through the pipeline increases the energy required to complete a given instruction sequence (the handling of a μop at each pipeline stage requires transistors to switch, thereby consuming power).

μop Fusion Description

General

When the Pentium® M instruction decoder encounters one of the multi-μop operations mentioned earlier (a Store or a Load and Operate), it decodes into a single, fused μop.

Upon arrival at the execute units, the two μops are split just prior to execution. This reduces the number of μops in the pipeline by as much as 10%, resulting in faster execution (due to the increased availability of resources) and less power used. To support fused μops, each Reservation Station entry (i.e., the scheduler queues) can accommodate up to three source operands.

μop fusion yields a typical performance increase of 5% for integer code and 9% for FP code. The Store fusion contributes most of the performance increase for integer code and both types of fused μops contribute about equally to the performance increase of FP code.

The Fused Store

The two μops that comprise a fused store μop can be issued to their execution units in parallel. The dispatch of the Store Address μop to the Address Generation execution unit is performed when its sources (the base and index registers) are ready. The data to be stored may or may not have been generated at this point. The dispatch of the Store Data μop to a Store Buffer can occur independently when its source operand becomes available. The retirement of the fused store μop can occur only after both of the μops have completed execution.

The Fused Load and Operate

The two μops that comprise the fused Load and Operate μop are dispatched serially to their respective execution units. The dispatch of the Load is performed when its sources (the base and index registers) are ready. The dispatch of the Operate portion of the Load and Operate to its execution unit can occur only after the Load completes execution and the memory operand is resident in the Load Buffer. The retirement of the fused Load and Operate μop can occur only after both of the μops have completed execution.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.97.64