Instruction Set Changes

Non-MMX Instructions

CMPXCHG8B

Compare and exchange eight bytes. Compares two 8-byte data objects:

  • Compares 64-bit value in EDX:EAX with a value in memory.

  • If equal, the value in ECX:EBX is stored into the specified memory operand.

  • If unequal, the contents of the memory operand is copied into EDX:EAX.

Can be used with a LOCK prefix to execute it as an atomic operation.

The processor never produces a locked read without producing a subsequent locked write. The destination operand always receives a write regardless of the compare result. The destination operand is written back if the compare fails; otherwise, the source operand is written into the destination.

RDTSC

Loads the current 64-bit TSC value into EDX:EAX (upper 32 bits into EDX and lower 32 bits into EAX). TSC is incremented every clock cycle and is resets to 0 when the processor is reset.

When in Protected Mode or VM86 mode, CR4[TSD] (Time Stamp Disable) restricts RDTSC use as follows:

  • TSD = 0: RDTSC can be executed at any privilege level.

  • TSD = 1: Can only be executed at privilege level 0.

  • When in Real Mode, use of the RDTSC instruction is always enabled.

TSC can also be read using the RDMSR instruction by code executing at privilege level 0.

RDTSC is not a serializing instruction (see “Serializing Instructions” on page 1079). It does not necessarily wait until all previous instructions are executed before reading the TSC. Subsequent instructions may begin execution before the read is performed. For more information, refer to “Time Stamp Counter” on page 498.

RDMSR and WRMSR
RDMSR

When executed, this instruction loads the contents of 64-bit MSR specified in ECX into the EDX:EAX register pair (EDX is loaded with the upper 32 MSR bits, and EAX is loaded with the lower 32 bits). It must be executed at privilege level 0 or in Real Mode (or it will result in the generation of a GP exception).

The appendix of the Intel® IA32 Architecture Software Developer's Manual Volume 3: System Programming Guide lists all of the MSRs and their addresses. Each processor family has its own set of MSRs.

The CPUID instruction should be used to determine whether the MSRs are supported (EDX[5]=1) before executing this instruction on a processor.

WRMSR

When executed, this instruction writes the contents of the EDX:EAX register pair into the MSR specified by the MSR address in ECX. EDX is written to the upper 32 bits of the MSR and EAX into the lower 32 bits of the MSR. This instruction must be executed at privilege level 0 or in Real Mode (or a GP exception will be generated). When writing to an MTRR (Memory Type and Range Register; these registers are implemented in the Pentium® Pro and subsequent IA32 processors), the TLBs are invalidated, including any global entries (implemented in the Pentium® Pro and subsequent IA32 processors; see “Global Pages” on page 567 for more information). WRMSR is a serializing instruction (for more information, see “Serializing Instructions” on page 1079). The CPUID instruction should be used to determine whether the MSRs are supported (EDX[5] = 1) before using this instruction.

CPUID Instruction
Description

The CPUID instruction first made its debut in the Pentium® processor and was also implemented in the later versions of the 486 (as well as in all subsequent IA32 processors). A detailed description can be found in “CPU Identification” on page 1443.

The additional Pentium®-related capabilities (refer to Figure 21-17 on page 519) that were not present in the later versions of the 486 were:

- DE. Debug Extensions.

- TSC. Time Stamp Counter.

- MSR. RDMSR and WRMSR instructions.

- MCE. Machine Check Exception.

- APIC. The Advanced Programmable Interrupt Controller was introduced in the P54C version of the Pentium® processor.

- MMX. The MMX instruction set was introduced in the P55C version of the Pentium® processor.

Figure 21-17. EDX Capability Bit Mask


MMX Capability

Introduction

The MMX instruction set was introduced in the P55C version of the Pentium® processor and consisted of 47 new instructions. In addition, there are eight MMX data registers (MM[7:0]; see Figure 21-18 on page 520). As shown in the illustration, the lower 64-bits of each of the FP data registers perform double-duty:

  • They are used as MMX registers when MMX code is being executed.

  • They are used as FP data registers when x87 FPU code is being executed.

Figure 21-18. MMX Register Set


The Basic Problem

Refer to Figure 21-19 on page 523. As an example, assume that there are two video frame buffers in memory (it should not be assumed, however, that MMX is only intended for processing video data). The current video mode has the following characteristics:

  • Each location in each of the two buffers represents the color of one pixel. The first buffer location corresponds to the first pixel on the left end of the first line of pixels on the screen, the second buffer location corresponds to the second pixel on the left end of the first line of pixels on the screen, etc.

  • A single location contains 8 bits (one byte), so a pixel can be any one of 256 possible colors (as represented by the values 00h-FFh).

  • The video controller is currently operating at a resolution of 1024 x 786, so each of the two video frame buffers consists of 786,432 locations.

Figure 21-19. Example Operation on Dual Frame Buffers


Now assume that the programmer wants to:

  1. Read the byte from the first location of one buffer,

  2. Read the first location of the other buffer,

  3. Add the two bytes together, and

  4. Store the result back into the first location of the second buffer.

  5. Repeat the operation for every pixel in the two frame buffers.

This could be accomplished in the following manner:

  1. Read one byte from one buffer into a one byte register (e.g., AL).

  2. Read the corresponding byte from the other buffer into another one byte register (e.g., BL).

  3. Add AL and BL together and store the result in BL.

  4. Since the add may result in the generation of a carry, the programmer has to decide whether to discard the carry or to deal with it. If the possibility of a carry must be dealt with, the programmer must include a conditional branch after the add that will jump to the code that handles the carry, or to skip that code.

As indicated in the illustration, this would result in 786,432 x 2 memory reads and 786,432 memory writes. This code would generate a tremendous number of memory accesses which may or may not hit on the processor's internal data cache. Any misses would result in memory transactions being performed on the FSB. This would degrade performance in two ways:

  • In a multiprocessor system, the FSB bandwidth available to the other processor(s) could be substantially impacted.

  • Since the FSB operates at a substantially slower rate of speed than the processor core, the memory accesses would be accomplished quite slowly.

The number of memory accesses could be reduced by reading four bytes at a time from each buffer:

  1. Read four bytes from one buffer into a 32-bit register (e.g., EAX).

  2. Read the corresponding four bytes from the other buffer into another 32-bit register (e.g., EBX).

  3. Add EAX and EBX together and store the result in EBX.

There is a problem here that must be dealt with. The four bytes read into EAX from the first buffer represent four independent pixel values. Likewise, the four bytes read into EBX from the second buffer represent four independent pixel values. If the programmer were to add EAX and EBX together when performing the add, the processor will treat the contents of the two registers as 32-bit integer values. The result will therefore be completely bogus. To fix this, the programmer would have to do something like the following (yes, the author realizes this is a simplistic and incomplete approach; it's to make the point that a lot of work is involved):

 mov    i,0
Loop:
  mov    eax,x[i]        ;read four pixels into ex starting at
                         ;the ith location in frame buffer x
  mov    ebx,y[i]        ;read four pixels into ebx starting at
                         ;the ith location in frame buffer y
  call ShiftIsolateAdd   ;isolate LSBs of both registers
                         ;and add
  jnc   Byte2            ;check for carry
  call CarryHandler      ;call handler if carry set
Byte2:
  call ShiftIsolateAdd   ;isolate next byte in both registers
                          ;and add
  jnc   Byte3            ;check for carry
  call CarryHandler      ;call handler if carry set
Byte3:
  call ShiftIsolateAdd   ;isolate 3rd byte of both registers
                          ;and add
  jnc   Byte4            ;check for carry
  call CarryHandler      ;call handler if carry set
Byte4:
  call ShiftIsolateAdd   ;isolate MSB in both registers
                         ;and add
  jnc   StoreResult      ;check for carry
  call CarryHandler      ;call handler if carry set
StoreResult:
  mov   y[i],edx         ;store 4 pixel result in buffer y
  add   i,4              ;point to next dword
  cmp   i,BufferEnd      ;check for end of buffer
  jnz   Loop             ;continue if not done
Done:

Although reading four pixels at a time from each buffer and writing four pixels at a time decreases the number of memory accesses, the programmer has to engage in a fair amount of bit-slinging to accomplish the goal. In addition, a conditional branch is performed after each add to see if a carry resulted and has to be dealt with. On later IA32 processors (starting with the Pentium® Pro), mispredicted branches result in a very deep performance dip. In this case, the conditional branches are dependent on completely unpredictable pixel data being received from a video source, so the misprediction rate will almost certainly be quite high.

MMX SIMD Solution

Refer to Figure 21-20 on page 524. MMX instructions can perform a simultaneous operation on bytes, words or dwords that are packed into MMX registers. This is referred to as a Single Instruction operating on Multiple Data items (SIMD). The programmer can read 64-bits (8 bytes, 4 words, or 2 dwords) into an MMX register using one instruction. In the example illustrated, the programmer has loaded eight packed bytes into MMX register MM0, another eight packed bytes into MMX register MM1, and then executes a PADDB instruction (an add on packed bytes). Loading 8 bytes rather than 4 bytes into a register at a time reduces the number of memory accesses that have to be performed. Furthermore, the MMX execution unit has eight independent adders that operate simultaneously on the 8 bytes in each of the registers. This results in a dramatic reduction in the compute time.

Figure 21-20. MMX SIMD Solution Increase Throughput


Dealing with Unpacked Data

Refer to Figure 21-21 on page 525. Sometimes, the data that the programmer wishes to perform a SIMD packed operation on is stored in memory in unpacked form. As an example, there is a video text mode wherein each text character byte in the video frame buffer is immediately followed by an attribute byte that defines the text character's attributes (e.g., underscore, blink, etc.). The programmer may wish to perform a SIMD operation on just the text characters or on the attributes (in other words, on every other byte).

Figure 21-21. Dealing with Unpacked Data


The MMX instruction set includes instructions that can read unpacked data from memory and pack it into an MMX register. Conversely, instructions are included that take data that is packed into an MMX register and stores it to memory in unpacked form.

Dealing with Math Underflows and Overflows

The MMX technology provides three ways of handling out-of-range conditions:

  • Wraparound math. Using wraparound math, a true out-of-range result is truncated. The carry or overflow is ignored and only the lsbs of the result are stored in the destination. Wraparound math can be used in applications that control the range of operands to prevent out-of-range results. Care should be taken, however, because, if the range of operands is not controlled, wraparound math can result in large errors (e.g., adding two large, signed numbers can result in positive overflow and produce a negative result).

  • Signed saturation math. Using signed saturation math, out-of-range results are automatically clamped to the representable range of signed integers for the integer size being operated on (see Table 21-2 on page 526). Two examples:

    - If an operation on signed word integers results in a positive overflow, the result is clamped (“saturated”) to 7FFFh, the largest positive integer that can be represented in 16 bits.

    - If in the same scenario negative overflow occurs, the result is saturated to 8000h.

    Table 21-2. Data Range Limits for Saturation
    Data TypeLower LimitUpper Limit
    HexDecimalHexDecimal
    Signed Byte80h-1287Fh127
    Signed Word8000h32,7687FFFh32,767
    Unsigned Byte00h0FFh255
    Unsigned Word0000h0FFFFh65,535

  • Unsigned saturation math. Using unsigned saturation math, out-of-range results are automatically clamped to the representable range of unsigned integers for the integer size being operated on. Positive overflow when operating on unsigned byte integers results in FFh being returned and negative overflow results in 00h being returned.

Saturated math lends itself well to many overflow scenarios. As an example, when performing color calculations, saturation causes a color to remain pure black or pure white and does not result in color inversion. It also prevents wraparound artifacts from affecting a computation (when operand range checking is not used).

It should be noted that MMX instructions do not indicate overflow or underflow by generating exceptions or setting flags in the EFlags register.

Elimination of Conditional Branches
Introduction

As mentioned earlier (see “The Basic Problem” on page 521), the later IA32 processors (Starting with the Pentium® Pro) experience a deep performance dip if the processor mispredicts a conditional branch and fetches the wrong instructions into the processor pipeline to be executed after the branch instruction. Conditional branches are especially troublesome if the condition being tested is based on a test of random, unpredictable data (e.g., video data).

Non-MMX Chroma-Key/Blue Screen Compositing Example

When the weather person is shown on TV walking around in front of the weather map, this is really the result of the real-time merging of two video frame buffers: one contains the data received from a camera pointing at the map, while the other contains the data received from a camera pointing at the person walking around in front of a blue background. The program is constantly studying the buffer containing the person pixels and, wherever a blue pixel is detected, it is replaced with the same pixel from the map video buffer.

The code fragment shown in Figure 21-22 on page 528 compares the value that represents the color blue against the ith location in the weather person's video frame buffer. The compare is followed by a conditional branch and one of two actions is taken based on the results of the comparison:

- If the pixel in the weather person's buffer isn't blue, the program jumps to next_pixel (not shown). In next_pixel, the pointer value (i) is incremented and a compare for the end of buffer is performed. If the end of buffer has not been reached, the code fragment shown is repeated again for the next pixel. When the end of buffer is reached, the process starts over again.

- If the pixel in the weather person's buffer is blue, it is replaced with the same pixel from buffer y (i.e., the map buffer). The program would then execute next_pixel again and continue to do so until the entire buffer has been processed.

Figure 21-22. Conditional Branches Can Severely Decrease Performance


This process is slow because it is not using the wider MMX registers to read the pixel information into and to perform the comparison. It will also result in abysmal performance due to the almost certain high incidence of mispredicted branches based on the comparisons of random video data.

MMX Chroma-Keying/Blue Screen Compositing Example

Figure 21-23 on page 529 shows an example MMX code fragment that can be used to accomplish the Chroma-Keying effect. Some things to note:

- Throughput is considerably enhanced by using MMX's SIMD capability to processor four pixels at a time.

- There are no conditional branches, thereby eliminating the potential performance degradation that accompanies mispredicted branches.

Figure 21-23. Example MMX Operation (1-of-4)


The code fragment consists of the following instructions:

- mov mm0, x[i]. This instruction moves four packed words from memory starting at the ith location in frame buffer x (i.e., the weather person's buffer) into MMX register MM0.

- pcmpeqw mm0, BLUE. Refer to Figure 21-23 on page 529. The prefix “p” stands for “packed”, while the suffix “w” stands for “words”. This instruction compares the four word values in the memory location labeled BLUE to the four pixels in MMX register MM0. The value BLUE in memory consists of four pixel values (packed into four contiguous memory locations) that represent the color blue. As shown at the bottom of Figure 21-23 on page 529, the instruction produces an array of four, 16-bit true/false indicators in MMX register MM0.

- pandn x[i], mm0. Refer to Figure 21-24 on page 530. Each bit of the resulting four, 16-bit pixels in the weather person buffer (starting at the ith location) is set to 1 if the corresponding bit in the first operand (the four pixels from the weather person buffer) is 0 and the corresponding bit in the second operand (the true/false mask in MMX register 0) is 1; otherwise, it is set to 0. This has the net effect of zeroing the pixels that are blue in the next four pixel locations in the weather person buffer.

Figure 21-24. Example MMX Operation (2-of-4)

- pand mm0, y[i]. As shown in Figure 21-25 on page 530, this instruction zeros pixels in the next four locations of the map image that must be replaced by the respective pixels from the weather person image.

Figure 21-25. Example MMX Operation (3-of-4)

- por x[i], mm0. As shown in Figure 21-26 on page 531, this instruction combines the non-zero pixels within the next four pixels from the map image with the non-zero pixels (of the person) from the weather person image and stores the resulting four composite pixels in the weather person buffer.

Figure 21-26. Example MMX Operation (4-of-4)

Detecting MMX Capability

Whether or not a processor supports MMX is detected by executing a CPUID request type 1. The processor's capabilities bit mask is returned in the EDX register (see Figure 21-27 on page 531). Bit 23 = 1 indicates that the processor supports MMX.

Figure 21-27. MMX Capability Is Indicated by EDX[23] = 1


Changes To the Programming Environment

Refer to Figure 21-28 on page 532. The processor does not actually implement a separate, distinct MMX register set. Rather, the eight 64-bit MMX registers, MM[7:0], are aliased over the lower 64 bits of the eight FP data registers.

Figure 21-28. MMX Registers Are Aliased Into the Lower 64-bits of the FP Data Registers


When the processor is executing FP instructions, the data registers are treated as a stack of eight, 80-bit FP data registers. When it executes an MMX instruction, the eight data registers are treated as the MMX registers, each of which is 64 bits wide and is mapped into the lower 64 bits of the FP data registers. Also, as indicated in the note at the bottom of the figure, the execution of any MMX instruction sets all eight fields in the x87 FPU Tag Word Register (TWR) = 00b. This erroneously indicates that all eight of the x87 data registers contain valid data. Before using any of the x87 data registers for FP operations after any of them have been used for MMX operations, the EMMS instruction (Empty MMX state) must be executed to set all eight Tag fields = 11b to indicate that none of the data registers contains valid FP data.

Handling a Task Switch

Refer to “Device Not Available Exception (7)” on page 299.

MMX Instruction Set Syntax

The MMX instruction set is summarized in Table 21-3 on page 533 and Table 21-4 on page 534. Note that some of the instructions include the “B”, “W”, or “D” suffix at end. This indicates that the instruction operates upon packed bytes, words, or dwords. A conversion instruction converts one data type to another, so it has two suffix characters at the end to indicate the “from” and “to” data types. As an example, the PACKUSWB instruction converts eight signed word integers from the two specified 64-bit sources into eight unsigned byte integers using unsigned saturation.

Table 21-3. MMX Instruction Set Summary, Part 1
CategoryInstruction TypeWraparoundSigned SaturationUnsigned Saturation
ArithmeticAdditionPADDB, PADDW, PADDDPADDSB, PADDSWPADDUSB, PADDUSW
SubtractionPSUBB, PSUBW, PSUBDPSUBSB, PSUBSWPSUBUSB, PSUBUSW
MultiplicationPMULL, PMULH  
Multiply and AddPMADD  
ComparisonCompare for =PCMPEQB, PCMPEQW, PCMPEQD  
Compare for >PCMPGTPB, PCMPGTPW, PCMPGTPD  
ConversionPack PACKSSWB, PACKSSDWPACKUSWB
UnpackUnpack HighPUNPCKHBW, PUNPCKHWD, PUNPCKHDQ  
Unpack LowPUNPCKLBW, PUNPCKLWD, PUNPCKLDQ  

Table 21-4. MMX Instruction Set Summary, Part 2
CategoryInstruction TypeData Size
  PackedFull Qword
LogicalAnd PAND
And Not PANDN
Or POR
Exclusive OR PXOR
ShiftShift Left LogicalPSLLW, PSLLDPSLLQ
 Shift Right LogicalPSRLW, PSRLDPSRLQ
 Shift Right ArithmeticPSRAW, PSRAD 
  Dword TransfersQword Transfers
Data TransferRegister to RegisterMOVDMOVQ
 Load from MemoryMOVDMOVQ
 Store to MemoryMOVDMOVQ
Empty MMX StateEMMSna

MMX Execution Unit

Figure 21-29 on page 535 shows the placement of the MMX execution unit in the Pentium® processor.

Figure 21-29. Pentium® MMX Execution Unit


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.235.144