Compare and exchange eight bytes. Compares two 8-byte data objects:
Compares 64-bit value in EDX:EAX with a value in memory.
If equal, the value in ECX:EBX is stored into the specified memory operand.
If unequal, the contents of the memory operand is copied into EDX:EAX.
Can be used with a LOCK prefix to execute it as an atomic operation.
The processor never produces a locked read without producing a subsequent locked write. The destination operand always receives a write regardless of the compare result. The destination operand is written back if the compare fails; otherwise, the source operand is written into the destination.
Loads the current 64-bit TSC value into EDX:EAX (upper 32 bits into EDX and lower 32 bits into EAX). TSC is incremented every clock cycle and is resets to 0 when the processor is reset.
When in Protected Mode or VM86 mode, CR4[TSD] (Time Stamp Disable) restricts RDTSC use as follows:
TSD = 0: RDTSC can be executed at any privilege level.
TSD = 1: Can only be executed at privilege level 0.
When in Real Mode, use of the RDTSC instruction is always enabled.
TSC can also be read using the RDMSR instruction by code executing at privilege level 0.
RDTSC is not a serializing instruction (see “Serializing Instructions” on page 1079). It does not necessarily wait until all previous instructions are executed before reading the TSC. Subsequent instructions may begin execution before the read is performed. For more information, refer to “Time Stamp Counter” on page 498.
When executed, this instruction loads the contents of 64-bit MSR specified in ECX into the EDX:EAX register pair (EDX is loaded with the upper 32 MSR bits, and EAX is loaded with the lower 32 bits). It must be executed at privilege level 0 or in Real Mode (or it will result in the generation of a GP exception).
The appendix of the Intel® IA32 Architecture Software Developer's Manual Volume 3: System Programming Guide lists all of the MSRs and their addresses. Each processor family has its own set of MSRs.
The CPUID instruction should be used to determine whether the MSRs are supported (EDX[5]=1) before executing this instruction on a processor.
When executed, this instruction writes the contents of the EDX:EAX register pair into the MSR specified by the MSR address in ECX. EDX is written to the upper 32 bits of the MSR and EAX into the lower 32 bits of the MSR. This instruction must be executed at privilege level 0 or in Real Mode (or a GP exception will be generated). When writing to an MTRR (Memory Type and Range Register; these registers are implemented in the Pentium® Pro and subsequent IA32 processors), the TLBs are invalidated, including any global entries (implemented in the Pentium® Pro and subsequent IA32 processors; see “Global Pages” on page 567 for more information). WRMSR is a serializing instruction (for more information, see “Serializing Instructions” on page 1079). The CPUID instruction should be used to determine whether the MSRs are supported (EDX[5] = 1) before using this instruction.
The CPUID instruction first made its debut in the Pentium® processor and was also implemented in the later versions of the 486 (as well as in all subsequent IA32 processors). A detailed description can be found in “CPU Identification” on page 1443.
The additional Pentium®-related capabilities (refer to Figure 21-17 on page 519) that were not present in the later versions of the 486 were:
The MMX instruction set was introduced in the P55C version of the Pentium® processor and consisted of 47 new instructions. In addition, there are eight MMX data registers (MM[7:0]; see Figure 21-18 on page 520). As shown in the illustration, the lower 64-bits of each of the FP data registers perform double-duty:
They are used as MMX registers when MMX code is being executed.
They are used as FP data registers when x87 FPU code is being executed.
Refer to Figure 21-19 on page 523. As an example, assume that there are two video frame buffers in memory (it should not be assumed, however, that MMX is only intended for processing video data). The current video mode has the following characteristics:
Each location in each of the two buffers represents the color of one pixel. The first buffer location corresponds to the first pixel on the left end of the first line of pixels on the screen, the second buffer location corresponds to the second pixel on the left end of the first line of pixels on the screen, etc.
A single location contains 8 bits (one byte), so a pixel can be any one of 256 possible colors (as represented by the values 00h-FFh).
The video controller is currently operating at a resolution of 1024 x 786, so each of the two video frame buffers consists of 786,432 locations.
Now assume that the programmer wants to:
Read the byte from the first location of one buffer,
Read the first location of the other buffer,
Add the two bytes together, and
Store the result back into the first location of the second buffer.
Repeat the operation for every pixel in the two frame buffers.
This could be accomplished in the following manner:
Read one byte from one buffer into a one byte register (e.g., AL).
Read the corresponding byte from the other buffer into another one byte register (e.g., BL).
Add AL and BL together and store the result in BL.
Since the add may result in the generation of a carry, the programmer has to decide whether to discard the carry or to deal with it. If the possibility of a carry must be dealt with, the programmer must include a conditional branch after the add that will jump to the code that handles the carry, or to skip that code.
As indicated in the illustration, this would result in 786,432 x 2 memory reads and 786,432 memory writes. This code would generate a tremendous number of memory accesses which may or may not hit on the processor's internal data cache. Any misses would result in memory transactions being performed on the FSB. This would degrade performance in two ways:
In a multiprocessor system, the FSB bandwidth available to the other processor(s) could be substantially impacted.
Since the FSB operates at a substantially slower rate of speed than the processor core, the memory accesses would be accomplished quite slowly.
The number of memory accesses could be reduced by reading four bytes at a time from each buffer:
Read four bytes from one buffer into a 32-bit register (e.g., EAX).
Read the corresponding four bytes from the other buffer into another 32-bit register (e.g., EBX).
Add EAX and EBX together and store the result in EBX.
There is a problem here that must be dealt with. The four bytes read into EAX from the first buffer represent four independent pixel values. Likewise, the four bytes read into EBX from the second buffer represent four independent pixel values. If the programmer were to add EAX and EBX together when performing the add, the processor will treat the contents of the two registers as 32-bit integer values. The result will therefore be completely bogus. To fix this, the programmer would have to do something like the following (yes, the author realizes this is a simplistic and incomplete approach; it's to make the point that a lot of work is involved):
mov i,0 Loop: mov eax,x[i] ;read four pixels into ex starting at ;the ith location in frame buffer x mov ebx,y[i] ;read four pixels into ebx starting at ;the ith location in frame buffer y call ShiftIsolateAdd ;isolate LSBs of both registers ;and add jnc Byte2 ;check for carry call CarryHandler ;call handler if carry set Byte2: call ShiftIsolateAdd ;isolate next byte in both registers ;and add jnc Byte3 ;check for carry call CarryHandler ;call handler if carry set Byte3: call ShiftIsolateAdd ;isolate 3rd byte of both registers ;and add jnc Byte4 ;check for carry call CarryHandler ;call handler if carry set Byte4: call ShiftIsolateAdd ;isolate MSB in both registers ;and add jnc StoreResult ;check for carry call CarryHandler ;call handler if carry set StoreResult: mov y[i],edx ;store 4 pixel result in buffer y add i,4 ;point to next dword cmp i,BufferEnd ;check for end of buffer jnz Loop ;continue if not done Done:
Although reading four pixels at a time from each buffer and writing four pixels at a time decreases the number of memory accesses, the programmer has to engage in a fair amount of bit-slinging to accomplish the goal. In addition, a conditional branch is performed after each add to see if a carry resulted and has to be dealt with. On later IA32 processors (starting with the Pentium® Pro), mispredicted branches result in a very deep performance dip. In this case, the conditional branches are dependent on completely unpredictable pixel data being received from a video source, so the misprediction rate will almost certainly be quite high.
Refer to Figure 21-20 on page 524. MMX instructions can perform a simultaneous operation on bytes, words or dwords that are packed into MMX registers. This is referred to as a Single Instruction operating on Multiple Data items (SIMD). The programmer can read 64-bits (8 bytes, 4 words, or 2 dwords) into an MMX register using one instruction. In the example illustrated, the programmer has loaded eight packed bytes into MMX register MM0, another eight packed bytes into MMX register MM1, and then executes a PADDB instruction (an add on packed bytes). Loading 8 bytes rather than 4 bytes into a register at a time reduces the number of memory accesses that have to be performed. Furthermore, the MMX execution unit has eight independent adders that operate simultaneously on the 8 bytes in each of the registers. This results in a dramatic reduction in the compute time.
Refer to Figure 21-21 on page 525. Sometimes, the data that the programmer wishes to perform a SIMD packed operation on is stored in memory in unpacked form. As an example, there is a video text mode wherein each text character byte in the video frame buffer is immediately followed by an attribute byte that defines the text character's attributes (e.g., underscore, blink, etc.). The programmer may wish to perform a SIMD operation on just the text characters or on the attributes (in other words, on every other byte).
The MMX instruction set includes instructions that can read unpacked data from memory and pack it into an MMX register. Conversely, instructions are included that take data that is packed into an MMX register and stores it to memory in unpacked form.
The MMX technology provides three ways of handling out-of-range conditions:
Wraparound math. Using wraparound math, a true out-of-range result is truncated. The carry or overflow is ignored and only the lsbs of the result are stored in the destination. Wraparound math can be used in applications that control the range of operands to prevent out-of-range results. Care should be taken, however, because, if the range of operands is not controlled, wraparound math can result in large errors (e.g., adding two large, signed numbers can result in positive overflow and produce a negative result).
Signed saturation math. Using signed saturation math, out-of-range results are automatically clamped to the representable range of signed integers for the integer size being operated on (see Table 21-2 on page 526). Two examples:
- If an operation on signed word integers results in a positive overflow, the result is clamped (“saturated”) to 7FFFh, the largest positive integer that can be represented in 16 bits.
- If in the same scenario negative overflow occurs, the result is saturated to 8000h.
Data Type | Lower Limit | Upper Limit | ||
---|---|---|---|---|
Hex | Decimal | Hex | Decimal | |
Signed Byte | 80h | -128 | 7Fh | 127 |
Signed Word | 8000h | 32,768 | 7FFFh | 32,767 |
Unsigned Byte | 00h | 0 | FFh | 255 |
Unsigned Word | 0000h | 0 | FFFFh | 65,535 |
Unsigned saturation math. Using unsigned saturation math, out-of-range results are automatically clamped to the representable range of unsigned integers for the integer size being operated on. Positive overflow when operating on unsigned byte integers results in FFh being returned and negative overflow results in 00h being returned.
Saturated math lends itself well to many overflow scenarios. As an example, when performing color calculations, saturation causes a color to remain pure black or pure white and does not result in color inversion. It also prevents wraparound artifacts from affecting a computation (when operand range checking is not used).
It should be noted that MMX instructions do not indicate overflow or underflow by generating exceptions or setting flags in the EFlags register.
As mentioned earlier (see “The Basic Problem” on page 521), the later IA32 processors (Starting with the Pentium® Pro) experience a deep performance dip if the processor mispredicts a conditional branch and fetches the wrong instructions into the processor pipeline to be executed after the branch instruction. Conditional branches are especially troublesome if the condition being tested is based on a test of random, unpredictable data (e.g., video data).
When the weather person is shown on TV walking around in front of the weather map, this is really the result of the real-time merging of two video frame buffers: one contains the data received from a camera pointing at the map, while the other contains the data received from a camera pointing at the person walking around in front of a blue background. The program is constantly studying the buffer containing the person pixels and, wherever a blue pixel is detected, it is replaced with the same pixel from the map video buffer.
The code fragment shown in Figure 21-22 on page 528 compares the value that represents the color blue against the ith location in the weather person's video frame buffer. The compare is followed by a conditional branch and one of two actions is taken based on the results of the comparison:
- If the pixel in the weather person's buffer isn't blue, the program jumps to next_pixel (not shown). In next_pixel, the pointer value (i) is incremented and a compare for the end of buffer is performed. If the end of buffer has not been reached, the code fragment shown is repeated again for the next pixel. When the end of buffer is reached, the process starts over again.
- If the pixel in the weather person's buffer is blue, it is replaced with the same pixel from buffer y (i.e., the map buffer). The program would then execute next_pixel again and continue to do so until the entire buffer has been processed.
This process is slow because it is not using the wider MMX registers to read the pixel information into and to perform the comparison. It will also result in abysmal performance due to the almost certain high incidence of mispredicted branches based on the comparisons of random video data.
Figure 21-23 on page 529 shows an example MMX code fragment that can be used to accomplish the Chroma-Keying effect. Some things to note:
- Throughput is considerably enhanced by using MMX's SIMD capability to processor four pixels at a time.
- There are no conditional branches, thereby eliminating the potential performance degradation that accompanies mispredicted branches.
The code fragment consists of the following instructions:
Whether or not a processor supports MMX is detected by executing a CPUID request type 1. The processor's capabilities bit mask is returned in the EDX register (see Figure 21-27 on page 531). Bit 23 = 1 indicates that the processor supports MMX.
Refer to Figure 21-28 on page 532. The processor does not actually implement a separate, distinct MMX register set. Rather, the eight 64-bit MMX registers, MM[7:0], are aliased over the lower 64 bits of the eight FP data registers.
When the processor is executing FP instructions, the data registers are treated as a stack of eight, 80-bit FP data registers. When it executes an MMX instruction, the eight data registers are treated as the MMX registers, each of which is 64 bits wide and is mapped into the lower 64 bits of the FP data registers. Also, as indicated in the note at the bottom of the figure, the execution of any MMX instruction sets all eight fields in the x87 FPU Tag Word Register (TWR) = 00b. This erroneously indicates that all eight of the x87 data registers contain valid data. Before using any of the x87 data registers for FP operations after any of them have been used for MMX operations, the EMMS instruction (Empty MMX state) must be executed to set all eight Tag fields = 11b to indicate that none of the data registers contains valid FP data.
Refer to “Device Not Available Exception (7)” on page 299.
The MMX instruction set is summarized in Table 21-3 on page 533 and Table 21-4 on page 534. Note that some of the instructions include the “B”, “W”, or “D” suffix at end. This indicates that the instruction operates upon packed bytes, words, or dwords. A conversion instruction converts one data type to another, so it has two suffix characters at the end to indicate the “from” and “to” data types. As an example, the PACKUSWB instruction converts eight signed word integers from the two specified 64-bit sources into eight unsigned byte integers using unsigned saturation.
Category | Instruction Type | Wraparound | Signed Saturation | Unsigned Saturation |
---|---|---|---|---|
Arithmetic | Addition | PADDB, PADDW, PADDD | PADDSB, PADDSW | PADDUSB, PADDUSW |
Subtraction | PSUBB, PSUBW, PSUBD | PSUBSB, PSUBSW | PSUBUSB, PSUBUSW | |
Multiplication | PMULL, PMULH | |||
Multiply and Add | PMADD | |||
Comparison | Compare for = | PCMPEQB, PCMPEQW, PCMPEQD | ||
Compare for > | PCMPGTPB, PCMPGTPW, PCMPGTPD | |||
Conversion | Pack | PACKSSWB, PACKSSDW | PACKUSWB | |
Unpack | Unpack High | PUNPCKHBW, PUNPCKHWD, PUNPCKHDQ | ||
Unpack Low | PUNPCKLBW, PUNPCKLWD, PUNPCKLDQ |
Category | Instruction Type | Data Size | |
---|---|---|---|
Packed | Full Qword | ||
Logical | And | PAND | |
And Not | PANDN | ||
Or | POR | ||
Exclusive OR | PXOR | ||
Shift | Shift Left Logical | PSLLW, PSLLD | PSLLQ |
Shift Right Logical | PSRLW, PSRLD | PSRLQ | |
Shift Right Arithmetic | PSRAW, PSRAD | ||
Dword Transfers | Qword Transfers | ||
Data Transfer | Register to Register | MOVD | MOVQ |
Load from Memory | MOVD | MOVQ | |
Store to Memory | MOVD | MOVQ | |
Empty MMX State | EMMS | na |
Figure 21-29 on page 535 shows the placement of the MMX execution unit in the Pentium® processor.
3.142.235.144