The previous two chapters explored the basics of x86-64 assembly language programing. In these chapters, you learned how to perform simple integer arithmetic using x86-64 instructions. You also learned how to carry out scalar floating-point calculations using AVX instructions. Finally, you studied important x86-64 assembly language programming constructs and concepts including for-loop coding, memory addressing modes, use of condition codes, and function calling convention requirements.
In this chapter, you will discover how to code x86-64 assembly language functions that perform packed integer operations using AVX instructions and 128-bit wide operands. The first section covers basic packed integer arithmetic. The second section details a few image processing algorithms. The source code examples presented in this chapter are adaptations of examples that you saw in Chapter 2. This was done intentionally to highlight the programming similarities that exist between C++ SIMD intrinsic functions and AVX instructions.
Integer Arithmetic
In this section, you will learn how to perform elementary packed integer arithmetic using x86-64 assembly language and AVX instructions. The first example explains packed integer addition and subtraction using 128-bit wide SIMD operands. This is followed by an example that demonstrates packed integer multiplication. The final two examples illustrate packed integer bitwise logical and shift operations.
Addition and Subtraction
Example Ch13_01
The first file in Listing 13-1, Ch13_01.h, contains the function declarations for this example. Note that functions AddI16_Aavx() and SubI16_Aavx() both require pointer arguments of type XmmVal . This is the same C++ SIMD data structure that was introduced in Chapter 2. The file Ch13_01.cpp contains code that performs test case initialization and streams results to std::cout.
The first function in file Ch13_01_fasm.asm, AddI16_Aavx(), illustrates packed integer addition using 16-bit wide elements. Function AddI16_Aavx() begins with a vmovdqa xmm0,xmmword ptr [r8] that loads argument value a into register XMM0. The text xmmword ptr is an assembler operator that conveys the size (128 bits) of the source operand pointed to by R8. The next instruction, vmovdqa xmm1,xmmword ptr [r9], loads argument value b into register XMM1. The ensuing instruction pair, vpaddw xmm2,xmm0,xmm1 (Add Packed Integers) and vpaddsw xmm3,xmm0,xmm1 (Add Packed Integers with Signed Saturation), performs packed integer addition of word elements using wraparound and saturated arithmetic, respectively. The final two AVX instructions of AddI16_Aavx(), vmovdqa xmmword ptr [rcx],xmm2 and vmovdqa xmmword ptr [rdx],xmm3, save the calculated results to the XmmVal buffers pointed to by c1 and c2.
Recall that source code example Ch02_01 included a C++ function named AddI16_Iavx(). This function employed _mm_load_si128() and _mm_store_si128() to perform SIMD load and store operations. In the current example, the assembly language function AddI16_Aavx() (which performs the same operations as AddI16_Iavx()) uses the vmovdqa instruction to perform SIMD loads and stores. Function AddI16_Iavx() also used _mm_add_epi16() and _mm_adds_epi16() to carry out packed integer addition using 16-bit wide integer elements. These C++ SIMD intrinsic functions are the counterparts of the AVX instructions vpaddw and vpaddsw . Most of the C++ SIMD intrinsic functions that you learned about in the first half of this book are essentially wrapper functions for x86-AVX instructions.
Multiplication
Example Ch13_02
Bitwise Logical Operations
Example Ch13_03
Arithmetic and Logical Shifts
Example Ch13_04
Function SllU16_Aavx() begins its execution with a vmovdqa xmm0,xmmword ptr [rdx] instruction that loads argument value a into register XMM0. The next instruction, vmovd xmm1,r8d (Move Doubleword), copies the doubleword value in register R8D (argument value count) to XMM1[31:0]. Execution of this instruction also zeros bits YMM1[255:32]; bits ZMM1[511:256] are likewise zeroed if the processor supports AVX-512. The ensuing vpsllw xmm2,xmm0,xmm1 instruction left shifts each word element in XMM0 using the shift count in XMM1[31:0].
Image Processing Algorithms
In the first part of this book, several source code examples were presented that explained how to exploit C++ SIMD intrinsic functions to perform common image processing techniques. In this section, you will learn how to code a few image processing methods using x86-64 assembly language and AVX. The first source code example illustrates using AVX instructions to find the minimum and maximum values in a pixel buffer. The second source code example describes how to calculate a pixel buffer mean. Note that the AVX instructions and SIMD processing computations demonstrated in this section are also appropriate for use in other functions that carry out calculations using arrays or matrices of integer elements.
Pixel Minimum and Maximum
Example Ch13_05
Near the top of Listing 13-5, function CalcMinMaxU8_Aavx() employs two test instructions to confirm that argument value n is not equal to zero and an integral multiple of 16. The third test instruction verifies that pixel buffer x is aligned on a 16-byte boundary. Following argument validation, CalcMinMaxU8_Aavx() uses a vpcmpeqb xmm4,xmm4,xmm4 (Compare Packed Data for Equal) instruction to load 0xFF into each byte element of register XMM4. More specifically, vpcmpeqb performs byte element compares using its two source operands and sets the corresponding byte element in the destination operand to 0xFF if source operand elements are equal. Function CalcMinMaxU8_Aavx() uses vpcmpeqb xmm4,xmm4,xmm4 to set each byte element of XMM4 to 0xFF since this is faster than using vmovdqa instruction to load a 128-bit constant of all ones from memory. The ensuing vpxor xmm5,xmm5,xmm5 instruction sets each byte element in register XMM5 to 0x00.
The next instruction, mov rax,-NSE, initializes loop index variable i. Register RAX is loaded with -NSE since each iteration of Loop1 begins with an add RAX,NSE instruction that calculates i += NSE. This is followed by the instruction pair cmp rax,r9 and jae @F, which terminates Loop1 when i >= n is true. Note that the order of instructions used to initialize and update i in Loop1 precludes a loop-carried dependency condition from occurring. A loop-carried dependency condition arises when calculations in a for-loop are dependent on values computed during a prior iteration. Having a loop-carried dependency in a for-loop sometimes results in slower performance. A for-loop sans any loop-carried dependencies provides better opportunities for the processor to perform calculations of successive iterations simultaneously.
During execution of Loop1, function CalcMinMaxU8_Aavx() maintains packed minimums and maximums in registers XMM4 and XMM5, respectively. The first AVX instruction of Loop1, vmovdqa xmm0,xmmword ptr [r8+rax], loads a block of 16 pixels (x[i:i+15]) into register XMM0. This is followed by a vpminub xmm4,xmm4,xmm0 (Minimum of Packed Unsigned Integers) instruction that updates the packed minimum pixel values in XMM4. The ensuing vpmaxub xmm5,xmm5,xmm0 (Maximum of Packed Unsigned Integers) instruction updates the packed maximum values in XMM5.
Pixel Minimum and Maximum Execution Times (Microseconds), 10,000,000 Pixels
CPU | CalcMinMaxU8_Cpp() | CalcMinMaxU8_Aavx() | CalcMinMaxU8_Iavx() |
---|---|---|---|
Intel Core i7-8700K | 6760 | 388 | 406 |
Intel Core i5-11600K | 7045 | 314 | 304 |
As a reminder, it is important to keep in mind that the benchmark timing measurements reported in this and subsequent chapters are intended to provide some helpful insights regarding potential performance gains of an x86-AVX assembly language coded function compared to one coded using standard C++ statements. It is also important to reiterate that this book is an introductory primer about x86 SIMD programming and not benchmarking. Many of the x86 SIMD calculating functions, both C++ and assembly language, are coded to hasten learning and yield significant but not necessarily optimal performance. Chapter 2 contains additional information about the benchmark timing measurements published in this book.
Pixel Mean Intensity
Example Ch13_06
Near the top of file Ch13_06_fasm.asm is the statement extern g_NumElementsMax:qword, which declares g_NumElementsMax as an external quadword variable (the definition of g_NumElementsMax is located in the file Ch13_06_Misc.cpp). The first code block of CalcMeanU8_Aavx() uses a test r9,r9 and jz BadArg to ensure that argument value n is not equal to zero. The next instruction pair, cmp r9,[g_NumElementsMax] and ja BadArg, bypasses the calculating code if n > g_NumElementsMax is true. This is followed by the instruction pair test r9,3fh and jnz BadArg, which confirms that n is an even multiple of 64 (in later examples, you will learn how to process residual pixels). The final check of the first code block, test r8,0fh and jnz BadArg, confirms that pixel buffer x is aligned on a 16-byte boundary.
Pixel Array Arithmetic Mean Execution Times (Microseconds), 10,000,000 Pixels
CPU | CalcMeanU8_Cpp() | CalcMeanU8_Aavx() | CalcMeanU8_Iavx() |
---|---|---|---|
Intel Core i7-8700K | 2289 | 461 | 462 |
Intel Core i5-11600K | 1856 | 301 | 288 |
Summary
X86 Assembly Language Instruction Summary for Chapter 13
Instruction Mnemonic | Description |
---|---|
vmov[d|q] | Move doubleword or quadword into XMM register |
vpadd[b|w|d|q] | Packed integer addition |
vpadds[b|w] | Packed signed integer addition (saturated) |
vpaddus[b|w] | Packed unsigned integer addition (saturated) |
vpand | Bitwise logical AND |
vpcmpeq[b|w|d|q] | Packed integer compare for equality |
vpextr[b|w|d|q] | Extract integer |
vpmaxs[b|w|d|q] | Packed signed integer maximum |
vpmaxu[b|w|d|q] | Packed unsigned integer maximum |
vpmins[b|w|d|q] | Packed signed integer minimum |
vpminu[b|w|d|q] | Packed unsigned integer minimum |
vpmuldq | Packed signed integer multiplication (quadword results) |
vpmulhw | Packed signed integer multiplication (high results) |
vpmull[w|d|q] | Packed signed integer multiplication (low results) |
vpor | Bitwise logical OR |
vpsll[w|d|q] | Packed integer shift left logical |
vpsra[w|d|q] | Packed integer shift right arithmetic |
vpsrl[w|d|q] | Packed integer shift right logical |
vpslldq | Shift double quadword left logical |
vpsrldq | Shift double quadword right logical |
vpsub[b|w|d|q] | Packed integer subtraction |
vpsubs[b|w] | Packed signed integer subtraction (saturated) |
vpsubus[b|w] | Packed unsigned integer subtraction (saturated) |
vpunpckh[bw|wd|dq|qdq] | Unpack and interleave high-order integers |
vpunpckl[bw|wd|dq|qdq] | Unpack and interleave low-order integers |
vpxor | Bitwise logical exclusive OR |