The previous chapter explained SIMD fundamentals including packed types, arithmetic calculations, and data manipulation operations. It also highlighted a few details regarding the history of x86-AVX and its computational capabilities. The focus of this chapter is AVX integer arithmetic using 128-bit wide operands. The first section contains several concise source code examples that illustrate how to use C++ SIMD intrinsic functions to perform packed integer arithmetic. This followed by a section that highlights common programming operations with packed integers including bitwise logical operations and shifts. The third and final section includes source code examples that demonstrate elementary image processing tasks using C++ SIMD intrinsic functions. As you will soon see, SIMD techniques are ideal for many types of image processing algorithms.
Integer Arithmetic
In this section, you will learn the basics of x86-AVX packed integer arithmetic using 128-bit wide SIMD operands. It begins with a simple program that demonstrates packed integer addition using both wraparound and saturated arithmetic. This is followed by a similar program that focuses on packed integer subtraction. The final source code example of this section details packed integer multiplication.
Most of the source code examples is this book are shown using a single listing. This is done to minimize the number of listing references in the main text. The actual source code is partitioned into separate files using the naming conventions described in Chapter 1.
Integer Addition
Example Ch02_01
Listing 2-1 begins with the declaration of a C++ structure named XmmVal , which is declared in the header file XmmVal.h. This structure contains a publicly accessible anonymous union whose members correspond to the packed data types that can be used with a 128-bit wide x86-AVX operand. Note that XmmVal is declared using the alignas(16) specifier. This specifier instructs the C++ compiler to align each instance of an XmmVal on a 16-byte boundary. When an x86 processor executes an x86-AVX instruction that references an operand in memory, maximum performance is achieved when the operand is aligned on its natural boundary (e.g., 16-, 32-, or 64-byte boundaries for 128-, 256-, or 512-bit wide data types, respectively). Some x86-AVX instructions require their operands to be properly aligned, and these instructions will raise an exception if they attempt to access a misaligned operand in memory. You will learn more about this later. The structure XmmVal also contains several member functions that format the contents of an XmmVal variable for streaming to std::cout. The source code for these member functions is not shown in Listing 2-1 but is included in the software download package. Structure XmmVal is used in this example and in later source code examples to demonstrate x86-AVX SIMD operations.
The next file in Listing 2-1, Ch02_01.h, incorporates the requisite C++ function declarations for this source code example. Note that the function declarations in this header file use the previously defined XmmVal structure. Also note that file Ch02_01.h begins with a short comment block that includes its name to make it identifiable in the listing.
The file Ch02_01.cpp is next. This file contains the function main() and two other static functions named AddI16() and AddU16(). Function AddI16() begins its execution by initializing two XmmVal variables with packed 16-bit signed integer data. This is followed by a call to the function AddI16_Iavx(), which performs packed 16-bit signed integer addition. The remaining code in AddI16() displays the results calculated by AddI16_Iavx(). The function AddU16() is almost identical to AddI16() except that it uses unsigned instead of signed integers.
The final file in Listing 2-1 is Ch02_01_fcpp.cpp. This file contains two SIMD calculating functions named AddI16_Iavx() and AddU16_Iavx(). Near the top of this file is an #include statement for the header file immintrin.h. This file contains the declarations for the C++ SIMD intrinsic functions that are used in Ch02_01_fcpp.cpp. Function AddI16_Iavx() begins its execution with a call to _mm_load_si128 (). This C++ SIMD intrinsic function loads the contents of argument a into a_vals. Note that a_vals is declared as an __m128i, which a 128-bit wide C++ SIMD intrinsic type of 8-, 16-, 32-, or 64-bit integers. The _mm_load_si128 () function requires its source operand to be properly aligned on a 16-byte boundary. This requirement is satisfied by the alignas(16) specifier that was used in the declaration of XmmVal . Another call to _mm_load_si128 () is then employed to initialize b_vals.
Following SIMD variable initialization, AddI16_Iavx() employs the C++ SIMD intrinsic function _mm_add_epi16 (), which performs packed 16-bit integer addition using operands a_vals and b_vals. The result of this addition is saved in an __m128i variable named c1_vals. The ensuing call to _mm_adds_epi16 () also performs packed 16-bit integer addition but carries out its calculations using saturated instead of wraparound arithmetic. The final two code statements of AddI16_Iavx() employ the C++ SIMD intrinsic function _mm_store_si128 (). This function saves a 128-bit wide packed integer value to the specified target buffer, which must be aligned on 16-byte boundary.
Other C++ SIMD intrinsic functions are available for packed integer addition. You can use the C++ intrinsic function _mm_add_epi8(), _mm_add_epi32(), or _mm_add_epi64() to perform packed addition using 8-, 32-, or 64-bit wide signed or unsigned integers. You can also use the function _mm_adds_epi8() or _mm_adds_epu8() to carry out packed saturated addition using 8-bit signed or unsigned integers. Note that distinct C++ SIMD intrinsic functions are used for wraparound and saturated integer addition since these operations can generate different results as explained in Chapter 1.
Integer Subtraction
Example Ch02_02
The organization of the source code in example Ch02_02 parallels the previous example. In the file Ch02_02.cpp, the function SubI32() initializes XmmVal variables a and b using 32-bit signed integers. It then calls SubI32_Iavx(), which performs the packed subtraction. Function SubI64() is akin to SubI32() but uses 64-bit signed integers.
You can use the C++ SIMD intrinsic function _mm_sub_epi8() or _mm_sub_epi16() to perform packed subtraction using 8- or 16-bit wide signed or unsigned integers, respectively. To perform saturated packed subtraction, you can use the C++ SIMD intrinsic function _mm_subs_epi8(), _mm_subs_epi16(), _mm_subs_epu8(), or _mm_subs_epu16().
Integer Multiplication
Example Ch02_03
In file Ch02_03.cpp, the C++ function MulI16() contains code that initializes XmmVal variables a and b using 16-bit signed integers. It then calls MulI16_Iavx(), which performs the packed integer multiplication. The results are then streamed to std::cout. The other two static functions in Ch02_03.cpp, MulI32a() and Mul32b(), perform similar initialization tasks for 32-bit packed integer multiplication and then call MulI32a_Iavx() and MulI32b_Iavx(), respectively.
Function MulI32a_Iavx() highlights one method of performing packed 32-bit signed integer multiplication. This function uses the C++ SIMD intrinsic function _mm_mullo_epi32 () to calculate the low-order 32 bits of each product. The packed 32-bit integer products are then saved using _mm_store_si128 (). This technique is suitable when calculating multiplicative products that will not exceed the value limits of a 32-bit signed integer.
Integer Bitwise Logical and Shift Operations
Besides standard arithmetic operations, x86-AVX also supports other common operations using 128-bit wide packed integer operands. In this section, you will learn how to carry out bitwise logical and shift operations.
Bitwise Logical Operations
Example Ch02_04
Shift Operations
Example Ch02_05
C++ SIMD Intrinsic Function Naming Conventions
C++ SIMD Intrinsic Function Name Prefixes and Suffixes
String | Type | Description |
---|---|---|
_mm | Prefix | X86-AVX function that uses 128-bit wide operands |
_mm256 | Prefix | X86-AVX function that uses 256-bit wide operands |
_mm512 | Prefix | X86-AVX function that uses 512-bit wide operands |
_epi8 | Suffix | Packed 8-bit signed integers |
_epi16 | Suffix | Packed 16-bit signed integers |
_epi32 | Suffix | Packed 32-bit signed integers |
_epi64 | Suffix | Packed 64-bit signed integers |
_epu8 | Suffix | Packed 8-bit unsigned integers |
_epu16 | Suffix | Packed 16-bit unsigned integers |
_epu32 | Suffix | Packed 32-bit unsigned integers |
_epu64 | Suffix | Packed 64-bit signed integers |
_ss | Suffix | Scalar single-precision floating-point |
_sd | Suffix | Scalar double-precision floating-point |
_ps | Suffix | Packed single-precision floating-point |
_pd | Suffix | Packed double-precision floating-point |
It should be noted that many of the C++ SIMD intrinsic functions that carry out their operations using 128-bit wide SIMD operands will also execute on processors that support SSE, SSE2, SSE3, SSSE3, SSE4.1, or SSE4.2. For more information, you can consult the previously mentioned Intel Intrinsics Guide website.
C++ SIMD Intrinsic Data Types
Type | Description |
---|---|
__m128 | 128-bit wide packed single-precision floating-point |
__m128d | 128-bit wide packed double-precision floating-point |
__m128i | 128-bit wide packed integers |
__m256 | 256-bit wide packed single-precision floating-point |
__m256d | 256-bit wide packed double-precision floating-point |
__m256i | 256-bit wide packed integers |
__m512 | 512-bit wide packed single-precision floating-point |
__m512d | 512-bit wide packed single-precision floating-point |
__m512i | 512-bit wide packed integers |
It is important to keep in mind that none of the C++ SIMD intrinsic functions and data types are defined in any of the ISO C++ standards. Minor discrepancies exist between mainstream compilers such as Visual C++ and GNU C++. Also, these compilers employ different techniques to implement the various SIMD functions and data types. If you are developing code that needs to work on multiple platforms, you should avoid directly referencing any of the internal members of the data types shown in Table 2-2. You can employ the portable SIMD data types used in this book (e.g., XmmVal) or define your own portable SIMD data type.
Image Processing Algorithms
The source code examples presented thus far were designed to familiarize you with basic C++ SIMD intrinsic functions and common packed integer operations. To fully exploit the performance benefits of x86-AVX, you must develop complete SIMD functions. The source code examples in this section explain how to code a few simple image processing functions.
In the first example, you will learn how to utilize x86-AVX and C++ SIMD intrinsic functions to find the minimum and maximum value in an array of 8-bit unsigned integers. This example has real-world utility since digital images are often arranged in memory using arrays or matrices of 8-bit unsigned integers. Also, many image processing algorithms (e.g., contrast enhancement) often need to ascertain the minimum (darkest) and maximum (brightest) pixel values in an image. The second source code example illustrates how to calculate the mean value of an array of 8-bit unsigned integers using SIMD arithmetic. This is another example of a realistic algorithm that is directly relevant to the province of image processing. Finally, you will learn some straightforward techniques for benchmarking the performance of a SIMD function.
Pixel Minimum and Maximum
Example Ch02_06
The first file in Listing 2-6 is Ch02_06.h. This file contains the requisite function declarations and a few miscellaneous constants. Note that the function declarations use the fixed-width integer type uint8_t, which is defined in the header file <cstdint>. Some programmers (including me) prefer to use the fixed-width integer types in SIMD calculating functions since it eschews the size ambiguities of the standard C++ integer types char, short, int, long, and long long.
The next file in Listing 2-6, Ch02_06_misc.cpp, contains a simple function named InitArray(). This function fills an array of 8-bit unsigned integers using random values. The actual filling of the array is performed by a template function named MT::FillArray(), which is defined in the header file MT.h. The driver function for this example is named CalcMinMaxU8() and is defined in Ch02_06.cpp. Near the top of CalcMinMaxU8() is the statement AlignedArray <uint8_t> x_aa(n, 16). This statement dynamically allocates an n element array of uint8_t integers that is aligned on a 16-byte boundary. The source code for both MT.h and AlignedMem.h (which contains the template class AlignedMem<T>) is not shown in Listing 2-6 but is included in the software download package.
The principal calculating functions for example Ch02_06 are defined in the file Ch02_06_fcpp.cpp. The first function in this module, CalcMinMaxU8_Cpp(), finds the minimum and maximum value in an array of uint8_t integers. This function is coded using typical C++ statements sans any C++ SIMD intrinsic functions and will be used later for comparison and benchmarking purposes. Note that prior to the start of the for-loop, two error checks are performed. The first error check ensures that n is not equal to zero and an integral multiple of 16. Requiring n to be an integral multiple of 16 is not as restrictive as it might appear since the number of pixels in a digital camera image is often an integral multiple of 64 due to the processing requirements of the JPEG algorithms. Later examples will include additional code that can process arrays of any size. The second error check ensures that the source pixel buffer x is properly aligned on a 16-byte boundary.
The SIMD counterpart function to CalcMinMaxU8_Cpp() is named CalcMinMaxU8_Iavx(). This function starts its execution by validating n for size and x for proper alignment. The next statement uses the C++ SIMD intrinsic function _mm_set1_epi8 () to set each 8-bit element in min_vals to 0xFF. This is also known as a broadcast operation. Unlike the non-SIMD min-max function, the for-loop in CalcMinMaxU8_Iavx() maintains 16 intermediate pixel minimums as it sweeps through pixel buffer x and the variable min_val holds these values. The next statement uses _mm_setzero_si128 () to initialize each 8-bit element in max_vals to 0x00. This variable holds intermediate pixel maximums during execution of the for-loop.
The final function in Listing 2-6 is named CalcMinMaxU8_bm(). This function contains code that measures the execution times of functions CalcMinMaxU8_Cpp() and CalcMinMaxU8_Iavx(). Most of the timing measurement code is encapsulated in a C++ class named BmThreadTimer . This class includes two member functions, BmThreadTimer::Start() and BmThreadTimer::Stop(), that implement a simple software stopwatch. Class BmThreadTimer also includes a member function named BmThreadTimer::SaveElapsedTimes(), which saves the timing measurements to a comma-separated text file. The source code for class BmThreadTimer is not shown in Listing 2-6 but included as part of the source code download package.
Pixel Minimum and Maximum Execution Times (Microseconds), 10,000,000 pixels
CPU | CalcMinMaxU8_Cpp() | CalcMinMaxU8_Iavx() |
---|---|---|
Intel Core i7-8700K | 6549 | 406 |
Intel Core i5-11600K | 6783 | 304 |
The values shown in Table 2-3 were computed using the CSV file execution times (500 runs of each algorithm) and the Excel spreadsheet function TRIMMEAN(array,0.10). In example Ch02_06, the C++ SIMD intrinsic function implementation of the pixel minimum-maximum algorithm clearly outperforms the standard C++ version by a wide margin. It is not uncommon to achieve significant speed improvements when using C++ intrinsic functions, especially by algorithms that can fully exploit the SIMD parallelism of an x86 processor. You will see additional examples of accelerated algorithmic performance throughout the remainder of this book.
The benchmark timing measurements cited in this book provide reasonable approximations of function execution times. They are intended to provide some insights regarding the performance of a function coded using standard C++ statements vs. a function coded using C++ SIMD intrinsic functions. Like automobile fuel economy and mobile device battery runtime estimates, software performance benchmarking is not an exact science and subject to a variety of uncontrollable factors. It is also important to keep mind that this book is an introductory primer about x86 SIMD programming and not benchmarking. The source code examples are structured to hasten the study of x86 SIMD programming techniques. In addition, the Visual C++ options described earlier were selected mostly for practical reasons and may not yield optimal performance in all cases. Both Visual C++ and GNU C++ include a plethora of code generation options that can affect performance. Benchmark timing measurements should always be construed in a context that is correlated with the software’s purpose. The methods described in this section are generally worthwhile, but measurement results occasionally vary. Appendix A contains additional information about the software tools used to develop the source code examples in this book.
Pixel Mean Intensity
Example Ch02_07
The organization of the code in this example is similar to Ch02_06. File Ch02_07_misc.cpp contains two functions, CheckArgs() and InitArray(), which perform argument checking and array initialization, respectively. Note that the size of the source array must be an integral multiple of 64 and aligned on a 16-byte boundary. The function CheckArgs() also verifies that the number of array elements n is less than g_NumElementsMax. This size restriction enables the C++ SIMD code to perform intermediate calculations using packed 32-bit unsigned integers without any safeguards for arithmetic overflows.
Calculation of an array mean is straightforward; a function must sum the elements of the array and then divide this sum by the total number of elements. The function CalcMeanU8_Cpp() accomplishes this using a simple for-loop and scalar floating-point division.
The C++ SIMD counterpart function is named CalcMeanU8_Iavx(). Following argument validation using CheckArgs(), CalcMeanU8_Iavx() initializes variables packed_zero and pixel_sums_u32 to all zeros. The former variable is employed by the for-loop to perform unsigned integer size promotions, and the latter maintains four 32-bit unsigned integer intermediate sum values. The main for-loop is next. Note that each for-loop iteration processes 64 array elements since the index variable i is incremented by num_simd_elements * 4. The reason for doing this is that it reduces the number of 8-bit to 32-bit unsigned integer size promotions required to calculate the final pixel sum.
Pixel Array Arithmetic Mean Execution Times (Microseconds), 10,000,000 pixels
CPU | CalcMeanU8_Cpp() | CalcMeanU8_Iavx() |
---|---|---|
Intel Core i7-8700K | 2234 | 462 |
Intel Core i5-11600K | 1856 | 288 |
Summary
C++ SIMD Intrinsic Function Summary for Chapter 2
C++ SIMD Function Names | Description |
---|---|
_mm_add_epi8, _epi16, _epi32, _epi64 | Packed integer addition |
_mm_adds_epi8, _epi16 | Packed signed integer addition (saturated) |
_mm_adds_epu8, _epu16 | Packed unsigned integer addition (saturated) |
_mm_and_si128 | Bitwise logical AND |
_mm_extract_epi8 , _epi16, _epi32, _epi64 | Extract integer |
_mm_load_si128 | Load (aligned) 128-bit wide packed integers |
_mm_max_epi8, _epi16, _epi32, _epi64 | Packed signed integer maximum |
_mm_max_epu8 , _epu16, _epu32, _epu64 | Packed unsigned integer maximum |
_mm_min_epi8, _epi16, _epi32, _epi64 | Packed signed integer minimum |
_mm_min_epu8 , _epu16, _epu32, _epu64 | Packed unsigned integer minimum |
_mm_mul_epi32 | Packed 32-bit signed integer multiplication |
_mm_mul_epu32 | Packed 32-bit unsigned integer multiplication |
_mm_mulhi_epi16 | Packed 16-bit signed integer multiplication (high result) |
_mm_mulhi_epu16 | Packed 16-bit unsigned integer multiplication (high result) |
_mm_mullo_epi16 , _epi32, _epi64 | Packed signed integer multiplication (low result) |
_mm_or_si128 | Bitwise logical OR |
_mm_set1_epi8 , _epi16, _epi32, _epi64 | Broadcast integer constant to all elements |
_mm_setzero_si128 | Set 128-bit wide SIMD operand to all zeros |
_mm_slli_epi16 , _epi32, _epi64 | Packed integer shift left logical |
_mm_slli_si128 | 128-bit wide shift left logical |
_mm_srai_epi16 , _epi32, _epi64 | Packed integer shift right arithmetic |
_mm_srli_epi16 , _epi32, _epi64 | Packed integer shift right logical |
_mm_srli_si128 | 128-bit wide shift right logical |
_mm_store_si128 | Store (aligned) 128-bit wide packed integers |
_mm_sub_epi8, _epi16, _epi32, _epi64 | Packed integer subtraction |
_mm_subs_epi8, _epi16 | Packed signed integer subtraction (saturated) |
_mm_subs_epu8, _epu16 | Packed unsigned integer subtraction (saturated) |
_mm_unpackhi_epi8 , _epi16, _epi32, _epi64 | Unpack and interleave high-order integers |
_mm_unpacklo_epi8 , _epi16, _epi32, _epi64 | Unpacked and interleave low-order integers |
_mm_xor_si128 | Bitwise logical XOR |