Chapter 1 introduces x86 SIMD fundamentals and essential concepts. It begins with a section that defines SIMD. This section also introduces SIMD arithmetic using a concise source code example. The next section presents a brief historical overview of x86 SIMD instruction set extensions. The principal sections of Chapter 1 are next. These highlight x86 SIMD concepts and programming constructs including data types, arithmetic calculations, and data manipulation operations. These sections also describe important particulars related to AVX, AVX2, and AVX-512. It is important for you to understand the material presented in this chapter since it provides the necessary foundation to successfully comprehend the topics and source code discussed in subsequent chapters.
Before proceeding, a few words about terminology are warranted. In all ensuing discussions, I will use the official acronyms AVX, AVX2, and AVX-512 when explaining specific features or instructions of these x86 SIMD instruction set extensions. I will use the term x86-AVX as an umbrella expression for x86 SIMD instructions or computational resources that pertain to more than one of the aforementioned x86 SIMD extensions. The terms x86-32 and x86-64 are used to signify x86 32-bit and 64-bit processors and execution environments. This book focuses exclusively on the latter, but the former is occasionally mentioned for historical context or comparison purposes.
What Is SIMD?
SIMD (single instruction multiple data) is a parallel computing technique whereby a CPU (or processing element incorporated within a CPU) performs a single operation using multiple data items concurrently. For example, a SIMD-capable CPU can carry out a single arithmetic operation using several elements of a floating-point array simultaneously. SIMD operations are frequently employed to accelerate the performance of computationally intense algorithms and functions in machine learning, image processing, audio/video encoding and decoding, data mining, and computer graphics.
Example Ch01_01
The function CalcZ_Cpp(), shown at the beginning of Listing 1-1, is a straightforward non-SIMD C++ function that calculates z[i] = x[i] + y[i]. However, a modern C++ compiler may generate SIMD code for this function as explained later in this section.
The next function in Listing 1-1, CalcZ_Iavx(), calculates the same result as CalcZ_Cpp() but employs C++ SIMD intrinsic functions to accelerate the computations. In CalcZ_Iavx(), the first for-loop uses the C++ SIMD intrinsic function _mm256_loadu_ps() to load eight consecutive elements from array x (i.e., elements x[i:i+7]) and temporarily saves these elements in an __m256 object named x_vals. An __m256 object is a generic container that holds eight values of type float. The ensuing _mm256_load_ps() call performs the same operation using array y. This is followed by a call to _mm256_add_ps() that calculates z[i:i+7] = x[i:i+7] + y[i:i+7]. What makes this code different from the code in the non-SIMD function CalcZ_Cpp() is that _mm256_add_ps() performs all eight array element additions concurrently. The final C++ intrinsic function in the first for-loop, _mm256_storeu_ps(), saves the resulting array element sums to z[i:i+7].
It is important to note that since the first for-loop in CalcZ_Iavx() processes eight array elements per iteration, it must terminate if there are fewer than eight elements remaining to process. The second for-loop handles any remaining (or residual) elements and only executes if n is not an integral multiple of eight. It is also important to mention that the C++ compiler treats C++ SIMD intrinsic function calls differently than normal C++ function calls. In the current example, the C++ compiler directly translates each __mm256 function into its corresponding AVX assembly language instruction. The overhead associated with a normal C++ function call is eliminated.
The final function in Listing 1-1 is named CalcZ_Aavx(). This is an x86-64 assembly language function that performs the same array calculation as CalcZ_Cpp() and CalcZ_Iavx(). What is noteworthy about this function is that the AVX instructions vmovps and vaddps contained in the code block are the same instructions that the C++ compiler emits for the C++ SIMD intrinsic functions _mm256_loadu_ps() and _mm256_add_ps(), respectively. The remaining code in CalcZ_Aavx() implements the two for-loops that are also implemented in function CalcZ_Cpp().
Do not worry if you are somewhat perplexed by the source code in Listing 1-1. The primary purpose of this book is to teach you how to develop and code SIMD algorithms like this using either C++ SIMD intrinsic functions or x86-64 assembly language. There are two takeaway points from this section. First, the CPU executes most SIMD arithmetic operations on the specified data elements concurrently. Second, similar design patterns are often employed when coding a SIMD algorithm regardless of whether C++ or assembly language is used.
One final note regarding the code in Listing 1-1. Recent versions of mainstream C++ compilers such as Visual C++ and GNU C++ are sometimes capable of automatically generating efficient x86 SIMD code for trivial arithmetic functions like CalcZ_Cpp(). However, these compilers still have difficulty generating efficient SIMD code for more complicated functions, especially ones that employ nested for-loops or nontrivial decision logic. In these cases, functions written using C++ SIMD intrinsic functions or x86-64 assembly language code can often outperform the SIMD code generated by a modern C++ compiler. However, employing C++ SIMD intrinsic functions does not improve performance in all cases. Many programmers will often code computationally intensive algorithms using standard C++ first, benchmark the code, and then recode bottleneck functions using C++ SIMD intrinsic functions or assembly language.
Historical Overview of x86 SIMD
For aspiring x86 SIMD programmers, having a basic understanding about the history of x86 SIMD and its various extensions is extremely beneficial. This section presents a brief overview that focuses on noteworthy x86 SIMD instruction set extensions. It does not discuss x86 SIMD extensions incorporated in special-use processors (e.g., Intel Xeon Phi) or x86 SIMD extensions that were never widely used. If you are interested in a more comprehensive chronicle of x86 SIMD architectures and instruction set extensions, you can consult the references listed in Appendix B.
Intel introduced the first x86 SIMD instruction set extension, called MMX, in 1997. This extension added instructions that facilitated simple SIMD operations using 64-bit wide packed integer operands. The MMX extension did not add any new registers to the x86 platform; it simply repurposed the registers in the x87 floating-point unit for SIMD integer arithmetic and other operations. In 1998, AMD launched an x86 SIMD extension called 3DNow, which facilitated vector operations using single-precision floating-point values. It also added a few new integer SIMD instructions. Like MMX, 3DNow uses x87 FPU registers to hold instruction operands. Both MMX and 3DNow have been superseded by newer x86 SIMD technologies and should not be used to develop new code.
In 1999, Intel launched a new SIMD technology called Streaming SIMD extensions (SSE). SSE adds 128-bit wide registers to the x86 platform and instructions that perform packed single-precision (32-bit) floating-point arithmetic. SSE also includes a few packed integer instructions. In 2000, SSE2 was launched and extends the floating-point capabilities of SSE to cover packed double-precision (64 bit) values. SSE2 also significantly expands the packed integer capabilities of SSE. Unlike x86-32 processors, all x86-64-compatible processors from both AMD and Intel support the SSE2 instruction set extension. The SIMD extensions that followed SSE2 include SSE3 (2004), SSSE3 (2006), SSE4.1 (2008), and SSE4.2 (2008). These extensions incorporated additional SIMD instructions that perform operations using either packed integer or floating-point operands, but no new registers or data types.
In 2011, Intel introduced processors that supported a new x86 SIMD technology called Advanced Vector Extensions (AVX). AVX adds packed floating-point operations (both single precision and double precision) using 256-bit wide registers. AVX also supports a new three-operand assembly language instruction syntax, which helps reduce the number of register-to-register data transfers that a software function must perform. In 2013, Intel unveiled AVX2, which extends AVX to support packed-integer operations using 256-bit wide registers. AVX2 also adds enhanced data transfer capabilities with its broadcast, gather, and permute instructions. Processors that support AVX or AVX2 may also support fused-multiply-add (FMA) operations. FMA enables software algorithms to perform sum-of-product (e.g., dot product) calculations using a single floating-point rounding operation, which can improve both performance and accuracy.
Beginning in 2017, high-end desktop and server-oriented processors marketed by Intel included a new SIMD extension called AVX-512. This architectural enhancement supports packed integer and floating-point operations using 512-bit wide registers. AVX-512 also includes SIMD extensions that facilitate instruction-level conditional data merging, floating-point rounding control, and embedded broadcast operations.
In addition to the abovementioned SIMD extensions, numerous non-SIMD instructions have been added to the x86 platform during the past 25 years. This ongoing evolution of the x86 platform presents some challenges to software developers who want to exploit the latest instruction sets and computational resources. Fortunately, there are techniques that you can use to determine which x86 SIMD and non-SIMD instruction set extensions are available during program execution. You will learn about these methods in Chapter 9. To ensure software compatibility with future processors, a software developer should never assume that a particular x86 SIMD or non-SIMD instruction set extension is available based on processor manufacturer, brand name, model number, or underlying microarchitecture.
SIMD Data Types
SIMD Data Types and Maximum Number of Elements
Numerical Type | xmmword | ymmword | zmmword |
---|---|---|---|
8-bit integer | 16 | 32 | 64 |
16-bit integer | 8 | 16 | 32 |
32-bit integer | 4 | 8 | 16 |
64-bit integer | 2 | 4 | 8 |
Single-precision floating point | 4 | 8 | 16 |
Double-precision floating-point | 2 | 4 | 8 |
AVX-512 extends maximum width of an x86 SIMD operand from 256 to 512 bits. Many AVX-512 instructions can also be used with 128- and 256-bit wide SIMD operands. However, it should be noted at this point that unlike AVX and AVX2, AVX-512 is not a cohesive x86 SIMD instruction set extension. Rather, it is a collection of interrelated but distinct instruction set extensions. An AVX-512-compliant processor must minimally support 512-bit wide operands of packed floating-point (single-precision or double-precision) and packed integer (32- or 64-bit wide) elements. The AVX-512 instructions that exercise 128- and 256-bit wide operands are a distinct x86 SIMD extension as are the instructions that support packed 8- and 16-bit wide integers. You will learn more about this in the chapters that explain AVX-512 programming. AVX-512 also adds eight opmask registers that a function can use to perform masked moves or masked zeroing.
SIMD Arithmetic
Source code example Ch01_01 introduced simple SIMD addition using single-precision floating-point elements. In this section, you will learn more about SIMD arithmetic operations that perform their calculations using either integer or floating-point elements.
SIMD Integer Arithmetic
Wraparound vs. Saturated Arithmetic
One notable feature of x86-AVX is its support for saturated integer arithmetic. When performing saturated integer arithmetic, the processor automatically clips the elements of a SIMD operand to prevent an arithmetic overflow or underflow condition from occurring. This is different from normal (or wraparound) integer arithmetic where an overflow or underflow result is retained. Saturated arithmetic is extremely useful when working with pixel values since it eliminates the need to explicitly check each pixel value for an overflow or underflow. X86-AVX includes instructions that perform packed saturated addition and subtraction using 8- or 16-bit wide integer elements, both signed and unsigned.
Range Limits for Saturated Arithmetic
Integer Type | Lower Limit | Upper Limit |
---|---|---|
8-bit signed | -128 | 127 |
8-bit unsigned | 0 | 255 |
16-bit signed | -32768 | 32767 |
16-bit unsigned | 0 | 65535 |
SIMD Floating-Point Arithmetic
SIMD Data Manipulation Operations
Besides arithmetic calculations, many algorithms frequently employ SIMD data manipulation operations. X86-AVX SIMD data manipulation operations include element compares, shuffles, permutations, blends, conditional moves, broadcasts, size promotions/reductions, and type conversions. You will learn more about these operations in the programming chapters of this book. A few common SIMD data manipulation operations are, however, employed regularly to warrant a few preliminary comments in this chapter.
One indispensable SIMD data manipulation operation is a data compare. Like a SIMD arithmetic calculation, the operations performed during a SIMD compare are carried out simultaneously using all operand elements. However, the results generated by a SIMD compare are different than those produced by an ordinary scalar compare. When performing a scalar compare such as a > b, the processor conveys the result using status bits in a flags register (on x86-64 processors, this flags register is named RFLAGS). A SIMD compare is different in that it needs to report the results of multiple compare operations, which means a single set of status bits in a flags register is inadequate. To overcome this limitation, SIMD compares return a mask value that signifies the result of each SIMD element compare operation.
SIMD Programming
As mentioned in the Introduction, the primary objective of this book is to help you learn x86 SIMD programming using C++ SIMD intrinsic functions and x86-64 assembly language. The source code examples that you will see in the subsequent chapters are structured to help you achieve this goal.
Source Code File Name Suffixes
File Name Suffix | Description |
---|---|
.h | Standard C++ header file |
.cpp | Standard C++ source code file |
_fcpp.cpp | C++ algorithm code (non-SIMD and SIMD) |
_misc.cpp | Miscellaneous C++ functions |
_bm.cpp | Benchmarking code |
_fasm.asm | Assembly language algorithm code (SIMD) |
Source Code Function Name Suffixes
Function Name Suffix | Description |
---|---|
_Cpp (or no suffix) | Function that uses standard C++ statements |
_Iavx | Function that uses C++ AVX intrinsic functions |
_Iavx2 | Function that uses C++ AVX2 intrinsic functions |
_Iavx512 | Function that uses C++ AVX-512 intrinsic functions |
_Aavx | Function that uses AVX assembly language instructions |
_Aavx2 | Function that uses AVX2 assembly language instructions |
_Avx512 | Function that uses AVX-512 assembly language instructions |
The most important code resides in files with the suffix names _fcpp.cpp and _fasm.asm. The code in files with other suffix names is somewhat ancillary but still necessary to create an executable program. Note that function names incorporating the substrings avx, avx2, and avx512 will only work on processors that support the AVX, AVX2, and AVX-512 instruction set extensions, respectively. You can use one of the free utilities listed in Appendix B to verify the processing capabilities of your computer.
Finally, it should be noted that the C++ SIMD intrinsic functions used in the source code examples were originally developed by Intel for their compilers. Most of these functions are also supported by other x86-64 C++ compilers including Visual Studio C++ and GNU C++. Appendix A contains additional information about the source code including download and build instructions for both Visual Studio and the GNU toolchain. Depending on your personal preference, you may want to download and install the source code first before proceeding to the next chapter.
Summary
SIMD is a parallel computing technique that carries out concurrent calculations using multiple data items.
AVX supports 128- and 256-bit wide packed floating-point operands. It also supports packed 128-bit wide integer operands.
AVX2 extends AVX to support 256-bit wide integer operands. It also adds additional broadcast and permutation instructions.
AVX-512 minimally supports 512-bit wide packed operands of single-precision or double-precision floating-point values. It also supports 512-bit wide operands of packed 32- and 64-bit wide integers.
The terms xmmword , ymmword , and zmmword are x86 assembly language expressions for 128-, 256-, and 512-bit wide SIMD data types and operands.
The terms byte, word, doubleword, and quadword are x86 assembly language designations for 8-, 16-, 32-, and 64-bit integers.
X86-AVX supports both wraparound and saturated arithmetic for packed 8- and 16-bit integers, both signed and unsigned.