Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Daniel KusswurmModern X86 Assembly Language Programminghttps://doi.org/10.1007/978-1-4842-4063-2_4

4. Advanced Vector Extensions

Daniel Kusswurm¹

(1)

Geneva, IL, USA

In the first three chapters of this book, you learned about the core x86-64 platform including its data types, general-purpose registers, and memory addressing modes. You also examined a cornucopia of sample code that illustrated the fundamentals of x86-64 assembly language programming, including basic operands, integer arithmetic, compare operations, conditional jumps, and manipulation of common data structures.

This chapter introduces Advanced Vector Extensions (AVX). It begins with a brief overview of AVX technologies and SIMD (Single Instruction Multiple Data) processing concepts. This is followed by an examination of the AVX execution environment that covers register sets, data types, and instruction syntax. The chapter also includes discussions of AVX’s scalar floating-point capabilities and its SIMD computational resources. The material presented in this chapter is relevant not only to AVX but also provides the necessary background information to understand AVX2 and AVX-512, which are explained in later chapters.

In the discussions that follow in this and subsequent chapters, the term x86-AVX is used to describe general characteristics and computing resources of Advanced Vector Extensions. The acronyms AVX, AXV2, and AVX-512 are employed when examining attributes or instructions related to a specific x86 feature set enhancement.

AVX Overview

AMD and Intel first incorporated AVX into their CPUs starting in 2011. AVX extends the packed single-precision and double-precision floating-point capabilities of x86-SSE from 128 bits to 256 bits. Unlike general-purpose register instructions, AVX instructions use a three-operand syntax that employs non-destructive source operands, which simplifies assembly language programming considerably. Programmers can use this new instruction syntax with packed 128-bit integer, packed 128-bit floating-point, and packed 256-bit floating-point operands. The three-operand instruction syntax can also be exploited to perform scalar single-precision and double-precision floating-point arithmetic.

In 2013 Intel launched processors that included AVX2. This architectural enhancement extends the packed integer capabilities of AVX from 128 bits to 256 bits. AVX2 adds new data broadcast, blend, and permute instructions to the x86 platform. It also introduces a new vector-index addressing mode that facilitates memory loads (or gathers) of data elements from non-contiguous locations. The most recent x86-AVX extension is called AVX-512, which expands the SIMD capabilities of AVX and AVX2 from 256 bits to 512 bits. AVX-512 also adds eight new opmask registers named K0–K7 to the x86 platform. These registers facilitate conditional instruction execution and data merging operations using per-element granularity. Table 4-1 summarizes current x86-AVX technologies. In this table (and subsequent tables), the acronyms SPFP and DPFP are used to signify single-precision floating-point and double-precision floating-point, respectively.

Table 4-1.

Summary of x86-AVX Technologies

Feature	AVX	AVX2	AVX-512
Three-operand syntax; non-destructive source operands	Yes	Yes	Yes
SIMD operations using 128-bit packed integers	Yes	Yes	Yes
SIMD operations using 256-bit packed integers	No	Yes	Yes
SIMD operations using 512-bit packed integers	No	No	Yes
SIMD operations using 128-bit packed SPFP, DPFP	Yes	Yes	Yes
SIMD operations using 256-bit packed SPFP, DPFP	Yes	Yes	Yes
SIMD operations using 512-bit packed SPFP, DPFP	No	No	Yes
Scalar SPFP, DPFP arithmetic	Yes	Yes	Yes
Enhanced SPFP, DPFP compare operations	Yes	Yes	Yes
Basic SPFP, DPFP broadcast and permute	Yes	Yes	Yes
Enhanced SPFP, DPFP broadcast and permute	No	Yes	Yes
Packed integer broadcast	No	Yes	Yes
Enhanced packed integer broadcast, compare, permute, conversions	No	No	Yes
Instruction-level broadcast and rounding control	No	No	Yes
Fused-multiply-add	No	Yes	Yes
Data gather	No	Yes	Yes
Data scatter	No	No	Yes
Conditional execution and data merging using opmask registers	No	No	Yes

It should be noted that fuse-multiply-add is a distinct x86 platform feature extension that was introduced in tandem with AVX2. A program must confirm the presence of this feature extension by testing the CPUID FMA feature flag before using any of the corresponding instructions. You’ll learn how to do this in Chapter 16. The remainder of this chapter focuses primarily on AVX. Chapters 8 and 12 discuss the particulars of AVX2 and AVX-512 in greater detail.

SIMD Programming Concepts

As implied by the words of the acronym, a SIMD computing element executes the same operation on multiple data items simultaneously. Universal SIMD operations include basic arithmetic such as addition, subtraction, multiplication, and division. SIMD processing techniques can also be applied to a variety of other computational tasks including data compares, conversions, Boolean calculations, permutations, and bit shifts. Processors facilitate SIMD operations by reinterpreting the bits of an operand in a register or memory location. For example, a 128-bit wide operand can hold two independent 64-bit integer values. It is also capable of accommodating four 32-bit integers, eight 16-bit integers, or sixteen 8-bit integers, as illustrated in Figure 4-1.

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig1_HTML.jpg — Figure 4-1.
128-bit wide operand using distinct integers

Figure 4-2 exemplifies a few SIMD arithmetic operations greater detail. In this figure, integer addition is illustrated using two 64-bit integers, four 32-bit integers, or eight 16-bit integers. Faster algorithmic processing takes place when multiple data items are exercised, since the CPU can carry out the necessary operations in parallel. For example, when 16-bit integer operands are specified by an instruction, the CPU performs all eight 16-bit integer additions simultaneously.

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig2_HTML.jpg — Figure 4-2.
SIMD integer addition

Wraparound vs. Saturated Arithmetic

One extremely useful feature of x86-AVX technology is its support for saturated integer arithmetic. In saturated integer arithmetic, computational results are automatically clipped by the processor to prevent overflow and underflow conditions. This differs from normal wraparound integer arithmetic where an overflow or underflow result is retained (as you’ll soon see). Saturated arithmetic is handy when working with pixel values since it automatically clips values and eliminates the need to explicitly check the result of each pixel calculation for an overflow or underflow condition. X86-AVX includes instructions that perform saturated arithmetic using 8-bit and 16-bit integers, both signed and unsigned.

Let’s take a closer look at some examples of both wraparound and saturated arithmetic. Figure 4-3 shows an example of 16-bit signed integer addition using wraparound and saturated arithmetic. An overflow condition occurs if the two 16-bit signed integers are added using wraparound arithmetic. With saturated arithmetic, however, the result is clipped to the largest possible 16-bit signed integer value. Figure 4-4 illustrates a similar example using 8-bit unsigned integers. Besides addition, x86-AVX also supports saturated integer subtraction, as shown in Figure 4-5. Table 4-2 summarizes the saturated arithmetic range limits for all supported integer sizes and sign types.

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig3_HTML.jpg — Figure 4-3.
16-bit signed integer addition using wraparound and saturated arithmetic

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig4_HTML.jpg — Figure 4-4.
8-bit unsigned integer addition using wraparound and saturated arithmetic

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig5_HTML.jpg — Figure 4-5.
16-bit signed integer subtraction using wraparound and saturated arithmetic

Table 4-2.

Range Limits for Saturated Arithmetic

Integer Type	Lower Limit	Upper Limit
8-bit signed	-128 (0x80)	+127 (0x7f)
8-bit unsigned	0	+255 (0xff)
16-bit signed	-32768 (0x8000)	+32767 (0x7fff)
16-bit unsigned	0	+65535 (0xffff)

AVX Execution Environment

In this section you’ll learn about the x86-AVX execution environment. Included are explanations of the AVX register set, its data types, and instruction syntax. As mentioned earlier, x86-AVX is an architectural enhancement that extends x86-SSE technology to support SIMD operations using either 256-bit or 128-bit wide operands. The material that’s presented in this section assumes no previous knowledge or experience with x86-SSE.

Register Set

X86-64 processors that support AVX incorporate 16 256-wide registers named YMM0 – YMM15. The low-order 128 bits of each YMM register are aliased to a corresponding XMM register, as illustrated in Figure 4-6. Most AVX instructions can use any of the XMM or YMM registers as SIMD operands. The XMM registers can also be employed to carry out scalar floating-point calculations using either single-precision or double-precision values similar to x86-SSE. Programmers with assembly language experience using x86-SSE need to be aware of some minor execution differences between this earlier instruction set extension and x86-AVX. These differences are explained later in this chapter.

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig6_HTML.jpg — Figure 4-6.
AVX register set

The x86-AVX execution environment also includes a control-status register named MXCSR. This register contains status flags that facilitate the detection of error conditions caused by floating-point arithmetic operations. It also includes control bits that programs can use to enable or disable floating-point exceptions and specify rounding options. You’ll learn more about MXCSR register later in this chapter.

Data Types

As previously mentioned, AVX supports SIMD operations using 256-bit and 128-bit wide packed single-precision or packed double-precision floating-point operands. A 256-bit wide YMM register or memory location can hold eight single-precision or four double-precision values, as shown in Figure 4-7. When used with a 128-bit wide XMM register or memory location, an AVX instruction can process four single-precision or two double-precision values. Like SSE and SSE2, AVX instructions use the low-order doubleword or quadword of an XMM register to carry out scalar single-precision or double-precision floating-point arithmetic.

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig7_HTML.jpg — Figure 4-7.
AVX and AVX2 data types

AVX also includes instructions that use the XMM registers to perform SIMD operations using a variety of packed integer operands including bytes, words, doublewords, and quadwords. AVX2 extends the packed integer processing capabilities of AVX to the YMM registers and 256-bit wide operands in memory. Figure 4-7 also shows these data types.

Instruction Syntax

Perhaps the most noteworthy programming facet of x86-AVX is its use of a contemporary assembly language instruction syntax. Most x86-AVX instructions use a three-operand format that consists of two source operands and one destination operand. The general syntax that’s employed for x86-AVX instructions is InstrMnemonic DesOp,SrcOp1,SrcOp2. Here, InstrMnemonic signifies the instruction mnemonic, DesOp represents the destination operand, and SrcOp1 and SrcOp2 denote the source operands. A small subset of x86-AVX instructions employ one or three source operands along with a destination operand. Nearly all x86-AVX instruction source operands are non-destructive. This means source operands are not modified during instruction execution, except in cases where the destination operand register is the same as one of the source operand registers. The use of non-destructive source operands often results in simpler and slightly faster code since the number of register-to-register data transfers that a function must perform is reduced.

X86-AVX’s ability to support a three-operand instruction syntax is due to a new instruction-encoding prefix. The vector extension (VEX) prefix enables x86-AVX instructions to be encoded using a more efficient format than the prefixes used for x86-SSE instructions. The VEX prefix has also been used to add new general-purpose register instructions to the x86 platform. You’ll learn about these instructions in Chapter 8.

AVX Scalar Floating-Point

This section examines the scalar floating-point capabilities of AVX. It begins with a short explanation of some important floating-point concepts including data types, bit encodings, and special values. Software developers who understand these concepts are often able to improve the performance of algorithms that make heavy use of floating-point arithmetic and minimize potential floating-point errors. The AVX scalar floating-point registers are also explained in this section and this includes descriptions the XMM registers and the MXCSR control-status register. The section concludes with an overview of the AVX scalar floating-point instruction set.

Floating-Point Programming Concepts

In mathematics a real-number system depicts an infinite continuum of all possible positive and negative numbers including integers, rational numbers, and irrational numbers. Given their finite resources, modern computing architectures typically employ a floating-point system to approximate a real-number system. Like many other computing platforms, the x86’s floating-point system is based on the IEEE 754 standard for binary floating-point arithmetic. This standard includes specifications that define bit encodings, range limits, and precisions for scalar floating-point values. The IEEE 754 standard also specifies important details related to floating-point arithmetic operations, rounding rules, and numerical exceptions.

The AVX instruction set supports common floating-point operations using single precision (32-bit) and double precision (64-bit) values. Many C++ compilers including Visual C++ use the x86’s intrinsic single-precision and double-precision types to implement the C++ types float and double. Figure 4-8 illustrates the memory organization of both single-precision and double-precision floating-point values. This figure also includes common integer types for comparison purposes.

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig8_HTML.jpg — Figure 4-8.
Memory organization of floating-point values

The binary encoding of a floating-point value requires three distinct fields: a significand, an exponent, and a sign bit. The significand field represents a number’s significant digits (or fractional part). The exponent specifies the location of the binary “decimal” point in the significand, which determines the magnitude. The sign bit indicates whether the number is positive (s = 0) or negative (s = 1). Table 4-3 lists the various size parameters that are used to encode single-precision and double-precision floating-point values.

Table 4-3.

Floating-Point Size Parameters

Parameter	Single-Precision	Double-Precision
Total width	32 bits	64 bits
Significand width	23 bits	52 bits
Exponent width	8 bits	11 bits
Sign width	1 bit	1 bit
Exponent bias	+127	+1023

Figure 4-9 illustrates how to convert a decimal number into an x86 compatible floating-point encoded value. In this example, the number 237.8125 is transformed from a decimal number to its single-precision floating-point encoding. The process starts by converting the number from base 10 to base 2. Next, the base 2 value is transformed to a binary scientific value. The value to the right of the E₂ symbol is the binary exponent. A properly encoded floating-point value uses a biased exponent instead of the true exponent since this expedites floating-point compare operations. For a single-precision floating-point number, the bias value is +127. Adding the exponent bias value to the true exponent generates a binary scientific number with a biased exponent value. In the example that’s shown in Figure 4-9, adding 111b (+7) to 1111111b (+127) yields a binary scientific with a biased exponent value of 10000110b (+134).

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig9_HTML.jpg — Figure 4-9.
Single-precision floating-point encoding process

When encoding a single-precision or double-precision floating-point value, the leading 1 digit of the significand is implied and not included in the final binary representation. Dropping the leading 1 digit forms a normalized significand. The three fields required for an IEEE 754 complaint encoding are now available, as shown in Table 4-4. A reading of the bit fields in this table from left to right yields the 32-bit value 0x436DD000, which is the final single-precision floating-point encoding of 237.8125.

Table 4-4.

Bit Fields for IEEE 754 Compliant Encoding of 237.8125

Sign	Biased Exponent	Normalized Significand
1	10000110	11011011101000000000000

The IEEE 754 floating-point encoding scheme reserves a small set of bit patterns for special values that are used to handle certain processing conditions. The first group of special values includes denormalized numbers (or denormal). As shown in the earlier encoding example, the standard encoding of a floating-point number assumes that the leading digit of the significand is always a 1. One limitation of IEEE 754 floating-point encoding scheme is its inability to accurately represent numbers very close to zero. In these cases, values get encoded using a non-normalized format, which enables tiny numbers close to zero (both positive and negative) to be encoded using less precision. Denormals rarely occur but when they do, the CPU can still process them. In algorithms where the use of a denormal is problematic, a function can test a floating-point value in order to ascertain its denormal state or the processor can be configured to generate an underflow or denormal exception.

Another application of special values involves the encodings that are used for floating-point zero. The IEEE 754 standard supports two different representations of floating-point zero: positive zero (+0.0) and negative zero (–0.0). A negative zero can be generated either algorithmically or as a side effect of the floating-point rounding mode. Computationally, the processor treats positive and negative zero the same and the programmer typically does not need to be concerned.

The IEEE 754 encoding scheme also supports positive and negative representations of infinity. Infinities are produced by certain numerical algorithms, overflow conditions, or division by zero. As discussed later in this chapter, the processor can be configured to generate an exception whenever a floating-point overflow occurs or if a program attempts to divide a number by zero.

The final special value type is called Not a Number (NaN). NaNs are floating-point encodings that represent invalid numbers. The IEEE 754 standard defines two types of NaNs: signaling NaN (SNaN) and quiet NaN (QNaN). SNaNs are created by software; an x86-64 CPU will not create a SNaN during any arithmetic operation. Any attempt by an instruction to use a SNaN will cause an invalid operation exception, unless the exception is masked. SNaNs are useful for testing exception handlers. They can also be exploited by an application program for proprietary numerical-processing purposes. An x86 CPU uses QNaNs as a default response to certain invalid arithmetic operations whose exceptions are masked. For example, one unique encoding of a QNaN, called an indefinite, is substituted for a result whenever a function uses one of the scalar square root instructions with a negative value. QNaNs also can be used by programs to signify algorithm-specific errors or other unusual numerical conditions. When QNaNs are used as operands, they enable continued processing without generating an exception.

When developing software that performs floating-point calculations, it is important to keep in mind that the employed encoding scheme is simply an approximation of a real-number system. It is impossible for any floating-point encoding system to represent an infinite number of values using a finite number of bits. This leads to floating-point rounding errors that can affect the accuracy of a calculation. Also, some mathematical properties that hold true for integers and real numbers are not necessarily true for floating-point numbers. For example, floating-point multiplication is not necessarily associative; (a * b) * c may not equal a * (b * c) depending on the values of a, b, and c. Developers of algorithms that require high levels of floating-point accuracy must be aware of these issues. Appendix A contains a list of references that explain this and other potential pitfalls of floating-point arithmetic in greater detail. Chapter 9 also includes a source code example that exemplifies floating-point non-associativity.

Scalar Floating-Point Register Set

As previously shown in Figure 4-6, all x86-64 compatible processors include 16 128-bit registers named XMM0 – XMM15. A program can use any of the XMM registers to perform scalar floating-point operations including common arithmetic calculations, data transfers, comparisons, and type conversions. The CPU uses the low-order 32 bits of an XMM register to carry out single-precision floating-point calculations. Double-precision floating-point operations employ the low-order 64 bits. Figure 4-10 illustrates these register locations in greater detail. Programs cannot use the high-order bits of an XMM register to perform scalar floating-point calculations. However, when used as a destination operand, the values of these bits might be modified during the execution of an AVX scalar floating-point instruction as explained later in this section.

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig10_HTML.jpg — Figure 4-10.
Scalar floating-point values when loaded in an XMM register

Control-Status Register

In addition to the XMM registers, x86-64 processors include a 32-bit control-status register named MXCSR. This register contains a series of control flags that enable a program to specify options for floating-point calculations and exceptions. It also includes a set of status flags that can be tested to detect floating-point error conditions. Figure 4-11 shows the organization of the bits in MXCSR; Table 4-5 describes the purpose of each bit field.

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig11_HTML.jpg — Figure 4-11.
MXCSR control and status register

Table 4-5.

Description of MXCSR Register Bit Fields

Bit	Field Name	Description
IE	Invalid operation flag	Floating-point invalid operation error flag.
DE	Denormal flag	Floating-point denormal error flag.
ZE	Divide-by-zero flag	Floating-point division-by-zero error flag.
OE	Overflow flag	Floating-point overflow error flag.
UE	Underflow flag	Floating-point underflow error flag.
PE	Precision flag	Floating-point precision error flag.
DAZ	Denormals are zero	When set to 1, forcibly converts a denormal source operand to zero prior to its use in a calculation.
IM	Invalid operation mask	Floating-point invalid operation error exception mask.
DM	Denormal mask	Floating-point denormal error exception mask.
ZM	Divide-by-zero mask	Floating-point divide-by-zero error exception mask.
OM	Overflow mask	Floating-point overflow error exception mask.
UM	Underflow mask	Floating-point underflow error exception mask.
PM	Precision mask	Floating-point precision error exception mask.
RC	Rounding control	Specifies the method for rounding floating-point results. Valid options include round to nearest (00b), round down toward +∞ (01b), round up toward +∞ (10b), and round toward zero or truncate (11b).
FTZ	Flush to zero	When set to 1, forces a zero result if the underflow exception is masked and a floating-point underflow error occurs.

An application program can modify any of the MXCSR’s control flags or status bits to accommodate its specific SIMD floating-point processing requirements. Any attempt to write a non-zero value to a reserved bit position will cause the processor to generate an exception. The processor sets an MXCSR error flag to 1 following the occurrence of an error condition. MXCSR error flags are not automatically cleared by the processor after an error is detected; they must be manually reset. The control flags and status bits of the MXCSR register can be modified using the vldmxcsr (Load MXCSR Register) instruction. Setting a mask bit to 1 disables the corresponding exception. The vstmxcsr (Store MXCSR Register) instruction can be used to save the current MXCSR state. An application program cannot directly access the internal processor tables that specify floating-point exception handlers. However, most C++ compilers provide a library function that allows an application program to designate a callback function that gets invoked whenever a floating-point exception occurs.

The MXCSR includes two control flags that can be used to speed up certain floating-point calculations. Setting the MXCSR.DAZ control flag to 1 can improve the performance of algorithms where the rounding of a denormal value to zero is acceptable. Similarly, the MXCSR.FTZ control flag can be used to accelerate computations where floating-point underflows are common. The downside of enabling either of these options is non-compliance with the IEEE 754 floating-point standard.

Instruction Set Overview

Table 4-6 lists in alphabetical order commonly used AVX scalar floating-point instructions. In this table, mnemonic text [d|s] signifies that an instruction can be used with either double-precision floating-point or single-precision floating-point operands. You’ll learn how to use many of these instructions in Chapter 5.

Table 4-6.

Overview of Commonly-Used AVX Scalar Floating-Point Instructions

Mnemonic	Description
vadds[d\|s]	Scalar floating-point addition
vbroadcasts[d\|s]	Broadcast scalar floating-point value
vcmps[d\|s]	Scalar floating-point compare
vcomis[d\|s]	Ordered scalar floating-point compare and set RFLAGS
vcvts[d\|s]2si	Convert scalar floating-point to doubleword signed integer
vcvtsd2ss	Convert scalar DPFP to scalar SPFP
vcvtsi2s[d\|s]	Convert signed doubleword integer to scalar floating-point
vcvtss2sd	Convert scalar SPFP to DPFP
vcvtts[d\|s]2si	Convert with truncation scalar floating-point to signed integer
vdivs[d\|s]	Scalar floating-point division
vmaxs[d\|s]	Scalar floating-point maximum
vmins[d\|s]	Scalar floating-point minimum
vmovs[d\|s]	Move scalar floating-point value
vmuls[d\|s]	Scalar floating-point multiplication
vrounds[d\|s]	Round scalar floating-point value
vsqrts[d\|s]	Scalar floating-point square root
vsubs[d\|s]	Scalar floating-point subtraction
vucomis[d\|s]	Unordered scalar floating-point compare and set RFLAGS

Table 4-7 illustrates operation of the AVX scalar floating-point instructions vadds[d|s] and vsqrts[d|s]. In these examples, the colon notation designates bit position ranges within a register (e.g., 31:0 designates bits positions 31 through 0 inclusive). Note that execution of an AVX scalar floating-point instruction also copies the unused bits of the first source operand to the destination operand. Also note that the upper 128 bits of the corresponding YMM register are set to zero.

Table 4-7.

AVX Scalar Floating-Point Instruction Examples

Instruction	Operation
vaddss xmm0,xmm1,xmm2	xmm0[31:0] = xmm1[31:0] + xmm2[31:0] xmm0[127:32] = xmm1[127:32] ymm0[255:128] = 0
vaddsd xmm0,xmm1,xmm2	xmm0[63:0] = xmm1[63:0] + xmm2[63:0] xmm0[127:64] = xmm1[127:64] ymm0[255:128] = 0
vsqrtss xmm0,xmm1,xmm2	xmm0[31:0] = sqrt(xmm2[31:0]) xmm0[127:32] = xmm1[127:32] ymm0[255:128] = 0
vsqrtsd xmm0,xmm1,xmm2	xmm0[63:0] = sqrt(xmm2[63:0]) xmm0[127:64] = xmm1[127:64] ymm0[255:128] = 0

AVX Packed Floating-Point

AVX supports packed floating-point operations using either 128-bit wide or 256-bit wide operands. Figures 4-12 and 4-13 illustrate common packed floating-point arithmetic operations using 256-bit wide operands with single-precision and double-precision elements. Similar to AVX scalar floating-point, rounding for AVX packed floating-point arithmetic operations is specified by the MXCSR’s rounding control field, as defined in Table 4-5. The processor also uses the MXCSR’s status flags to signal the occurrence of a packed floating-point error condition.

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig12_HTML.jpg — Figure 4-12.
AVX packed single-precision floating-point addition

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig13_HTML.jpg — Figure 4-13.
AVX packed double-precision floating-point multiplication

Most AVX arithmetic instructions perform their operations using the corresponding element positions of the two source operands. AVX also supports horizontal arithmetic operations using either packed floating-point or packed integer operands. A horizontal arithmetic operation carries out its computations using the adjacent elements of a packed data type. Figure 4-14 illustrates horizontal addition using single-precision floating-point and horizontal subtraction using double-precision floating-point operands. The AVX instruction set also supports integer horizontal addition and subtraction using packed words and doublewords. Horizontal operations are typically used to reduce a packed data operand that contains multiple intermediate values to a single final result.

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig14_HTML.jpg — Figure 4-14.
AVX horizontal addition and subtraction using single-precision and double-precision elements

Instruction Set Overview

Table 4-8 lists in alphabetical order commonly used AVX packed floating-point instructions. Similar to the scalar floating-point table that you saw in the previous section, the mnemonic text [d|s] signifies that an instruction can be used with either packed double-precision floating-point or packed single-precision floating-point operands. You’ll learn how to use many of these instructions in Chapter 6.

Table 4-8.

Overview of Commonly-Used AVX Packed Floating-Point Instructions

Instruction	Description
vaddp[d\|s]	Packed floating-point addition
vaddsubp[d\|s]	Packed floating-point add-subtract
vandp[d\|s]	Packed floating-point bitwise AND
vandnp[d\|s]	Packed floating-point bitwise AND NOT
vblendp[d\|s]	Packed floating-point blend
vblendvp[d\|s]	Variable packed floating-point blend
vcmpp[d\|s]	Packed floating-point compare
vcvtdq2p[d\|s]	Convert packed signed doubleword integers to floating-point
vcvtp[d\|s]2dq	Convert packed floating-point to signed doublewords
vcvtpd2ps	Convert packed DPFP to packed SPFP
vcvtps2pd	Convert packed SPFP to packed DPFP
vdivp[d\|s]	Packed floating-point division
vdpp[d\|s]	Packed dot product
vhaddp[d\|s]	Horizontal packed floating-point addition
vhsubp[d\|s]	Horizontal packed floating-point subtraction
vmaskmovp[d\|s]	Packed floating-point conditional load and store
vmaxp[d\|s]	Packed floating-point maximum
vminp[d\|s]	Packed floating-point minimum
vmovap[d\|s]	Move aligned packed floating-point values
vmovmskp[d\|s]	Extract packed floating-point sign bitmask
vmovup[d\|s]	Move unaligned packed floating-point values
vmulp[d\|s]	Packed floating-point multiplication
vorp[d\|s]	Packed floating-point bitwise inclusive OR
vpermilp[d\|s]	Permute in-lane packed floating-point elements
vroundp[d\|s]	Round packed floating-point values
vshufp[d\|s]	Shuffle packed floating-point values
vsqrtp[d\|s]	Packed floating-point square root
vsubp[d\|s]	Packed floating-point subtraction
vunpckhp[d\|s]	Unpack and interleave high packed floating-point values
vunpcklp[d\|s]	Unpack and interleave low packed floating-point values
vxorp[d\|s]	Packed floating-point bitwise exclusive OR

AVX Packed Integer

AVX supports packed integer operations using 128-bit wide operands. A 128-bit wide operand facilitates packed integer operations using two quadwords, four doublewords, eight words, or sixteen bytes, as shown in Figure 4-15. In this figure, the vpaddb (Add Packed Integers) instruction illustrates packed 8-bit integer addition. The vpmaxsw (Packed Signed Integer Maximums) saves the maximum signed word value of each element pair to the specified destination operand. The vpmulld (Multiply Packed Integers and Store Low Result) carries out packed signed doubleword multiplication and saves the low-order 32 bits of each result. Finally, the vpsllq (Shift Packed Data Left Logical) performs a logical left shift of each quadword element using the bit count that’s specified by the immediate operand. Note that this instruction supports the use of an immediate operand to specify the bit count.

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig15_HTML.jpg — Figure 4-15.
Example AVX packed integer operations

Most AVX packed integer instructions do not update the status flags in the RFLAGS register. This means that error conditions such as arithmetic overflow and underflow are not reported. It also means that the results of a packed integer operation do not directly affect execution of the conditional instructions cmovcc, jcc, and setb. However, programs can employ SIMD-specific techniques to make logical decisions based on the outcome of a packed integer operation. You’ll see examples of these techniques in Chapter 7.

Instruction Set Overview

Table 4-9 lists in alphabetical order commonly-used AVX packed integer instructions. In this table, the mnemonic text [b|w|d|q] signifies the size (byte, word, doubleword, or quadword) of the elements that are processed. You’ll learn how to use many of these instructions in Chapter 7.

Table 4-9.

Overview of Commonly-Used AVX Packed Integer Instructions

Instruction	Description
vmov[d\|q]	Move to/from XMM register
vmovdqa	Move aligned packed integer values
vmovdqu	Move unaligned packed integer values
vpabs[b\|w\|d]	Packed integer absolute value
vpackss[dw\|wb]	Pack with signed saturation
vpackus[dw\|wb]	Pack with unsigned saturation
vpadd[b\|w\|d\|q]	Packed integer addition
vpadds[b\|w]	Packed integer addition with signed saturation
vpaddus[b\|w]	Packed integer addition with unsigned saturation
vpand	Packed bitwise AND
vpandn	Packed bitwise AND NOT
vpcmpeq[b\|w\|d\|q]	Pack integer compare for equality
vpcmpgt[b\|w\|d\|q]	Packed signed integer compare for greater than
vpextr[b\|w\|d\|q]	Extract integer from XMM register
vphadd[w\|d]	Horizontal packed addition
vphsub[w\|d]	Horizontal packed subtraction
vpinsr[b\|w\|d\|q]	Insert integer into XMM register
vpmaxs[b\|w\|d]	Packed signed integer maximum
vpmaxu[b\|w\|d]	Packed unsigned integer maximum
vpmins[b\|w\|d]	Packed signed integer minimum
vpminu[b\|w\|d]	Packed unsigned integer minimum
vpmovsx	Packed integer move with sign extend
vpmovzx	Packed integer move with zero extend
vpmuldq	Packed signed doubleword multiplication
vpmulhuw	Packed unsigned word multiplication, save high result
vpmul[h\|l]w	Packed signed word multiplication, save [high \| low] result
vpmull[d\|w]	Packed signed multiplication (save low result)
vpmuludq	Packed unsigned doubleword multiplication
vpshuf[b\|d]	Shuffle packed integers
vpshuf[h\|l]w	Shuffle [high \| low] packed words
vpslldq	Shift logical left double quadword
vpsll[w\|d\|q]	Packed logical shift left
vpsra[w\|d]	Packed arithmetic shift right
vpsrldq	Shift logical right double quadword
vpsrl[w\|d\|q]	Packed logical shift right
vpsub[b\|w\|d\|q]	Packed integer subtraction
vpsubs[b\|w]	Packed integer subtraction with signed saturation
vpsubus[b\|w]	Packed integer subtraction with unsigned saturation
vpunpckh[bw\|wd\|dq]	Unpack high data
vpunpckl[bw\|wd\|dq]	Unpack low data

Differences Between x86-AVX and x86-SSE

If you have any previous experience with x86-SSE assembly language programming, you have undoubtedly noticed that a high degree of symmetry exists between this execution environment and x86-AVX. Most x86-SSE instructions have an x86-AVX equivalent that can use either 256-bit or 128-bit wide operands. There are, however, a few important differences between the x86-SSE and x86-AVX execution environments. The remainder of this section explains these differences. Even if you don’t have any previous experience with x86-SSE, I still recommend reading this section since it elucidates important details that you need to be aware of when writing code that uses the x86-AVX instruction set.

Within an x86-64 processor that supports x86-AVX, each 256-bit YMM register is partitioned into an upper and lower 128-bit lane. Many x86-AVX instructions carry out their operations using same-lane source and destination operand elements. This independent lane execution tends to be inconspicuous when using x86-AVX instructions that perform arithmetic calculations. However, when using instructions that re-order the data elements of a packed quantity, the effect of separate execution lanes is more evident. For example, the vshufps (Packed Interleave Shuffle of Single-Precision Values) instruction rearranges the elements of its source operands according to a control mask that’s specified as an immediate operand. The vpunpcklwd (Unpack Low Data) instruction interleaves the low-order elements in its two source operands. Figure 4-16 illustrates the in-lane effect of these instructions in greater detail. Note that the floating-point shuffle and unpack operations are carried out independently in both the upper (bits 255:128) and lower (bits 127:0) double quadwords. You’ll learn more about the vshufps and vpunpcklwd instructions in Chapters 6 and 7.

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig16_HTML.jpg — Figure 4-16.
Examples of x86-AVX instruction execution using independent lanes

The aliasing of the XMM and YMM register sets introduces a few programming issues that software developers need to keep in mind. The first issue relates to the processor’s handling of a YMM register’s high-order 128 bits when the corresponding XMM register is used as a destination operand. When executing on a processor that supports x86-AVX technology, an x86-SSE instruction that uses an XMM register as a destination operand will never modify the upper 128 bits of the corresponding YMM register. However, the equivalent x86-AVX instruction will zero the upper 128 bits of the respective YMM register. Consider, for example, the following instances of the (v)cvtps2pd (Convert Packed Single-Precision to Packed Double-Precision) instruction:

cvtps2pd xmm0,xmm1

vcvtps2pd xmm0,xmm1

vcvtps2pd ymm0,ymm1

The x86-SSE cvtps2pd instruction converts the two packed single-precision floating-point values in the low-order quadword of XMM1 to double-precision floating-point and saves the result in register XMM0. This instruction does not modify the high-order 128 bits of register YMM0. The first vcvtps2pd instruction performs the same packed single-precision to packed double-precision conversion operation; it also zeros the high-order 128 bits of YMM0. The second vcvtps2pd instruction converts the four packed single-precision floating-point values in the low-order 128 bits of YMM1 to packed double-precision floating-point values and saves the result to YMM0.

X86-AVX relaxes the alignment requirements of x86-SSE for packed operands in memory. Except for instructions that explicitly specify an aligned operand (e.g., vmovaps, vmovdqa, etc.), proper alignment of a 128-bit or 256-bit wide operand in memory is not mandatory. However, 128-bit and 256-bit wide operands should always be properly aligned whenever possible in order to prevent processing delays that can occur when the processor accesses unaligned operands in memory.

The last issue that programmers need to be aware of involves the intermixing of x86-AVX and x86-SSE code. Programs are allowed to intermix x86-AVX and x86-SSE instructions, but any intermixing should be kept to a minimum in order avoid internal processor state transition penalties that can affect performance. These penalties can occur if the processor is required to preserve the upper 128 bits of each YMM register during a transition from executing x86-AVX to executing x86-SSE instructions. State transition penalties can be completely avoided by using the vzeroupper (Zero Upper Bits of YMM Registers) instruction, which zeroes the upper 128 bits of all YMM registers. This instruction should be used prior to any transition from 256-bit x86-AVX code (i.e., any x86-AVX code that uses a YMM register) to x86-SSE code.

One common use of the vzeroupper instruction is by a public function that uses 256-bit x86-AVX instructions. These types of functions should include a vzeroupper instruction prior to the execution of any ret instruction since this prevents processor state transition penalties from occurring in any high-level language code that uses x86-SSE instructions. The vzeroupper instruction should also be employed before calling any library functions that might contain x86-SSE code. Later in this book, you’ll see several source code examples that demonstrate proper use of the vzeroupper instruction. Functions can also use the vzeroall (Zero All YMM Registers) instruction instead of vzeroupper to avoid potential x86-AVX/x86-SSE state transition penalties.

Summary

Here are the key learning points for Chapter 4:

AVX technology is an x86 platform architectural enhancement that facilitates SIMD operations using 128-bit and 256-bit wide packed floating-point operands, both single-precision and double-precision.
AVX also supports SIMD operations using 128-bit wide packed integer and scalar floating-point operands. AVX2 extends the AVX instruction set to support SIMD operations using 256-bit wide packed integer operands.
AVX adds 16 YMM (256-bit) and XMM (128-bit) registers to the x86-64 platform. Each XMM register is aliased with the low-order 128 bits of its corresponding YMM register.
Most AVX instructions use a three-operand syntax that includes two non-destructive source operands.
AVX floating-point operations conform to the IEEE 754 standard for floating-point arithmetic.
Programs can use the control and status flags in the MXCSR register to enable floating-point exceptions, detect floating-point error conditions, and configure floating-point rounding.
Except for instructions that explicitly specify aligned operands, 128-bit and 256-bit wide operands in memory need not be properly aligned. However, SIMD operands in memory should always be properly aligned whenever possible to avoid delays that can occur when the processor accesses an unaligned operand in memory.
A vzeroupper or vzeroall instruction should be used in any function that uses a YMM register as an operand in order to avoid potential x86-AVX to x86-SSE state transition performance penalties.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 4. Advanced Vector Extensions

Create new playlist

Sign In

Sign Up

4. Advanced Vector Extensions

AVX Overview

SIMD Programming Concepts

Wraparound vs. Saturated Arithmetic

AVX Execution Environment

Register Set

Data Types

Instruction Syntax

AVX Scalar Floating-Point

Floating-Point Programming Concepts

Scalar Floating-Point Register Set

Control-Status Register

Instruction Set Overview

AVX Packed Floating-Point

Instruction Set Overview

AVX Packed Integer

Instruction Set Overview

Differences Between x86-AVX and x86-SSE

Summary

Table of Contents for
4. Advanced Vector Extensions