© Daniel Kusswurm 2018
Daniel KusswurmModern X86 Assembly Language Programminghttps://doi.org/10.1007/978-1-4842-4063-2_4

4. Advanced Vector Extensions

Daniel Kusswurm1 
(1)
Geneva, IL, USA
 

In the first three chapters of this book, you learned about the core x86-64 platform including its data types, general-purpose registers, and memory addressing modes. You also examined a cornucopia of sample code that illustrated the fundamentals of x86-64 assembly language programming, including basic operands, integer arithmetic, compare operations, conditional jumps, and manipulation of common data structures.

This chapter introduces Advanced Vector Extensions (AVX). It begins with a brief overview of AVX technologies and SIMD (Single Instruction Multiple Data) processing concepts. This is followed by an examination of the AVX execution environment that covers register sets, data types, and instruction syntax. The chapter also includes discussions of AVX’s scalar floating-point capabilities and its SIMD computational resources. The material presented in this chapter is relevant not only to AVX but also provides the necessary background information to understand AVX2 and AVX-512, which are explained in later chapters.

In the discussions that follow in this and subsequent chapters, the term x86-AVX is used to describe general characteristics and computing resources of Advanced Vector Extensions. The acronyms AVX, AXV2, and AVX-512 are employed when examining attributes or instructions related to a specific x86 feature set enhancement.

AVX Overview

AMD and Intel first incorporated AVX into their CPUs starting in 2011. AVX extends the packed single-precision and double-precision floating-point capabilities of x86-SSE from 128 bits to 256 bits. Unlike general-purpose register instructions, AVX instructions use a three-operand syntax that employs non-destructive source operands, which simplifies assembly language programming considerably. Programmers can use this new instruction syntax with packed 128-bit integer, packed 128-bit floating-point, and packed 256-bit floating-point operands. The three-operand instruction syntax can also be exploited to perform scalar single-precision and double-precision floating-point arithmetic.

In 2013 Intel launched processors that included AVX2. This architectural enhancement extends the packed integer capabilities of AVX from 128 bits to 256 bits. AVX2 adds new data broadcast, blend, and permute instructions to the x86 platform. It also introduces a new vector-index addressing mode that facilitates memory loads (or gathers) of data elements from non-contiguous locations. The most recent x86-AVX extension is called AVX-512, which expands the SIMD capabilities of AVX and AVX2 from 256 bits to 512 bits. AVX-512 also adds eight new opmask registers named K0–K7 to the x86 platform. These registers facilitate conditional instruction execution and data merging operations using per-element granularity. Table 4-1 summarizes current x86-AVX technologies. In this table (and subsequent tables), the acronyms SPFP and DPFP are used to signify single-precision floating-point and double-precision floating-point, respectively.
Table 4-1.

Summary of x86-AVX Technologies

Feature

AVX

AVX2

AVX-512

Three-operand syntax; non-destructive source operands

Yes

Yes

Yes

SIMD operations using 128-bit packed integers

Yes

Yes

Yes

SIMD operations using 256-bit packed integers

No

Yes

Yes

SIMD operations using 512-bit packed integers

No

No

Yes

SIMD operations using 128-bit packed SPFP, DPFP

Yes

Yes

Yes

SIMD operations using 256-bit packed SPFP, DPFP

Yes

Yes

Yes

SIMD operations using 512-bit packed SPFP, DPFP

No

No

Yes

Scalar SPFP, DPFP arithmetic

Yes

Yes

Yes

Enhanced SPFP, DPFP compare operations

Yes

Yes

Yes

Basic SPFP, DPFP broadcast and permute

Yes

Yes

Yes

Enhanced SPFP, DPFP broadcast and permute

No

Yes

Yes

Packed integer broadcast

No

Yes

Yes

Enhanced packed integer broadcast, compare, permute, conversions

No

No

Yes

Instruction-level broadcast and rounding control

No

No

Yes

Fused-multiply-add

No

Yes

Yes

Data gather

No

Yes

Yes

Data scatter

No

No

Yes

Conditional execution and data merging using opmask registers

No

No

Yes

It should be noted that fuse-multiply-add is a distinct x86 platform feature extension that was introduced in tandem with AVX2. A program must confirm the presence of this feature extension by testing the CPUID FMA feature flag before using any of the corresponding instructions. You’ll learn how to do this in Chapter 16. The remainder of this chapter focuses primarily on AVX. Chapters 8 and 12 discuss the particulars of AVX2 and AVX-512 in greater detail.

SIMD Programming Concepts

As implied by the words of the acronym, a SIMD computing element executes the same operation on multiple data items simultaneously. Universal SIMD operations include basic arithmetic such as addition, subtraction, multiplication, and division. SIMD processing techniques can also be applied to a variety of other computational tasks including data compares, conversions, Boolean calculations, permutations, and bit shifts. Processors facilitate SIMD operations by reinterpreting the bits of an operand in a register or memory location. For example, a 128-bit wide operand can hold two independent 64-bit integer values. It is also capable of accommodating four 32-bit integers, eight 16-bit integers, or sixteen 8-bit integers, as illustrated in Figure 4-1.
../images/326959_2_En_4_Chapter/326959_2_En_4_Fig1_HTML.jpg
Figure 4-1.

128-bit wide operand using distinct integers

Figure 4-2 exemplifies a few SIMD arithmetic operations greater detail. In this figure, integer addition is illustrated using two 64-bit integers, four 32-bit integers, or eight 16-bit integers. Faster algorithmic processing takes place when multiple data items are exercised, since the CPU can carry out the necessary operations in parallel. For example, when 16-bit integer operands are specified by an instruction, the CPU performs all eight 16-bit integer additions simultaneously.
../images/326959_2_En_4_Chapter/326959_2_En_4_Fig2_HTML.jpg
Figure 4-2.

SIMD integer addition

Wraparound vs. Saturated Arithmetic

One extremely useful feature of x86-AVX technology is its support for saturated integer arithmetic. In saturated integer arithmetic, computational results are automatically clipped by the processor to prevent overflow and underflow conditions. This differs from normal wraparound integer arithmetic where an overflow or underflow result is retained (as you’ll soon see). Saturated arithmetic is handy when working with pixel values since it automatically clips values and eliminates the need to explicitly check the result of each pixel calculation for an overflow or underflow condition. X86-AVX includes instructions that perform saturated arithmetic using 8-bit and 16-bit integers, both signed and unsigned.

Let’s take a closer look at some examples of both wraparound and saturated arithmetic. Figure 4-3 shows an example of 16-bit signed integer addition using wraparound and saturated arithmetic. An overflow condition occurs if the two 16-bit signed integers are added using wraparound arithmetic. With saturated arithmetic, however, the result is clipped to the largest possible 16-bit signed integer value. Figure 4-4 illustrates a similar example using 8-bit unsigned integers. Besides addition, x86-AVX also supports saturated integer subtraction, as shown in Figure 4-5. Table 4-2 summarizes the saturated arithmetic range limits for all supported integer sizes and sign types.
../images/326959_2_En_4_Chapter/326959_2_En_4_Fig3_HTML.jpg
Figure 4-3.

16-bit signed integer addition using wraparound and saturated arithmetic

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig4_HTML.jpg
Figure 4-4.

8-bit unsigned integer addition using wraparound and saturated arithmetic

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig5_HTML.jpg
Figure 4-5.

16-bit signed integer subtraction using wraparound and saturated arithmetic

Table 4-2.

Range Limits for Saturated Arithmetic

Integer Type

Lower Limit

Upper Limit

8-bit signed

-128 (0x80)

+127 (0x7f)

8-bit unsigned

0

+255 (0xff)

16-bit signed

-32768 (0x8000)

+32767 (0x7fff)

16-bit unsigned

0

+65535 (0xffff)

AVX Execution Environment

In this section you’ll learn about the x86-AVX execution environment. Included are explanations of the AVX register set, its data types, and instruction syntax. As mentioned earlier, x86-AVX is an architectural enhancement that extends x86-SSE technology to support SIMD operations using either 256-bit or 128-bit wide operands. The material that’s presented in this section assumes no previous knowledge or experience with x86-SSE.

Register Set

X86-64 processors that support AVX incorporate 16 256-wide registers named YMM0 – YMM15. The low-order 128 bits of each YMM register are aliased to a corresponding XMM register, as illustrated in Figure 4-6. Most AVX instructions can use any of the XMM or YMM registers as SIMD operands. The XMM registers can also be employed to carry out scalar floating-point calculations using either single-precision or double-precision values similar to x86-SSE. Programmers with assembly language experience using x86-SSE need to be aware of some minor execution differences between this earlier instruction set extension and x86-AVX. These differences are explained later in this chapter.
../images/326959_2_En_4_Chapter/326959_2_En_4_Fig6_HTML.jpg
Figure 4-6.

AVX register set

The x86-AVX execution environment also includes a control-status register named MXCSR. This register contains status flags that facilitate the detection of error conditions caused by floating-point arithmetic operations. It also includes control bits that programs can use to enable or disable floating-point exceptions and specify rounding options. You’ll learn more about MXCSR register later in this chapter.

Data Types

As previously mentioned, AVX supports SIMD operations using 256-bit and 128-bit wide packed single-precision or packed double-precision floating-point operands. A 256-bit wide YMM register or memory location can hold eight single-precision or four double-precision values, as shown in Figure 4-7. When used with a 128-bit wide XMM register or memory location, an AVX instruction can process four single-precision or two double-precision values. Like SSE and SSE2, AVX instructions use the low-order doubleword or quadword of an XMM register to carry out scalar single-precision or double-precision floating-point arithmetic.
../images/326959_2_En_4_Chapter/326959_2_En_4_Fig7_HTML.jpg
Figure 4-7.

AVX and AVX2 data types

AVX also includes instructions that use the XMM registers to perform SIMD operations using a variety of packed integer operands including bytes, words, doublewords, and quadwords. AVX2 extends the packed integer processing capabilities of AVX to the YMM registers and 256-bit wide operands in memory. Figure 4-7 also shows these data types.

Instruction Syntax

Perhaps the most noteworthy programming facet of x86-AVX is its use of a contemporary assembly language instruction syntax. Most x86-AVX instructions use a three-operand format that consists of two source operands and one destination operand. The general syntax that’s employed for x86-AVX instructions is InstrMnemonic DesOp,SrcOp1,SrcOp2. Here, InstrMnemonic signifies the instruction mnemonic, DesOp represents the destination operand, and SrcOp1 and SrcOp2 denote the source operands. A small subset of x86-AVX instructions employ one or three source operands along with a destination operand. Nearly all x86-AVX instruction source operands are non-destructive. This means source operands are not modified during instruction execution, except in cases where the destination operand register is the same as one of the source operand registers. The use of non-destructive source operands often results in simpler and slightly faster code since the number of register-to-register data transfers that a function must perform is reduced.

X86-AVX’s ability to support a three-operand instruction syntax is due to a new instruction-encoding prefix. The vector extension (VEX) prefix enables x86-AVX instructions to be encoded using a more efficient format than the prefixes used for x86-SSE instructions. The VEX prefix has also been used to add new general-purpose register instructions to the x86 platform. You’ll learn about these instructions in Chapter 8.

AVX Scalar Floating-Point

This section examines the scalar floating-point capabilities of AVX. It begins with a short explanation of some important floating-point concepts including data types, bit encodings, and special values. Software developers who understand these concepts are often able to improve the performance of algorithms that make heavy use of floating-point arithmetic and minimize potential floating-point errors. The AVX scalar floating-point registers are also explained in this section and this includes descriptions the XMM registers and the MXCSR control-status register. The section concludes with an overview of the AVX scalar floating-point instruction set.

Floating-Point Programming Concepts

In mathematics a real-number system depicts an infinite continuum of all possible positive and negative numbers including integers, rational numbers, and irrational numbers. Given their finite resources, modern computing architectures typically employ a floating-point system to approximate a real-number system. Like many other computing platforms, the x86’s floating-point system is based on the IEEE 754 standard for binary floating-point arithmetic. This standard includes specifications that define bit encodings, range limits, and precisions for scalar floating-point values. The IEEE 754 standard also specifies important details related to floating-point arithmetic operations, rounding rules, and numerical exceptions.

The AVX instruction set supports common floating-point operations using single precision (32-bit) and double precision (64-bit) values. Many C++ compilers including Visual C++ use the x86’s intrinsic single-precision and double-precision types to implement the C++ types float and double. Figure 4-8 illustrates the memory organization of both single-precision and double-precision floating-point values. This figure also includes common integer types for comparison purposes.
../images/326959_2_En_4_Chapter/326959_2_En_4_Fig8_HTML.jpg
Figure 4-8.

Memory organization of floating-point values

The binary encoding of a floating-point value requires three distinct fields: a significand, an exponent, and a sign bit. The significand field represents a number’s significant digits (or fractional part). The exponent specifies the location of the binary “decimal” point in the significand, which determines the magnitude. The sign bit indicates whether the number is positive (s = 0) or negative (s = 1). Table 4-3 lists the various size parameters that are used to encode single-precision and double-precision floating-point values.
Table 4-3.

Floating-Point Size Parameters

Parameter

Single-Precision

Double-Precision

Total width

32 bits

64 bits

Significand width

23 bits

52 bits

Exponent width

8 bits

11 bits

Sign width

1 bit

1 bit

Exponent bias

+127

+1023

Figure 4-9 illustrates how to convert a decimal number into an x86 compatible floating-point encoded value. In this example, the number 237.8125 is transformed from a decimal number to its single-precision floating-point encoding. The process starts by converting the number from base 10 to base 2. Next, the base 2 value is transformed to a binary scientific value. The value to the right of the E2 symbol is the binary exponent. A properly encoded floating-point value uses a biased exponent instead of the true exponent since this expedites floating-point compare operations. For a single-precision floating-point number, the bias value is +127. Adding the exponent bias value to the true exponent generates a binary scientific number with a biased exponent value. In the example that’s shown in Figure 4-9, adding 111b (+7) to 1111111b (+127) yields a binary scientific with a biased exponent value of 10000110b (+134).
../images/326959_2_En_4_Chapter/326959_2_En_4_Fig9_HTML.jpg
Figure 4-9.

Single-precision floating-point encoding process

When encoding a single-precision or double-precision floating-point value, the leading 1 digit of the significand is implied and not included in the final binary representation. Dropping the leading 1 digit forms a normalized significand. The three fields required for an IEEE 754 complaint encoding are now available, as shown in Table 4-4. A reading of the bit fields in this table from left to right yields the 32-bit value 0x436DD000, which is the final single-precision floating-point encoding of 237.8125.
Table 4-4.

Bit Fields for IEEE 754 Compliant Encoding of 237.8125

Sign

Biased Exponent

Normalized Significand

1

10000110

11011011101000000000000

The IEEE 754 floating-point encoding scheme reserves a small set of bit patterns for special values that are used to handle certain processing conditions. The first group of special values includes denormalized numbers (or denormal). As shown in the earlier encoding example, the standard encoding of a floating-point number assumes that the leading digit of the significand is always a 1. One limitation of IEEE 754 floating-point encoding scheme is its inability to accurately represent numbers very close to zero. In these cases, values get encoded using a non-normalized format, which enables tiny numbers close to zero (both positive and negative) to be encoded using less precision. Denormals rarely occur but when they do, the CPU can still process them. In algorithms where the use of a denormal is problematic, a function can test a floating-point value in order to ascertain its denormal state or the processor can be configured to generate an underflow or denormal exception.

Another application of special values involves the encodings that are used for floating-point zero. The IEEE 754 standard supports two different representations of floating-point zero: positive zero (+0.0) and negative zero (–0.0). A negative zero can be generated either algorithmically or as a side effect of the floating-point rounding mode. Computationally, the processor treats positive and negative zero the same and the programmer typically does not need to be concerned.

The IEEE 754 encoding scheme also supports positive and negative representations of infinity. Infinities are produced by certain numerical algorithms, overflow conditions, or division by zero. As discussed later in this chapter, the processor can be configured to generate an exception whenever a floating-point overflow occurs or if a program attempts to divide a number by zero.

The final special value type is called Not a Number (NaN). NaNs are floating-point encodings that represent invalid numbers. The IEEE 754 standard defines two types of NaNs: signaling NaN (SNaN) and quiet NaN (QNaN). SNaNs are created by software; an x86-64 CPU will not create a SNaN during any arithmetic operation. Any attempt by an instruction to use a SNaN will cause an invalid operation exception, unless the exception is masked. SNaNs are useful for testing exception handlers. They can also be exploited by an application program for proprietary numerical-processing purposes. An x86 CPU uses QNaNs as a default response to certain invalid arithmetic operations whose exceptions are masked. For example, one unique encoding of a QNaN, called an indefinite, is substituted for a result whenever a function uses one of the scalar square root instructions with a negative value. QNaNs also can be used by programs to signify algorithm-specific errors or other unusual numerical conditions. When QNaNs are used as operands, they enable continued processing without generating an exception.

When developing software that performs floating-point calculations, it is important to keep in mind that the employed encoding scheme is simply an approximation of a real-number system. It is impossible for any floating-point encoding system to represent an infinite number of values using a finite number of bits. This leads to floating-point rounding errors that can affect the accuracy of a calculation. Also, some mathematical properties that hold true for integers and real numbers are not necessarily true for floating-point numbers. For example, floating-point multiplication is not necessarily associative; (a * b) * c may not equal a * (b * c) depending on the values of a, b, and c. Developers of algorithms that require high levels of floating-point accuracy must be aware of these issues. Appendix A contains a list of references that explain this and other potential pitfalls of floating-point arithmetic in greater detail. Chapter 9 also includes a source code example that exemplifies floating-point non-associativity.

Scalar Floating-Point Register Set

As previously shown in Figure 4-6, all x86-64 compatible processors include 16 128-bit registers named XMM0 – XMM15. A program can use any of the XMM registers to perform scalar floating-point operations including common arithmetic calculations, data transfers, comparisons, and type conversions. The CPU uses the low-order 32 bits of an XMM register to carry out single-precision floating-point calculations. Double-precision floating-point operations employ the low-order 64 bits. Figure 4-10 illustrates these register locations in greater detail. Programs cannot use the high-order bits of an XMM register to perform scalar floating-point calculations. However, when used as a destination operand, the values of these bits might be modified during the execution of an AVX scalar floating-point instruction as explained later in this section.
../images/326959_2_En_4_Chapter/326959_2_En_4_Fig10_HTML.jpg
Figure 4-10.

Scalar floating-point values when loaded in an XMM register

Control-Status Register

In addition to the XMM registers, x86-64 processors include a 32-bit control-status register named MXCSR. This register contains a series of control flags that enable a program to specify options for floating-point calculations and exceptions. It also includes a set of status flags that can be tested to detect floating-point error conditions. Figure 4-11 shows the organization of the bits in MXCSR; Table 4-5 describes the purpose of each bit field.
../images/326959_2_En_4_Chapter/326959_2_En_4_Fig11_HTML.jpg
Figure 4-11.

MXCSR control and status register

Table 4-5.

Description of MXCSR Register Bit Fields

Bit

Field Name

Description

IE

Invalid operation flag

Floating-point invalid operation error flag.

DE

Denormal flag

Floating-point denormal error flag.

ZE

Divide-by-zero flag

Floating-point division-by-zero error flag.

OE

Overflow flag

Floating-point overflow error flag.

UE

Underflow flag

Floating-point underflow error flag.

PE

Precision flag

Floating-point precision error flag.

DAZ

Denormals are zero

When set to 1, forcibly converts a denormal source operand to zero prior to its use in a calculation.

IM

Invalid operation mask

Floating-point invalid operation error exception mask.

DM

Denormal mask

Floating-point denormal error exception mask.

ZM

Divide-by-zero mask

Floating-point divide-by-zero error exception mask.

OM

Overflow mask

Floating-point overflow error exception mask.

UM

Underflow mask

Floating-point underflow error exception mask.

PM

Precision mask

Floating-point precision error exception mask.

RC

Rounding control

Specifies the method for rounding floating-point results. Valid options include round to nearest (00b), round down toward +∞ (01b), round up toward +∞ (10b), and round toward zero or truncate (11b).

FTZ

Flush to zero

When set to 1, forces a zero result if the underflow exception is masked and a floating-point underflow error occurs.

An application program can modify any of the MXCSR’s control flags or status bits to accommodate its specific SIMD floating-point processing requirements. Any attempt to write a non-zero value to a reserved bit position will cause the processor to generate an exception. The processor sets an MXCSR error flag to 1 following the occurrence of an error condition. MXCSR error flags are not automatically cleared by the processor after an error is detected; they must be manually reset. The control flags and status bits of the MXCSR register can be modified using the vldmxcsr (Load MXCSR Register) instruction. Setting a mask bit to 1 disables the corresponding exception. The vstmxcsr (Store MXCSR Register) instruction can be used to save the current MXCSR state. An application program cannot directly access the internal processor tables that specify floating-point exception handlers. However, most C++ compilers provide a library function that allows an application program to designate a callback function that gets invoked whenever a floating-point exception occurs.

The MXCSR includes two control flags that can be used to speed up certain floating-point calculations. Setting the MXCSR.DAZ control flag to 1 can improve the performance of algorithms where the rounding of a denormal value to zero is acceptable. Similarly, the MXCSR.FTZ control flag can be used to accelerate computations where floating-point underflows are common. The downside of enabling either of these options is non-compliance with the IEEE 754 floating-point standard.

Instruction Set Overview

Table 4-6 lists in alphabetical order commonly used AVX scalar floating-point instructions. In this table, mnemonic text [d|s] signifies that an instruction can be used with either double-precision floating-point or single-precision floating-point operands. You’ll learn how to use many of these instructions in Chapter 5.
Table 4-6.

Overview of Commonly-Used AVX Scalar Floating-Point Instructions

Mnemonic

Description

vadds[d|s]

Scalar floating-point addition

vbroadcasts[d|s]

Broadcast scalar floating-point value

vcmps[d|s]

Scalar floating-point compare

vcomis[d|s]

Ordered scalar floating-point compare and set RFLAGS

vcvts[d|s]2si

Convert scalar floating-point to doubleword signed integer

vcvtsd2ss

Convert scalar DPFP to scalar SPFP

vcvtsi2s[d|s]

Convert signed doubleword integer to scalar floating-point

vcvtss2sd

Convert scalar SPFP to DPFP

vcvtts[d|s]2si

Convert with truncation scalar floating-point to signed integer

vdivs[d|s]

Scalar floating-point division

vmaxs[d|s]

Scalar floating-point maximum

vmins[d|s]

Scalar floating-point minimum

vmovs[d|s]

Move scalar floating-point value

vmuls[d|s]

Scalar floating-point multiplication

vrounds[d|s]

Round scalar floating-point value

vsqrts[d|s]

Scalar floating-point square root

vsubs[d|s]

Scalar floating-point subtraction

vucomis[d|s]

Unordered scalar floating-point compare and set RFLAGS

Table 4-7 illustrates operation of the AVX scalar floating-point instructions vadds[d|s] and vsqrts[d|s]. In these examples, the colon notation designates bit position ranges within a register (e.g., 31:0 designates bits positions 31 through 0 inclusive). Note that execution of an AVX scalar floating-point instruction also copies the unused bits of the first source operand to the destination operand. Also note that the upper 128 bits of the corresponding YMM register are set to zero.
Table 4-7.

AVX Scalar Floating-Point Instruction Examples

Instruction

Operation

vaddss xmm0,xmm1,xmm2

xmm0[31:0] = xmm1[31:0] + xmm2[31:0]

xmm0[127:32] = xmm1[127:32]

ymm0[255:128] = 0

vaddsd xmm0,xmm1,xmm2

xmm0[63:0] = xmm1[63:0] + xmm2[63:0]

xmm0[127:64] = xmm1[127:64]

ymm0[255:128] = 0

vsqrtss xmm0,xmm1,xmm2

xmm0[31:0] = sqrt(xmm2[31:0])

xmm0[127:32] = xmm1[127:32]

ymm0[255:128] = 0

vsqrtsd xmm0,xmm1,xmm2

xmm0[63:0] = sqrt(xmm2[63:0])

xmm0[127:64] = xmm1[127:64]

ymm0[255:128] = 0

AVX Packed Floating-Point

AVX supports packed floating-point operations using either 128-bit wide or 256-bit wide operands. Figures 4-12 and 4-13 illustrate common packed floating-point arithmetic operations using 256-bit wide operands with single-precision and double-precision elements. Similar to AVX scalar floating-point, rounding for AVX packed floating-point arithmetic operations is specified by the MXCSR’s rounding control field, as defined in Table 4-5. The processor also uses the MXCSR’s status flags to signal the occurrence of a packed floating-point error condition.
../images/326959_2_En_4_Chapter/326959_2_En_4_Fig12_HTML.jpg
Figure 4-12.

AVX packed single-precision floating-point addition

../images/326959_2_En_4_Chapter/326959_2_En_4_Fig13_HTML.jpg
Figure 4-13.

AVX packed double-precision floating-point multiplication

Most AVX arithmetic instructions perform their operations using the corresponding element positions of the two source operands. AVX also supports horizontal arithmetic operations using either packed floating-point or packed integer operands. A horizontal arithmetic operation carries out its computations using the adjacent elements of a packed data type. Figure 4-14 illustrates horizontal addition using single-precision floating-point and horizontal subtraction using double-precision floating-point operands. The AVX instruction set also supports integer horizontal addition and subtraction using packed words and doublewords. Horizontal operations are typically used to reduce a packed data operand that contains multiple intermediate values to a single final result.
../images/326959_2_En_4_Chapter/326959_2_En_4_Fig14_HTML.jpg
Figure 4-14.

AVX horizontal addition and subtraction using single-precision and double-precision elements

Instruction Set Overview

Table 4-8 lists in alphabetical order commonly used AVX packed floating-point instructions. Similar to the scalar floating-point table that you saw in the previous section, the mnemonic text [d|s] signifies that an instruction can be used with either packed double-precision floating-point or packed single-precision floating-point operands. You’ll learn how to use many of these instructions in Chapter 6.
Table 4-8.

Overview of Commonly-Used AVX Packed Floating-Point Instructions

Instruction

Description

vaddp[d|s]

Packed floating-point addition

vaddsubp[d|s]

Packed floating-point add-subtract

vandp[d|s]

Packed floating-point bitwise AND

vandnp[d|s]

Packed floating-point bitwise AND NOT

vblendp[d|s]

Packed floating-point blend

vblendvp[d|s]

Variable packed floating-point blend

vcmpp[d|s]

Packed floating-point compare

vcvtdq2p[d|s]

Convert packed signed doubleword integers to floating-point

vcvtp[d|s]2dq

Convert packed floating-point to signed doublewords

vcvtpd2ps

Convert packed DPFP to packed SPFP

vcvtps2pd

Convert packed SPFP to packed DPFP

vdivp[d|s]

Packed floating-point division

vdpp[d|s]

Packed dot product

vhaddp[d|s]

Horizontal packed floating-point addition

vhsubp[d|s]

Horizontal packed floating-point subtraction

vmaskmovp[d|s]

Packed floating-point conditional load and store

vmaxp[d|s]

Packed floating-point maximum

vminp[d|s]

Packed floating-point minimum

vmovap[d|s]

Move aligned packed floating-point values

vmovmskp[d|s]

Extract packed floating-point sign bitmask

vmovup[d|s]

Move unaligned packed floating-point values

vmulp[d|s]

Packed floating-point multiplication

vorp[d|s]

Packed floating-point bitwise inclusive OR

vpermilp[d|s]

Permute in-lane packed floating-point elements

vroundp[d|s]

Round packed floating-point values

vshufp[d|s]

Shuffle packed floating-point values

vsqrtp[d|s]

Packed floating-point square root

vsubp[d|s]

Packed floating-point subtraction

vunpckhp[d|s]

Unpack and interleave high packed floating-point values

vunpcklp[d|s]

Unpack and interleave low packed floating-point values

vxorp[d|s]

Packed floating-point bitwise exclusive OR

AVX Packed Integer

AVX supports packed integer operations using 128-bit wide operands. A 128-bit wide operand facilitates packed integer operations using two quadwords, four doublewords, eight words, or sixteen bytes, as shown in Figure 4-15. In this figure, the vpaddb (Add Packed Integers) instruction illustrates packed 8-bit integer addition. The vpmaxsw (Packed Signed Integer Maximums) saves the maximum signed word value of each element pair to the specified destination operand. The vpmulld (Multiply Packed Integers and Store Low Result) carries out packed signed doubleword multiplication and saves the low-order 32 bits of each result. Finally, the vpsllq (Shift Packed Data Left Logical) performs a logical left shift of each quadword element using the bit count that’s specified by the immediate operand. Note that this instruction supports the use of an immediate operand to specify the bit count.
../images/326959_2_En_4_Chapter/326959_2_En_4_Fig15_HTML.jpg
Figure 4-15.

Example AVX packed integer operations

Most AVX packed integer instructions do not update the status flags in the RFLAGS register. This means that error conditions such as arithmetic overflow and underflow are not reported. It also means that the results of a packed integer operation do not directly affect execution of the conditional instructions cmovcc, jcc, and setb. However, programs can employ SIMD-specific techniques to make logical decisions based on the outcome of a packed integer operation. You’ll see examples of these techniques in Chapter 7.

Instruction Set Overview

Table 4-9 lists in alphabetical order commonly-used AVX packed integer instructions. In this table, the mnemonic text [b|w|d|q] signifies the size (byte, word, doubleword, or quadword) of the elements that are processed. You’ll learn how to use many of these instructions in Chapter 7.
Table 4-9.

Overview of Commonly-Used AVX Packed Integer Instructions

Instruction

Description

vmov[d|q]

Move to/from XMM register

vmovdqa

Move aligned packed integer values

vmovdqu

Move unaligned packed integer values

vpabs[b|w|d]

Packed integer absolute value

vpackss[dw|wb]

Pack with signed saturation

vpackus[dw|wb]

Pack with unsigned saturation

vpadd[b|w|d|q]

Packed integer addition

vpadds[b|w]

Packed integer addition with signed saturation

vpaddus[b|w]

Packed integer addition with unsigned saturation

vpand

Packed bitwise AND

vpandn

Packed bitwise AND NOT

vpcmpeq[b|w|d|q]

Pack integer compare for equality

vpcmpgt[b|w|d|q]

Packed signed integer compare for greater than

vpextr[b|w|d|q]

Extract integer from XMM register

vphadd[w|d]

Horizontal packed addition

vphsub[w|d]

Horizontal packed subtraction

vpinsr[b|w|d|q]

Insert integer into XMM register

vpmaxs[b|w|d]

Packed signed integer maximum

vpmaxu[b|w|d]

Packed unsigned integer maximum

vpmins[b|w|d]

Packed signed integer minimum

vpminu[b|w|d]

Packed unsigned integer minimum

vpmovsx

Packed integer move with sign extend

vpmovzx

Packed integer move with zero extend

vpmuldq

Packed signed doubleword multiplication

vpmulhuw

Packed unsigned word multiplication, save high result

vpmul[h|l]w

Packed signed word multiplication, save [high | low] result

vpmull[d|w]

Packed signed multiplication (save low result)

vpmuludq

Packed unsigned doubleword multiplication

vpshuf[b|d]

Shuffle packed integers

vpshuf[h|l]w

Shuffle [high | low] packed words

vpslldq

Shift logical left double quadword

vpsll[w|d|q]

Packed logical shift left

vpsra[w|d]

Packed arithmetic shift right

vpsrldq

Shift logical right double quadword

vpsrl[w|d|q]

Packed logical shift right

vpsub[b|w|d|q]

Packed integer subtraction

vpsubs[b|w]

Packed integer subtraction with signed saturation

vpsubus[b|w]

Packed integer subtraction with unsigned saturation

vpunpckh[bw|wd|dq]

Unpack high data

vpunpckl[bw|wd|dq]

Unpack low data

Differences Between x86-AVX and x86-SSE

If you have any previous experience with x86-SSE assembly language programming, you have undoubtedly noticed that a high degree of symmetry exists between this execution environment and x86-AVX. Most x86-SSE instructions have an x86-AVX equivalent that can use either 256-bit or 128-bit wide operands. There are, however, a few important differences between the x86-SSE and x86-AVX execution environments. The remainder of this section explains these differences. Even if you don’t have any previous experience with x86-SSE, I still recommend reading this section since it elucidates important details that you need to be aware of when writing code that uses the x86-AVX instruction set.

Within an x86-64 processor that supports x86-AVX, each 256-bit YMM register is partitioned into an upper and lower 128-bit lane. Many x86-AVX instructions carry out their operations using same-lane source and destination operand elements. This independent lane execution tends to be inconspicuous when using x86-AVX instructions that perform arithmetic calculations. However, when using instructions that re-order the data elements of a packed quantity, the effect of separate execution lanes is more evident. For example, the vshufps (Packed Interleave Shuffle of Single-Precision Values) instruction rearranges the elements of its source operands according to a control mask that’s specified as an immediate operand. The vpunpcklwd (Unpack Low Data) instruction interleaves the low-order elements in its two source operands. Figure 4-16 illustrates the in-lane effect of these instructions in greater detail. Note that the floating-point shuffle and unpack operations are carried out independently in both the upper (bits 255:128) and lower (bits 127:0) double quadwords. You’ll learn more about the vshufps and vpunpcklwd instructions in Chapters 6 and 7.
../images/326959_2_En_4_Chapter/326959_2_En_4_Fig16_HTML.jpg
Figure 4-16.

Examples of x86-AVX instruction execution using independent lanes

The aliasing of the XMM and YMM register sets introduces a few programming issues that software developers need to keep in mind. The first issue relates to the processor’s handling of a YMM register’s high-order 128 bits when the corresponding XMM register is used as a destination operand. When executing on a processor that supports x86-AVX technology, an x86-SSE instruction that uses an XMM register as a destination operand will never modify the upper 128 bits of the corresponding YMM register. However, the equivalent x86-AVX instruction will zero the upper 128 bits of the respective YMM register. Consider, for example, the following instances of the (v)cvtps2pd (Convert Packed Single-Precision to Packed Double-Precision) instruction:
cvtps2pd xmm0,xmm1
vcvtps2pd xmm0,xmm1
vcvtps2pd ymm0,ymm1

The x86-SSE cvtps2pd instruction converts the two packed single-precision floating-point values in the low-order quadword of XMM1 to double-precision floating-point and saves the result in register XMM0. This instruction does not modify the high-order 128 bits of register YMM0. The first vcvtps2pd instruction performs the same packed single-precision to packed double-precision conversion operation; it also zeros the high-order 128 bits of YMM0. The second vcvtps2pd instruction converts the four packed single-precision floating-point values in the low-order 128 bits of YMM1 to packed double-precision floating-point values and saves the result to YMM0.

X86-AVX relaxes the alignment requirements of x86-SSE for packed operands in memory. Except for instructions that explicitly specify an aligned operand (e.g., vmovaps, vmovdqa, etc.), proper alignment of a 128-bit or 256-bit wide operand in memory is not mandatory. However, 128-bit and 256-bit wide operands should always be properly aligned whenever possible in order to prevent processing delays that can occur when the processor accesses unaligned operands in memory.

The last issue that programmers need to be aware of involves the intermixing of x86-AVX and x86-SSE code. Programs are allowed to intermix x86-AVX and x86-SSE instructions, but any intermixing should be kept to a minimum in order avoid internal processor state transition penalties that can affect performance. These penalties can occur if the processor is required to preserve the upper 128 bits of each YMM register during a transition from executing x86-AVX to executing x86-SSE instructions. State transition penalties can be completely avoided by using the vzeroupper (Zero Upper Bits of YMM Registers) instruction, which zeroes the upper 128 bits of all YMM registers. This instruction should be used prior to any transition from 256-bit x86-AVX code (i.e., any x86-AVX code that uses a YMM register) to x86-SSE code.

One common use of the vzeroupper instruction is by a public function that uses 256-bit x86-AVX instructions. These types of functions should include a vzeroupper instruction prior to the execution of any ret instruction since this prevents processor state transition penalties from occurring in any high-level language code that uses x86-SSE instructions. The vzeroupper instruction should also be employed before calling any library functions that might contain x86-SSE code. Later in this book, you’ll see several source code examples that demonstrate proper use of the vzeroupper instruction. Functions can also use the vzeroall (Zero All YMM Registers) instruction instead of vzeroupper to avoid potential x86-AVX/x86-SSE state transition penalties.

Summary

Here are the key learning points for Chapter 4:
  • AVX technology is an x86 platform architectural enhancement that facilitates SIMD operations using 128-bit and 256-bit wide packed floating-point operands, both single-precision and double-precision.

  • AVX also supports SIMD operations using 128-bit wide packed integer and scalar floating-point operands. AVX2 extends the AVX instruction set to support SIMD operations using 256-bit wide packed integer operands.

  • AVX adds 16 YMM (256-bit) and XMM (128-bit) registers to the x86-64 platform. Each XMM register is aliased with the low-order 128 bits of its corresponding YMM register.

  • Most AVX instructions use a three-operand syntax that includes two non-destructive source operands.

  • AVX floating-point operations conform to the IEEE 754 standard for floating-point arithmetic.

  • Programs can use the control and status flags in the MXCSR register to enable floating-point exceptions, detect floating-point error conditions, and configure floating-point rounding.

  • Except for instructions that explicitly specify aligned operands, 128-bit and 256-bit wide operands in memory need not be properly aligned. However, SIMD operands in memory should always be properly aligned whenever possible to avoid delays that can occur when the processor accesses an unaligned operand in memory.

  • A vzeroupper or vzeroall instruction should be used in any function that uses a YMM register as an operand in order to avoid potential x86-AVX to x86-SSE state transition performance penalties.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.172.252