Chapter 10

Advanced SIMD instructions

Abstract

This chapter covers the Advanced SIMD instructions provided by AArch64. The sine function is used again as an example to show how these instructions can be used to provide increased performance.

Keywords

SIMD; Advanced SIMD load and store instructions; Advanced SIMD data movement instructions; Advanced SIMD bitwise logical operations; Advanced SIMD basic arithmetic instructions; Advanced SIMD multiplication; Advanced SIMD division; Advanced SIMD shift instructions; Advanced SIMD comparison operations

In addition to the FP/NEON instructions described in the previous chapter, AArch64 also supports Advanced SIMD instructions, which allow the programmer to treat the FP/NEON registers as vectors (arrays) of data. Advanced SIMD uses the same set of registers described in Chapter 9, but adds new views to provide the ability to access the registers in more ways. Advanced SIMD adds about 125 instructions and pseudo-instructions to support not only floating point, but also integer and fixed point.

A single Advanced SIMD instruction can operate on up to 128 bits, which may represent multiple integer, fixed point, or floating point numbers. For example, if two of the 128-bit registers each contain eight 16-bit integers, then a single Advanced SIMD instruction can add all eight integers from one register to the corresponding integers in the other register, resulting in eight simultaneous additions. For certain applications, this vector architecture can result in extremely fast and efficient implementations. Advanced SIMD is particularly useful at handling streaming video and audio, but also can give very good performance on floating point intensive tasks.

The 32 FP/NEON/Advanced SIMD registers, originally introduced in Chapter 9, can be accessed using various views. Some SIMD instructions use the byte, half-word, word, and double-word views from Chapter 9, but most of them use the Advanced SIMD views. Fig. 10.1 shows the different ways of viewing an Advanced SIMD register. Each register can be viewed as containing a vector of 2, 4, 8, or 16 elements, all of the same size and type. Individual elements of each vector can also be accessed by some instructions. The scalar register names and views introduced in Chapter 9 are also used by some instructions. A scalar can be 8 bits, 16 bits, 32 bits, or 64 bits. The instruction syntax is extended to refer to elements of a vector register by using an index, Image. For example Image is element Image in register Image, where Image is treated as a vector of four single-word (32-bit) elements.

Image
Figure 10.1 All of the possible views for register Vn. Valid names depend on which instruction is being used.

10.1 Instruction syntax

The syntax of Advanced SIMD instructions can be described using an extension of the notation used throughout this book. Each instruction operates on certain types of register(s), and there are many registers. Advanced SIMD instruction syntax may use any of the following register definitions:

Xy Refers to a 64-bit AArch64 integer register.

Wy Refers to 32-bit AArch64 integer register.

By Refers to the lower 8-bits, or byte, of an Advanced SIMD register.

Hy Refers to the lower 16-bits, or half-word, of an Advanced SIMD register.

Sy Refers to the lower 32-bits, or single-word, of an Advanced SIMD register.

Dy Refers to the lower 64-bits, or double-word, of an Advanced SIMD register.

Fy Is used to indicate either a single-word or double-word FP/NEON register. F must be either Image for a single word register, Image for a double word register.

Vy A 128-bit Advanced SIMD register. Image can be any valid register number.

Vy.T A 128-bit Advanced SIMD register, treated as a vector of elements of type T, where T may be one of:

b A vector of 16 bytes.

h A vector of 8 half-words.

s A vector of 4 words.

d A vector of 2 double-words.

Some instructions can only allow a subset of these types.

Vy.nT A 128-bit Advanced SIMD register, or the lower 64 bits of an Advanced SIMD register, treated as a vector of elements of type T, where T may be one of:

16b A 128-bit Advanced SIMD register, treated as a vector of sixteen bytes.

8b The lower 64 bits of an Advanced SIMD register, treated as a vector of eight bytes.

8h A 128-bit Advanced SIMD register, treated as a vector of eight half-words.

4h The lower 64 bits of an Advanced SIMD register, treated as a vector of four half-words.

4s A 128-bit Advanced SIMD register, treated as a vector of four words.

2s The lower 64 bits of an Advanced SIMD register, treated as a vector of two words.

2d A 128-bit Advanced SIMD register, treated as a vector of two double-words.

Some instructions can only allow a subset of these types.

Vy.nT[x] Element x of an Advanced SIMD register, treated as a vector of type Image.

Each instruction has its own set of restrictions on legal values for the registers and types used. For example, one possible form of the Image instruction is:

Image

which indicates that a 32-bit AArch32 register is used as the source operand, and any element of any Advanced SIMD register may be used as the destination. However, the instruction further requires that Image must be either Image or Image, in order to match the size of Image.

Instructions may have several forms. In those cases, the following syntax is used to specify possible forms:

{opt} Braces around a string indicate that the string is optional. For example, several operations have an optional r which indicates that the result is rounded instead of truncated.

(s|u) Parentheses indicate a choice between two or more possible characters or strings, separated by the pipe “|” character. For example, (s|u)shr would describe two forms for the shr instruction: sshr and ushr.

<Tn> A string inside the < and > symbols indicates a choice or special syntax that is too complex to be easily described using the parenthesis and pipe (a|b) syntax, and is described in the following text. It is also used to define a syntactical token when simply using a character would lead to confusion.

The following function definitions are used in describing the effects of many of the instructions:

xImage The floor function maps a real number, x, to the next smallest integer.

Image The saturate function limits the value of x to the highest or lowest value that can be stored in the destination register. Saturation is a method used to prevent overflow.

xImage The round function maps a real number, x, to the nearest integer.

xImage The narrow function reduces a 2n bit number to an n bit number, by taking the n least significant bits.

xImage The extend function converts an n bit number to a 2n bit number, performing zero extension if the number is unsigned, or sign extension if the number is signed.

10.2 Load and store instructions

These instructions can be used to perform interleaving of data when structured data is loaded or stored. The data should be properly aligned for best performance. These instructions are very useful for common multimedia data types.

For example, image data is typically stored in arrays of pixels, where each pixel is a small data structure such as the Image struct shown in Listing 5.38. Since each pixel is three bytes, and a Image register is 8 bytes, loading a single pixel into one register would be inefficient. It would be much better to load multiple pixels at once, but an even number of pixels will not fit in a register. It will take three doubleword or quadword registers to hold an even number of pixels without wasting space, as shown in Fig. 10.2. This is the way data would be loaded using an Advanced SIMD Image instruction. Many image processing operations work best if each color “channel” is processed separately. The SIMD load and store instructions can be used to split the image data into color channels, where each channel is stored in a different register, as shown in Fig. 10.3.

Image
Figure 10.2 Pixel data interleaved in thee doubleword registers.
Image
Figure 10.3 Pixel data de-interleaved in thee doubleword registers.

Other examples of interleaved data include stereo audio, which is two interleaved channels, and surround sound, which may have up to nine interleaved channels. In all of these cases, most processing operations are simplified when the data is separated into non-interleaved channels.

10.2.1 Load or store single structure using one lane

These instructions are used to load and store structured data across multiple registers:

ld<n> Load Structured Data, and

st<n> Store Structured Data.

They can be used for interleaving or deinterleaving the data as it is loaded or stored, as shown in Fig. 10.3.

10.2.1.1 Syntax

Image

  • •  Image must be either Image or Image.
  • •  Image must be one of Image, Image, Image, or Image.
  • •  Image specifies the list of registers. There are four list formats:
    1. 1.  {Vt.T}
    2. 2.  {Vt.T, V(t+1).T} or {Vt.T-V(t+1).T}
    3. 3.  {Vt.T, V(t+1).T, V(t+2).T} or {Vt.T-V(t+2).T}
    4. 4.  {Vt.T, V(t+1).T, V(t+2).T, V(t+3).T} or {Vt.T-V(t+3).T}
    The registers must be consecutive. Register 0 is consecutive to register 31.
  • •  Image must be b, h, s, or d.
  • •  The immediate Image specifies which element of each register is to be used, and must be appropriate for the data size specified by Image. The same element will be used for all registers.
  • •  Image is the AARCH64 register containing the base address.
  • •  Image is the AARCH64 register containing an offset.
  • •  If a register or immediate offset is given, then the base register, Image, will be post-incremented.
  • •  The post-increment immediate offset, if present, must be 8, 16, 24, 32, 48, or 64, depending on the number of elements transferred and the size specified by Image.

10.2.1.2 Operations

Image

10.2.1.3 Examples

Image

10.2.2 Load or store multiple structures

These instructions are used to load and store multiple data structures across multiple registers with interleaving or deinterleaving:

ld<n> Load Multiple Structured Data, and

st<n> Store Multiple Structured Data.

10.2.2.1 Syntax

Image

  • •  Image must be either Image or Image.
  • •  Image must be one of Image, Image, Image, or Image.
  • •  Image specifies the list of registers. There are four list formats:
    1. 1.  {Vt.T}
    2. 2.  {Vt.T, V(t+1).T} or {Vt.T-V(t+1).T}
    3. 3.  {Vt.T, V(t+1).T, V(t+2).T} or {Vt.T-V(t+2).T}
    4. 4.  {Vt.T, V(t+1).T, V(t+2).T, V(t+3).T} or {Vt.T-V(t+3).T}
    The registers must be consecutive. Register 0 is consecutive to register 31.
  • •  Image must be 16b, 8b, 8h, 4h, 4s, 2s, or 2d. If Image is 1, then Image can be 1d.
  • •  Image is the AARCH64 register containing the base address.
  • •  Image is the AARCH64 register containing an offset.
  • •  If a register or immediate offset is given, then the base register, Image, will be post-incremented.
  • •  The post-increment immediate offset, if present, must be 8, 16, 24, 32, 48, or 64, depending on the number of elements transferred and the size specified by Image.

10.2.2.2 Operations

Image

10.2.2.3 Examples

Image

10.2.3 Load copies of a structure to all lanes

This instruction is used to load multiple copies of structured data across multiple registers:

ld<n>r Load Copies of Structured Data.

The data is copied to all lanes. This instruction is useful for initializing vectors for use in later instructions.

10.2.3.1 Syntax

Image

  • •  Image must be one of Image, Image, Image, or Image.
  • •  Image specifies the list of registers. There are four list formats:
    1. 1.  {Vt.T}
    2. 2.  {Vt.T, V(t+1).T} or {Vt.T-V(t+1).T}
    3. 3.  {Vt.T, V(t+1).T, V(t+2).T} or {Vt.T-V(t+2).T}
    4. 4.  {Vt.T, V(t+1).T, V(t+2).T, V(t+3).T} or {Vt.T-V(t+3).T}
    The registers must be consecutive. Register 0 is consecutive to register 31.
  • •  Image must be 16b, 8b, 8h, 4h, 4s, 2s, or 2d. If Image is 1, then Image can be 1d.
  • •  Image is the AARCH64 register containing the base address.
  • •  Image is the AARCH64 register containing an offset.
  • •  If a register or immediate offset is given, then the base register, Image, will be post-incremented.
  • •  The post-increment immediate offset, if present, must be 1, 2, 3, 4, 6, 8, 12, 16, 24, or 32, depending on the number of elements transferred and the size specified by Image.

10.2.3.2 Operations

Image

10.2.3.3 Examples

Image

10.3 Data movement instructions

With the additional register views added by Advanced SIMD, there are many more ways to specify data movement. Instructions are provided to move data using the Advanced SIMD views, the FP/NEON views, and the AArch64 integer register views. This results in a large number of possible move instructions.

10.3.1 Duplicate scalar

The duplicate instruction copies a scalar into every element of the destination vector. The scalar can be in an Advanced SIMD register or an AARCH64 integer register. The instruction is:

dup Duplicate Scalar.

10.3.1.1 Syntax

Image

  • •  Td1 may be 8b, 16b, 4h, 8h, 2s, or 4s. The lowest n bits of Wn will be used, where n is the number of bits specified by Td1.
  • •  Td2 may be 8b, 16b, 4h, 8h, 2s, 4s, or 2d.
  • •  Ts can be one of b, h, s, or d, and must match Td2.
  • •  Fd may be any of the FP/NEON register names used in Chapter 9.
  • •  Td3 may be b, h, s, or d, and must match Fd.
  • •  The immediate index must be valid for the type of vector element specified.
  • •  MOV Fd Vn.Td[index] is an alias for DUP Fd,Vn.Td[index].
  • •  The MOV Fd,Fn instruction, which was introduced in Chapter 9, is an alias for Image.

10.3.1.2 Operations

Image

10.3.1.3 Examples

Image

10.3.2 Move vector element

These instructions copy one element into a vector:

mov Copy element into vector,

umov Copy unsigned integer element from vector to AARCH64 register, and

smov Copy signed integer element from vector to AARCH64 register.

10.3.2.1 Syntax

Image

  • •  T may be 8b, 16b, 4h, 8h, 2s, or 4s.
  • •  The lowest n bits of Wn will be used, where n is the number of bits specified by T.
  • •  The type T2 may be b, h, s, or d.
  • •  The type T3 may be b or h.
  • •  Ts may be 8b, 16b, 4h, 8h, 2s, or 4s.
  • •  T4 may be b, h, or s.
  • •  Both immediates, index and index2, must be valid for the type of vector element specified.
  • •  INS Vd.T[index],Wn is an alias for MOV Vd.T[index],Wn.
  • •  INS Vd.D[index],Xn is an alias for MOV Vd.2d[index],Xn.
  • •  INS Vd.T[index],Vn.T[index2] in an alias for MOV Vd.T[index], Vn.T[index2].

10.3.2.2 Operations

Image

10.3.2.3 Examples

Image

10.3.3 Move immediate

These instructions are used to load immediate data into the vector registers:

movi Vector Move Immediate,

mvni Vector Move NOT Immediate, and

fmov Vector Floating Point Move Immediate.

10.3.3.1 Syntax

Image

  • •  sop may be lsl or msl, where msl is a left shift which fills the low order bits with ones instead of zeros.
  • •  If sop is not present, then the shift is assumed to be an lsl with shift amount of zero. The valid combinations of T and shift are given by the following table:
sopTshiftDescription
lsl4h or 8h0 or 8Replicate LSL(uimm8,shift) into each 16-bit element.
lsl2s or 4s0, 8, 16, or 24Replicate LSL(uimm8,shift) into each 32-bit element.
msl2s or 4s8 or 16Replicate MSL(uimm8,shift) into each 32-bit element.

Image

  • •  For Image, if sop is not present, then T may be 8b or 16b, in addition to the values shown in the previous table.
  • •  uimm64 may be either 0 or 0xFFFFFFFFFFFFFFFF.
  • •  Td may be 2s, 4s, or 2d.
  • •  fpimm may be specified either in decimal notation or in hexadecimal using its IEEE754 encoding. The value must be the expressable as ±n÷16×2rImage, where n and r are integers such that 16n31Image and 3r4Image. It is encoded as a normalized binary floating point number with sign, 4 bits of fraction, and a 3-bit exponent.

10.3.3.2 Operations

Image

10.3.3.3 Examples

Image

10.3.4 Transpose matrix

Advanced SIMD provides two versions of the transpose instruction that can be used together for transposing 2×2Image matrices. Fig. 10.4 shows two examples of this instruction. The instruction is:

trn Transpose Matrix.

Image
Figure 10.4 Examples of the Image instruction.

10.3.4.1 Syntax

Image

  • •  T must be 8b, 16b, 4h, 8h, 2s, 4s, or 2d.
  • •  Larger matrices can be transposed using a divide-and-conquer approach.

10.3.4.2 Operation

Image

10.3.4.3 Examples

Image

Fig. 10.5 shows how the Image instruction can be used to transpose a 3×3Image matrix.

Image
Figure 10.5 Transpose of a 3 × 3 matrix.

10.3.5 Vector permute

These instructions are used to interleave or deinterleave the data from two vectors, or to extract bits from a vector:

zip Zip Vectors,

uzp Unzip Vectors, and

ext Byte Extract.

Fig. 10.6 gives an example of the Image instruction. The Image instruction performs the inverse operation.

Image
Figure 10.6 Example of Image. The Image instruction does the same thing, but uses the odd elements of the source registers, rather than the even elements.

10.3.5.1 Syntax

Image

  • •  T is 8b, 16b, 4h, 8h, 2s, 4s, or 2d.
  • •  For Image and Image:
    1. –  If 1 is present, use lower half of source registers.
    2. –  If 2 is present, use upper half of source registers.
  • •  Ta is either 8b (use only 64 bits of each register) or 16b (use all 128 bits of each register).
  • •  index is an immediate value in the range 0 to nelem(T)1Image.

10.3.5.2 Operations

Image

10.3.5.3 Examples

Image

10.3.6 Table lookup

The table lookup instructions use indices held in one vector to lookup values from a table held in one or more other vectors. The resulting values are stored in the destination vector. The table lookup instructions are:

tbl Table Lookup, and

tbx Table Lookup with Extend.

10.3.6.1 Syntax

Image

  • •  Image is one of Image or Image.
  • •  T may be 8b or 16b.
  • •  Image specifies the list of registers. There are five list formats:
    1. 1.  {Vn.16b},
    2. 2.  {Vn.16b, V(n+1).16b},
    3. 3.  {Vn.16b, V(n+1).16b, V(n+2).16b}, or
    4. 4.  {Vn.16b, V(n+1).16b, V(n+2).16b, V(n+3).16b}.
  • •  A dash “-” can be used to specify a range of registers, as shown in the examples below.
  • •  Image is the register holding the indices.
  • •  The table can contain up to 64 bytes.

10.3.6.2 Operations

Image

10.3.6.3 Examples

Image

10.4 Data conversion

When high precision is not required, The IEEE half-precision format can be used to store floating point numbers in memory. This can reduce memory requirements by up to 50%. This can also result in a significant performance improvement, since only half as much data needs to be moved between the CPU and main memory. However, on most processors half-precision data must be converted to single or double precision before it is used in calculations. Advanced SIMD provides instructions to support conversion to and from IEEE half precision. There are also functions to perform integer or fixed-point to floating-point conversions, and convert between IEEE single and double precision.

10.4.1 Convert between integer or fixed point and floating point

These instructions can be used to perform data conversion between floating point and fixed point (or integer) on each element in a vector:

fcvt Vector convert floating point to integer or fixed point, and

cvtf Vector convert integer or fixed point to floating point.

The elements in the result vector must be the same size as the elements in the source vector. An out of range integer or fixed-point result will be saturated to the destination size.

Fixed point (or integer) arithmetic operations are up to twice as fast as floating point operations. In some cases it is much more efficient to make this conversion, perform the calculations, then convert the results back to floating point.

10.4.1.1 Syntax

Image

  • •  Image is a single character which specifies the rounding mode:

    N: round to nearest with ties to even,

    A: round to nearest with ties away from zero,

    P: round towards +∞,

    M: round towards −∞, or

    Z: round towards zero.

  • •  T may be either 2s, 4s, or 2d.
  • •  Image must be either Image or Image.
  • •  Image specifies the number of fraction bits for a fixed point number, and must be between 1 and the number of bits specified by T. If it is omitted, then it is assumed to be zero.

10.4.1.2 Operations

NameEffectDescription
fcvt<x>sVd[]←fixed(Vm[],fbits)Convert single precision to 32-bit signed fixed point or integer.
fcvt<x>uVd[]←ufixed(Vm[],fbits)Convert single precision to 32-bit unsigned fixed point or integer.
scvtfVd[]←float(Vm[])Convert signed 32-bit fixed point or integer to single precision
ucvtfVd[]←single(Vm[])Convert unsigned 32-bit fixed point or integer to single precision

10.4.1.3 Examples

Image

10.4.2 Convert between half, single, and double precision

The following instructions can be used to convert between floating point formats:

fcvtl Vector convert from half to single precision,

fcvtn Vector convert from single to half precision, and

fcvtxn Vector convert from double to single precision.

These instructions operate on vectors. There are additional conversion instructions available, but they only operate on scalar values.

10.4.2.1 Syntax

Image

  • •  If 2 is present, then the upper 64 bits of the register containing the smaller elements will be used. Otherwise, the lower 64 bits are used.
  • •  Td /Ts may be 4s /4h or 2d /2s.
  • •  Td2/Ts2 may be 4s /8h or 2d /4s.
  • •  Td3/Ts3 may be 4h /4s or 2s /2d.
  • •  Td4/Ts4 may be 8h /4s or 4s /2d.

10.4.2.2 Operations

NameEffectDescription
fcvtlVd[]←single(Vn[])Convert half precision to single precision.
fcvtnVd[]←half(Vn[])Convert single precision to half precision.
fcvtxnVd[]←single(Vd[])Convert double precision to single precision.

10.4.2.3 Examples

Image

10.4.3 Round floating point to integer

The following instruction can be used to round a vector of floating point values to integers:

frint Round Floating Point to Integer.

10.4.3.1 Syntax

Image

  • •  T is 2s, 4s, or 2d.
  • •  <x> selects the rounding mode. It may be one of the following:

    N: round to nearest with ties to even,

    A: round to nearest with ties away from zero,

    P: round towards +∞,

    M: round towards −∞,

    Z: round towards zero,

    I: round using FPCR rounding mode, and

    X: round using FPCR rounding mode with exactness test.

10.4.3.2 Operations

NameEffectDescription
frintxVd[]←roundx(Vn[],x)Round to integer using specified rounding mode.

10.4.3.3 Examples

Image

10.5 Bitwise logical operations

Advanced SIMD provides instructions to perform bitwise logical operations on the vector register set. These operations add a great deal of power to the AARCH64 processor.

10.5.1 Vector logical operations

The bitwise logical operations are:

and Vector bitwise AND,

orr Vector bitwise OR,

orn Vector bitwise NOR,

eor Vector bitwise Exclusive-OR,

bic Vector bit clear,

bif Vector insert if false,

bit Vector insert if true, and

bsl Vector bitwise select.

10.5.1.1 Syntax

Image

  • •  Image must be one of Image, Image, Image, Image, Image, Image, Image, or Image.
  • •  T is 8b or 16b (though an assembler should accept any other equivalent format).

10.5.1.2 Operations

Image

10.5.1.3 Examples

Image

10.5.2 Bitwise logical operations with immediate data

Advanced SIMD provides vector versions of the logical OR and bit clear instructions:

orr Vector bitwise OR immediate, and

bic Vector Bit clear immediate.

10.5.2.1 Syntax

Image

  • •  Image must be either Image, or Image.
  • •  T may be either 2s, 4s, 4h, or 8h.
  • •  If T is 2s or 4s, then shift may be 0, 8, 16, or 24.
  • •  If T is 4h or 8h, then shift may be 0 or 8.
  • •  If shift is not specified, then it is assumed to be zero.
  • •  Image is an 8-bit unsigned immediate value, which is shifted left by shift bits to create the desired pattern for the Image, or Image operation on each vector element.

10.5.2.2 Operations

NameEffectDescription
vorrVd[]←Vd[]∨(uimm8 ≪ shift)Logical OR
vbicVd[]←Vd[]∧(uimm8 ≪ shift)Bit Clear

10.5.2.3 Examples

Image

10.6 Basic arithmetic instructions

Advanced SIMD provides many instructions for addition, subtraction, and multiplication, but does not provide an integer divide instruction. When division cannot be avoided, it is performed by multiplying the reciprocal, as was described in Chapter 7 and Chapter 8. When dividing by a constant, the reciprocal can be calculated in advance. For dividing by variables, Advanced SIMD provides instructions for quickly calculating the reciprocals of the elements in a vector. In most cases, this is faster than using a divide instruction. For floating point numbers, the FP/NEON divide instructions can be used.

10.6.1 Vector add and subtract

The Image and Image instructions add corresponding elements in two vectors and store the results in the corresponding elements of the destination register. The Image and Image instructions subtract elements in one vector from corresponding elements in another vector and store the results in the corresponding elements of the destination register. Other versions of the add and subtract instructions allow mismatched operand and destination sizes, and the saturating versions prevent overflow by limiting the range of the results. The following instructions perform vector addition and subtraction:

add Vector integer add,

fadd Vector floating point add,

qadd Vector saturating add,

addl Vector add long,

addw Vector add wide,

sub Vector integer subtract,

fsub Vector floating point subtract,

qsub Vector saturating subtract,

subl Vector subtract long, and

subw Vector subtract wide.

10.6.1.1 Syntax

Image

  • •  <op> can be add or sub.
  • •  If double word registers are specified (Dd, Dn, Dm ) then the operation is a simple add or subtract of scalar 64-bit integer values, and not a vector operation.
  • •  The valid choices for T are given in the following table:
OpcodeValid Values for T
<op>8b, 16b, 4h, 8h, 2s, 4s, or 2d
f<op>2s, 4s, 2d
(s|u)q<op>8b, 16b, 4h, 8h, 2s, 4s, 2d
  • •  <sop> can be uadd, sadd, usub, or ssub.
  • •  If the modifier 2 is present, then the operation is performed using the upper 64 bits of the registers holding the narrower elements.
  • •  The valid choices for Td /Ts are given in the following table:
OpcodeValid Values for Td /Ts
<sop>l8h /8b, 4s /4h, 2d /2s
<sop>l28h /16b, 4s /8h, 2d /4s
<sop>w8h /8b, 4s /4h, 2d /2s
<sop>w28h /16b, 4s /8h, 2d /4s

10.6.1.2 Operations

Image

10.6.1.3 Examples

Image

10.6.2 Vector add and subtract with narrowing

These instructions add or subtract the corresponding elements of two vectors and narrow the resulting elements by taking the upper (most significant) half:

addhn Vector add and narrow,

raddhn Vector add, round, and narrow,

subhn Vector subtract and narrow, and

rsubhn Vector subtract, round, and narrow.

The results are stored in the corresponding elements of the destination register. Results can be optionally rounded instead of truncated.

10.6.2.1 Syntax

Image

  • •  Image is either Image or Image.
  • •  If Image is present, then the result is rounded instead of truncated.
  • •  If Image is present, then the upper 64 bits of the destination vector are used.
  • •  The valid choices for Td /Ts are given in the following table:
OpcodeValid Values for Td /Ts
r<op>hn8b /8h, 4h /4s, 2s /2d
r<op>hn216b /8h, 8h /4s, 4s /2d

10.6.2.2 Operations

Image

10.6.2.3 Examples

Image

10.6.3 Add or subtract and divide by two

These instructions add or subtract corresponding integer elements from two vectors, then shift the result right by one bit:

hadd Vector halving add,

rhadd Vector halving add and round, and

hsub Vector halving subtract.

The results are stored in corresponding elements of the destination vector.

10.6.3.1 Syntax

Image

  • •  If Image is specified, then the result is rounded instead of truncated.
  • •  T must be 8b, 16b, 4h, 8h, 2s, or 4s.

10.6.3.2 Operations

Image

10.6.3.3 Examples

Image

10.6.4 Add elements pairwise

These instructions add vector elements pairwise:

addp Vector add pairwise,

addlp Vector add long pairwise, and

adalp Vector add and accumulate long pairwise.

The long versions can be used to prevent overflow.

10.6.4.1 Syntax

Image

  • •  T must be 8b, 16b, 4h, 8h, 2s, 4s, or 2d.
  • •  Tf must be 2s, 4s, or 2d.
  • •  Td /Ts must be 4h /8b, 8h /16b, 2s /4h, 4s /8h, 1d /2s, or 2d /4s.

10.6.4.2 Operations

Image

10.6.4.3 Examples

Image

10.6.5 Absolute difference

These instructions subtract the elements of one vector from another and store or accumulate the absolute value of the results:

abd Vector integer absolute difference,

fabd Vector floating point absolute difference,

aba Vector integer absolute difference and accumulate,

faba Vector floating point absolute difference and accumulate,

abal Vector absolute difference and accumulate long, and

abdl Vector absolute difference long.

The long versions can be used to prevent overflow.

10.6.5.1 Syntax

Image

  • •  Image is either Image or Image.
  • •  When a scalar register is specified (F is S or D), a scalar operation is performed instead of a vector operation.
  • •  If Image is present, then the upper 64 bits of the source registers are used.
  • •  T must be 8b, 16b, 4h, 8h, 2s, or 4s.
  • •  Tf must be 2s, 4s, or 2d.
  • •  Td /Ts must be one of 4s /8h, 8h /16b, or 2d /4s.
  • •  The valid choices for Td /Ts are given in the following table:
OpcodeValid Types for Td /Ts
(s|u)<op>l8h /8b, 4s /4h, 2d /2s
(s|u)<op>l28h /16b, 4s /8h, 2d /4s

10.6.5.2 Operations

Image

10.6.5.3 Examples

Image

10.6.6 Absolute value and negate

These operations compute the absolute value or negate each element in a vector:

abs Vector absolute value,

neg Vector negate,

fabs Vector floating point absolute value, and

fneg Vector floating point negate,

The saturating versions can be used to prevent overflow.

10.6.6.1 Syntax

Image

  • •  Image is either Image or Image.
  • •  T may be 8b, 16b, 4h, 8h, 2s, 4s, or 2d.
  • •  Tf may be 2s, 4s, or 2d.

10.6.6.2 Operations

Image

10.6.6.3 Examples

Image

10.6.7 Get maximum or minimum elements

The following four instructions select the maximum or minimum elements and store the results in the destination vector:

max Vector integer maximum,

min Vector integer minimum,

fmax Vector floating point maximum,

fmin Vector floating point minimum,

maxp Vector integer pairwise maximum,

minp Vector integer pairwise minimum,

fmaxp Vector floating point pairwise maximum,

fminp Vector floating point pairwise minimum,

fmaxnm Vector floating point maxnum,

fminnm Vector floating point minnum,

fmaxnmp Vector floating point pairwise maxnum, and

fminnmp Vector floating point pairwise minnum.

10.6.7.1 Syntax

Image

  • •  Image is either Image or Image.
  • •  T must be 8b, 16b, 4h, 8h, 2s, or 4s.
  • •  Tf must be 2s, 4s, or 2d.
  • •  If nm is present, then the result is as described in Chapter 9, Section 9.7.5, on page 311.

10.6.7.2 Operations

Image

10.6.7.3 Examples

Image

10.6.8 Count bits

These instructions can be used to count leading sign bits or zeros, or to count the number of bits that are set, for each element in a vector:

cls Vector count leading sign bits,

clz Vector count leading zero bits, and

cnt Vector count set bits.

10.6.8.1 Syntax

Image

  • •  T must be 8b, 16b, 4h, 8h, 2s, or 4s.
  • •  Tn must be 8b or 16b.

10.6.8.2 Operations

Image

10.6.8.3 Examples

Image

10.6.9 Scalar saturating operations

The following instructions perform basic saturating operations on scalars:

qadd Scalar saturating add,

qsub Scalar saturating subtract,

qdmulh Scalar saturating multiply (high half), and

qshl Scalar saturating shift left.

10.6.9.1 Syntax

Image

  • •  F is b, h, s d.
  • •  <Fx> is h or s.

10.6.9.2 Operations

Image

10.6.9.3 Examples

Image

10.7 Multiplication and division

There is no integer divide instruction in Advanced SIMD. Integer division is accomplished with multiplication by the reciprocal, as was described in Chapter 7 and Chapter 8. For division by a constant, the constant reciprocal can be computed in advance, and simply loaded into a register. For division by a variable, special instructions are provided for computing the reciprocal.

10.7.1 Vector multiply and divide

These instructions are used to multiply the corresponding elements from two vectors:

mul Vector integer multiply,

mla Vector integer multiply accumulate,

mls Vector integer multiply subtract,

fmul Vector floating point multiply,

fdiv Vector floating point divide,

fmla Vector floating point multiply accumulate,

fmls Vector floating point multiply subtract,

mull Vector multiply long,

mlal Vector multiply accumulate long,

mlsl Vector multiply subtract long,

pmul Vector polynomial multiply, and

pmull Vector polynomial multiply long.

10.7.1.1 Syntax

Image

  • •  T may be 8b, 16b, 4h, 8h, 2s, or 4s.
  • •  Tf may be 2s, 4s, or 2d.
  • •  T2 may be 8b or 16b.
  • •  If x is present, then 0×±±2Image (vector).
  • •  If 2 is present, then
    1. –  Td /Ts may be 8h /16b, 4s /8h, or 2d /4s.
    2. –  the upper 64 bits of the source vectors are used.
  • •  If 2 is not present, then Td /Ts may be 8h /8b, 4s /4h, or 2d /2s.

10.7.1.2 Operations

Image

10.7.1.3 Examples

Image

10.7.2 Multiply vector by element

These instructions are used to multiply each element in a vector by a scalar:

mul Vector by scalar integer multiply,

mla Vector by scalar integer multiply accumulate,

mls Vector by scalar integer multiply subtract,

fmul Vector by scalar floating point multiply,

fmla Vector by scalar floating point multiply accumulate,

fmls Vector by scalar floating point multiply subtract,

mull Vector by scalar multiply long,

mlal Vector by scalar multiply accumulate long, and

mlsl Vector by scalar multiply subtract long.

10.7.2.1 Syntax

Image

  • •  <op> is either mul, mla, or mls.
  • •  T /Ts must be 4h /h, 8h /h, 2s /s, or 4s /s.
  • •  Tf /Ts2 must be 2s /s, 4s /s, or 2d /d.
  • •  If x is present, then 0×±±2Image (vector).
  • •  If 2 is present, then
    1. –  Ta/Tb/Tc is 4s /8h /h or 2d /4s /s.
    2. –  the upper 64 bits of the source vectors are used.
  • •  If 2 is not present, then Ta/Tb/Tc is 4s /4h /h or 2d /2s /s.

10.7.2.2 Operations

Image

10.7.2.3 Examples

Image

10.7.3 Saturating vector multiply and double

These instructions perform multiplication, double the results, and perform saturation:

sqdmull Saturating Multiply Double,

sqdmlal Saturating Multiply Double Accumulate, and

sqdmlsl Saturating Multiply Double Subtract.

10.7.3.1 Syntax

Image

  • •  Image is either Image, Image, or Image.
  • •  If 2 is present, then Td /Ts is 4s /8h or 2d /4s and the upper half of Vn and Vm are used. Otherwise, Td /Ts is 4s /4h or 2d /2s and the lower half of Vn and Vm are used.
  • •  If 2 is present, then Td /Ta /Tb is 4s /8h /h or 2d /4s /s, and the upper half of Vn is used. Otherwise Td /Ta /Tb is 4s /4h /h or 2d /4s /s, and the lower half of Vn is used.
  • •  If the third operand is a scalar (Image is specified) and Tb is h, then Vm must be in the range Image.

10.7.3.2 Operations

Image

10.7.3.3 Examples

Image

10.7.4 Saturating multiply and double (high)

These instructions perform multiplication, double the results, perform saturation, and store the high half of the results:

sqdmulh Saturating Multiply Double (High), and

sqrdmulh Saturating Multiply Double and Round (High).

10.7.4.1 Syntax

Image

  • •  T must be 4h, 8h, 2s, or 4s.
  • •  Td /Ts must be 4h /h, 8h /h, 2s /s, or 4s /s.
  • •  If Ts is h, then Vm must be in the range Image.

10.7.4.2 Operations

Image

10.7.4.3 Examples

Image

10.7.5 Estimate reciprocals

In general, multiplication is faster than division. In many cases of vector arithmetic, it is faster to calculate reciprocals and use multiplication. These instructions perform the initial estimates of the reciprocal values:

recpe Reciprocal Estimate, and

rsqrte Reciprocal Square Root Estimate.

These work on floating point and unsigned fixed point vectors. The estimates from this instruction are accurate to within about eight bits. If higher accuracy is desired, then the Newton-Raphson method can be used to improve the initial estimates. For more information, see the Reciprocal Step instructions on page 369.

10.7.5.1 Syntax

Image

  • •  Image is either Image or Image.
  • •  Ta must be 2s or 4s.
  • •  Tb must be 2s, 4s, or 2d.

10.7.5.2 Operations

Image

10.7.5.3 Examples

Image

10.7.6 Reciprocal step

These instructions are used to perform one Newton-Raphson step for improving the reciprocal estimates:

frecps Reciprocal Step, and

frsqrts Reciprocal Square Root Step.

For each element in the vector, the following equation can be used to improve the estimates of the reciprocals:

xn+1=xn(2dxn).

Image

Where xnImage is the estimated reciprocal from the previous step, and d is the number for which the reciprocal is desired. This equation converges to 1dImage if x0Image is obtained using Image on d. The Image instruction computes

xn+1=2dxn,

Image

so one additional multiplication is required to complete the update step. The initial estimate x0Image must be obtained using the Image instruction.

For each element in the vector, the following equation can be used to improve the estimates of the reciprocals of the square roots:

xn+1=xn3dxn22.

Image

Where xnImage is the estimated reciprocal from the previous step, and d is the number for which the reciprocal is desired. This equation converges to 1dImage if x0Image is obtained using Image on d. The Image instruction computes

xn+1=3dxn2,

Image

so two additional multiplications are required to complete the update step. The initial estimate x0Image must be obtained using the Image instruction.

10.7.6.1 Syntax

Image

  • •  Image is either Image or Image.
  • •  T must be 2s, 4s, or 2d.
  • •  F is s or d.

10.7.6.2 Operations

Image

10.7.6.3 Examples

Image

10.7.7 Multiply scalar by element

These instructions are used to multiply each element in a vector by a scalar:

fmul Vector by scalar floating point multiply,

fmla Vector by scalar floating point multiply accumulate,

fmls Vector by scalar floating point multiply subtract.

10.7.7.1 Syntax

Image

  • •  Ts must be s or d.
  • •  If Ts is s, then Vm must be in the range s0 to s15.
  • •  If x is present, then 0×±±2Image (vector).

10.7.7.2 Operations

Image

10.7.7.3 Examples

Image

10.7.8 Saturating multiply scalar by element and double

These instructions perform multiplication, double the results, and perform saturation:

sqdmull Saturating Multiply Scalar by Element and Double,

sqdmlal Saturating Multiply Scalar by Element, Double, and Accumulate, and

sqdmlsl Saturating Multiply Scalar by Element, Double, and Subtract.

10.7.8.1 Syntax

Image

  • •  Image is either Image, Image, or Image.
  • •  Fd /Fn /Ts must be s /h /h or d /s /s.
  • •  If Ts is h, then Vm must be in the range Image.
  • •  F /Tc must be h /h or s /s.
  • •  If Tc is h, then Vc must be in the range Image.

10.7.8.2 Operations

Image

10.7.8.3 Examples

Image

10.8 Shift instructions

The Advanced SIMD shift instructions operate on vectors. Shifts are often used for multiplication and division by powers of two. The results of a left shift may be larger than the destination register, resulting in overflow. A shift right is equivalent to division. In some cases, it may be useful to round the result of a division, rather than truncating. Advanced SIMD provides versions of the shift instruction which perform saturation and/or rounding of the result.

10.8.1 Vector shift left by immediate

These instructions shift each element in a vector left by an immediate value:

shl Unsigned Shift Left Immediate,

qshl Saturating Signed or Unsigned Shift Left Immediate,

sqshlu Saturating Signed Shift Left Immediate Unsigned, and

shll Signed or Unsigned Shift Left Immediate Long.

Overflow conditions can be avoided by using the saturating version, or by using the long version, in which case the destination is twice the size of the source.

10.8.1.1 Syntax

Image

  • •  Image is an alias for Image.
  • •  Image is an alias for Image.
  • •  Image is an alias for Image.
  • •  Image is an alias for Image.
  • •  T is 8b, 16b, 4h, 8h, 2s, 4s, or 2d.
  • •  If 2 is present, then Td /Ts is 8h /16b, 4s /8h, or 2d /4s.
  • •  If 2 is not present, then Td /Ts is 8h /8b, 4s /4h, or 2d /2s.
  • •  shift is in the range 0 to size(T)1Image.
  • •  If the instruction begins with Image, then the elements are treated as unsigned integers.
  • •  If Image is present, then the elements are treated as signed integers.

10.8.1.2 Operations

Image

10.8.1.3 Examples

Image

10.8.2 Vector shift right by immediate

These instructions shift each element in a vector right by an immediate value:

shr Shift Right Immediate,

rshr Shift Right Immediate and Round,

shrn Shift Right Immediate and Narrow,

rshrn Shift Right Immediate Round and Narrow,

sra Shift Right and Accumulate Immediate, and

rsra Shift Right Round and Accumulate Immediate.

10.8.2.1 Syntax

Image

  • •  T is 8b, 16b, 4h, 8h, 2s, 4s, or 2d.
  • •  If 2 is present, then Td /Ts is 16b /8h, 8h /4s, or 4s /2d.
  • •  If 2 is not present, then Td /Ts is 8h /8b, 4s /4h, or 2d /2s.
  • •  shift is in the range 0 to size(T)1Image (or size(Td)1Image).

10.8.2.2 Operations

Image

10.8.2.3 Examples

Image

10.8.3 Vector saturating shift right by immediate

These instructions shift each element in a quad word vector right by an immediate value:

qshrn Saturating Shift Right Narrow,

qrshrn Saturating Rounding Shift Right Narrow,

sqshrun Signed Saturating Shift Right Unsigned Narrow, and

sqrshrun Signed Saturating Rounding Shift Right Unsigned Immediate.

10.8.3.1 Syntax

Image

  • •  If Image is present, the Td /Ts is 16b /8h, 8h /4s, or 4s /2d
  • •  If Image is not present, the Td /Ts is 8b /8h, 4h /4s, or 2s /2d to elsize(Td ).
  • •  shift is in the range 1 to size(Td)Image.

10.8.3.2 Operations

Image

10.8.3.3 Examples

Image

10.8.4 Shift left or right by variable

These instructions shift each element in a vector left or right, using the least significant byte of the corresponding element of a second vector as the shift amount:

shl Shift Left or Right by Variable,

rshl Shift Left or Right by Variable and Round,

qshl Saturating Shift Left or Right by Variable, and

qrshl Saturating Shift Left or Right by Variable and Round.

10.8.4.1 Syntax

Image

  • •  T is 8b, 16b, 4h, 8h, 2s, 4s, or 2d.
  • •  If double precision register s are specified (Dd, Dn, Dm ), then the operation is a scalar operation instead of a vector operation.
  • •  If the shift value is positive, the operation is a left shift.
  • •  If shift value is negative, then it is a right shift.
  • •  A shift value of zero is equivalent to a move.
  • •  If the operation is a right shift, and Image is specified, then the result is rounded rather than truncated.
  • •  Results are saturated if Image is specified.
  • •  If Image is present, then the results are saturated.
  • •  If Image is present, then right shifted values are rounded rather than truncated.

10.8.4.2 Operations

Image

10.8.4.3 Examples

Image

10.8.5 Shift and insert

These instructions perform bitwise shifting of each element in a vector, then combine bits from the source with bits from the destination. Fig. 10.7 provides an example. The instructions are:

sli Shift Left and Insert, and

sri Shift Right and Insert.

Image
Figure 10.7 Effects of sli v4.2d,v9.2d,#6.

10.8.5.1 Syntax

Image

  • •  T is 8b, 16b, 4h, 8h, 2s, 4s, or 2d.
  • •  Image must be l for a left shift, or r for a right shift.
  • •  Image is the amount that elements are to be shifted, and must be between zero and size(T)1Image for Image, or between one and size(T)Image for Image.

10.8.5.2 Operations

Image

10.8.5.3 Examples

Image

10.8.6 Scalar shift left by immediate

These instructions shift each element in a vector left by an immediate value:

shl Unsigned Shift Left Immediate,

qshl Saturating Signed or Unsigned Shift Left Immediate,

sqshlu Saturating Signed Shift Left Immediate Unsigned, and

shll Signed or Unsigned Shift Left Immediate Long.

Overflow conditions can be avoided by using the saturating version, or by using the long version, in which case the destination is twice the size of the source.

10.8.6.1 Syntax

Image

  • •  F may be b, h, s, or d.
  • •  If the instruction begins with Image, then the scalars are treated as unsigned integers.
  • •  If Image is present, then the scalars are treated as signed integers.

10.8.6.2 Operations

Image

10.8.6.3 Examples

Image

10.8.7 Scalar shift right by immediate

These instructions shift each element in a vector right by an immediate value:

shr Shift Right Immediate,

rshr Shift Right Immediate and Round,

sra Shift Right and Accumulate Immediate, and

rsra Shift Right Round and Accumulate Immediate.

10.8.7.1 Syntax

Image

  • •  shift is in the range 0 to size(T)1Image (or size(Td)1Image).

10.8.7.2 Operations

Image

10.8.7.3 Examples

Image

10.8.8 Scalar saturating shift right by immediate

These instructions shift each element in a quad word vector right by an immediate value:

qshrn Saturating Shift Right Narrow,

qrshrn Saturating Rounding Shift Right Narrow,

sqshrun Signed Saturating Shift Right Unsigned Narrow, and

sqrshrun Signed Saturating Rounding Shift Right Unsigned Immediate.

10.8.8.1 Syntax

Image

  • •  <Fd>/<Fn> must be b /h, h /s, or s /d.

10.8.8.2 Operations

Image

10.8.8.3 Examples

Image

10.9 Unary arithmetic

Advanced SIMD provides several unary operations for integer and floating point values. It provides instructions for bitwise complement, negation, reversing bits an elements, and other operations.

10.9.1 Vector unary arithmetic

not Vector 1's Complement,

qadd Vector Saturating Accumulate,

fsqrt Vector Floating Point Square Root,

rbit Vector bit reverse,

rev Reverse Elements.

10.9.1.1 Syntax

Image

  • •  T is 8b or 16b.
  • •  Ta is 8b, 16b, 4h, 8h, 2s, 4s, or 2d.
  • •  Tb is 2s, 4s, or 2d.
  • •  Tc is 8b, 16b, 4h, or 8h.
  • •  Td is 8b, 16b, 4h, 8h, 2s, or 4s.

10.9.1.2 Operations

Image

10.9.1.3 Examples

Fig. 10.8 provides some illustrated examples of the Image instructions.

Image

Image
Figure 10.8 Examples of the rev instruction.

10.9.2 Scalar unary arithmetic

abs Integer absolute value,

neg Integer absolute value,

qadd Vector Saturating Accumulate,

fsqrt Vector Floating Point Square Root,

rbit Vector bit reverse, and

rev Reverse Elements.

10.9.2.1 Syntax

Image

  • •  F is b, h, s, or d.

10.9.2.2 Operations

Image

10.9.2.3 Examples

Image

10.10 Vector reduce instructions

These instructions operate across all lanes in a vector, and produce a scalar.

10.10.1 Reduce across lanes

Advanced SIMD provides instructions for summing the elements in a vector, and for getting the maximum or minimum value from a vector. These instructions are:

addv Integer Sum Elements to Scalar,

maxv Integer Maximum Element to Scalar,

minv Integer Minimum Element to Scalar,

fmaxv Floating Point Maximum Element to Scalar, and

fminv Floating Point Minimum Element to Scalar.

There are long versions of the Image instruction which will prevent overflow.

10.10.1.1 Syntax

Image

  • •  F /T may be b /8b, b /16b, h /4h, h /8h, s /2s, or s /4s.
  • •  Fa /Tn may be h /8b, h /16b, s /4h, s /8h, d /2s, or d /4s.
  • •  If nm is present, then comparison between a nan and a numerical value will return the numerical value.

10.10.1.2 Operations

Image

10.10.1.3 Examples

Image

10.10.2 Reduce pairwise

The pairwise instructions are similar to the vector reduce instructions, but always operate on two elements of the source vector. These instructions are:

addp Integer Sum Elements to Scalar,

faddp Integer Maximum Element to Scalar,

fmaxp Floating Point Maximum Element to Scalar, and

fminp Floating Point Minimum Element to Scalar.

There are long versions of the Image instruction which will prevent overflow.

10.10.2.1 Syntax

Image

  • •  F /T must be either s /2s or d /2d.
  • •  If nm is present, then comparison between a nan and a numerical value will return the numerical value.

10.10.2.2 Operations

Image

10.10.2.3 Examples

Image

10.11 Comparison operations

Advanced SIMD provides instructions to perform comparisons between vectors. Since there are multiple pairs of items to be compared, the comparison instructions set one element in a result vector for each pair of items. After the comparison operation, each element of the result vector will have every bit set to zero (for false) or one (for true). Note that if the elements of the result vector are interpreted as signed two's-complement numbers, then the value 0 represents false and the value −1 represents true. Note that summing the elements of the result vector (as signed integers) will give the two's complement of the number of comparisons which were true.

10.11.1 Vector compare mask

The following instructions perform comparisons of all of the corresponding elements of two vectors in parallel:

cm Vector integer compare mask, and

fcm Vector floating point compare mask.

10.11.1.1 Syntax

Image

  • •  <op> is one of eq, hs, ge, hi, gt, ls, le, lo, or lt.
  • •  <op2> is one of eq, ge, gt, le, or lt.
  • •  T is 8b, 16b, 4h, 8h, 2s, 4s, or 2d.
  • •  Tf is 2s, 4s, or 2d.
  • •  If the third operand is Image, then it is treated as a vector of the correct size in which every element is zero.
  • •  Image can be Image or Image.

10.11.1.2 Operations

Image

10.11.1.3 Examples

Image

10.11.2 Vector absolute compare mask

The following instruction performs comparisons between the absolute values of all of the corresponding elements of two vectors in parallel:

fac Vector floating point absolute compare mask.

10.11.2.1 Syntax

Image

  • •  <op> is one of ge, gt, le, or lt.
  • •  T is 2s, 4s, or 2d.

10.11.2.2 Operations

Image

10.11.2.3 Examples

Image

10.11.3 Vector test bits mask

Advanced SIMD provides the following vector version of the ARM Image instruction:

cmtst Vector test bits compare mask.

10.11.3.1 Syntax

Image

  • •  T is 8b, 16b, 4h, 8h, 2s, 4s, or 2d.

10.11.3.2 Operations

Image

10.11.3.3 Examples

Image

10.11.4 Scalar compare mask

The following instructions perform comparisons of the specified scalar registers:

cm Scalar integer compare mask, and

fcm Scalar floating point compare mask,

If the comparison is true, then all bits in the destination register are set to one. Otherwise, all bits in the destination register are set to zero.

10.11.4.1 Syntax

Image

  • •  The integer comparison can only operate on 64-bit integers.
  • •  F can be either s for single precision, or d for double precision floating point.
  • •  <op> is one of eq, hs, ge, hi, gt, ls, le, lo, or lt.
  • •  <op2> is one of eq, ge, gt, le, or lt.

10.11.4.2 Operations

Image

10.11.4.3 Examples

Image

10.11.5 Scalar absolute compare mask

The following instruction performs comparisons between the absolute values of two scalars:

fac Scalar floating point absolute compare mask.

10.11.5.1 Syntax

Image

  • •  <op> is one of ge, gt, le, or lt.
  • •  T is 2s, 4s, or 2d.

10.11.5.2 Operations

Image

10.11.5.3 Examples

Image

10.11.6 Scalar test bits mask

The scalar test bits instruction performs a logical AND operation between two source registers. If the result is not zero, then every bit in the result register is set to one. Otherwise, every bit in the result register is set to zero. The instruction is:

cmtst Scalar test bits compare mask.

10.11.6.1 Syntax

Image

  • •  Only 64-bit comparisons can be performed.

10.11.6.2 Operations

Image

10.11.6.3 Examples

Image

10.12 Performance mathematics: a final look at sine

In Chapter 9, two versions of the sine function were given. Those implementations used scalar FP/NEON instructions for single-precision and double-precision. Those previous implementations are already faster than the implementations provided by GCC, However, it may be possible to gain a little more performance by taking advantage of the Advanced SIMD architecture.

10.12.1 Single precision

Image

Listing 10.1 shows a single precision floating point implementation of the sine function, using Advanced SIMD vector instructions. It performs the same operations as the previous implementations of the sine function, but performs many of the calculations in parallel. This implementation is slightly faster than the previous version. In addition to being faster, it also uses nine terms of the Taylor series, so it should be more accurate as well.

10.12.2 Double precision

Image

Listing 10.2 shows a double precision floating point implementation of the sine function. It also uses Advanced SIMD vector instructions. Both of the implementations in this chapter are faster than the corresponding implementations in Chapter 9 because they use a large number of registers, do not contain loops, and are written carefully to use vector instructions, ordered so that multiple instructions can be at different stages in the pipeline at the same time. This technique of gaining performance is known as loop unrolling. In addition to being faster, the vector implementations use more terms of the Taylor series, so they may also be more accurate.

10.12.3 Performance comparison

Table 10.1 compares the implementations from Listing 10.1 and Listing 10.2 with the FP/NEON implementations from Chapter 9 and the sine function provided by GCC. When compiler optimization is not used, the single precision scalar FP/NEON implementation achieves a speedup of about 3.41, and the Advanced SIMD implementation achieves a speedup of about 4.21 compared to the GCC implementation. The double precision scalar FP/NEON implementation achieves a speedup of about 2.22, and the Advanced SIMD achieves a speedup of about 2.41 compared to the GCC implementation.

Table 10.1

Performance of sine function with various implementations.

OptimizationImplementationCPU seconds
NoneSingle Precision C6.958
Single Precision FP/NEON scalar Assembly2.038
Single Precision Advanced SIMD Assembly1.654
Double Precision C6.682
Double Precision FP/NEON scalar Assembly3.014
Double Precision Advanced SIMD Assembly2.778
FullSingle Precision C4.409
Single Precision FP/NEON scalar Assembly1.449
Single Precision Advanced SIMD Assembly1.142
Double Precision C5.758
Double Precision FP/NEON scalar Assembly2.134
Double Precision Advanced SIMD Assembly1.941

Image

When the compiler optimization is used (-Ofast), the single precision scalar FP/NEON implementation achieves a speedup of about 3.04, and the Advanced SIMD implementation achieves a speedup of about 3.86 compared to the GCC implementation. The double precision scalar FP/NEON implementation achieves a speedup of about 2.70, and the Advanced SIMD implementation achieves a speedup of about 2.97 compared to the GCC implementation. The single precision Advanced SIMD version was 1.27 times as fast as the FP/NEON scalar version, and the double precision Advanced SIMD implementation was 1.10 times as fast as the FP/NEON scalar implementation.

Although the FP/NEON versions of the sine functions were already much faster than the C standard library, re-writing them using Advanced SIMD resulted in further performance improvement. The take-away lesson is that a programmer can improve performance by writing some functions in assembly that are specifically targeted to run on a specific platform. To achieve optimal or near-optimal performance, it is important for the programmer to be aware of advanced features available on the hardware platform that is being used.

10.13 Alphabetized list of advanced SIMD instructions

NamePageOperation
aba354Vector integer absolute difference and accumulate
abal355Vector absolute difference and accumulate long
abd354Vector integer absolute difference
abdl355Vector absolute difference long
abs383Integer absolute value
abs356Vector absolute value
adalp353Vector add and accumulate long pairwise
add348Vector integer add
addhn351Vector add and narrow
addl348Vector add long
addlp353Vector add long pairwise
addp386Integer Sum Elements to Scalar
addp353Vector add pairwise
addv385Integer Sum Elements to Scalar
addw349Vector add wide
and346Vector bitwise AND
bic347Vector Bit clear immediate
bic346Vector bit clear
bif346Vector insert if false
bit346Vector insert if true
bsl346Vector bitwise select
cls359Vector count leading sign bits
clz359Vector count leading zero bits
cm391Scalar integer compare mask
cm388Vector integer compare mask
cmtst393Scalar test bits compare mask
cmtst390Vector test bits compare mask
cnt359Vector count set bits
cvtf343Vector convert integer or fixed point to floating point
dup333Duplicate Scalar
eor346Vector bitwise Exclusive-OR
ext339Byte Extract
faba355Vector floating point absolute difference and accumulate
fabd354Vector floating point absolute difference
fabs356Vector floating point absolute value
fac392Scalar floating point absolute compare mask
fac389Vector floating point absolute compare mask
fadd348Vector floating point add
faddp386Integer Maximum Element to Scalar
fcm391Scalar floating point compare mask
fcm388Vector floating point compare mask
fcvt343Vector convert floating point to integer or fixed point
fcvtl344Vector convert from half to single precision
fcvtn344Vector convert from single to half precision
fcvtxn344Vector convert from double to single precision
fdiv362Vector floating point divide
fmadd310Fused Multiply Accumulate
fmax357Vector floating point maximum
fmaxnm357Vector floating point maxnum
fmaxnmp357Vector floating point pairwise maxnum
fmaxp386Floating Point Maximum Element to Scalar
fmaxp357Vector floating point pairwise maximum
fmaxv385Floating Point Maximum Element to Scalar
fmin357Vector floating point minimum
fminnm357Vector floating point minnum
fminnmp357Vector floating point pairwise minnum
fminp387Floating Point Minimum Element to Scalar
fminp357Vector floating point pairwise minimum
fminv385Floating Point Minimum Element to Scalar
fmla364Vector by scalar floating point multiply accumulate
fmla370Vector by scalar floating point multiply accumulate
fmla362Vector floating point multiply accumulate
fmls364Vector by scalar floating point multiply subtract
fmls370Vector by scalar floating point multiply subtract
fmls362Vector floating point multiply subtract
fmov336Vector Floating Point Move Immediate
fmsub310Fused Multiply Subtract
fmul364Vector by scalar floating point multiply
fmul370Vector by scalar floating point multiply
fmul362Vector floating point multiply
fneg356Vector floating point negate
fnmadd310Fused Multiply Accumulate and Negate
fnmsub310Fused Multiply Subtract and Negate
frecps369Reciprocal Step
frint345Round Floating Point to Integer
frsqrts369Reciprocal Square Root Step
fsqrt382Vector Floating Point Square Root
fsqrt383Vector Floating Point Square Root
fsub349Vector floating point subtract
hadd352Vector halving add
hsub352Vector halving subtract
ld<n>329Load Multiple Structured Data
ld<n>327Load Structured Data
ld<n>r331Load Copies of Structured Data
max357Vector integer maximum
maxp357Vector integer pairwise maximum
maxv385Integer Maximum Element to Scalar
min357Vector integer minimum
minp357Vector integer pairwise minimum
minv385Integer Minimum Element to Scalar
mla364Vector by scalar integer multiply accumulate
mla362Vector integer multiply accumulate
mlal364Vector by scalar multiply accumulate long
mlal362Vector multiply accumulate long
mls364Vector by scalar integer multiply subtract
mls362Vector integer multiply subtract
mlsl364Vector by scalar multiply subtract long
mlsl362Vector multiply subtract long
mov334Copy element into vector
movi336Vector Move Immediate
mul364Vector by scalar integer multiply
mul362Vector integer multiply
mull364Vector by scalar multiply long
mull362Vector multiply long
mvni336Vector Move NOT Immediate
neg383Integer absolute value
neg356Vector negate
not382Vector 1's Complement
orn346Vector bitwise NOR
orr346Vector bitwise OR
orr347Vector bitwise OR immediate
pmul362Vector polynomial multiply
pmull362Vector polynomial multiply long
qadd360Scalar saturating add
qadd382Vector Saturating Accumulate
qadd383Vector Saturating Accumulate
qadd348Vector saturating add
qdmulh360Scalar saturating multiply (high half)
qrshl377Saturating Shift Left or Right by Variable and Round
qrshrn376Saturating Rounding Shift Right Narrow
qrshrn381Saturating Rounding Shift Right Narrow
qshl377Saturating Shift Left or Right by Variable
qshl373Saturating Signed or Unsigned Shift Left Immediate
qshl379Saturating Signed or Unsigned Shift Left Immediate
qshl360Scalar saturating shift left
qshrn376Saturating Shift Right Narrow
qshrn381Saturating Shift Right Narrow
qsub360Scalar saturating subtract
qsub349Vector saturating subtract
raddhn351Vector add, round, and narrow
rbit382Vector bit reverse
rbit384Vector bit reverse
recpe368Reciprocal Estimate
rev382Reverse Elements
rev384Reverse Elements
rhadd352Vector halving add and round
rshl377Shift Left or Right by Variable and Round
rshr374Shift Right Immediate and Round
rshr380Shift Right Immediate and Round
rshrn374Shift Right Immediate Round and Narrow
rsqrte368Reciprocal Square Root Estimate
rsra374Shift Right Round and Accumulate Immediate
rsra380Shift Right Round and Accumulate Immediate
rsubhn351Vector subtract, round, and narrow
shl377Shift Left or Right by Variable
shl373Unsigned Shift Left Immediate
shl379Unsigned Shift Left Immediate
shll373Signed or Unsigned Shift Left Immediate Long
shll379Signed or Unsigned Shift Left Immediate Long
shr374Shift Right Immediate
shr380Shift Right Immediate
shrn374Shift Right Immediate and Narrow
sli378Shift Left and Insert
smov334Copy signed integer element from vector to AARCH64 register
sqdmlal365Saturating Multiply Double Accumulate
sqdmlal371Saturating Multiply Scalar by Element, Double, and Accumulate
sqdmlsl365Saturating Multiply Double Subtract
sqdmlsl371Saturating Multiply Scalar by Element, Double, and Subtract
sqdmulh366Saturating Multiply Double (High)
sqdmull365Saturating Multiply Double
sqdmull371Saturating Multiply Scalar by Element and Double
sqrdmulh367Saturating Multiply Double and Round (High)
sqrshrun376Signed Saturating Rounding Shift Right Unsigned Immediate
sqrshrun381Signed Saturating Rounding Shift Right Unsigned Immediate
sqshlu373Saturating Signed Shift Left Immediate Unsigned
sqshlu379Saturating Signed Shift Left Immediate Unsigned
sqshrun376Signed Saturating Shift Right Unsigned Narrow
sqshrun381Signed Saturating Shift Right Unsigned Narrow
sra374Shift Right and Accumulate Immediate
sra380Shift Right and Accumulate Immediate
sri378Shift Right and Insert
st<n>329Store Multiple Structured Data
st<n>327Store Structured Data
sub349Vector integer subtract
subhn351Vector subtract and narrow
subl349Vector subtract long
subw349Vector subtract wide
tbl341Table Lookup
tbx341Table Lookup with Extend
trn337Transpose Matrix
umov334Copy unsigned integer element from vector to AARCH64 register
uzp339Unzip Vectors
zip339Zip Vectors

10.14 Advanced SIMD intrinsics

The C compiler may provide C (and C++) programs direct access to the Advanced SIMD instructions through the Advanced SIMD intrinsics library. The intrinsics are a large set of functions that are built into the compiler. Most of the intrinsics functions map to one Advanced SIMD instruction. There are additional functions provided for typecasting (reinterpreting) SIMD vectors, so that the C compiler does not complain about mismatched types. It is usually shorter and more efficient to write the Advanced SIMD code directly as assembly language functions and link them to the C code. However only those who know assembly language are capable of doing that.

10.15 Chapter summary

Advanced SIMD can dramatically improve performance of algorithms that can take advantage of data parallelism. However, compiler support for automatically vectorizing and using Advanced SIMD instructions is still immature. Advanced SIMD intrinsics allow C and C++ programmers to access these instructions, by making them look like C functions. It is usually just as easy and more concise to write AArch64 assembly code as it is to use the intrinsics functions. A careful assembly language programmer can usually beat the compiler, sometimes by a wide margin.

Exercises

  1. 10.1.  What is the advantage of using IEEE half-precision? What is the disadvantage?
  2. 10.2.  Advanced SIMD achieved relatively modest performance gains on the sine function, when compared to FP/NEON.
    1. a.  Why?
    2. b.  List some tasks for which Advanced SIMD could significantly outperform scalar FP/NEON.
  3. 10.3.  There are some limitations on the size of the structure that can be loaded or stored using the Image and Image instructions. What are the limitations?
  4. 10.4.  The sine function in Listing 10.2 uses a technique known as “loop unrolling” to achieve higher performance. Name at least three reasons why this code is more efficient than using a loop?
  5. 10.5.  Reimplement the fixed-point sine function from Listing 8.7 using Advanced SIMD instructions. Hint: you should not need to use a loop. Compare the performance of your Advanced SIMD implementation with the performance of the original implementation.
  6. 10.6.  Reimplement Exercise 9.10. using Advanced SIMD instructions.
  7. 10.7.  Fixed point operations may be faster than floating point operations. Modify your code from the previous example so that it uses the following definitions for points and transformation matrices:
Image
  1. Use saturating instructions and/or any other techniques necessary to prevent overflow. Compare the performance of the two implementations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.97.64