SIMD; Advanced SIMD load and store instructions; Advanced SIMD data movement instructions; Advanced SIMD bitwise logical operations; Advanced SIMD basic arithmetic instructions; Advanced SIMD multiplication; Advanced SIMD division; Advanced SIMD shift instructions; Advanced SIMD comparison operations
In addition to the FP/NEON instructions described in the previous chapter, AArch64 also supports Advanced SIMD instructions, which allow the programmer to treat the FP/NEON registers as vectors (arrays) of data. Advanced SIMD uses the same set of registers described in Chapter 9, but adds new views to provide the ability to access the registers in more ways. Advanced SIMD adds about 125 instructions and pseudo-instructions to support not only floating point, but also integer and fixed point.
A single Advanced SIMD instruction can operate on up to 128 bits, which may represent multiple integer, fixed point, or floating point numbers. For example, if two of the 128-bit registers each contain eight 16-bit integers, then a single Advanced SIMD instruction can add all eight integers from one register to the corresponding integers in the other register, resulting in eight simultaneous additions. For certain applications, this vector architecture can result in extremely fast and efficient implementations. Advanced SIMD is particularly useful at handling streaming video and audio, but also can give very good performance on floating point intensive tasks.
The 32 FP/NEON/Advanced SIMD registers, originally introduced in Chapter 9, can be accessed using various views. Some SIMD instructions use the byte, half-word, word, and double-word views from Chapter 9, but most of them use the Advanced SIMD views. Fig. 10.1 shows the different ways of viewing an Advanced SIMD register. Each register can be viewed as containing a vector of 2, 4, 8, or 16 elements, all of the same size and type. Individual elements of each vector can also be accessed by some instructions. The scalar register names and views introduced in Chapter 9 are also used by some instructions. A scalar can be 8 bits, 16 bits, 32 bits, or 64 bits. The instruction syntax is extended to refer to elements of a vector register by using an index, . For example is element in register , where is treated as a vector of four single-word (32-bit) elements.
The syntax of Advanced SIMD instructions can be described using an extension of the notation used throughout this book. Each instruction operates on certain types of register(s), and there are many registers. Advanced SIMD instruction syntax may use any of the following register definitions:
Xy Refers to a 64-bit AArch64 integer register.
Wy Refers to 32-bit AArch64 integer register.
By Refers to the lower 8-bits, or byte, of an Advanced SIMD register.
Hy Refers to the lower 16-bits, or half-word, of an Advanced SIMD register.
Sy Refers to the lower 32-bits, or single-word, of an Advanced SIMD register.
Dy Refers to the lower 64-bits, or double-word, of an Advanced SIMD register.
Fy Is used to indicate either a single-word or double-word FP/NEON register. F must be either for a single word register, for a double word register.
Vy A 128-bit Advanced SIMD register. can be any valid register number.
Vy.T A 128-bit Advanced SIMD register, treated as a vector of elements of type T, where T may be one of:
b A vector of 16 bytes.
h A vector of 8 half-words.
s A vector of 4 words.
d A vector of 2 double-words.
Some instructions can only allow a subset of these types.
Vy.nT A 128-bit Advanced SIMD register, or the lower 64 bits of an Advanced SIMD register, treated as a vector of elements of type T, where T may be one of:
16b A 128-bit Advanced SIMD register, treated as a vector of sixteen bytes.
8b The lower 64 bits of an Advanced SIMD register, treated as a vector of eight bytes.
8h A 128-bit Advanced SIMD register, treated as a vector of eight half-words.
4h The lower 64 bits of an Advanced SIMD register, treated as a vector of four half-words.
4s A 128-bit Advanced SIMD register, treated as a vector of four words.
2s The lower 64 bits of an Advanced SIMD register, treated as a vector of two words.
2d A 128-bit Advanced SIMD register, treated as a vector of two double-words.
Some instructions can only allow a subset of these types.
Vy.nT[x] Element x of an Advanced SIMD register, treated as a vector of type .
Each instruction has its own set of restrictions on legal values for the registers and types used. For example, one possible form of the instruction is:
which indicates that a 32-bit AArch32 register is used as the source operand, and any element of any Advanced SIMD register may be used as the destination. However, the instruction further requires that must be either or , in order to match the size of .
Instructions may have several forms. In those cases, the following syntax is used to specify possible forms:
{opt} Braces around a string indicate that the string is optional. For example, several operations have an optional r which indicates that the result is rounded instead of truncated.
(s|u) Parentheses indicate a choice between two or more possible characters or strings, separated by the pipe “|” character. For example, (s|u)shr would describe two forms for the shr instruction: sshr and ushr.
<Tn> A string inside the < and > symbols indicates a choice or special syntax that is too complex to be easily described using the parenthesis and pipe (a|b) syntax, and is described in the following text. It is also used to define a syntactical token when simply using a character would lead to confusion.
The following function definitions are used in describing the effects of many of the instructions:
The floor function maps a real number, x, to the next smallest integer.
The saturate function limits the value of x to the highest or lowest value that can be stored in the destination register. Saturation is a method used to prevent overflow.
The round function maps a real number, x, to the nearest integer.
The narrow function reduces a 2n bit number to an n bit number, by taking the n least significant bits.
The extend function converts an n bit number to a 2n bit number, performing zero extension if the number is unsigned, or sign extension if the number is signed.
These instructions can be used to perform interleaving of data when structured data is loaded or stored. The data should be properly aligned for best performance. These instructions are very useful for common multimedia data types.
For example, image data is typically stored in arrays of pixels, where each pixel is a small data structure such as the struct shown in Listing 5.38. Since each pixel is three bytes, and a register is 8 bytes, loading a single pixel into one register would be inefficient. It would be much better to load multiple pixels at once, but an even number of pixels will not fit in a register. It will take three doubleword or quadword registers to hold an even number of pixels without wasting space, as shown in Fig. 10.2. This is the way data would be loaded using an Advanced SIMD instruction. Many image processing operations work best if each color “channel” is processed separately. The SIMD load and store instructions can be used to split the image data into color channels, where each channel is stored in a different register, as shown in Fig. 10.3.
Other examples of interleaved data include stereo audio, which is two interleaved channels, and surround sound, which may have up to nine interleaved channels. In all of these cases, most processing operations are simplified when the data is separated into non-interleaved channels.
These instructions are used to load and store structured data across multiple registers:
ld<n> Load Structured Data, and
st<n> Store Structured Data.
They can be used for interleaving or deinterleaving the data as it is loaded or stored, as shown in Fig. 10.3.
These instructions are used to load and store multiple data structures across multiple registers with interleaving or deinterleaving:
ld<n> Load Multiple Structured Data, and
st<n> Store Multiple Structured Data.
10.2.2.1 Syntax
The registers must be consecutive. Register 0 is consecutive to register 31.10.2.2.2 Operations
10.2.2.3 Examples
This instruction is used to load multiple copies of structured data across multiple registers:
ld<n>r Load Copies of Structured Data.
The data is copied to all lanes. This instruction is useful for initializing vectors for use in later instructions.
With the additional register views added by Advanced SIMD, there are many more ways to specify data movement. Instructions are provided to move data using the Advanced SIMD views, the FP/NEON views, and the AArch64 integer register views. This results in a large number of possible move instructions.
The duplicate instruction copies a scalar into every element of the destination vector. The scalar can be in an Advanced SIMD register or an AARCH64 integer register. The instruction is:
dup Duplicate Scalar.
10.3.1.1 Syntax
10.3.1.2 Operations
10.3.1.3 Examples
These instructions copy one element into a vector:
mov Copy element into vector,
umov Copy unsigned integer element from vector to AARCH64 register, and
smov Copy signed integer element from vector to AARCH64 register.
10.3.2.1 Syntax
10.3.2.2 Operations
10.3.2.3 Examples
These instructions are used to load immediate data into the vector registers:
movi Vector Move Immediate,
mvni Vector Move NOT Immediate, and
fmov Vector Floating Point Move Immediate.
10.3.3.1 Syntax
sop T shift Description lsl 4h or 8h 0 or 8 Replicate LSL(uimm8,shift) into each 16-bit element. lsl 2s or 4s 0, 8, 16, or 24 Replicate LSL(uimm8,shift) into each 32-bit element. msl 2s or 4s 8 or 16 Replicate MSL(uimm8,shift) into each 32-bit element. 10.3.3.2 Operations
10.3.3.3 Examples
Advanced SIMD provides two versions of the transpose instruction that can be used together for transposing matrices. Fig. 10.4 shows two examples of this instruction. The instruction is:
trn Transpose Matrix.
Fig. 10.5 shows how the instruction can be used to transpose a matrix.
These instructions are used to interleave or deinterleave the data from two vectors, or to extract bits from a vector:
zip Zip Vectors,
uzp Unzip Vectors, and
ext Byte Extract.
Fig. 10.6 gives an example of the instruction. The instruction performs the inverse operation.
The table lookup instructions use indices held in one vector to lookup values from a table held in one or more other vectors. The resulting values are stored in the destination vector. The table lookup instructions are:
tbl Table Lookup, and
tbx Table Lookup with Extend.
10.3.6.1 Syntax
10.3.6.2 Operations
10.3.6.3 Examples
When high precision is not required, The IEEE half-precision format can be used to store floating point numbers in memory. This can reduce memory requirements by up to 50%. This can also result in a significant performance improvement, since only half as much data needs to be moved between the CPU and main memory. However, on most processors half-precision data must be converted to single or double precision before it is used in calculations. Advanced SIMD provides instructions to support conversion to and from IEEE half precision. There are also functions to perform integer or fixed-point to floating-point conversions, and convert between IEEE single and double precision.
These instructions can be used to perform data conversion between floating point and fixed point (or integer) on each element in a vector:
fcvt Vector convert floating point to integer or fixed point, and
cvtf Vector convert integer or fixed point to floating point.
The elements in the result vector must be the same size as the elements in the source vector. An out of range integer or fixed-point result will be saturated to the destination size.
Fixed point (or integer) arithmetic operations are up to twice as fast as floating point operations. In some cases it is much more efficient to make this conversion, perform the calculations, then convert the results back to floating point.
N: round to nearest with ties to even,
A: round to nearest with ties away from zero,
P: round towards +∞,
M: round towards −∞, or
Z: round towards zero.
Name | Effect | Description |
---|---|---|
fcvt<x>s | Vd[]←fixed(Vm[],fbits) | Convert single precision to 32-bit signed fixed point or integer. |
fcvt<x>u | Vd[]←ufixed(Vm[],fbits) | Convert single precision to 32-bit unsigned fixed point or integer. |
scvtf | Vd[]←float(Vm[]) | Convert signed 32-bit fixed point or integer to single precision |
ucvtf | Vd[]←single(Vm[]) | Convert unsigned 32-bit fixed point or integer to single precision |
The following instructions can be used to convert between floating point formats:
fcvtl Vector convert from half to single precision,
fcvtn Vector convert from single to half precision, and
fcvtxn Vector convert from double to single precision.
These instructions operate on vectors. There are additional conversion instructions available, but they only operate on scalar values.
Name | Effect | Description |
---|---|---|
fcvtl | Vd[]←single(Vn[]) | Convert half precision to single precision. |
fcvtn | Vd[]←half(Vn[]) | Convert single precision to half precision. |
fcvtxn | Vd[]←single(Vd[]) | Convert double precision to single precision. |
The following instruction can be used to round a vector of floating point values to integers:
frint Round Floating Point to Integer.
N: round to nearest with ties to even, A: round to nearest with ties away from zero, P: round towards +∞, M: round towards −∞, Z: round towards zero, I: round using FPCR rounding mode, and X: round using FPCR rounding mode with exactness test.10.4.3.1 Syntax
10.4.3.2 Operations
Name Effect Description frintx Vd[]←roundx(Vn[],x) Round to integer using specified rounding mode. 10.4.3.3 Examples
Advanced SIMD provides instructions to perform bitwise logical operations on the vector register set. These operations add a great deal of power to the AARCH64 processor.
The bitwise logical operations are:
and Vector bitwise AND,
orr Vector bitwise OR,
orn Vector bitwise NOR,
eor Vector bitwise Exclusive-OR,
bic Vector bit clear,
bif Vector insert if false,
bit Vector insert if true, and
bsl Vector bitwise select.
10.5.1.1 Syntax
10.5.1.2 Operations
10.5.1.3 Examples
Advanced SIMD provides vector versions of the logical OR and bit clear instructions:
orr Vector bitwise OR immediate, and
bic Vector Bit clear immediate.
10.5.2.1 Syntax
10.5.2.2 Operations
Name Effect Description vorr Vd[]←Vd[]∨(uimm8 ≪ shift) Logical OR vbic Vd[]←Vd[]∧(uimm8 ≪ shift) Bit Clear 10.5.2.3 Examples
Advanced SIMD provides many instructions for addition, subtraction, and multiplication, but does not provide an integer divide instruction. When division cannot be avoided, it is performed by multiplying the reciprocal, as was described in Chapter 7 and Chapter 8. When dividing by a constant, the reciprocal can be calculated in advance. For dividing by variables, Advanced SIMD provides instructions for quickly calculating the reciprocals of the elements in a vector. In most cases, this is faster than using a divide instruction. For floating point numbers, the FP/NEON divide instructions can be used.
The and instructions add corresponding elements in two vectors and store the results in the corresponding elements of the destination register. The and instructions subtract elements in one vector from corresponding elements in another vector and store the results in the corresponding elements of the destination register. Other versions of the add and subtract instructions allow mismatched operand and destination sizes, and the saturating versions prevent overflow by limiting the range of the results. The following instructions perform vector addition and subtraction:
add Vector integer add,
fadd Vector floating point add,
qadd Vector saturating add,
addl Vector add long,
addw Vector add wide,
sub Vector integer subtract,
fsub Vector floating point subtract,
qsub Vector saturating subtract,
subl Vector subtract long, and
subw Vector subtract wide.
Opcode | Valid Values for T |
---|---|
<op> | 8b, 16b, 4h, 8h, 2s, 4s, or 2d |
f<op> | 2s, 4s, 2d |
(s|u)q<op> | 8b, 16b, 4h, 8h, 2s, 4s, 2d |
Opcode | Valid Values for Td /Ts |
---|---|
<sop>l | 8h /8b, 4s /4h, 2d /2s |
<sop>l2 | 8h /16b, 4s /8h, 2d /4s |
<sop>w | 8h /8b, 4s /4h, 2d /2s |
<sop>w2 | 8h /16b, 4s /8h, 2d /4s |
These instructions add or subtract the corresponding elements of two vectors and narrow the resulting elements by taking the upper (most significant) half:
addhn Vector add and narrow,
raddhn Vector add, round, and narrow,
subhn Vector subtract and narrow, and
rsubhn Vector subtract, round, and narrow.
The results are stored in the corresponding elements of the destination register. Results can be optionally rounded instead of truncated.
Opcode | Valid Values for Td /Ts |
---|---|
r<op>hn | 8b /8h, 4h /4s, 2s /2d |
r<op>hn2 | 16b /8h, 8h /4s, 4s /2d |
These instructions add or subtract corresponding integer elements from two vectors, then shift the result right by one bit:
hadd Vector halving add,
rhadd Vector halving add and round, and
hsub Vector halving subtract.
The results are stored in corresponding elements of the destination vector.
These instructions add vector elements pairwise:
addp Vector add pairwise,
addlp Vector add long pairwise, and
adalp Vector add and accumulate long pairwise.
The long versions can be used to prevent overflow.
These instructions subtract the elements of one vector from another and store or accumulate the absolute value of the results:
abd Vector integer absolute difference,
fabd Vector floating point absolute difference,
aba Vector integer absolute difference and accumulate,
faba Vector floating point absolute difference and accumulate,
abal Vector absolute difference and accumulate long, and
abdl Vector absolute difference long.
The long versions can be used to prevent overflow.
Opcode | Valid Types for Td /Ts |
---|---|
(s|u)<op>l | 8h /8b, 4s /4h, 2d /2s |
(s|u)<op>l2 | 8h /16b, 4s /8h, 2d /4s |
These operations compute the absolute value or negate each element in a vector:
abs Vector absolute value,
neg Vector negate,
fabs Vector floating point absolute value, and
fneg Vector floating point negate,
The saturating versions can be used to prevent overflow.
The following four instructions select the maximum or minimum elements and store the results in the destination vector:
max Vector integer maximum,
min Vector integer minimum,
fmax Vector floating point maximum,
fmin Vector floating point minimum,
maxp Vector integer pairwise maximum,
minp Vector integer pairwise minimum,
fmaxp Vector floating point pairwise maximum,
fminp Vector floating point pairwise minimum,
fmaxnm Vector floating point maxnum,
fminnm Vector floating point minnum,
fmaxnmp Vector floating point pairwise maxnum, and
fminnmp Vector floating point pairwise minnum.
These instructions can be used to count leading sign bits or zeros, or to count the number of bits that are set, for each element in a vector:
cls Vector count leading sign bits,
clz Vector count leading zero bits, and
cnt Vector count set bits.
10.6.8.1 Syntax
10.6.8.2 Operations
10.6.8.3 Examples
The following instructions perform basic saturating operations on scalars:
qadd Scalar saturating add,
qsub Scalar saturating subtract,
qdmulh Scalar saturating multiply (high half), and
qshl Scalar saturating shift left.
10.6.9.1 Syntax
10.6.9.2 Operations
10.6.9.3 Examples
There is no integer divide instruction in Advanced SIMD. Integer division is accomplished with multiplication by the reciprocal, as was described in Chapter 7 and Chapter 8. For division by a constant, the constant reciprocal can be computed in advance, and simply loaded into a register. For division by a variable, special instructions are provided for computing the reciprocal.
These instructions are used to multiply the corresponding elements from two vectors:
mul Vector integer multiply,
mla Vector integer multiply accumulate,
mls Vector integer multiply subtract,
fmul Vector floating point multiply,
fdiv Vector floating point divide,
fmla Vector floating point multiply accumulate,
fmls Vector floating point multiply subtract,
mull Vector multiply long,
mlal Vector multiply accumulate long,
mlsl Vector multiply subtract long,
pmul Vector polynomial multiply, and
pmull Vector polynomial multiply long.
10.7.1.1 Syntax
10.7.1.2 Operations
10.7.1.3 Examples
These instructions are used to multiply each element in a vector by a scalar:
mul Vector by scalar integer multiply,
mla Vector by scalar integer multiply accumulate,
mls Vector by scalar integer multiply subtract,
fmul Vector by scalar floating point multiply,
fmla Vector by scalar floating point multiply accumulate,
fmls Vector by scalar floating point multiply subtract,
mull Vector by scalar multiply long,
mlal Vector by scalar multiply accumulate long, and
mlsl Vector by scalar multiply subtract long.
10.7.2.1 Syntax
10.7.2.2 Operations
10.7.2.3 Examples
These instructions perform multiplication, double the results, and perform saturation:
sqdmull Saturating Multiply Double,
sqdmlal Saturating Multiply Double Accumulate, and
sqdmlsl Saturating Multiply Double Subtract.
10.7.3.1 Syntax
10.7.3.2 Operations
10.7.3.3 Examples
These instructions perform multiplication, double the results, perform saturation, and store the high half of the results:
sqdmulh Saturating Multiply Double (High), and
sqrdmulh Saturating Multiply Double and Round (High).
10.7.4.1 Syntax
10.7.4.2 Operations
10.7.4.3 Examples
In general, multiplication is faster than division. In many cases of vector arithmetic, it is faster to calculate reciprocals and use multiplication. These instructions perform the initial estimates of the reciprocal values:
recpe Reciprocal Estimate, and
rsqrte Reciprocal Square Root Estimate.
These work on floating point and unsigned fixed point vectors. The estimates from this instruction are accurate to within about eight bits. If higher accuracy is desired, then the Newton-Raphson method can be used to improve the initial estimates. For more information, see the Reciprocal Step instructions on page 369.
These instructions are used to perform one Newton-Raphson step for improving the reciprocal estimates:
frecps Reciprocal Step, and
frsqrts Reciprocal Square Root Step.
For each element in the vector, the following equation can be used to improve the estimates of the reciprocals:
Where is the estimated reciprocal from the previous step, and d is the number for which the reciprocal is desired. This equation converges to if is obtained using on d. The instruction computes
so one additional multiplication is required to complete the update step. The initial estimate must be obtained using the instruction.
For each element in the vector, the following equation can be used to improve the estimates of the reciprocals of the square roots:
Where is the estimated reciprocal from the previous step, and d is the number for which the reciprocal is desired. This equation converges to if is obtained using on d. The instruction computes
so two additional multiplications are required to complete the update step. The initial estimate must be obtained using the instruction.
These instructions are used to multiply each element in a vector by a scalar:
fmul Vector by scalar floating point multiply,
fmla Vector by scalar floating point multiply accumulate,
fmls Vector by scalar floating point multiply subtract.
10.7.7.1 Syntax
10.7.7.2 Operations
10.7.7.3 Examples
These instructions perform multiplication, double the results, and perform saturation:
sqdmull Saturating Multiply Scalar by Element and Double,
sqdmlal Saturating Multiply Scalar by Element, Double, and Accumulate, and
sqdmlsl Saturating Multiply Scalar by Element, Double, and Subtract.
10.7.8.1 Syntax
10.7.8.2 Operations
10.7.8.3 Examples
The Advanced SIMD shift instructions operate on vectors. Shifts are often used for multiplication and division by powers of two. The results of a left shift may be larger than the destination register, resulting in overflow. A shift right is equivalent to division. In some cases, it may be useful to round the result of a division, rather than truncating. Advanced SIMD provides versions of the shift instruction which perform saturation and/or rounding of the result.
These instructions shift each element in a vector left by an immediate value:
shl Unsigned Shift Left Immediate,
qshl Saturating Signed or Unsigned Shift Left Immediate,
sqshlu Saturating Signed Shift Left Immediate Unsigned, and
shll Signed or Unsigned Shift Left Immediate Long.
Overflow conditions can be avoided by using the saturating version, or by using the long version, in which case the destination is twice the size of the source.
These instructions shift each element in a vector right by an immediate value:
shr Shift Right Immediate,
rshr Shift Right Immediate and Round,
shrn Shift Right Immediate and Narrow,
rshrn Shift Right Immediate Round and Narrow,
sra Shift Right and Accumulate Immediate, and
rsra Shift Right Round and Accumulate Immediate.
10.8.2.1 Syntax
10.8.2.2 Operations
10.8.2.3 Examples
These instructions shift each element in a quad word vector right by an immediate value:
qshrn Saturating Shift Right Narrow,
qrshrn Saturating Rounding Shift Right Narrow,
sqshrun Signed Saturating Shift Right Unsigned Narrow, and
sqrshrun Signed Saturating Rounding Shift Right Unsigned Immediate.
10.8.3.1 Syntax
10.8.3.2 Operations
10.8.3.3 Examples
These instructions shift each element in a vector left or right, using the least significant byte of the corresponding element of a second vector as the shift amount:
shl Shift Left or Right by Variable,
rshl Shift Left or Right by Variable and Round,
qshl Saturating Shift Left or Right by Variable, and
qrshl Saturating Shift Left or Right by Variable and Round.
10.8.4.1 Syntax
10.8.4.2 Operations
10.8.4.3 Examples
These instructions perform bitwise shifting of each element in a vector, then combine bits from the source with bits from the destination. Fig. 10.7 provides an example. The instructions are:
sli Shift Left and Insert, and
sri Shift Right and Insert.
These instructions shift each element in a vector left by an immediate value:
shl Unsigned Shift Left Immediate,
qshl Saturating Signed or Unsigned Shift Left Immediate,
sqshlu Saturating Signed Shift Left Immediate Unsigned, and
shll Signed or Unsigned Shift Left Immediate Long.
Overflow conditions can be avoided by using the saturating version, or by using the long version, in which case the destination is twice the size of the source.
These instructions shift each element in a vector right by an immediate value:
shr Shift Right Immediate,
rshr Shift Right Immediate and Round,
sra Shift Right and Accumulate Immediate, and
rsra Shift Right Round and Accumulate Immediate.
10.8.7.1 Syntax
10.8.7.2 Operations
10.8.7.3 Examples
These instructions shift each element in a quad word vector right by an immediate value:
qshrn Saturating Shift Right Narrow,
qrshrn Saturating Rounding Shift Right Narrow,
sqshrun Signed Saturating Shift Right Unsigned Narrow, and
sqrshrun Signed Saturating Rounding Shift Right Unsigned Immediate.
10.8.8.1 Syntax
10.8.8.2 Operations
10.8.8.3 Examples
Advanced SIMD provides several unary operations for integer and floating point values. It provides instructions for bitwise complement, negation, reversing bits an elements, and other operations.
not Vector 1's Complement,
qadd Vector Saturating Accumulate,
fsqrt Vector Floating Point Square Root,
rbit Vector bit reverse,
rev Reverse Elements.
Fig. 10.8 provides some illustrated examples of the instructions.10.9.1.1 Syntax
10.9.1.2 Operations
10.9.1.3 Examples
abs Integer absolute value,
neg Integer absolute value,
qadd Vector Saturating Accumulate,
fsqrt Vector Floating Point Square Root,
rbit Vector bit reverse, and
rev Reverse Elements.
10.9.2.1 Syntax
10.9.2.2 Operations
10.9.2.3 Examples
These instructions operate across all lanes in a vector, and produce a scalar.
Advanced SIMD provides instructions for summing the elements in a vector, and for getting the maximum or minimum value from a vector. These instructions are:
addv Integer Sum Elements to Scalar,
maxv Integer Maximum Element to Scalar,
minv Integer Minimum Element to Scalar,
fmaxv Floating Point Maximum Element to Scalar, and
fminv Floating Point Minimum Element to Scalar.
There are long versions of the instruction which will prevent overflow.
The pairwise instructions are similar to the vector reduce instructions, but always operate on two elements of the source vector. These instructions are:
addp Integer Sum Elements to Scalar,
faddp Integer Maximum Element to Scalar,
fmaxp Floating Point Maximum Element to Scalar, and
fminp Floating Point Minimum Element to Scalar.
There are long versions of the instruction which will prevent overflow.
Advanced SIMD provides instructions to perform comparisons between vectors. Since there are multiple pairs of items to be compared, the comparison instructions set one element in a result vector for each pair of items. After the comparison operation, each element of the result vector will have every bit set to zero (for false) or one (for true). Note that if the elements of the result vector are interpreted as signed two's-complement numbers, then the value 0 represents false and the value −1 represents true. Note that summing the elements of the result vector (as signed integers) will give the two's complement of the number of comparisons which were true.
The following instructions perform comparisons of all of the corresponding elements of two vectors in parallel:
cm Vector integer compare mask, and
fcm Vector floating point compare mask.
10.11.1.1 Syntax
10.11.1.2 Operations
10.11.1.3 Examples
The following instruction performs comparisons between the absolute values of all of the corresponding elements of two vectors in parallel:
fac Vector floating point absolute compare mask.
10.11.2.1 Syntax
10.11.2.2 Operations
10.11.2.3 Examples
Advanced SIMD provides the following vector version of the ARM instruction:
cmtst Vector test bits compare mask.
10.11.3.1 Syntax
10.11.3.2 Operations
10.11.3.3 Examples
The following instructions perform comparisons of the specified scalar registers:
cm Scalar integer compare mask, and
fcm Scalar floating point compare mask,
If the comparison is true, then all bits in the destination register are set to one. Otherwise, all bits in the destination register are set to zero.
The following instruction performs comparisons between the absolute values of two scalars:
fac Scalar floating point absolute compare mask.
10.11.5.1 Syntax
10.11.5.2 Operations
10.11.5.3 Examples
The scalar test bits instruction performs a logical AND operation between two source registers. If the result is not zero, then every bit in the result register is set to one. Otherwise, every bit in the result register is set to zero. The instruction is:
cmtst Scalar test bits compare mask.
10.11.6.1 Syntax
10.11.6.2 Operations
10.11.6.3 Examples
In Chapter 9, two versions of the sine function were given. Those implementations used scalar FP/NEON instructions for single-precision and double-precision. Those previous implementations are already faster than the implementations provided by GCC, However, it may be possible to gain a little more performance by taking advantage of the Advanced SIMD architecture.
Listing 10.1 shows a single precision floating point implementation of the sine function, using Advanced SIMD vector instructions. It performs the same operations as the previous implementations of the sine function, but performs many of the calculations in parallel. This implementation is slightly faster than the previous version. In addition to being faster, it also uses nine terms of the Taylor series, so it should be more accurate as well.
Listing 10.2 shows a double precision floating point implementation of the sine function. It also uses Advanced SIMD vector instructions. Both of the implementations in this chapter are faster than the corresponding implementations in Chapter 9 because they use a large number of registers, do not contain loops, and are written carefully to use vector instructions, ordered so that multiple instructions can be at different stages in the pipeline at the same time. This technique of gaining performance is known as loop unrolling. In addition to being faster, the vector implementations use more terms of the Taylor series, so they may also be more accurate.
Table 10.1 compares the implementations from Listing 10.1 and Listing 10.2 with the FP/NEON implementations from Chapter 9 and the sine function provided by GCC. When compiler optimization is not used, the single precision scalar FP/NEON implementation achieves a speedup of about 3.41, and the Advanced SIMD implementation achieves a speedup of about 4.21 compared to the GCC implementation. The double precision scalar FP/NEON implementation achieves a speedup of about 2.22, and the Advanced SIMD achieves a speedup of about 2.41 compared to the GCC implementation.
Table 10.1
Performance of sine function with various implementations.
Optimization | Implementation | CPU seconds |
---|---|---|
None | Single Precision C | 6.958 |
Single Precision FP/NEON scalar Assembly | 2.038 | |
Single Precision Advanced SIMD Assembly | 1.654 | |
Double Precision C | 6.682 | |
Double Precision FP/NEON scalar Assembly | 3.014 | |
Double Precision Advanced SIMD Assembly | 2.778 | |
Full | Single Precision C | 4.409 |
Single Precision FP/NEON scalar Assembly | 1.449 | |
Single Precision Advanced SIMD Assembly | 1.142 | |
Double Precision C | 5.758 | |
Double Precision FP/NEON scalar Assembly | 2.134 | |
Double Precision Advanced SIMD Assembly | 1.941 |
When the compiler optimization is used (-Ofast), the single precision scalar FP/NEON implementation achieves a speedup of about 3.04, and the Advanced SIMD implementation achieves a speedup of about 3.86 compared to the GCC implementation. The double precision scalar FP/NEON implementation achieves a speedup of about 2.70, and the Advanced SIMD implementation achieves a speedup of about 2.97 compared to the GCC implementation. The single precision Advanced SIMD version was 1.27 times as fast as the FP/NEON scalar version, and the double precision Advanced SIMD implementation was 1.10 times as fast as the FP/NEON scalar implementation.
Although the FP/NEON versions of the sine functions were already much faster than the C standard library, re-writing them using Advanced SIMD resulted in further performance improvement. The take-away lesson is that a programmer can improve performance by writing some functions in assembly that are specifically targeted to run on a specific platform. To achieve optimal or near-optimal performance, it is important for the programmer to be aware of advanced features available on the hardware platform that is being used.
Name | Page | Operation |
---|---|---|
aba | 354 | Vector integer absolute difference and accumulate |
abal | 355 | Vector absolute difference and accumulate long |
abd | 354 | Vector integer absolute difference |
abdl | 355 | Vector absolute difference long |
abs | 383 | Integer absolute value |
abs | 356 | Vector absolute value |
adalp | 353 | Vector add and accumulate long pairwise |
add | 348 | Vector integer add |
addhn | 351 | Vector add and narrow |
addl | 348 | Vector add long |
addlp | 353 | Vector add long pairwise |
addp | 386 | Integer Sum Elements to Scalar |
addp | 353 | Vector add pairwise |
addv | 385 | Integer Sum Elements to Scalar |
addw | 349 | Vector add wide |
and | 346 | Vector bitwise AND |
bic | 347 | Vector Bit clear immediate |
bic | 346 | Vector bit clear |
bif | 346 | Vector insert if false |
bit | 346 | Vector insert if true |
bsl | 346 | Vector bitwise select |
cls | 359 | Vector count leading sign bits |
clz | 359 | Vector count leading zero bits |
cm | 391 | Scalar integer compare mask |
cm | 388 | Vector integer compare mask |
cmtst | 393 | Scalar test bits compare mask |
cmtst | 390 | Vector test bits compare mask |
cnt | 359 | Vector count set bits |
cvtf | 343 | Vector convert integer or fixed point to floating point |
dup | 333 | Duplicate Scalar |
eor | 346 | Vector bitwise Exclusive-OR |
ext | 339 | Byte Extract |
faba | 355 | Vector floating point absolute difference and accumulate |
fabd | 354 | Vector floating point absolute difference |
fabs | 356 | Vector floating point absolute value |
fac | 392 | Scalar floating point absolute compare mask |
fac | 389 | Vector floating point absolute compare mask |
fadd | 348 | Vector floating point add |
faddp | 386 | Integer Maximum Element to Scalar |
fcm | 391 | Scalar floating point compare mask |
fcm | 388 | Vector floating point compare mask |
fcvt | 343 | Vector convert floating point to integer or fixed point |
fcvtl | 344 | Vector convert from half to single precision |
fcvtn | 344 | Vector convert from single to half precision |
fcvtxn | 344 | Vector convert from double to single precision |
fdiv | 362 | Vector floating point divide |
fmadd | 310 | Fused Multiply Accumulate |
fmax | 357 | Vector floating point maximum |
fmaxnm | 357 | Vector floating point maxnum |
fmaxnmp | 357 | Vector floating point pairwise maxnum |
fmaxp | 386 | Floating Point Maximum Element to Scalar |
fmaxp | 357 | Vector floating point pairwise maximum |
fmaxv | 385 | Floating Point Maximum Element to Scalar |
fmin | 357 | Vector floating point minimum |
fminnm | 357 | Vector floating point minnum |
fminnmp | 357 | Vector floating point pairwise minnum |
fminp | 387 | Floating Point Minimum Element to Scalar |
fminp | 357 | Vector floating point pairwise minimum |
fminv | 385 | Floating Point Minimum Element to Scalar |
fmla | 364 | Vector by scalar floating point multiply accumulate |
fmla | 370 | Vector by scalar floating point multiply accumulate |
fmla | 362 | Vector floating point multiply accumulate |
fmls | 364 | Vector by scalar floating point multiply subtract |
fmls | 370 | Vector by scalar floating point multiply subtract |
fmls | 362 | Vector floating point multiply subtract |
fmov | 336 | Vector Floating Point Move Immediate |
fmsub | 310 | Fused Multiply Subtract |
fmul | 364 | Vector by scalar floating point multiply |
fmul | 370 | Vector by scalar floating point multiply |
fmul | 362 | Vector floating point multiply |
fneg | 356 | Vector floating point negate |
fnmadd | 310 | Fused Multiply Accumulate and Negate |
fnmsub | 310 | Fused Multiply Subtract and Negate |
frecps | 369 | Reciprocal Step |
frint | 345 | Round Floating Point to Integer |
frsqrts | 369 | Reciprocal Square Root Step |
fsqrt | 382 | Vector Floating Point Square Root |
fsqrt | 383 | Vector Floating Point Square Root |
fsub | 349 | Vector floating point subtract |
hadd | 352 | Vector halving add |
hsub | 352 | Vector halving subtract |
ld<n> | 329 | Load Multiple Structured Data |
ld<n> | 327 | Load Structured Data |
ld<n>r | 331 | Load Copies of Structured Data |
max | 357 | Vector integer maximum |
maxp | 357 | Vector integer pairwise maximum |
maxv | 385 | Integer Maximum Element to Scalar |
min | 357 | Vector integer minimum |
minp | 357 | Vector integer pairwise minimum |
minv | 385 | Integer Minimum Element to Scalar |
mla | 364 | Vector by scalar integer multiply accumulate |
mla | 362 | Vector integer multiply accumulate |
mlal | 364 | Vector by scalar multiply accumulate long |
mlal | 362 | Vector multiply accumulate long |
mls | 364 | Vector by scalar integer multiply subtract |
mls | 362 | Vector integer multiply subtract |
mlsl | 364 | Vector by scalar multiply subtract long |
mlsl | 362 | Vector multiply subtract long |
mov | 334 | Copy element into vector |
movi | 336 | Vector Move Immediate |
mul | 364 | Vector by scalar integer multiply |
mul | 362 | Vector integer multiply |
mull | 364 | Vector by scalar multiply long |
mull | 362 | Vector multiply long |
mvni | 336 | Vector Move NOT Immediate |
neg | 383 | Integer absolute value |
neg | 356 | Vector negate |
not | 382 | Vector 1's Complement |
orn | 346 | Vector bitwise NOR |
orr | 346 | Vector bitwise OR |
orr | 347 | Vector bitwise OR immediate |
pmul | 362 | Vector polynomial multiply |
pmull | 362 | Vector polynomial multiply long |
qadd | 360 | Scalar saturating add |
qadd | 382 | Vector Saturating Accumulate |
qadd | 383 | Vector Saturating Accumulate |
qadd | 348 | Vector saturating add |
qdmulh | 360 | Scalar saturating multiply (high half) |
qrshl | 377 | Saturating Shift Left or Right by Variable and Round |
qrshrn | 376 | Saturating Rounding Shift Right Narrow |
qrshrn | 381 | Saturating Rounding Shift Right Narrow |
qshl | 377 | Saturating Shift Left or Right by Variable |
qshl | 373 | Saturating Signed or Unsigned Shift Left Immediate |
qshl | 379 | Saturating Signed or Unsigned Shift Left Immediate |
qshl | 360 | Scalar saturating shift left |
qshrn | 376 | Saturating Shift Right Narrow |
qshrn | 381 | Saturating Shift Right Narrow |
qsub | 360 | Scalar saturating subtract |
qsub | 349 | Vector saturating subtract |
raddhn | 351 | Vector add, round, and narrow |
rbit | 382 | Vector bit reverse |
rbit | 384 | Vector bit reverse |
recpe | 368 | Reciprocal Estimate |
rev | 382 | Reverse Elements |
rev | 384 | Reverse Elements |
rhadd | 352 | Vector halving add and round |
rshl | 377 | Shift Left or Right by Variable and Round |
rshr | 374 | Shift Right Immediate and Round |
rshr | 380 | Shift Right Immediate and Round |
rshrn | 374 | Shift Right Immediate Round and Narrow |
rsqrte | 368 | Reciprocal Square Root Estimate |
rsra | 374 | Shift Right Round and Accumulate Immediate |
rsra | 380 | Shift Right Round and Accumulate Immediate |
rsubhn | 351 | Vector subtract, round, and narrow |
shl | 377 | Shift Left or Right by Variable |
shl | 373 | Unsigned Shift Left Immediate |
shl | 379 | Unsigned Shift Left Immediate |
shll | 373 | Signed or Unsigned Shift Left Immediate Long |
shll | 379 | Signed or Unsigned Shift Left Immediate Long |
shr | 374 | Shift Right Immediate |
shr | 380 | Shift Right Immediate |
shrn | 374 | Shift Right Immediate and Narrow |
sli | 378 | Shift Left and Insert |
smov | 334 | Copy signed integer element from vector to AARCH64 register |
sqdmlal | 365 | Saturating Multiply Double Accumulate |
sqdmlal | 371 | Saturating Multiply Scalar by Element, Double, and Accumulate |
sqdmlsl | 365 | Saturating Multiply Double Subtract |
sqdmlsl | 371 | Saturating Multiply Scalar by Element, Double, and Subtract |
sqdmulh | 366 | Saturating Multiply Double (High) |
sqdmull | 365 | Saturating Multiply Double |
sqdmull | 371 | Saturating Multiply Scalar by Element and Double |
sqrdmulh | 367 | Saturating Multiply Double and Round (High) |
sqrshrun | 376 | Signed Saturating Rounding Shift Right Unsigned Immediate |
sqrshrun | 381 | Signed Saturating Rounding Shift Right Unsigned Immediate |
sqshlu | 373 | Saturating Signed Shift Left Immediate Unsigned |
sqshlu | 379 | Saturating Signed Shift Left Immediate Unsigned |
sqshrun | 376 | Signed Saturating Shift Right Unsigned Narrow |
sqshrun | 381 | Signed Saturating Shift Right Unsigned Narrow |
sra | 374 | Shift Right and Accumulate Immediate |
sra | 380 | Shift Right and Accumulate Immediate |
sri | 378 | Shift Right and Insert |
st<n> | 329 | Store Multiple Structured Data |
st<n> | 327 | Store Structured Data |
sub | 349 | Vector integer subtract |
subhn | 351 | Vector subtract and narrow |
subl | 349 | Vector subtract long |
subw | 349 | Vector subtract wide |
tbl | 341 | Table Lookup |
tbx | 341 | Table Lookup with Extend |
trn | 337 | Transpose Matrix |
umov | 334 | Copy unsigned integer element from vector to AARCH64 register |
uzp | 339 | Unzip Vectors |
zip | 339 | Zip Vectors |
The C compiler may provide C (and C++) programs direct access to the Advanced SIMD instructions through the Advanced SIMD intrinsics library. The intrinsics are a large set of functions that are built into the compiler. Most of the intrinsics functions map to one Advanced SIMD instruction. There are additional functions provided for typecasting (reinterpreting) SIMD vectors, so that the C compiler does not complain about mismatched types. It is usually shorter and more efficient to write the Advanced SIMD code directly as assembly language functions and link them to the C code. However only those who know assembly language are capable of doing that.
Advanced SIMD can dramatically improve performance of algorithms that can take advantage of data parallelism. However, compiler support for automatically vectorizing and using Advanced SIMD instructions is still immature. Advanced SIMD intrinsics allow C and C++ programmers to access these instructions, by making them look like C functions. It is usually just as easy and more concise to write AArch64 assembly code as it is to use the intrinsics functions. A careful assembly language programmer can usually beat the compiler, sometimes by a wide margin.
3.136.97.64