© Daniel Kusswurm 2018
Daniel KusswurmModern X86 Assembly Language Programminghttps://doi.org/10.1007/978-1-4842-4063-2_9

9. AVX2 Programming – Packed Floating-Point

Daniel Kusswurm1 
(1)
Geneva, IL, USA
 

In Chapter 6, you learned how to use the AVX instruction set to perform packed floating-point operations using the XMM register set and 128-bit wide operands. In this chapter, you learn how carry out packed floating-point operations using the YMM register set and 256-bit wide operands. The chapter begins with a simple example that demonstrates the basics of packed floating-point arithmetic and YMM register use. This is followed by three source code examples that illustrate how to perform packed calculations with floating-point arrays.

Chapter 6 also presented source code examples that exploited the AVX instruction set to accelerate matrix transposition and multiplication using single-precision floating-point values. In this chapter, you learn how to perform these same calculations using double-precision floating-point values. You also study a source code example that computes the inverse of a matrix. The final two source code examples in this chapter explain how to perform data blends , permutes, and gathers using packed floating-point operands.

You may recall that the source code examples in Chapter 6 used only XMM register operands with AVX instructions. This was done to avoid information overload and maintain a reasonable chapter length. Nearly all AVX floating-point instructions can use either the XMM or YMM registers as operands. Many of the source code examples in this chapter will run on a processor that supports AVX. The function names in these examples use the prefix Avx. Similarly, source code examples that required an AVX2-compatible processor use the function name prefix Avx2. You can use one of the freely-available tools listed in Appendix A to determine whether your computer supports only AVX or both AVX and AVX2.

Packed Floating-Point Arithmetic

Listing 9-1 shows the source code for example Ch09_01. This example illustrates how to perform common arithmetic operations using 256-bit wide single-precision and double-precision floating-point operands. It also illustrates how to use the vzeroupper instruction and several MASM directives for 256-bit wide operands.
//------------------------------------------------
//        YmmVal.h
//------------------------------------------------
#pragma once
#include <string>
#include <cstdint>
#include <sstream>
#include <iomanip>
struct YmmVal
{
public:
  union
  {
    int8_t m_I8[32];
    int16_t m_I16[16];
    int32_t m_I32[8];
    int64_t m_I64[4];
    uint8_t m_U8[32];
    uint16_t m_U16[16];
    uint32_t m_U32[8];
    uint64_t m_U64[4];
    float m_F32[8];
    double m_F64[4];
  };
//------------------------------------------------
//        Ch09_01.cpp
//------------------------------------------------
#include "stdafx.h"
#include <iostream>
#include <iomanip>
#define _USE_MATH_DEFINES
#include <math.h>
#include "YmmVal.h"
using namespace std;
extern "C" void AvxPackedMathF32_(const YmmVal& a, const YmmVal& b, YmmVal c[8]);
extern "C" void AvxPackedMathF64_(const YmmVal& a, const YmmVal& b, YmmVal c[8]);
void AvxPackedMathF32(void)
{
  alignas(32) YmmVal a;
  alignas(32) YmmVal b;
  alignas(32) YmmVal c[8];
  a.m_F32[0] = 36.0f;         b.m_F32[0] = -0.1111111f;
  a.m_F32[1] = 0.03125f;       b.m_F32[1] = 64.0f;
  a.m_F32[2] = 2.0f;         b.m_F32[2] = -0.0625f;
  a.m_F32[3] = 42.0f;         b.m_F32[3] = 8.666667f;
  a.m_F32[4] = 7.0f;         b.m_F32[4] = -18.125f;
  a.m_F32[5] = 20.5f;         b.m_F32[5] = 56.0f;
  a.m_F32[6] = 36.125f;        b.m_F32[6] = 24.0f;
  a.m_F32[7] = 0.5f;         b.m_F32[7] = -98.6f;
  AvxPackedMathF32_(a, b, c);
  cout << (" Results for AvxPackedMathF32 ");
  cout << "a[0]:    " << a.ToStringF32(0) << ' ';
  cout << "b[0]:    " << b.ToStringF32(0) << ' ';
  cout << "addps[0]:  " << c[0].ToStringF32(0) << ' ';
  cout << "subps[0]:  " << c[1].ToStringF32(0) << ' ';
  cout << "mulps[0]:  " << c[2].ToStringF32(0) << ' ';
  cout << "divps[0]:  " << c[3].ToStringF32(0) << ' ';
  cout << "absps b[0]: " << c[4].ToStringF32(0) << ' ';
  cout << "sqrtps a[0]:" << c[5].ToStringF32(0) << ' ';
  cout << "minps[0]:  " << c[6].ToStringF32(0) << ' ';
  cout << "maxps[0]:  " << c[7].ToStringF32(0) << ' ';
  cout << ' ';
  cout << "a[1]:    " << a.ToStringF32(1) << ' ';
  cout << "b[1]:    " << b.ToStringF32(1) << ' ';
  cout << "addps[1]:  " << c[0].ToStringF32(1) << ' ';
  cout << "subps[1]:  " << c[1].ToStringF32(1) << ' ';
  cout << "mulps[1]:  " << c[2].ToStringF32(1) << ' ';
  cout << "divps[1]:  " << c[3].ToStringF32(1) << ' ';
  cout << "absps b[1]: " << c[4].ToStringF32(1) << ' ';
  cout << "sqrtps a[1]:" << c[5].ToStringF32(1) << ' ';
  cout << "minps[1]:  " << c[6].ToStringF32(1) << ' ';
  cout << "maxps[1]:  " << c[7].ToStringF32(1) << ' ';
}
void AvxPackedMathF64(void)
{
  alignas(32) YmmVal a;
  alignas(32) YmmVal b;
  alignas(32) YmmVal c[8];
  a.m_F64[0] = 2.0;      b.m_F64[0] = M_PI;
  a.m_F64[1] = 4.0 ;     b.m_F64[1] = M_E;
  a.m_F64[2] = 7.5;      b.m_F64[2] = -9.125;
  a.m_F64[3] = 3.0;      b.m_F64[3] = -M_PI;
  AvxPackedMathF64_(a, b, c);
  cout << (" Results for AvxPackedMathF64 ");
  cout << "a[0]:    " << a.ToStringF64(0) << ' ';
  cout << "b[0]:    " << b.ToStringF64(0) << ' ';
  cout << "addpd[0]:  " << c[0].ToStringF64(0) << ' ';
  cout << "subpd[0]:  " << c[1].ToStringF64(0) << ' ';
  cout << "mulpd[0]:  " << c[2].ToStringF64(0) << ' ';
  cout << "divpd[0]:  " << c[3].ToStringF64(0) << ' ';
  cout << "abspd b[0]: " << c[4].ToStringF64(0) << ' ';
  cout << "sqrtpd a[0]:" << c[5].ToStringF64(0) << ' ';
  cout << "minpd[0]:  " << c[6].ToStringF64(0) << ' ';
  cout << "maxpd[0]:  " << c[7].ToStringF64(0) << ' ';
  cout << ' ';
  cout << "a[1]:    " << a.ToStringF64(1) << ' ';
  cout << "b[1]:    " << b.ToStringF64(1) << ' ';
  cout << "addpd[1]:  " << c[0].ToStringF64(1) << ' ';
  cout << "subpd[1]:  " << c[1].ToStringF64(1) << ' ';
  cout << "mulpd[1]:  " << c[2].ToStringF64(1) << ' ';
  cout << "divpd[1]:  " << c[3].ToStringF64(1) << ' ';
  cout << "abspd b[1]: " << c[4].ToStringF64(1) << ' ';
  cout << "sqrtpd a[1]:" << c[5].ToStringF64(1) << ' ';
  cout << "minpd[1]:  " << c[6].ToStringF64(1) << ' ';
  cout << "maxpd[1]:  " << c[7].ToStringF64(1) << ' ';
}
int main()
{
  AvxPackedMathF32();
  AvxPackedMathF64();
  return 0;
}
;-------------------------------------------------
;        Ch09_01.asm
;-------------------------------------------------
; Mask values used to calculate floating-point absolute values
      .const
AbsMaskF32 dword 8 dup(7fffffffh)
AbsMaskF64 qword 4 dup(7fffffffffffffffh)
; extern "C" void AvxPackedMathF32_(const YmmVal& a, const YmmVal& b, YmmVal c[8]);
      .code
AvxPackedMathF32_ proc
; Load packed SP floating-point values
    vmovaps ymm0,ymmword ptr [rcx]   ;ymm0 = *a
    vmovaps ymm1,ymmword ptr [rdx]   ;ymm1 = *b
; Packed SP floating-point addition
    vaddps ymm2,ymm0,ymm1
    vmovaps ymmword ptr [r8],ymm2
; Packed SP floating-point subtraction
    vsubps ymm2,ymm0,ymm1
    vmovaps ymmword ptr [r8+32],ymm2
; Packed SP floating-point multiplication
    vmulps ymm2,ymm0,ymm1
    vmovaps ymmword ptr [r8+64],ymm2
; Packed SP floating-point division
    vdivps ymm2,ymm0,ymm1
    vmovaps ymmword ptr [r8+96],ymm2
; Packed SP floating-point absolute value (b)
    vandps ymm2,ymm1,ymmword ptr [AbsMaskF32]
    vmovaps ymmword ptr [r8+128],ymm2
; Packed SP floating-point square root (a)
    vsqrtps ymm2,ymm0
    vmovaps ymmword ptr [r8+160],ymm2
; Packed SP floating-point minimum
    vminps ymm2,ymm0,ymm1
    vmovaps ymmword ptr [r8+192],ymm2
; Packed SP floating-point maximum
    vmaxps ymm2,ymm0,ymm1
    vmovaps ymmword ptr [r8+224],ymm2
    vzeroupper
    ret
AvxPackedMathF32_ endp
; extern "C" void AvxPackedMathF64_(const YmmVal& a, const YmmVal& b, YmmVal c[8]);
AvxPackedMathF64_ proc
; Load packed DP floating-point values
    vmovapd ymm0,ymmword ptr [rcx]    ;ymm0 = *a
    vmovapd ymm1,ymmword ptr [rdx]    ;ymm1 = *b
; Packed DP floating-point addition
    vaddpd ymm2,ymm0,ymm1
    vmovapd ymmword ptr [r8],ymm2
; Packed DP floating-point subtraction
    vsubpd ymm2,ymm0,ymm1
    vmovapd ymmword ptr [r8+32],ymm2
; Packed DP floating-point multiplication
    vmulpd ymm2,ymm0,ymm1
    vmovapd ymmword ptr [r8+64],ymm2
; Packed DP floating-point division
    vdivpd ymm2,ymm0,ymm1
    vmovapd ymmword ptr [r8+96],ymm2
; Packed DP floating-point absolute value (b)
    vandpd ymm2,ymm1,ymmword ptr [AbsMaskF64]
    vmovapd ymmword ptr [r8+128],ymm2
; Packed DP floating-point square root (a)
    vsqrtpd ymm2,ymm0
    vmovapd ymmword ptr [r8+160],ymm2
; Packed DP floating-point minimum
    vminpd ymm2,ymm0,ymm1
    vmovapd ymmword ptr [r8+192],ymm2
; Packed DP floating-point maximum
    vmaxpd ymm2,ymm0,ymm1
    vmovapd ymmword ptr [r8+224],ymm2
    vzeroupper
    ret
AvxPackedMathF64_ endp
    end
Listing 9-1.

Example Ch09_01

Listing 9-1 begins with the declaration of a C++ structure named YmmVal that’s declared in the header file YmmVal.h. This structure is similar to the XmmVal structure that you saw in Chapter 6. YmmVal contains a publicly-accessible anonymous union that facilitates packed operand data exchange between functions written in C++ and x86 assembly language. The members of this union correspond to the packed data types that can be used with a YMM register. The structure YmmVal also includes several formatting and display functions (the source code for these member functions is not shown).

The C++ code for example Ch09_01 starts with declarations for the assembly language functions AvxPackedMathF32_ and AvxPackedMathF64_. These functions carry out various packed single-precision and double-precision floating-point arithmetic operations using the supplied YmmVal arguments. Following the assembly language function declarations is the function AvxPackedMathF32. This function starts by initializing YmmVal variables a and b. Note that the C++ specifier alignas(32) is used with each YmmVal declaration. This specifier instructs the C++ compiler to align each YmmVal variable on a 32-byte boundary. Following YmmVal variable initialization, AvxPackedMathF32 calls the assembly language function AvxPackedMathF32_ to perform the required arithmetic. The results are then streamed to cout. The function AvxPackedMathF64 is the double-precision floating-point counterpart of AvxPackedMathF32.

Near the top of the assembly language code in Listing 9-1 is a .const section that defines packed constant values for calculating floating-point absolute values . The text dup is a MASM operator that allocates and optionally initializes multiple data values. In the current example, the statement AbsMaskF32 dword 8 dup(7fffffffh) allocates storage space for eight doubleword values and each value is initialized to 0x7fffffff. The following statement, AbsMaskF64 qword 4 dup(7fffffffffffffffh), allocates four quadwords of 0x7fffffffffffffff. Note that neither of these 256-bit wide operands is preceded by an align statement, which means that they may not be properly aligned in memory. The reason for this is that the MASM align directive does not support 32-byte alignment within a .const, .data, or .code section. Later in this chapter, you learn how to define a custom segment of constant values that supports 32-byte alignment.

Following the .const section, the first instruction of AvxPackedMathF32_, vmovaps ymm0,ymmword ptr [rcx], loads argument a (i.e., the eight floating-point values of YmmVal a) into register YMM0. The vmovaps can be used here since YmmVal a was defined using the alignas(32) specifier in the C++ code. The operator ymmword ptr directs the assembler to treat the memory location pointed to by RCX as a 256-bit wide operand. Use of the ymmword ptr operator is optional in this instance and employed to improve code readability. The ensuing vmovaps ymm1,ymmword ptr [rdx] instruction loads b into register YMM1. The vaddps ymm2,ymm0,ymm1 instruction that follows sums the packed single-precision floating-point values in YMM0 and YMM1; it then saves the result to YMM2. The vmovaps ymmword ptr [r8],ymm2 instruction saves the packed sums to c[0].

The ensuing vsubps, vmulps, and vdivps instructions carry out packed single-precision floating-point subtraction, multiplication, and division. This is followed by a vandps ymm2,ymm1,ymmword ptr [AbsMaskF32] instruction that calculates packed absolute values using argument b. The remaining instructions in AvxPackedMathF32_ calculate packed single-precision floating-point square roots, minimums, and maximums.

Prior to its ret instruction, the function AvxPackedMath32_ uses a vzeroupper instruction, which zeros the high-order 128 bits of each YMM register. As explained in Chapter 4, the vzeroupper instruction is needed here to avoid potential performance delays that can occur whenever the processor transitions from executing x86-AVX instructions that use 256-bit wide operands to executing x86-SSE instructions. Any assembly language function that uses one or more YMM registers and is callable from code that potentially uses x86-SSE instructions should always ensure that a vzeroupper instruction is executed before program control is transferred back to the calling function. You’ll see additional examples of vzeroupper instruction use in this and subsequent chapters.

The organization of function AvxPackedMathF64_ is similar to AvxPackedMathF32_. AvxPackedMathF64_ carries out its calculations using the double-precision versions of the same instructions that are used in AvxPackedMathF32_. Here is the output for source code example Ch09_01:
Results for AvxPackedMathF32
a[0]:       36.000000    0.031250  |    2.000000    42.000000
b[0]:       -0.111111    64.000000  |    -0.062500    8.666667
addps[0]:     35.888889    64.031250  |    1.937500    50.666668
subps[0]:     36.111111   -63.968750  |    2.062500    33.333332
mulps[0]:     -4.000000    2.000000  |    -0.125000   364.000000
divps[0]:    -324.000031    0.000488  |   -32.000000    4.846154
absps b[0]:     0.111111    64.000000  |    0.062500    8.666667
sqrtps a[0]:    6.000000    0.176777  |    1.414214    6.480741
minps[0]:     -0.111111    0.031250  |    -0.062500    8.666667
maxps[0]:     36.000000    64.000000  |    2.000000    42.000000
a[1]:        7.000000    20.500000  |    36.125000    0.500000
b[1]:       -18.125000    56.000000  |    24.000000   -98.599998
addps[1]:     -11.125000    76.500000  |    60.125000   -98.099998
subps[1]:     25.125000   -35.500000  |    12.125000    99.099998
mulps[1]:    -126.875000   1148.000000  |   867.000000   -49.299999
divps[1]:     -0.386207    0.366071  |    1.505208    -0.005071
absps b[1]:    18.125000    56.000000  |    24.000000    98.599998
sqrtps a[1]:    2.645751    4.527693  |    6.010407    0.707107
minps[1]:     -18.125000    20.500000  |    24.000000   -98.599998
maxps[1]:      7.000000    56.000000  |    36.125000    0.500000
Results for AvxPackedMathF64
a[0]:             2.000000000000  |         4.000000000000
b[0]:             3.141592653590  |         2.718281828459
addpd[0]:           5.141592653590  |         6.718281828459
subpd[0]:          -1.141592653590  |         1.281718171541
mulpd[0]:           6.283185307180  |         10.873127313836
divpd[0]:           0.636619772368  |         1.471517764686
abspd b[0]:          3.141592653590  |         2.718281828459
sqrtpd a[0]:         1.414213562373  |         2.000000000000
minpd[0]:           2.000000000000  |         2.718281828459
maxpd[0]:           3.141592653590  |         4.000000000000
a[1]:             7.500000000000  |         3.000000000000
b[1]:            -9.125000000000  |         -3.141592653590
addpd[1]:          -1.625000000000  |         -0.141592653590
subpd[1]:          16.625000000000  |         6.141592653590
mulpd[1]:          -68.437500000000  |         -9.424777960769
divpd[1]:          -0.821917808219  |         -0.954929658551
abspd b[1]:          9.125000000000  |         3.141592653590
sqrtpd a[1]:         2.738612787526  |         1.732050807569
minpd[1]:          -9.125000000000  |         -3.141592653590
maxpd[1]:           7.500000000000  |         3.000000000000

Packed Floating-Point Arrays

In previous chapters, you learned how to carry out integer and floating-point array calculations using the general-purpose and XMM register sets. In this section, you learn how to perform floating-point array operations using the YMM register set.

Simple Calculations

Listing 9-2 shows the source code for example Ch09_02. This example illustrates how to perform simple array calculations using 256-bit wide packed floating-point operands. It also demonstrates how to detect and exclude invalid array elements from packed calculations. Source code example Ch09_02 is an array implementation of example Ch05_02 from Chapter 5, which calculated sphere surface areas and volumes. In that example, the assembly language function CalcSphereAreaVolume_ computed the surface area and volume of a single sphere. In this example, the sphere radii are passed via an array to calculating functions coded using C++ and assembly language. To make the example a little more interesting, both the C++ and assembly language calculating functions test for radii less than zero. If an invalid radius is detected, the calculating functions set the corresponding elements in the surface area and volume arrays to QNaN.
//------------------------------------------------
//        Ch09_02.cpp
//------------------------------------------------
#include "stdafx.h"
#include <iostream>
#include <iomanip>
#include <random>
#include <limits>
#define _USE_MATH_DEFINES
#include <math.h>
using namespace std;
extern "C" void AvxCalcSphereAreaVolume_(float* sa, float* vol, const float* r, size_t n);
extern "C" float c_PI_F32 = (float)M_PI;
extern "C" float c_QNaN_F32 = numeric_limits<float>::quiet_NaN();
void Init(float* r, size_t n, unsigned int seed)
{
  uniform_int_distribution<> ui_dist {1, 100};
  default_random_engine rng {seed};
  for (size_t i = 0; i < n; i++)
    r[i] = (float)ui_dist(rng) / 10.0f;
  // Set invalid radii for test purposes
  if (n > 2)
  {
    r[2] = -r[2];
    r[n / 4] = -r[n / 4];
    r[n / 2] = -r[n / 2];
    r[n / 4 * 3] = -r[n / 4 * 3];
    r[n - 2] = -r[n - 2];
  }
}
void AvxCalcSphereAreaVolumeCpp(float* sa, float* vol, const float* r, size_t n)
{
  for (size_t i = 0; i < n; i++)
  {
    if (r[i] < 0.0f)
      sa[i] = vol[i] = c_QNaN_F32;
    else
    {
      sa[i] = r[i] * r[i] * 4.0f * c_PI_F32;
      vol[i] = sa[i] * r[i] / 3.0f;
    }
  }
}
void AvxCalcSphereAreaVolume(void)
{
  const size_t n = 21;
  alignas(32) float r[n];
  alignas(32) float sa1[n];
  alignas(32) float vol1[n];
  alignas(32) float sa2[n];
  alignas(32) float vol2[n];
  Init(r, n, 93);
  AvxCalcSphereAreaVolumeCpp(sa1, vol1, r, n);
  AvxCalcSphereAreaVolume_(sa2, vol2, r, n);
  cout << " Results for AvxCalcSphereAreaVolume ";
  cout << fixed;
  const float eps = 1.0e-6f;
  for (size_t i = 0; i < n; i++)
  {
    cout << setw(2) << i << ": ";
    cout << setprecision(2);
    cout << setw(5) << r[i] << " | ";
    cout << setprecision(6);
    cout << setw(12) << sa1[i] << "  ";
    cout << setw(12) << sa2[i] << " | ";
    cout << setw(12) << vol1[i] << "  ";
    cout << setw(12) << vol2[i];
    bool b0 = (fabs(sa1[i] - sa2[i]) > eps);
    bool b1 = (fabs(vol1[i] - vol2[i]) > eps);
    if (b0 || b1)
      cout << " Compare discrepancy";
    cout << ' ';
  }
}
int main()
{
  AvxCalcSphereAreaVolume();
  return 0;
}
;-------------------------------------------------
;        Ch09_02.asm
;-------------------------------------------------
    include <cmpequ.asmh>
    include <MacrosX86-64-AVX.asmh>
    .const
r4_3p0 real4 3.0
r4_4p0 real4 4.0
    extern c_PI_F32:real4
    extern c_QNaN_F32:real4
; extern "C" void AvxCalcSphereAreaVolume_(float* sa, float* vol, const float* r, size_t n);
    .code
AvxCalcSphereAreaVolume_ proc frame
    _CreateFrame CC_,0,64
    _SaveXmmRegs xmm6,xmm7,xmm8,xmm9
    _EndProlog
; Initialize
    vbroadcastss ymm0,real4 ptr [r4_4p0]    ;packed 4.0
    vbroadcastss ymm1,real4 ptr [c_PI_F32]   ;packed PI
    vmulps ymm6,ymm0,ymm1            ;packed 4.0 * PI
    vbroadcastss ymm7,real4 ptr [r4_3p0]    ;packed 3.0
    vbroadcastss ymm8,real4 ptr [c_QNaN_F32]  ;packed QNaN
    vxorps ymm9,ymm9,ymm9            ;packed 0.0
    xor eax,eax               ;common offset for arrays
    cmp r9,8
    jb FinalR                ;skip main loop if n < 8
; Calculate surface area and volume values using packed arithmetic
@@:   vmovdqa ymm0,ymmword ptr [r8+rax]    ;load next 8 radii
    vmulps ymm2,ymm6,ymm0          ;4.0 * PI * r
    vmulps ymm3,ymm2,ymm0          ;4.0 * PI * r * r
    vcmpps ymm1,ymm0,ymm9,CMP_LT      ;ymm1 = mask of radii < 0.0
    vandps ymm4,ymm1,ymm8          ;set surface area to QNaN for radii < 0.0
    vandnps ymm5,ymm1,ymm3         ;keep surface area for radii >= 0.0
    vorps ymm5,ymm4,ymm5          ;final packed surface area
    vmovaps ymmword ptr[rcx+rax],ymm5    ;save packed surface area
    vmulps ymm2,ymm3,ymm0          ;4.0 * PI * r * r * r
    vdivps ymm3,ymm2,ymm7          ;4.0 * PI * r * r * r / 3.0
    vandps ymm4,ymm1,ymm8          ;set volume to QNaN for radii < 0.0
    vandnps ymm5,ymm1,ymm3         ;keep volume for radii >= 0.0
    vorps ymm5,ymm4,ymm5          ;final packed volume
    vmovaps ymmword ptr[rdx+rax],ymm5    ;save packed volume
    add rax,32               ;rax = offset to next set of radii
    sub r9,8
    cmp r9,8
    jae @B                 ;repeat until n < 8
; Perform final calculations using scalar arithmetic
FinalR: test r9,r9
    jz Done                 ;skip loop of no more elements
@@:   vmovss xmm0,real4 ptr [r8+rax]
    vmulss xmm2,xmm6,xmm0          ;4.0 * PI * r
    vmulss xmm3,xmm2,xmm0          ;4.0 * PI * r * r
    vcmpss xmm1,xmm0,xmm9,CMP_LT
    vandps xmm4,xmm1,xmm8
    vandnps xmm5,xmm1,xmm3
    vorps xmm5,xmm4,xmm5
    vmovss real4 ptr[rcx+rax],xmm5     ;save surface area
    vmulss xmm2,xmm3,xmm0          ;4.0 * PI * r * r * r
    vdivss xmm3,xmm2,xmm7          ;4.0 * PI * r * r * r / 3.0
    vandps xmm4,xmm1,xmm8
    vandnps xmm5,xmm1,xmm3
    vorps xmm5,xmm4,xmm5
    vmovss real4 ptr[rdx+rax],xmm5     ;save volume
    add rax,4
    dec r9
    jnz @B                 ;repeat until done
Done:  vzeroupper
    _RestoreXmmRegs xmm6,xmm7,xmm8,xmm9
    _DeleteFrame
    ret
AvxCalcSphereAreaVolume_ endp
    end
Listing 9-2.

Example Ch09_02

The C++ code in Listing 9-2 includes a function named AvxCalcSphereAreaVolumeCpp. This function calculates sphere surface areas and volumes. The sphere radii are passed to AvxCalcSphereAreaVolumeCpp via an array. Prior to calculating a surface area or volume, the sphere’s radius (r[i]) is tested to verify that it’s not negative. If the radius is negative, the corresponding elements in the surface area and volume arrays (sa[i] and vol[i]) are set to c_QNaN_F32. The remaining C++ code performs the necessary initializations, exercises the C++ and assembly language calculating functions, and displays the results. Note that the function AvxCalcSphereAreaVolume employs the alignas(32) specifier with each array declaration.

The assembly language function AvxCalcSphereAreaVolume_ performs the same calculations as its C++ counterpart. Following its prolog, AvxCalcSphereAreaVolume_ uses a series of vbroadcastss instructions to initialize packed versions of the required constants. Prior to the start of the processing loop, a cmp r9,8 instruction checks the value of n. The reason for this check is that the processing loop carries out eight surface area and volume calculations simultaneously using 256-bit wide operands. The jb FinalR conditional jump instruction skips the processing loop if there are fewer than eight radii to process.

Each processing loop iteration begins with a vmovdqa ymm0,ymmword ptr [r8+rax] instruction that loads eight single-precision floating-point radii into register YMM0. The ensuing vmulps instructions calculate the sphere surface areas. The next instruction, vcmpps ymm1,ymm0,ymm9,CMP_LT, tests each sphere radii for a value less than 0.0 (register YMM9 contains packed 0.0). Recall that the vcmpps instruction signifies its results by setting elements in the destination operand to either 0x00000000 (false compare predicate) or 0xffffffff (true compare predicate). The vandps, vandnps, and vorps instructions that follow set the surface area of each sphere whose radius is less than 0.0 to c_QNaN_F32. Figure 9-1 illustrates this operation in greater detail. A vmovaps ymmword ptr[rcx+rax],ymm5 instruction saves the eight sphere surface area values to the array sa.
../images/326959_2_En_9_Chapter/326959_2_En_9_Fig1_HTML.jpg
Figure 9-1.

Surface area QNaN assignment for spheres with radius less than 0.0

Following the calculation of the surface areas, the vmulps ymm2,ymm3,ymm0 and vdivps ymm3,ymm2,ymm7 instructions compute the sphere volumes. The processing loop uses another vandps, vandnps, and vorps instruction sequence to set the volume of any negative-radius sphere to c_QNaN_F32. These values are then saved to the array vol. The processing loop repeats until there are fewer than eight remaining radii.

The next block of code computes sphere surface areas and volumes for the remaining (1–7) radii. Note that AvxCalcSphereAreaVolume_ carries out these calculations using scalar single-precision floating-point arithmetic. The scalar processing loop performs the same arithmetic and Boolean operations as the packed processing loop. Similar to the previous example, AvxCalcSphereAreaVolume_ uses a vzeroupper instruction immediately after the scalar processing loop. This instruction is needed since AvxCalcSphereAreaVolume_ carried out its calculations using the YMM register set. When a vzeroupper instruction is required, it should always be positioned before any function epilog macros (e.g., _RestoreXmmRegs and _DeleteFrame) and the ret instruction. Here are the results for source code example Ch09_02:
Results for AvxCalcSphereAreaVolume
 0:  3.80 |  181.458389   181.458389 |  229.847290   229.847290
 1: 10.00 | 1256.637085  1256.637085 | 4188.790527  4188.790527
 2: -6.10 |     nan      nan |     nan      nan
 3:  3.70 |  172.033630   172.033630 |  212.174805   212.174805
 4:  9.60 | 1158.116821  1158.116821 | 3705.973877  3705.973877
 5: -6.60 |     nan      nan |     nan      nan
 6:  2.60 |  84.948662   84.948654 |  73.622169   73.622162 Compare discrepancy
 7:  9.30 | 1086.865479  1086.865479 | 3369.283203  3369.283203
 8:  9.00 | 1017.876038  1017.876038 | 3053.628174  3053.628174
 9:  5.80 |  422.732758   422.732758 |  817.283386   817.283386
10: -2.90 |     nan      nan |     nan      nan
11:  8.10 |  824.479675   824.479675 | 2226.095215  2226.095215
12:  3.00 |  113.097336   113.097336 |  113.097328   113.097328
13:  8.00 |  804.247742   804.247742 | 2144.660645  2144.660645
14:  1.40 |  24.630087   24.630085 |  11.494040   11.494039 Compare discrepancy
15: -1.80 |     nan      nan |     nan      nan
16:  4.30 |  232.352219   232.352219 |  333.038177   333.038177
17:  6.60 |  547.391113   547.391113 | 1204.260376  1204.260376
18:  4.50 |  254.469009   254.469009 |  381.703522   381.703522
19: -1.20 |     nan      nan |     nan      nan
20:  4.50 |  254.469009   254.469009 |  381.703522   381.703522

The output for source code example Ch09_02 includes a couple of lines with the text “compare discrepancy.” This text was generated by the compare code in AvxCalcSphereAreaVolume to exemplify the non-associativity of floating-point arithmetic. In this example, the functions AvxCalcSphereAreaVolumeCpp and AvxCalcSphereAreaVolume_ carried out their respective floating-point calculations using different operands orderings. For each sphere surface area, the C++ code calculates sa[i] = r[i] * r[i] * 4.0 * c_PI_F32, while the assembly language code calculates sa[i] = 4.0 * c_PI_F32 * r[i] * r[i]. Tiny numerical discrepancies like this are not unusual when comparing floating-point values that are calculated using different operand orderings irrespective of the programming language. This is something that you should keep in mind if you’re developing production code that includes multiple versions of the same calculating function (e.g., one coded using C++ and an AVX/AVX2 accelerated version that’s implemented using x86 assembly language).

Finally, you may have noticed that the function AvxCalcSphereAreaVolume_ handled invalid radii sans any x86 conditional jump instructions. Minimizing the number of conditional jump instructions in a function, especially data-dependent ones, often results in faster executing code. You’ll learn more about jump instruction optimization techniques in Chapter 15.

Column Means

Listing 9-3 shows the source code for example Ch09_03. This example illustrates how to calculate the arithmetic mean of each column in a two-dimensional array of double-precision floating-point values.
//------------------------------------------------
//        Ch09_03.cpp
//------------------------------------------------
#include "stdafx.h"
#include <iostream>
#include <iomanip>
#include <random>
#include <memory>
using namespace std;
extern "C" size_t c_NumRowsMax = 1024 * 1024;
extern "C" size_t c_NumColsMax = 1024 * 1024;
extern "C" bool AvxCalcColumnMeans_(const double* x, size_t nrows, size_t ncols, double* col_means);
void Init(double* x, size_t n, unsigned int seed)
{
  uniform_int_distribution<> ui_dist {1, 2000};
  default_random_engine rng {seed};
  for (size_t i = 0; i < n; i++)
    x[i] = (double)ui_dist(rng) / 10.0;
}
bool AvxCalcColumnMeansCpp(const double* x, size_t nrows, size_t ncols, double* col_means)
{
  // Make sure nrows and ncols are valid
  if (nrows == 0 || nrows > c_NumRowsMax)
    return false;
  if (ncols == 0 || ncols > c_NumColsMax)
    return false;
  // Set initial column means to zero
  for (size_t i = 0; i < ncols; i++)
    col_means[i] = 0.0;
  // Calculate column means
  for (size_t i = 0; i < nrows; i++)
  {
    for (size_t j = 0; j < ncols; j++)
      col_means[j] += x[i * ncols + j];
  }
  for (size_t j = 0; j < ncols; j++)
    col_means[j] /= nrows;
  return true;
}
void AvxCalcColumnMeans(void)
{
  const size_t nrows = 20;
  const size_t ncols = 11;
  unique_ptr<double[]> x {new double[nrows * ncols]};
  unique_ptr<double[]> col_means1 {new double[ncols]};
  unique_ptr<double[]> col_means2 {new double[ncols]};
  Init(x.get(), nrows * ncols, 47);
  bool rc1 = AvxCalcColumnMeansCpp(x.get(), nrows, ncols, col_means1.get());
  bool rc2 = AvxCalcColumnMeans_(x.get(), nrows, ncols, col_means2.get());
  cout << "Results for AvxCalcColumnMeans ";
  if (!rc1 || !rc2)
  {
    cout << "Invalid return code: ";
    cout << "rc1 = " << boolalpha << rc1 << ", ";
    cout << "rc2 = " << boolalpha << rc2 << ' ';
    return;
  }
  cout << " Test Matrix ";
  cout << fixed << setprecision(1);
  for (size_t i = 0; i < nrows; i++)
  {
    cout << "row " << setw(2) << i;
    for (size_t j = 0; j < ncols; j++)
      cout << setw(7) << x[i * ncols + j];
    cout << ' ';
  }
  cout << " Column Means ";
  cout << setprecision(2);
  for (size_t j = 0; j < ncols; j++)
  {
    cout << "col_means1[" << setw(2) << j << "] =";
    cout << setw(10) << col_means1[j] << "  ";
    cout << "col_means2[" << setw(2) << j << "] =";
    cout << setw(10) << col_means2[j] << ' ';
  }
}
int main()
{
  AvxCalcColumnMeans();
  return 0;
}
;-------------------------------------------------
;        Ch09_03.asm
;-------------------------------------------------
; extern "C" bool AvxCalcColMeans_(const double* x, size_t nrows, size_t ncols, double* col_means)
    extern c_NumRowsMax:qword
    extern c_NumColsMax:qword
    .code
AvxCalcColumnMeans_ proc
; Validate nrows and ncols
    xor eax,eax             ;error return code (also col_mean index)
    test rdx,rdx
    jz Done               ;jump if nrows is zero
    cmp rdx,[c_NumRowsMax]
    ja Done               ;jump if nrows is too large
    test r8,r8
    jz Done               ;jump if ncols is zero
    cmp r8,[c_NumColsMax]
    ja Done               ;jump if ncols is too large
; Initialize elements of col_means to zero
    vxorpd xmm0,xmm0,xmm0        ;xmm0[63:0] = 0.0
@@:   vmovsd real8 ptr[r9+rax*8],xmm0   ;col_means[i] = 0.0
    inc rax
    cmp rax,r8
    jb @B                ;repeat until done
    vcvtsi2sd xmm2,xmm2,rdx       ;convert nrows for later use
; Compute the sum of each column in x
LP1:  mov r11,r9             ;r11 = ptr to col_means
    xor r10,r10             ;r10 = col_index
LP2:  mov rax,r10             ;rax = col_index
    add rax,4
    cmp rax,r8             ;4 or more columns remaining?
    ja @F                ;jump if no (col_index + 4 > ncols)
; Update col_means using next four columns
    vmovupd ymm0,ymmword ptr [rcx]   ;load next 4 columns of current row
    vaddpd ymm1,ymm0,ymmword ptr [r11] ;add to col_means
    vmovupd ymmword ptr [r11],ymm1   ;save updated col_means
    add r10,4              ;col_index += 4
    add rcx,32             ;update x ptr
    add r11,32             ;update col_means ptr
    jmp NextColSet
@@:   sub rax,2
    cmp rax,r8             ;2 or more columns remaining?
    ja @F                ;jump if no (col_index + 2 > ncols)
; Update col_means using next two columns
    vmovupd xmm0,xmmword ptr [rcx]   ;load next 2 columns of current row
    vaddpd xmm1,xmm0,xmmword ptr [r11] ;add to col_means
    vmovupd xmmword ptr [r11],xmm1   ;save updated col_means
    add r10,2              ;col_index += 2
    add rcx,16             ;update x ptr
    add r11,16             ;update col_means ptr
    jmp NextColSet
; Update col_means using next column (or last column in the current row)
@@:   vmovsd xmm0,real8 ptr [rcx]     ;load x from last column
    vaddsd xmm1,xmm0,real8 ptr [r11]  ;add to col_means
    vmovsd real8 ptr [r11],xmm1     ;save updated col_means
    inc r10               ;col_index += 1
    add rcx,8              ;update x ptr
NextColSet:
    cmp r10,r8             ;more columns in current row?
    jb LP2               ;jump if yes
    dec rdx               ;nrows -= 1
    jnz LP1               ;jump if more rows
; Compute the final col_means
@@:   vmovsd xmm0,real8 ptr [r9]     ;xmm0 = col_means[i]
    vdivsd xmm1,xmm0,xmm2        ;compute final mean
    vmovsd real8 ptr [r9],xmm1     ;save col_mean[i]
    add r9,8              ;update col_means ptr
    dec r8               ;ncols -= 1
    jnz @B               ;repeat until done
    mov eax,1              ;set success return code
Done:  vzeroupper
    ret
AvxCalcColumnMeans_ endp
    end
Listing 9-3.

Example Ch09_03

Toward the top of the C++ code is a function named AvxCalcColumnMeansCpp. This function calculates the column means of a two-dimensional array using a straightforward set of nested for loops and some simple arithmetic. The function AvxCalcColumnMeans contains code that uses the C++ smart pointer class unique_ptr<> to help manage its dynamically-allocated arrays. Note that storage space for the test array x is allocated using the C++ new operator, which means that the array may not be aligned on a 16- or 32-byte boundary. In this particular example, aligning the start of array x to a specific boundary would be of little benefit since it’s not possible to align the individual rows or columns of a standard C++ two-dimensional array (recall that the elements of a two-dimensional C++ array are stored in a contiguous block of memory using row-major ordering as described in Chapter 2).

The function AvxCalcColumnMeans also uses class unique_ptr<> and the new operator for the one-dimensional arrays col_means1 and col_means2. Using unique_ptr<> in this example simplifies the C++ code somewhat since its destructor automatically invokes the delete[] operator to release the storage space that was allocated by the new operator. If you’re interested in learning more about the smart pointer class unique_ptr<>, Appendix A contains a list of C++ references that you can consult. The remaining code in AvxCalcColumnMeans invokes the C++ and assembly language column-mean calculating functions and streams the results to cout.

Following argument validation, the assembly language function AvxCalcColMeans_ initializes each element in col_means to 0.0. These elements will maintain the intermediate column sums. In order to maximize throughput, the column summation code uses slightly different instruction sequences depending on the current column and the total number of columns in the array. For example, assume that array x contains seven columns. For each row, the elements of the first four columns in x can be added to col_means using 256-bit wide packed addition; the elements of the next two columns can be added to col_means using 128-bit wide packed addition; and the final column element must be added to col_means using scalar addition. Figure 9-2 illustrates this technique in greater detail.
../images/326959_2_En_9_Chapter/326959_2_En_9_Fig2_HTML.jpg
Figure 9-2.

Updating the col_means array using different operand sizes

The mov r11,r9 instruction next to the label LP1 is the starting point for adding elements in the current row of x to col_means. This instruction initializes R11 to first entry in col_means. The col_index counter in register R10 is then set to zero. The instruction group near the label LP2 determines the number of columns remaining to be processed in the current row. If four or more columns remain, the next four elements from the current row are added to the column sums in col_means. A vmovupd ymm0,ymmword ptr [rcx] instruction loads four double-precision floating-point values from x into YMM0 (a vmovapd instruction is not used here since alignment of the elements is unknown). The ensuing vaddpd ymm1,ymm0,ymmword ptr [r11] instruction sums the current array elements with the corresponding elements in col_means, and the vmovupd ymmword ptr [r11],ymm1 instruction saves the updated results back to col_means. The function’s various pointers and counters are then updated in preparation for the next set of elements from the current row of x.

The summation code repeats the steps described in the previous paragraph until the number of array elements that remain in the current row is less than four. As soon as this condition is met, the elements in the remaining columns (if any) must be processed using 128-bit wide or 64-bit wide operands. This is the reason for the distinct blocks of code in AvxCalcColumnMeans_ that process four elements, two elements, or a single element per row. Following computation of the column sums, each element in col_means is divided by n, which yields the final column mean. Here are the results for source code example Ch09_03:
Results for AvxCalcColumnMeans
Test Matrix
row 0 125.6  59.9 100.0 170.5 140.1 197.2  73.7  15.2  92.4 155.3 159.2
row 1  77.6 105.4  45.0 176.8  65.9  12.3 189.1 102.0  56.2 112.8  17.2
row 2 198.9 199.3  74.6 137.9  65.0 125.0  19.8  32.1  58.6  94.1 123.5
row 3  1.7  29.1  99.1 200.0 109.0 123.7 130.0 125.3 146.2  90.6  52.2
row 4  8.7  88.7  84.8 174.6 164.4 106.2 114.0 151.8 130.8 101.9 116.2
row 5  42.7 130.5 180.4 199.4 196.6  99.7 163.6  34.2  5.5 146.1 108.5
row 6 120.0 159.5  26.0  83.4  58.7  10.1 170.1  20.5  10.8  48.3 121.9
row 7 148.9 148.4 142.0 106.6 198.4  60.3  72.1 137.8  74.5  75.7  44.8
row 8  25.7 192.0  12.1  23.4  98.7 145.3 196.8  43.9 143.1  25.1 122.6
row 9  5.4 134.7 165.1  61.8  46.7 183.3 173.7 146.9  76.5 186.2  24.9
row 10 174.5 158.9 127.8  58.9  42.9 182.9  7.8  50.3  68.0  62.0  66.1
row 11  47.3 166.2  8.2  71.2  98.5  12.4 179.0 100.2  29.7 167.4 155.2
row 12  23.9 196.6 148.7  7.1 128.2 128.8  66.3 153.7  60.7 115.4  71.6
row 13 103.4 184.3 161.5  57.9 199.2  79.3  28.1  73.1  12.5  71.3 100.4
row 14 130.3 154.2 127.5  29.7 198.2 170.3 121.9  80.4 159.8  70.0  82.6
row 15  26.7  45.6  67.7 109.7  5.1  96.2 188.7 100.7  48.3 164.2  75.4
row 16 115.4  25.5  58.8 148.5  80.7 149.1 156.7 153.8  42.0 103.7  4.2
row 17  67.9 161.5  16.9 102.1  77.3  3.9 104.7  97.2 181.8 182.0 155.1
row 18 169.5 122.4 102.2  5.5  14.5 105.1 181.5  83.3 117.6  52.1 111.2
row 19  47.1 146.9  21.0  8.6 130.3  24.7  95.7  6.7 159.9  38.8  82.6
Column Means
col_means1[ 0] =   83.06  col_means2[ 0] =   83.06
col_means1[ 1] =  130.48  col_means2[ 1] =  130.48
col_means1[ 2] =   88.47  col_means2[ 2] =   88.47
col_means1[ 3] =   96.68  col_means2[ 3] =   96.68
col_means1[ 4] =  105.92  col_means2[ 4] =  105.92
col_means1[ 5] =  100.79  col_means2[ 5] =  100.79
col_means1[ 6] =  121.66  col_means2[ 6] =  121.66
col_means1[ 7] =   85.46  col_means2[ 7] =   85.46
col_means1[ 8] =   83.75  col_means2[ 8] =   83.75
col_means1[ 9] =  103.15  col_means2[ 9] =  103.15
col_means1[10] =   89.77  col_means2[10] =   89.77

Correlation Coefficient

The next source code example illustrates how to calculate a correlation coefficient using packed double-precision floating-point arithmetic. This example also demonstrates how to perform a few common auxiliary operations with packed floating-point operands, including 128-bit wide extractions and horizontal addition. Listing 9-4 shows the source code for example Ch09_04.
//------------------------------------------------
//        Ch09_04.cpp
//------------------------------------------------
#include "stdafx.h"
#include <iostream>
#include <iomanip>
#include <string>
#include <random>
#include "AlignedMem.h"
using namespace std;
extern "C" bool AvxCalcCorrCoef_(const double* x, const double* y, size_t n, double sums[5], double epsilon, double* rho);
void Init(double* x, double* y, size_t n, unsigned int seed)
{
  uniform_int_distribution<> ui_dist {1, 999};
  default_random_engine rng {seed};
  for (size_t i = 0; i < n; i++)
  {
    x[i] = (double)ui_dist(rng);
    y[i] = x[i] + (ui_dist(rng) % 6000) - 3000;
  }
}
bool AvxCalcCorrCoefCpp(const double* x, const double* y, size_t n, double sums[5], double epsilon, double* rho)
{
  const size_t alignment = 32;
  // Make sure n is valid
  if (n == 0)
    return false;
  // Make sure x and y are properly aligned
  if (!AlignedMem::IsAligned(x, alignment))
    return false;
  if (!AlignedMem::IsAligned(y, alignment))
    return false;
  // Calculate and save sum variables
  double sum_x = 0, sum_y = 0, sum_xx = 0, sum_yy = 0, sum_xy = 0;
  for (size_t i = 0; i < n; i++)
  {
    sum_x += x[i];
    sum_y += y[i];
    sum_xx += x[i] * x[i];
    sum_yy += y[i] * y[i];
    sum_xy += x[i] * y[i];
  }
  sums[0] = sum_x;
  sums[1] = sum_y;
  sums[2] = sum_xx;
  sums[3] = sum_yy;
  sums[4] = sum_xy;
  // Calculate rho
  double rho_num = n * sum_xy - sum_x * sum_y;
  double rho_den = sqrt(n * sum_xx - sum_x * sum_x) * sqrt(n * sum_yy - sum_y * sum_y);
  if (rho_den >= epsilon)
  {
    *rho = rho_num / rho_den;
    return true;
  }
  else
  {
    *rho = 0;
    return false;
  }
}
int main()
{
  const size_t n = 103;
  const size_t alignment = 32;
  AlignedArray<double> x_aa(n, alignment);
  AlignedArray<double> y_aa(n, alignment);
  double sums1[5], sums2[5];
  double rho1, rho2;
  double epsilon = 1.0e-12;
  double* x = x_aa.Data();
  double* y = y_aa.Data();
  Init(x, y, n, 71);
  bool rc1 = AvxCalcCorrCoefCpp(x, y, n, sums1, epsilon, &rho1);
  bool rc2 = AvxCalcCorrCoef_(x, y, n, sums2, epsilon, &rho2);
  cout << "Results for AvxCalcCorrCoef ";
  if (!rc1 || !rc2)
  {
    cout << "Invalid return code ";
    cout << "rc1 = " << boolalpha << rc1 << ", ";
    cout << "rc2 = " << boolalpha << rc2 << ' ';
    return 1;
  }
  int w = 14;
  string sep(w * 3, '-');
  cout << fixed << setprecision(8);
  cout << "Value  " << setw(w) << "C++" << " " << setw(w) << "x86-AVX" << ' ';
  cout << sep << ' ';
  cout << "rho:   " << setw(w) << rho1 << " " << setw(w) << rho2 << " ";
  cout << setprecision(1);
  cout << "sum_x:  " << setw(w) << sums1[0] << " " << setw(w) << sums2[0] << ' ';
  cout << "sum_y:  " << setw(w) << sums1[1] << " " << setw(w) << sums2[1] << ' ';
  cout << "sum_xx: " << setw(w) << sums1[2] << " " << setw(w) << sums2[2] << ' ';
  cout << "sum_yy: " << setw(w) << sums1[3] << " " << setw(w) << sums2[3] << ' ';
  cout << "sum_xy: " << setw(w) << sums1[4] << " " << setw(w) << sums2[4] << ' ';
  return 0;
}
;-------------------------------------------------
;        Ch09_04.asm
;-------------------------------------------------
    include <MacrosX86-64-AVX.asmh>
; extern "C" bool AvxCalcCorrCoef_(const double* x, const double* y, size_t n, double sums[5], double epsilon, double* rho)
;
; Returns    0 = error, 1 = success
    .code
AvxCalcCorrCoef_ proc frame
    _CreateFrame CC_,0,32
    _SaveXmmRegs xmm6,xmm7
    _EndProlog
; Validate arguments
    or r8,r8
    jz BadArg              ;jump if n == 0
    test rcx,1fh
    jnz BadArg             ;jump if x is not aligned
    test rdx,1fh
    jnz BadArg             ;jump if y is not aligned
; Initialize sum variables to zero
    vxorpd ymm3,ymm3,ymm3        ;ymm3 = packed sum_x
    vxorpd ymm4,ymm4,ymm4        ;ymm4 = packed sum_y
    vxorpd ymm5,ymm5,ymm5        ;ymm5 = packed sum_xx
    vxorpd ymm6,ymm6,ymm6        ;ymm6 = packed sum_yy
    vxorpd ymm7,ymm7,ymm7        ;ymm7 = packed sum_xy
    mov r10,r8             ;r10 = n
    cmp r8,4
    jb LP2               ;jump if n >= 1 && n <= 3
; Calculate intermediate packed sum variables
LP1:  vmovapd ymm0,ymmword ptr [rcx]   ;ymm0 = packed x values
    vmovapd ymm1,ymmword ptr [rdx]   ;ymm1 = packed y values
    vaddpd ymm3,ymm3,ymm0        ;update packed sum_x
    vaddpd ymm4,ymm4,ymm1        ;update packed sum_y
    vmulpd ymm2,ymm0,ymm1        ;ymm2 = packed xy values
    vaddpd ymm7,ymm7,ymm2        ;update packed sum_xy
    vmulpd ymm0,ymm0,ymm0        ;ymm0 = packed xx values
    vmulpd ymm1,ymm1,ymm1        ;ymm1 = packed yy values
    vaddpd ymm5,ymm5,ymm0        ;update packed sum_xx
    vaddpd ymm6,ymm6,ymm1        ;update packed sum_yy
    add rcx,32             ;update x ptr
    add rdx,32             ;update y ptr
    sub r8,4              ;n -= 4
    cmp r8,4              ;is n >= 4?
    jae LP1               ;jump if yes
    or r8,r8              ;is n == 0?
    jz FSV               ;jump if yes
; Update sum variables with final x & y values
LP2:  vmovsd xmm0,real8 ptr [rcx]     ;xmm0[63:0] = x[i], ymm0[255:64] = 0
    vmovsd xmm1,real8 ptr [rdx]     ;xmm1[63:0] = y[i], ymm1[255:64] = 0
    vaddpd ymm3,ymm3,ymm0        ;update packed sum_x
    vaddpd ymm4,ymm4,ymm1        ;update packed sum_y
    vmulpd ymm2,ymm0,ymm1        ;ymm2 = packed xy values
    vaddpd ymm7,ymm7,ymm2        ;update packed sum_xy
    vmulpd ymm0,ymm0,ymm0        ;ymm0 = packed xx values
    vmulpd ymm1,ymm1,ymm1        ;ymm1 = packed yy values
    vaddpd ymm5,ymm5,ymm0        ;update packed sum_xx
    vaddpd ymm6,ymm6,ymm1        ;update packed sum_yy
    add rcx,8              ;update x ptr
    add rdx,8              ;update y ptr
    sub r8,1              ;n -= 1
    jnz LP2               ;repeat until done
; Calculate final sum variables
FSV:  vextractf128 xmm0,ymm3,1
    vaddpd xmm1,xmm0,xmm3
    vhaddpd xmm3,xmm1,xmm1       ;xmm3[63:0] = sum_x
    vextractf128 xmm0,ymm4,1
    vaddpd xmm1,xmm0,xmm4
    vhaddpd xmm4,xmm1,xmm1       ;xmm4[63:0] = sum_y
    vextractf128 xmm0,ymm5,1
    vaddpd xmm1,xmm0,xmm5
    vhaddpd xmm5,xmm1,xmm1       ;xmm5[63:0] = sum_xx
    vextractf128 xmm0,ymm6,1
    vaddpd xmm1,xmm0,xmm6
    vhaddpd xmm6,xmm1,xmm1       ;xmm6[63:0] = sum_yy
    vextractf128 xmm0,ymm7,1
    vaddpd xmm1,xmm0,xmm7
    vhaddpd xmm7,xmm1,xmm1       ;xmm7[63:0] = sum_xy
; Save final sum variables
    vmovsd real8 ptr [r9],xmm3     ;save sum_x
    vmovsd real8 ptr [r9+8],xmm4    ;save sum_y
    vmovsd real8 ptr [r9+16],xmm5    ;save sum_xx
    vmovsd real8 ptr [r9+24],xmm6    ;save sum_yy
    vmovsd real8 ptr [r9+32],xmm7    ;save sum_xy
; Calculate rho numerator
; rho_num = n * sum_xy - sum_x * sum_y;
    vcvtsi2sd xmm2,xmm2,r10       ;xmm2 = n
    vmulsd xmm0,xmm2,xmm7        ;xmm0 = = n * sum_xy
    vmulsd xmm1,xmm3,xmm4        ;xmm1 = sum_x * sum_y
    vsubsd xmm7,xmm0,xmm1        ;xmm7 = rho_num
; Calculate rho denominator
; t1 = sqrt(n * sum_xx - sum_x * sum_x)
; t2 = sqrt(n * sum_yy - sum_y * sum_y)
; rho_den = t1 * t2
    vmulsd xmm0,xmm2,xmm5    ;xmm0 = n * sum_xx
    vmulsd xmm3,xmm3,xmm3    ;xmm3 = sum_x * sum_x
    vsubsd xmm3,xmm0,xmm3    ;xmm3 = n * sum_xx - sum_x * sum_x
    vsqrtsd xmm3,xmm3,xmm3   ;xmm3 = t1
    vmulsd xmm0,xmm2,xmm6    ;xmm0 = n * sum_yy
    vmulsd xmm4,xmm4,xmm4    ;xmm4 = sum_y * sum_y
    vsubsd xmm4,xmm0,xmm4    ;xmm4 = n * sum_yy - sum_y * sum_y
    vsqrtsd xmm4,xmm4,xmm4   ;xmm4 = t2
    vmulsd xmm0,xmm3,xmm4    ;xmm0 = rho_den
; Calculate and save final rho
    xor eax,eax
    vcomisd xmm0,real8 ptr [rbp+CC_OffsetStackArgs] ;rho_den < epsilon?
    setae al                ;set return code
    jb BadRho                ;jump if rho_den < epsilon
    vdivsd xmm1,xmm7,xmm0          ;xmm1 = rho
SavRho: mov rdx,[rbp+CC_OffsetStackArgs+8]   ;rdx = ptr to rho
    vmovsd real8 ptr [rdx],xmm1       ;save rho
Done:  vzeroupper
    _RestoreXmmRegs xmm6,xmm7
    _DeleteFrame
    ret
; Error handling code
BadRho: vxorpd xmm1,xmm1,xmm1        ;rho = 0
    jmp SavRho
BadArg: xor eax,eax             ;eax = invalid arg ret code
    jmp Done
AvxCalcCorrCoef_ endp
    end
Listing 9-4.

Example Ch09_04

A correlation coefficient measures the strength of association between two variables. Correlation coefficients can range in value from -1.0 to +1.0, signifying a perfect negative or positive relationship between the two variables. Real-world correlation coefficients are rarely equal to these theoretical limits. A correlation coefficient of 0.0 indicates that the data variables are not associated. The C++ and assembly language code in this example calculate the well-known Pearson correlation coefficient using the following equation:
$$ 
ho =frac{nsum limits_i{x}_i{y}_i-sum limits_i{x}_isum limits_i{y}_i}{sqrt{nsum limits_i{x}_i^2-{left(sum limits_i{x}_i
ight)}^2}sqrt{nsum limits_i{y}_i^2-{left(sum limits_i{y}_i
ight)}^2}} $$
In order to calculate a correlation coefficient using this formula, a function must compute the following five sum variables:
$$ sum\_x=sum limits_i{x}_i $$
$$ sum\_y=sum limits_i{y}_i $$
$$ sum\_ xx=sum limits_i{x}_i^2 $$
$$ sum\_ yy=sum limits_i{y}_i^2 $$
$$ sum\_ xy=sum limits_i{x}_i{y}_i $$

The C++ function AvxCalcCorrCoefCpp shows how to calculate a correlation coefficient. This function begins by checking the value of n to make sure it’s greater than zero. It also validates the two data arrays x and y for proper alignment. The aforementioned sum variables are then calculated using a simple for loop. Following completion of the for loop, the function AvxCalcCorrCoefCpp saves the sum variables to the array sums for comparison and display purposes. It then computes the intermediate values rho_num and rho_den. Before computing the final correlation coefficient rho, rho_den is tested to confirm that it’s greater than or equal to epsilon.

Following its prolog, the assembly language function AvxCalcCorrCoef_ performs the same size and alignment checks as its C++ counterpart. It then initializes packed versions of sum_x, sum_y, sum_xx, sum_yy, and sum_xy to zero in registers YMM3–YMM7. During each iteration, the loop labeled LP1 processes four elements from arrays x and y using packed double-precision floating-point arithmetic. This means that registers YMM3–YMM7 maintain four distinct intermediate values for each sum variable. Execution of loop LP1 continues until there are fewer than four elements remaining to process.

Following completion of loop LP1, the loop labeled LP2 processes the final (1–3) entries in arrays x and y. The vmovsd xmm0,real8 ptr [rcx] and vmovsd xmm1,real8 ptr [rdx] instructions load x[i] and y[i] into registers XMM0 and XMM1, respectively. Note that these vmovsd instructions also zero out bits YMM0[255:64] and YMM1[255:64], which means that the same chain of vaddpd and vmulpd instructions used in loop LP1 to update the intermediate sum variables can also be used in loop LP2 (the scalar instructions vaddsd and vmulsd cannot be used here to update the sum variables without extra code since these instructions set bits 255:128 of their destination operand register to zero). Following completion of loop LP2, each packed sum variable is reduced to a single value using a vextractf128, vaddpd, and vhaddpd instruction, as illustrated in Figure 9-3. The final sum values are then saved to the sums array .
../images/326959_2_En_9_Chapter/326959_2_En_9_Fig3_HTML.jpg
Figure 9-3.

Calculation of sum_x using vextractf128, vaddpd, and vhaddpd

Function AvxCalcCorrCoef_ uses simple scalar arithmetic to compute the intermediate values rho_num and rho_den. Like the corresponding C++ function, AvxCalcCorrCoef_ compares rho_den to see if it’s less than epsilon (a value below epsilon is likely a rounding error and considered too close to zero to be valid). If rho_den is valid, the correlation coefficient rho is calculated and saved. Here are the results for source code example Ch09_04:
Results for AvxCalcCorrCoef
Value        C++    x86-AVX
------------------------------------------
rho:     0.70128193   0.70128193
sum_x:     53081.0    53081.0
sum_y:    -199158.0   -199158.0
sum_xx:   35732585.0   35732585.0
sum_yy:   401708868.0  401708868.0
sum_xy:   -94360528.0  -94360528.0

Matrix Multiplication and Transposition

In Chapter 6, you learned how to perform 4 × 4 matrix transposition and multiplication using single-precision floating-point values (see source code examples Ch06_07 and Ch06_08). The source code example in this section illustrates how to carry out these same matrix operations using double-precision floating-point values. Listing 9-5 shows the source code for example Ch09_05. The fundamentals of matrix transposition and multiplication are explained in Chapter 6. If your understanding of these mathematical operations is lacking, you may want to review the relevant sections in Chapter 6 before proceeding.
//------------------------------------------------
//        Ch09_05.cpp
//------------------------------------------------
#include "stdafx.h"
#include <iostream>
#include <iomanip>
#include "Ch09_05.h"
#include "Matrix.h"
using namespace std;
void AvxMat4x4TransposeF64(Matrix<double>& m_src1)
{
  const size_t nr = 4;
  const size_t nc = 4;
  Matrix<double> m_des1(nr ,nc);
  Matrix<double> m_des2(nr ,nc);
  Matrix<double>::Transpose(m_des1, m_src1);
  AvxMat4x4TransposeF64_(m_des2.Data(), m_src1.Data());
  cout << fixed << setprecision(1);
  m_src1.SetOstream(12, " ");
  m_des1.SetOstream(12, " ");
  m_des2.SetOstream(12, " ");
  cout << "Results for AvxMat4x4TransposeF64 ";
  cout << "Matrix m_src1 " << m_src1 << ' ';
  cout << "Matrix m_des1 " << m_des1 << ' ';
  cout << "Matrix m_des2 " << m_des2 << ' ';
  if (m_des1 != m_des2)
    cout << " Matrix compare failed - AvxMat4x4TransposeF64 ";
}
void AvxMat4x4MulF64(Matrix<double>& m_src1, Matrix<double>& m_src2)
{
  const size_t nr = 4;
  const size_t nc = 4;
  Matrix<double> m_des1(nr ,nc);
  Matrix<double> m_des2(nr ,nc);
  Matrix<double>::Mul(m_des1, m_src1, m_src2);
  AvxMat4x4MulF64_(m_des2.Data(), m_src1.Data(), m_src2.Data());
  cout << fixed << setprecision(1);
  m_src1.SetOstream(12, " ");
  m_src2.SetOstream(12, " ");
  m_des1.SetOstream(12, " ");
  m_des2.SetOstream(12, " ");
  cout << " Results for AvxMat4x4MulF64 ";
  cout << "Matrix m_src1 " << m_src1 << ' ';
  cout << "Matrix m_src2 " << m_src2 << ' ';
  cout << "Matrix m_des1 " << m_des1 << ' ';
  cout << "Matrix m_des2 " << m_des2 << ' ';
  if (m_des1 != m_des2)
    cout << " Matrix compare failed - AvxMat4x4MulF64 ";
}
int main()
{
  const size_t nr = 4;
  const size_t nc = 4;
  Matrix<double> m_src1(nr ,nc);
  Matrix<double> m_src2(nr ,nc);
  const double src1_row0[] = { 10, 11, 12, 13 };
  const double src1_row1[] = { 20, 21, 22, 23 };
  const double src1_row2[] = { 30, 31, 32, 33 };
  const double src1_row3[] = { 40, 41, 42, 43 };
  const double src2_row0[] = { 100, 101, 102, 103 };
  const double src2_row1[] = { 200, 201, 202, 203 };
  const double src2_row2[] = { 300, 301, 302, 303 };
  const double src2_row3[] = { 400, 401, 402, 403 };
  m_src1.SetRow(0, src1_row0);
  m_src1.SetRow(1, src1_row1);
  m_src1.SetRow(2, src1_row2);
  m_src1.SetRow(3, src1_row3);
  m_src2.SetRow(0, src2_row0);
  m_src2.SetRow(1, src2_row1);
  m_src2.SetRow(2, src2_row2);
  m_src2.SetRow(3, src2_row3);
  // Test functions
  AvxMat4x4TransposeF64(m_src1);
  AvxMat4x4MulF64(m_src1, m_src2);
  // Benchmark functions
  AvxMat4x4TransposeF64_BM();
  AvxMat4x4MulF64_BM();
  return 0;
}
;-------------------------------------------------
;        Ch09_05.asm
;-------------------------------------------------
    include <MacrosX86-64-AVX.asmh>
; _Mat4x4TransposeF64 macro
;
; Description: This macro computes the transpose of a 4x4
;        double-precision floating-point matrix.
;
; Input Matrix          Output Matrix
; ---------------------------------------------------
; ymm0  a3 a2 a1 a0       ymm0  d0 c0 b0 a0
; ymm1  b3 b2 b1 b0       ymm1  d1 c1 b1 a1
; ymm2  c3 c2 c1 c0       ymm2  d2 c2 b2 a2
; ymm3  d3 d2 d1 d0       ymm3  d3 c3 b3 a3
;
_Mat4x4TransposeF64 macro
    vunpcklpd ymm4,ymm0,ymm1      ;ymm4 = b2 a2 b0 a0
    vunpckhpd ymm5,ymm0,ymm1      ;ymm5 = b3 a3 b1 a1
    vunpcklpd ymm6,ymm2,ymm3      ;ymm6 = d2 c2 d0 c0
    vunpckhpd ymm7,ymm2,ymm3      ;ymm7 = d3 c3 d1 c1
    vperm2f128 ymm0,ymm4,ymm6,20h    ;ymm0 = d0 c0 b0 a0
    vperm2f128 ymm1,ymm5,ymm7,20h    ;ymm1 = d1 c1 b1 a1
    vperm2f128 ymm2,ymm4,ymm6,31h    ;ymm2 = d2 c2 b2 a2
    vperm2f128 ymm3,ymm5,ymm7,31h    ;ymm3 = d3 c3 b3 a3
    endm
; extern "C" void AvxMat4x4TransposeF64_(double* m_des, const double* m_src1)
    .code
AvxMat4x4TransposeF64_ proc frame
    _CreateFrame MT_,0,32
    _SaveXmmRegs xmm6,xmm7
    _EndProlog
; Transpose matrix m_src1
    vmovaps ymm0,[rdx]         ;ymm0 = m_src1.row_0
    vmovaps ymm1,[rdx+32]        ;ymm1 = m_src2.row_1
    vmovaps ymm2,[rdx+64]        ;ymm2 = m_src3.row_2
    vmovaps ymm3,[rdx+96]        ;ymm3 = m_src4.row_3
    _Mat4x4TransposeF64
    vmovaps [rcx],ymm0         ;save m_des.row_0
    vmovaps [rcx+32],ymm1        ;save m_des.row_1
    vmovaps [rcx+64],ymm2        ;save m_des.row_2
    vmovaps [rcx+96],ymm3        ;save m_des.row_3
    vzeroupper
Done:  _RestoreXmmRegs xmm6,xmm7
    _DeleteFrame
    ret
AvxMat4x4TransposeF64_ endp
; _Mat4x4MulCalcRowF64 macro
;
; Description: This macro computes one row of a 4x4 matrix multiplication.
;
; Registers:  ymm0 = m_src2.row0
;        ymm1 = m_src2.row1
;        ymm2 = m_src2.row2
;        ymm3 = m_src2.row3
;        rcx = m_des ptr
;        rdx = m_src1 ptr
;        ymm4 - ymm4 = scratch registers
_Mat4x4MulCalcRowF64 macro disp
    vbroadcastsd ymm4,real8 ptr [rdx+disp]   ;broadcast m_src1[i][0]
    vbroadcastsd ymm5,real8 ptr [rdx+disp+8]  ;broadcast m_src1[i][1]
    vbroadcastsd ymm6,real8 ptr [rdx+disp+16]  ;broadcast m_src1[i][2]
    vbroadcastsd ymm7,real8 ptr [rdx+disp+24]  ;broadcast m_src1[i][3]
    vmulpd ymm4,ymm4,ymm0            ;m_src1[i][0] * m_src2.row_0
    vmulpd ymm5,ymm5,ymm1            ;m_src1[i][1] * m_src2.row_1
    vmulpd ymm6,ymm6,ymm2            ;m_src1[i][2] * m_src2.row_2
    vmulpd ymm7,ymm7,ymm3            ;m_src1[i][3] * m_src2.row_3
    vaddpd ymm4,ymm4,ymm5            ;calc m_des.row_i
    vaddpd ymm6,ymm6,ymm7
    vaddpd ymm4,ymm4,ymm6
    vmovapd [rcx+disp],ymm4           ;save m_des.row_i
    endm
; extern "C" void AvxMat4x4MulF64_(double* m_des, const double* m_src1, const double* m_src2)
AvxMat4x4MulF64_ proc frame
    _CreateFrame MM_,0,32
    _SaveXmmRegs xmm6,xmm7
    _EndProlog
; Load m_src2 into YMM3:YMM0
    vmovapd ymm0,[r8]          ;ymm0 = m_src2.row_0
    vmovapd ymm1,[r8+32]        ;ymm1 = m_src2.row_1
    vmovapd ymm2,[r8+64]        ;ymm2 = m_src2.row_2
    vmovapd ymm3,[r8+96]        ;ymm3 = m_src2.row_3
; Compute matrix product
    _Mat4x4MulCalcRowF64 0       ;calculate m_des.row_0
    _Mat4x4MulCalcRowF64 32       ;calculate m_des.row_1
    _Mat4x4MulCalcRowF64 64       ;calculate m_des.row_2
    _Mat4x4MulCalcRowF64 96       ;calculate m_des.row_3
    vzeroupper
Done:  _RestoreXmmRegs xmm6,xmm7
    _DeleteFrame
    ret
AvxMat4x4MulF64_ endp
    end
Listing 9-5.

Example Ch09_05

The C++ source code that’s shown in Listing 9-5 is very similar to what you saw in Chapter 6. It begins with a function named AvxMat4x4TransposeF64 that exercises both the C++ and assembly language matrix transposition calculating routines and displays the results. The function that follows, AvxMat4x4MulF64, implements the same tasks for matrix multiplication. Similar to the source code examples in Chapter 6, the C++ versions of matrix transposition and multiplication are implemented by the template functions Matrix<>::Transpose and Matrix<>::Mul, respectively. Chapter 6 contains additional details regarding these template functions.

Near the top of the assembly language code is a macro named _Mat4x4TransposeF64. This macro contains instructions that transpose a 4 × 4 matrix of double-precision floating-point values. The four rows of the source double-precision floating-point matrix must be loaded in registers YMM0–YMM3 prior to its use. Macro _Mat4x4TransposeF64 uses the vperm2f128 instruction to permute the 128-bit wide floating-point fields of its two source operands. This instruction uses an immediate 8-bit control mask to select which fields are copied from the source operands to the destination operand, as outlined in Table 9-1. Figure 9-4 shows the entire 4 × 4 matrix transposition operation in greater detail. The assembly language function AvxMat4x4TransposeF64_ uses the macro _Mat4x4TransposeF64 to transpose a 4 × 4 matrix of double-precision floating-point values.
Table 9-1.

Field Selection for vperm2f128 ymm0,ymm1,ymm2,imm8 Instruction

Destination Field

Source Field

imm8[1:0]

imm8[4:3]

ymm0[127:0]

ymm1[127:0]

0

 
 

ymm1[255:128]

1

 
 

ymm2[127:0]

2

 
 

ymm2[255:128]

3

 

ymm0[255:128]

ymm1[127:0]

 

0

 

ymm1[255:128]

 

1

 

ymm2[127:0]

 

2

 

ymm2[255:128]

 

3

../images/326959_2_En_9_Chapter/326959_2_En_9_Fig4_HTML.jpg
Figure 9-4.

Instruction sequence used by _Max4x4TransposeF64 to transpose a 4 × 4 matrix of double-precision floating-point values

In Listing 9-5, the macro definition _Mat4x4MulCalcRowF64 follows the function AvxMat4x4TransposeF64_. This macro contains instructions that calculate a single row of a 4 × 4 matrix multiplication. The row-multiplication technique that’s used here is identical to the one that was used in source code example Ch06_08 in Chapter 6 (see Figure 6-7). The function AvxMat4x4MulF64_ uses the macro _Mat4x4MulCalcRowF64 to multiply two 4 × 4 double-precision floating-point matrices. Here are the results for source code example Ch09_05:
Results for AvxMat4x4TransposeF64
Matrix m_src1
    10.0     11.0     12.0     13.0
    20.0     21.0     22.0     23.0
    30.0     31.0     32.0     33.0
    40.0     41.0     42.0     43.0
Matrix m_des1
    10.0     20.0     30.0     40.0
    11.0     21.0     31.0     41.0
    12.0     22.0     32.0     42.0
    13.0     23.0     33.0     43.0
Matrix m_des2
    10.0     20.0     30.0     40.0
    11.0     21.0     31.0     41.0
    12.0     22.0     32.0     42.0
    13.0     23.0     33.0     43.0
Results for AvxMat4x4MulF64
Matrix m_src1
    10.0     11.0     12.0     13.0
    20.0     21.0     22.0     23.0
    30.0     31.0     32.0     33.0
    40.0     41.0     42.0     43.0
Matrix m_src2
    100.0     101.0     102.0     103.0
    200.0     201.0     202.0     203.0
    300.0     301.0     302.0     303.0
    400.0     401.0     402.0     403.0
Matrix m_des1
   12000.0    12046.0    12092.0    12138.0
   22000.0    22086.0    22172.0    22258.0
   32000.0    32126.0    32252.0    32378.0
   42000.0    42166.0    42332.0    42498.0
Matrix m_des2
   12000.0    12046.0    12092.0    12138.0
   22000.0    22086.0    22172.0    22258.0
   32000.0    32126.0    32252.0    32378.0
   42000.0    42166.0    42332.0    42498.0
Running benchmark function AvxMat4x4TransposeF64_BM - please wait
Benchmark times save to file Ch09_05_AvxMat4x4TransposeF64_BM_CHROMIUM.csv
Running benchmark function AvxMat4x4MulF64_BM - please wait
Benchmark times save to file Ch09_05_AvxMat4x4MulF64_BM_CHROMIUM.csv
Tables 9-2 and 9-3 contain benchmark timing measurements for the matrix transposition and multiplication functions presented in this section. These measurements were made using the procedure that’s described in Chapter 6.
Table 9-2.

Matrix Transposition Mean Execution Times (Microseconds), 1,000,000 Transpositions

CPU

C++

Assembly Language

i7-4790S

15562

2670

i9-7900X

13167

2112

i7-8700K

12194

1963

Table 9-3.

Matrix Multiplication Mean Execution Times (Microseconds), 1,000,000 Multiplications

CPU

C++

Assembly Language

i7-4790S

55652

5874

i9-7900X

46910

5286

i7-8700K

43118

4505

Matrix Inversion

Besides transposition and multiplication, matrix inversion is another common operation that’s often applied to 4 × 4 matrices. In this section, you examine a program that calculates the inverse of a 4 × 4 matrix of double-precision floating-point values. Listing 9-6 shows the source code for example Ch09_06.
//------------------------------------------------
//        Ch09_06.cpp
//------------------------------------------------
#include "stdafx.h"
#include <cmath>
#include "Ch09_06.h"
#include "Matrix.h"
using namespace std;
bool Avx2Mat4x4InvF64Cpp(Matrix<double>& m_inv, const Matrix<double>& m, double epsilon, bool* is_singular)
{
  // The intermediate matrices below are declared static for benchmarking purposes.
  static const size_t nrows = 4;
  static const size_t ncols = 4;
  static Matrix<double> m2(nrows, ncols);
  static Matrix<double> m3(nrows, ncols);
  static Matrix<double> m4(nrows, ncols);
  static Matrix<double> I(nrows, ncols, true);
  static Matrix<double> tempA(nrows, ncols);
  static Matrix<double> tempB(nrows, ncols);
  static Matrix<double> tempC(nrows, ncols);
  static Matrix<double> tempD(nrows, ncols);
  Matrix<double>::Mul(m2, m, m);
  Matrix<double>::Mul(m3, m2, m);
  Matrix<double>::Mul(m4, m3, m);
  double t1 = m.Trace();
  double t2 = m2.Trace();
  double t3 = m3.Trace();
  double t4 = m4.Trace();
  double c1 = -t1;
  double c2 = -1.0 / 2.0 * (c1 * t1 + t2);
  double c3 = -1.0 / 3.0 * (c2 * t1 + c1 * t2 + t3);
  double c4 = -1.0 / 4.0 * (c3 * t1 + c2 * t2 + c1 * t3 + t4);
  // Make sure matrix is not singular
  *is_singular = (fabs(c4) < epsilon);
  if (*is_singular)
    return false;
  // Calculate = -1.0 / c4 * (m3 + c1 * m2 + c2 * m + c3 * I)
  Matrix<double>::MulScalar(tempA, I, c3);
  Matrix<double>::MulScalar(tempB, m, c2);
  Matrix<double>::MulScalar(tempC, m2, c1);
  Matrix<double>::Add(tempD, tempA, tempB);
  Matrix<double>::Add(tempD, tempD, tempC);
  Matrix<double>::Add(tempD, tempD, m3);
  Matrix<double>::MulScalar(m_inv, tempD, -1.0 / c4);
  return true;
}
void Avx2Mat4x4InvF64(const Matrix<double>& m, const char* msg)
{
  cout << ' ' << msg << " - Test Matrix ";
  cout << m << ' ';
  const double epsilon = 1.0e-9;
  const size_t nrows = m.GetNumRows();
  const size_t ncols = m.GetNumCols();
  Matrix<double> m_inv_a(nrows, ncols);
  Matrix<double> m_ver_a(nrows, ncols);
  Matrix<double> m_inv_b(nrows, ncols);
  Matrix<double> m_ver_b(nrows, ncols);
  for (int i = 0; i <= 1; i++)
  {
    string fn;
    const size_t nrows = m.GetNumRows();
    const size_t ncols = m.GetNumCols();
    Matrix<double> m_inv(nrows, ncols);
    Matrix<double> m_ver(nrows, ncols);
    bool rc, is_singular;
    if (i == 0)
    {
      fn = "Avx2Mat4x4InvF64Cpp";
      rc = Avx2Mat4x4InvF64Cpp(m_inv, m, epsilon, &is_singular);
      if (rc)
        Matrix<double>::Mul(m_ver, m_inv, m);
    }
    else
    {
      fn = "Avx2Mat4x4InvF64_";
      rc = Avx2Mat4x4InvF64_(m_inv.Data(), m.Data(), epsilon, &is_singular);
      if (rc)
        Avx2Mat4x4MulF64_(m_ver.Data(), m_inv.Data(), m.Data());
    }
    if (rc)
    {
      cout << msg << " - " << fn << " - Inverse Matrix ";
      cout << m_inv << ' ';
      // Round to zero used for display purposes, can be removed.
      cout << msg << " - " << fn << " - Verify Matrix ";
      m_ver.RoundToZero(epsilon);
      cout << m_ver << ' ';
    }
    else
    {
      if (is_singular)
        cout << msg << " - " << fn << " - Singular Matrix ";
      else
        cout << msg << " - " << fn << " - Unexpected error occurred ";
    }
  }
}
int main()
{
  cout << " Results for Avx2Mat4x4InvF64 ";
  // Test Matrix #1 - Non-Singular
  Matrix<double> m1(4, 4);
  const double m1_row0[] = { 2, 7, 3, 4 };
  const double m1_row1[] = { 5, 9, 6, 4.75 };
  const double m1_row2[] = { 6.5, 3, 4, 10 };
  const double m1_row3[] = { 7, 5.25, 8.125, 6 };
  m1.SetRow(0, m1_row0);
  m1.SetRow(1, m1_row1);
  m1.SetRow(2, m1_row2);
  m1.SetRow(3, m1_row3);
  // Test Matrix #2 - Non-Singular
  Matrix<double> m2(4, 4);
  const double m2_row0[] = { 0.5, 12, 17.25, 4 };
  const double m2_row1[] = { 5, 2, 6.75, 8 };
  const double m2_row2[] = { 13.125, 1, 3, 9.75 };
  const double m2_row3[] = { 16, 1.625, 7, 0.25 };
  m2.SetRow(0, m2_row0);
  m2.SetRow(1, m2_row1);
  m2.SetRow(2, m2_row2);
  m2.SetRow(3, m2_row3);
  // Test Matrix #3 - Singular
  Matrix<double> m3(4, 4);
  const double m3_row0[] = { 2, 0, 0, 1 };
  const double m3_row1[] = { 0, 4, 5, 0 };
  const double m3_row2[] = { 0, 0, 0, 7 };
  const double m3_row3[] = { 0, 0, 0, 6 };
  m3.SetRow(0, m3_row0);
  m3.SetRow(1, m3_row1);
  m3.SetRow(2, m3_row2);
  m3.SetRow(3, m3_row3);
  Avx2Mat4x4InvF64(m1, "Test #1");
  Avx2Mat4x4InvF64(m2, "Test #2");
  Avx2Mat4x4InvF64(m3, "Test #3");
  Avx2Mat4x4InvF64_BM(m1);
  return 0;
}
;-------------------------------------------------
;        Ch09_06.asm
;-------------------------------------------------
    include <MacrosX86-64-AVX.asmh>
; Custom segment for constants
ConstVals segment readonly align(32) 'const'
Mat4x4I real8 1.0, 0.0, 0.0, 0.0
    real8 0.0, 1.0, 0.0, 0.0
    real8 0.0, 0.0, 1.0, 0.0
    real8 0.0, 0.0, 0.0, 1.0
r8_SignBitMask qword 4 dup (8000000000000000h)
r8_AbsMask   qword 4 dup (7fffffffffffffffh)
r8_1p0     real8 1.0
r8_N1p0     real8 -1.0
r8_N0p5     real8 -0.5
r8_N0p3333   real8 -0.33333333333333
r8_N0p25    real8 -0.25
ConstVals ends
    .code
; _Mat4x4TraceF64 macro
;
; Description: This macro contains instructions that compute the trace
;        of the 4x4 double-precision floating-point matrix in ymm3:ymm0.
_Max4x4TraceF64 macro
    vblendpd ymm0,ymm0,ymm1,00000010b    ;ymm0[127:0] = row 1,0 diag vals
    vblendpd ymm1,ymm2,ymm3,00001000b    ;ymm1[255:128] = row 3,2 diag vals
    vperm2f128 ymm2,ymm1,ymm1,00000001b   ;ymm2[127:0] = row 3,2 diag vals
    vaddpd ymm3,ymm0,ymm2
    vhaddpd ymm0,ymm3,ymm3         ;xmm0[63:0] = trace
    endm
; extern "C" double Avx2Mat4x4TraceF64_(const double* m_src1)
;
; Description: The following function computes the trace of a
;        4x4 double-precision floating-point array.
Avx2Mat4x4TraceF64_ proc
      vmovapd ymm0,[rcx]       ;ymm0 = m_src1.row_0
      vmovapd ymm1,[rcx+32]      ;ymm1 = m_src1.row_1
      vmovapd ymm2,[rcx+64]      ;ymm2 = m_src1.row_2
      vmovapd ymm3,[rcx+96]      ;ymm3 = m_src1.row_3
      _Max4x4TraceF64         ;xmm0[63:0] = m_src1.trace()
      vzeroupper
      ret
Avx2Mat4x4TraceF64_ endp
; _Mat4x4MulCalcRowF64 macro
;
; Description: This macro is used to compute one row of a 4x4 matrix
;        multiply.
;
; Registers:  ymm0 = m_src2.row0
;        ymm1 = m_src2.row1
;        ymm2 = m_src2.row2
;        ymm3 = m_src2.row3
;        ymm4 - ymm7 = scratch registers
_Mat4x4MulCalcRowF64 macro dreg,sreg,disp
    vbroadcastsd ymm4,real8 ptr [sreg+disp]   ;broadcast m_src1[i][0]
    vbroadcastsd ymm5,real8 ptr [sreg+disp+8]  ;broadcast m_src1[i][1]
    vbroadcastsd ymm6,real8 ptr [sreg+disp+16] ;broadcast m_src1[i][2]
    vbroadcastsd ymm7,real8 ptr [sreg+disp+24] ;broadcast m_src1[i][3]
    vmulpd ymm4,ymm4,ymm0            ;m_src1[i][0] * m_src2.row_0
    vmulpd ymm5,ymm5,ymm1            ;m_src1[i][1] * m_src2.row_1
    vmulpd ymm6,ymm6,ymm2            ;m_src1[i][2] * m_src2.row_2
    vmulpd ymm7,ymm7,ymm3            ;m_src1[i][3] * m_src2.row_3
    vaddpd ymm4,ymm4,ymm5            ;calc m_des.row_i
    vaddpd ymm6,ymm6,ymm7
    vaddpd ymm4,ymm4,ymm6
    vmovapd[dreg+disp],ymm4           ;save m_des.row_i
    endm
; extern "C" void Avx2Mat4x4MulF64_(double* m_des, const double* m_src1, const double* m_src2)
Avx2Mat4x4MulF64_ proc frame
    _CreateFrame MM_,0,32
    _SaveXmmRegs xmm6,xmm7
    _EndProlog
    vmovapd ymm0,[r8]          ;ymm0 = m_src2.row_0
    vmovapd ymm1,[r8+32]        ;ymm1 = m_src2.row_1
    vmovapd ymm2,[r8+64]        ;ymm2 = m_src2.row_2
    vmovapd ymm3,[r8+96]        ;ymm3 = m_src2.row_3
    _Mat4x4MulCalcRowF64 rcx,rdx,0   ;calculate m_des.row_0
    _Mat4x4MulCalcRowF64 rcx,rdx,32   ;calculate m_des.row_1
    _Mat4x4MulCalcRowF64 rcx,rdx,64   ;calculate m_des.row_2
    _Mat4x4MulCalcRowF64 rcx,rdx,96   ;calculate m_des.row_3
    vzeroupper
    _RestoreXmmRegs xmm6,xmm7
    _DeleteFrame
    ret
Avx2Mat4x4MulF64_ endp
; extern "C" bool Avx2Mat4x4InvF64_(double* m_inv, const double* m, double epsilon, bool* is_singular);
; Offsets of intermediate matrices on stack relative to rsp
OffsetM2 equ 32
OffsetM3 equ 160
OffsetM4 equ 288
Avx2Mat4x4InvF64_ proc frame
    _CreateFrame MI_,0,160
    _SaveXmmRegs xmm6,xmm7,xmm8,xmm9,xmm10,xmm11,xmm12,xmm13,xmm14,xmm15
    _EndProlog
; Save args to home area for later use
    mov qword ptr [rbp+MI_OffsetHomeRCX],rcx    ;save m_inv ptr
    mov qword ptr [rbp+MI_OffsetHomeRDX],rdx    ;save m ptr
    vmovsd real8 ptr [rbp+MI_OffsetHomeR8],xmm2   ;save epsilon
    mov qword ptr [rbp+MI_OffsetHomeR9],r9     ;save is_singular ptr
; Allocate 384 bytes of stack space for temp matrices + 32 bytes for function calls
    and rsp,0ffffffe0h         ;align rsp to 32-byte boundary
    sub rsp,416             ;alloc stack space
; Calculate m2
    lea rcx,[rsp+OffsetM2]       ;rcx = m2 ptr
    mov r8,rdx             ;rdx, r8 = m ptr
    call Avx2Mat4x4MulF64_       ;calculate and save m2
; Calculate m3
    lea rcx,[rsp+OffsetM3]       ;rcx = m3 ptr
    lea rdx,[rsp+OffsetM2]       ;rdx = m2 ptr
    mov r8,[rbp+MI_OffsetHomeRDX]    ;r8 = m
    call Avx2Mat4x4MulF64_       ;calculate and save m3
; Calculate m4
    lea rcx,[rsp+OffsetM4]       ;rcx = m4 ptr
    lea rdx,[rsp+OffsetM3]       ;rdx = m3 ptr
    mov r8,[rbp+MI_OffsetHomeRDX]    ;r8 = m
    call Avx2Mat4x4MulF64_       ;calculate and save m4
; Calculate trace of m, m2, m3, and m4
    mov rcx,[rbp+MI_OffsetHomeRDX]
    call Avx2Mat4x4TraceF64_
    vmovsd xmm8,xmm8,xmm0        ;xmm8 = t1
    lea rcx,[rsp+OffsetM2]
    call Avx2Mat4x4TraceF64_
    vmovsd xmm9,xmm9,xmm0        ;xmm9 = t2
    lea rcx,[rsp+OffsetM3]
    call Avx2Mat4x4TraceF64_
    vmovsd xmm10,xmm10,xmm0       ;xmm10 = t3
    lea rcx,[rsp+OffsetM4]
    call Avx2Mat4x4TraceF64_
    vmovsd xmm11,xmm11,xmm0       ;xmm10 = t4
; Calculate the required coefficients
; c1 = -t1;
; c2 = -1.0f / 2.0f * (c1 * t1 + t2);
; c3 = -1.0f / 3.0f * (c2 * t1 + c1 * t2 + t3);
; c4 = -1.0f / 4.0f * (c3 * t1 + c2 * t2 + c1 * t3 + t4);
;
; Registers used:
;  t1-t4 = xmm8-xmm11
;  c1-c4 = xmm12-xmm15
    vxorpd xmm12,xmm8,real8 ptr [r8_SignBitMask]  ;xmm12 = c1
    vmulsd xmm13,xmm12,xmm8     ;c1 * t1
    vaddsd xmm13,xmm13,xmm9     ;c1 * t1 + t2
    vmulsd xmm13,xmm13,[r8_N0p5]  ;c2
    vmulsd xmm14,xmm13,xmm8     ;c2 * t1
    vmulsd xmm0,xmm12,xmm9     ;c1 * t2
    vaddsd xmm14,xmm14,xmm0     ;c2 * t1 + c1 * t2
    vaddsd xmm14,xmm14,xmm10    ;c2 * t1 + c1 * t2 + t3
    vmulsd xmm14,xmm14,[r8_N0p3333] ;c3
    vmulsd xmm15,xmm14,xmm8     ;c3 * t1
    vmulsd xmm0,xmm13,xmm9     ;c2 * t2
    vmulsd xmm1,xmm12,xmm10     ;c1 * t3
    vaddsd xmm2,xmm0,xmm1      ;c2 * t2 + c1 * t3
    vaddsd xmm15,xmm15,xmm2     ;c3 * t1 + c2 * t2 + c1 * t3
    vaddsd xmm15,xmm15,xmm11    ;c3 * t1 + c2 * t2 + c1 * t3 + t4
    vmulsd xmm15,xmm15,[r8_N0p25]  ;c4
; Make sure matrix is not singular
    vandpd xmm0,xmm15,[r8_AbsMask]         ;compute fabs(c4)
    vmovsd xmm1,real8 ptr [rbp+MI_OffsetHomeR8]
    vcomisd xmm0,real8 ptr [rbp+MI_OffsetHomeR8]  ;compare against epsilon
    setp al                     ;set al = if unordered
    setb ah                     ;set ah = if fabs(c4) < epsilon
    or al,ah                    ;al = is_singular
    mov rcx,[rbp+MI_OffsetHomeR9]          ;rax = is_singular ptr
    mov [rcx],al                  ;save is_singular state
    jnz Error                    ;jump if singular
; Calculate m_inv = -1.0 / c4 * (m3 + c1 * m2 + c2 * m1 + c3 * I)
    vbroadcastsd ymm14,xmm14            ;ymm14 = packed c3
    lea rcx,[Mat4x4I]                ;rcx = I ptr
    vmulpd ymm0,ymm14,ymmword ptr [rcx]
    vmulpd ymm1,ymm14,ymmword ptr [rcx+32]
    vmulpd ymm2,ymm14,ymmword ptr [rcx+64]
    vmulpd ymm3,ymm14,ymmword ptr [rcx+96]     ;c3 * I
    vbroadcastsd ymm13,xmm13            ;ymm13 = packed c2
    mov rcx,[rbp+MI_OffsetHomeRDX]         ;rcx = m ptr
    vmulpd ymm4,ymm13,ymmword ptr [rcx]
    vmulpd ymm5,ymm13,ymmword ptr [rcx+32]
    vmulpd ymm6,ymm13,ymmword ptr [rcx+64]
    vmulpd ymm7,ymm13,ymmword ptr [rcx+96]     ;c2 * m1
    vaddpd ymm0,ymm0,ymm4
    vaddpd ymm1,ymm1,ymm5
    vaddpd ymm2,ymm2,ymm6
    vaddpd ymm3,ymm3,ymm7              ;c2 * m1 + c3 * I
    vbroadcastsd ymm12,xmm12            ;ymm12 = packed c1
    lea rcx,[rsp+OffsetM2]             ;rcx = m2 ptr
    vmulpd ymm4,ymm12,ymmword ptr [rcx]
    vmulpd ymm5,ymm12,ymmword ptr [rcx+32]
    vmulpd ymm6,ymm12,ymmword ptr [rcx+64]
    vmulpd ymm7,ymm12,ymmword ptr [rcx+96]     ;c1 * m2
    vaddpd ymm0,ymm0,ymm4
    vaddpd ymm1,ymm1,ymm5
    vaddpd ymm2,ymm2,ymm6
    vaddpd ymm3,ymm3,ymm7              ;c1 * m2 + c2 * m1 + c3 * I
    lea rcx,[rsp+OffsetM3]             ;rcx = m3 ptr
    vaddpd ymm0,ymm0,ymmword ptr [rcx]
    vaddpd ymm1,ymm1,ymmword ptr [rcx+32]
    vaddpd ymm2,ymm2,ymmword ptr [rcx+64]
    vaddpd ymm3,ymm3,ymmword ptr [rcx+96]      ;m3 + c1 * m2 + c2 * m1 + c3 * I
    vmovsd xmm4,[r8_N1p0]
    vdivsd xmm4,xmm4,xmm15       ;xmm4 = -1.0 / c4
    vbroadcastsd ymm4,xmm4
    vmulpd ymm0,ymm0,ymm4
    vmulpd ymm1,ymm1,ymm4
    vmulpd ymm2,ymm2,ymm4
    vmulpd ymm3,ymm3,ymm4        ;ymm3:ymm0 = m_inv
; Save m_inv
    mov rcx,[rbp+MI_OffsetHomeRCX]
    vmovapd ymmword ptr [rcx],ymm0
    vmovapd ymmword ptr [rcx+32],ymm1
    vmovapd ymmword ptr [rcx+64],ymm2
    vmovapd ymmword ptr [rcx+96],ymm3
    mov eax,1              ;set success return code
Done:  vzeroupper
    _RestoreXmmRegs xmm6,xmm7,xmm8,xmm9,xmm10,xmm11,xmm12,xmm13,xmm14,xmm15
    _DeleteFrame
    ret
Error: xor eax,eax
    jmp Done
Avx2Mat4x4InvF64_ endp
    end
Listing 9-6.

Example Ch09_06

The multiplicative inverse of a matrix is defined as follows: Let A and X represent n × n matrices. Matrix X is an inverse of A if AX = XA = I, where I denotes an n × n identity matrix (i.e., a matrix of all zeros except for the diagonal elements, which are equal to one). Figure 9-5 shows an example of an inverse matrix. It is important to note that inverses do not exist for all n × n matrices. A matrix without an inverse is called a singular matrix.
../images/326959_2_En_9_Chapter/326959_2_En_9_Fig5_HTML.jpg
Figure 9-5.

Matrix A and its multiplicative inverse Matrix X

The inverse of a 4 × 4 matrix can be calculated using a variety of mathematical techniques. Source code example Ch09_06 uses a computational method based on the Cayley-Hamilton theorem , which employs common matrix operations that are relatively easy to carry out using SIMD arithmetic. Here are the required equations:
$$ {mathbf{A}}^1=mathbf{A};{mathbf{A}}^2=mathbf{A}mathbf{A};{mathbf{A}}^3=mathbf{A}mathbf{A}mathbf{A};{mathbf{A}}^4=mathbf{A}mathbf{A}mathbf{A}mathbf{A} $$
$$ trace;left(mathbf{A}
ight)=sum limits_i{a}_{ii} $$
$$ {t}_n= trace;left({mathbf{A}}^n
ight) $$
$$ {c}_1=-{t}_1 $$
$$ {c}_2=-frac{1}{2}left({c}_1{t}_1+{t}_2
ight) $$
$$ {c}_3=-frac{1}{3}left({c}_2{t}_1+{c}_1{t}_2+{t}_3
ight) $$
$$ {c}_4=-frac{1}{4}left({c}_3{t}_1+{c}_2{t}_2+{c}_1{t}_3+{t}_4
ight) $$
$$ {mathbf{A}}^{-1}=-frac{1}{c_4}left({mathbf{A}}^3+{c}_1{mathbf{A}}^2+{c}_2{mathbf{A}}^3+{c}_3mathbf{I}
ight) $$

Toward the top of the C++ code is a function named Avx2Mat4x4InvF64Cpp. This function calculates the inverse of a 4 × 4 matrix of double-precision floating-point values using the aforementioned equations. Function Avx2Mat4x4InvF64Cpp uses the C++ class Matrix<> to perform many of the required intermediate computations, including matrix addition, multiplication, and trace. The source code for class Matrix<> is not shown but included with the chapter download package. Note that the intermediate matrices are declared using the static qualifier in order to avoid constructor overhead when performing benchmark timing measurements. The drawback of using the static qualifier here means that the function is not thread-safe (a thread-safe function can be simultaneously used by multiple threads). Following calculation of the trace values t1 - t4, Avx2Mat4x4InvF64Cpp computes c1c4 using simple scalar arithmetic. It then checks to make sure the source matrix m is not singular by comparing c4 against epsilon. If matrix m is not singular, the final inverse is calculated. The remaining C++ code performs test case initialization and exercises both the C++ and assembly language matrix inversion functions.

The assembly language code in Listing 9-6 begins with a custom segment that contains definitions of the constant values needed by the assembly language matrix inversion functions. The statement ConstVals segment readonly align(32) 'const' marks the start of a segment that begins on a 32-byte boundary and contains read-only data. The reason for using a custom segment here is that the MASM align directive does not support aligning data items on a 32-byte boundary. In this example, proper alignment of the packed constants is essential in order to maximize performance. Note that the scalar double-precision floating-point constants are defined after the 256-bit wide packed constants and are aligned on an 8-byte boundary. The MASM statement ConstVals ends terminates the custom segment.

Following the custom constant segment is the macro _Max4x4TraceF64. This macro contains instructions that calculate the trace of a 4 × 4 matrix of double-precision floating-point values. Macro _Max4x4TraceF64 requires the four rows of the source matrix to be loaded in registers YMM0–YMM3 and uses the vblendpd, vperm2f128, and vhaddpd instructions to calculate the matrix trace, as shown in Figure 9-6. The vblendpd (Blend Packed Double-Precision Floating-Point Values) instruction merges values from its two source operands according to an immediate control mask. If bit 0 of the control mask equals 0, element 0 (i.e., bits 63:0) from the first source operand is copied to the corresponding element position in the destination operand; otherwise, element 0 from the second source operand is copied to the destination operand. Bits 1–3 of the control mask are used in a similar manner for the other three elements. Register XMM0[63:0] contains the trace value following execution of the vhaddpd instruction.
../images/326959_2_En_9_Chapter/326959_2_En_9_Fig6_HTML.jpg
Figure 9-6.

Trace calculation for a 4 × 4 matrix

The assembly language function Avx2Mat4x4InvF64_ calculates an inverse matrix using the same technique as the corresponding C++ function. Following its prolog, the function Avx2Mat4x4InvF64_ saves its argument values to the home area for later use. It then allocates storage space on the stack to hold intermediate results. More specifically, the and rsp,0ffffffe0h instruction aligns RSP to a 32-byte boundary, and the sub rsp,416 instruction allocates local stack space that’s required for the intermediate matrices m2, m3, and m4 plus 32 bytes for function calls. Next, a series of calls are made to the functions Avx2Mat4x4MulF64_ and Avx2Mat4x4TraceF64_ to calculate the trace values t1t4. The matrix multiplication code that’s used in this example is basically the same code that you saw in example Ch09_05. The algorithm coefficients c1c4 are calculated next using simple scalar floating-point arithmetic. Coefficient c4 is then tested to verify that the source matrix is not singular. If the source matrix is not singular, the function calculates the inverse matrix m_inv. Note that all of the arithmetic required to calculate m_inv is carried out using straightforward packed double-precision floating-point multiplication and addition. Here is the output for source code example Ch09_06:
Results for Avx2Mat4x4InvF64
Test #1 - Test Matrix
     2      7      3      4
     5      9      6    4.75
    6.5      3      4     10
     7    5.25    8.125      6
Test #1 - Avx2Mat4x4InvF64Cpp - Inverse Matrix
 -0.943926   0.91657  0.197547  -0.425579
-0.0568818  0.251148 0.00302831  -0.165952
 0.545399  -0.647656  -0.213597  0.505123
 0.412456  -0.412053  0.0561248  0.124363
Test #1 - Avx2Mat4x4InvF64Cpp - Verify Matrix
     1      0      0      0
     0      1      0      0
     0      0      1      0
     0      0      0      1
Test #1 - Avx2Mat4x4InvF64_ - Inverse Matrix
 -0.943926   0.91657  0.197547  -0.425579
-0.0568818  0.251148 0.00302831  -0.165952
 0.545399  -0.647656  -0.213597  0.505123
 0.412456  -0.412053  0.0561248  0.124363
Test #1 - Avx2Mat4x4InvF64_ - Verify Matrix
     1      0      0      0
     0      1      0      0
     0      0      1      0
     0      0      0      1
Test #2 - Test Matrix
    0.5     12    17.25      4
     5      2    6.75      8
  13.125      1      3    9.75
    16    1.625      7    0.25
Test #2 - Avx2Mat4x4InvF64Cpp - Inverse Matrix
0.00165165 -0.0690239  0.0549591  0.0389347
 0.135369  -0.359846  0.242038 -0.0903252
-0.0350097  0.239298  -0.183964  0.0772214
-0.0053352  0.056194  0.0603606 -0.0669085
Test #2 - Avx2Mat4x4InvF64Cpp - Verify Matrix
     1      0      0      0
     0      1      0      0
     0      0      1      0
     0      0      0      1
Test #2 - Avx2Mat4x4InvF64_ - Inverse Matrix
0.00165165 -0.0690239  0.0549591  0.0389347
 0.135369  -0.359846  0.242038 -0.0903252
-0.0350097  0.239298  -0.183964  0.0772214
-0.0053352  0.056194  0.0603606 -0.0669085
Test #2 - Avx2Mat4x4InvF64_ - Verify Matrix
     1      0      0      0
     0      1      0      0
     0      0      1      0
     0      0      0      1
Test #3 - Test Matrix
     2      0      0      1
     0      4      5      0
     0      0      0      7
     0      0      0      6
Test #3 - Avx2Mat4x4InvF64Cpp - Singular Matrix
Test #3 - Avx2Mat4x4InvF64_ - Singular Matrix
Running benchmark function Avx2Mat4x4InvF64_BM - please wait
Benchmark times save to file Ch09_06_Avx2Mat4x4InvF64_BM_CHROMIUM.csv
Table 9-4 contains benchmark timing measurements for the matrix inversion functions.
Table 9-4.

Matrix Inverse Mean Execution Times (Microseconds), 100,000 Inversions

CPU

C++

Assembly Language

i7-4790S

30417

4168

i9-7900X

26646

3773

i7-8700K

24485

2941

Blend and Permute Instructions

A data blend operation conditionally copies elements from two packed source operands to a packed destination operand using a control mask that specifies which elements to copy. A data permute operation rearranges the elements of a packed source operand according to a control mask. You’ve already seen several source code examples in this chapter that exploited data blend and permute operations. The next source code example is named Ch09_07 and includes code that demonstrates how to use additional blend and permute instructions. Listing 9-7 shows the source code for example Ch09_07.
//------------------------------------------------
//        Ch09_07.cpp
//------------------------------------------------
#include "stdafx.h"
#include <cstdint>
#include <iostream>
#include "YmmVal.h"
using namespace std;
extern "C" void AvxBlendF32_(YmmVal* des1, YmmVal* src1, YmmVal* src2, YmmVal* idx1);
extern "C" void Avx2PermuteF32_(YmmVal* des1, YmmVal* src1, YmmVal* idx1, YmmVal* des2, YmmVal* src2, YmmVal* idx2);
void AvxBlendF32(void)
{
  const uint32_t sel0 = 0x00000000;
  const uint32_t sel1 = 0x80000000;
  alignas(32) YmmVal des1, src1, src2, idx1;
  src1.m_F32[0] = 10.0f; src2.m_F32[0] = 100.0f; idx1.m_I32[0] = sel1;
  src1.m_F32[1] = 20.0f; src2.m_F32[1] = 200.0f; idx1.m_I32[1] = sel0;
  src1.m_F32[2] = 30.0f; src2.m_F32[2] = 300.0f; idx1.m_I32[2] = sel0;
  src1.m_F32[3] = 40.0f; src2.m_F32[3] = 400.0f; idx1.m_I32[3] = sel1;
  src1.m_F32[4] = 50.0f; src2.m_F32[4] = 500.0f; idx1.m_I32[4] = sel1;
  src1.m_F32[5] = 60.0f; src2.m_F32[5] = 600.0f; idx1.m_I32[5] = sel0;
  src1.m_F32[6] = 70.0f; src2.m_F32[6] = 700.0f; idx1.m_I32[6] = sel1;
  src1.m_F32[7] = 80.0f; src2.m_F32[7] = 800.0f; idx1.m_I32[7] = sel0;
  AvxBlendF32_(&des1, &src1, &src2, &idx1);
  cout << " Results for AvxBlendF32 (vblendvps) ";
  cout << fixed << setprecision(1);
  for (size_t i = 0; i < 8; i++)
  {
    cout << "i: " << setw(2) << i << " ";
    cout << "src1: " << setw(8) << src1.m_F32[i] << " ";
    cout << "src2: " << setw(8) << src2.m_F32[i] << " ";
    cout << setfill('0');
    cout << "idx1: 0x" << setw(8) << hex << idx1.m_U32[i] << " ";
    cout << setfill(' ');
    cout << "des1: " << setw(8) << des1.m_F32[i] << ' ';
  }
}
void Avx2PermuteF32(void)
{
  alignas(32) YmmVal des1, src1, idx1;
  alignas(32) YmmVal des2, src2, idx2;
  // idx1 values must be between 0 and 7.
  src1.m_F32[0] = 100.0f;    idx1.m_I32[0] = 3;
  src1.m_F32[1] = 200.0f;    idx1.m_I32[1] = 7;
  src1.m_F32[2] = 300.0f;    idx1.m_I32[2] = 0;
  src1.m_F32[3] = 400.0f;    idx1.m_I32[3] = 4;
  src1.m_F32[4] = 500.0f;    idx1.m_I32[4] = 6;
  src1.m_F32[5] = 600.0f;    idx1.m_I32[5] = 6;
  src1.m_F32[6] = 700.0f;    idx1.m_I32[6] = 1;
  src1.m_F32[7] = 800.0f;    idx1.m_I32[7] = 2;
  // idx2 values must be between 0 and 3.
  src2.m_F32[0] = 100.0f;    idx2.m_I32[0] = 3;
  src2.m_F32[1] = 200.0f;    idx2.m_I32[1] = 1;
  src2.m_F32[2] = 300.0f;    idx2.m_I32[2] = 1;
  src2.m_F32[3] = 400.0f;    idx2.m_I32[3] = 2;
  src2.m_F32[4] = 500.0f;    idx2.m_I32[4] = 3;
  src2.m_F32[5] = 600.0f;    idx2.m_I32[5] = 2;
  src2.m_F32[6] = 700.0f;    idx2.m_I32[6] = 0;
  src2.m_F32[7] = 800.0f;    idx2.m_I32[7] = 0;
  Avx2PermuteF32_(&des1, &src1, &idx1, &des2, &src2, &idx2);
  cout << " Results for Avx2PermuteF32 (vpermps) ";
  cout << fixed << setprecision(1);
  for (size_t i = 0; i < 8; i++)
  {
    cout << "i: " << setw(2) << i << " ";
    cout << "src1: " << setw(8) << src1.m_F32[i] << " ";
    cout << "idx1: " << setw(8) << idx1.m_I32[i] << " ";
    cout << "des1: " << setw(8) << des1.m_F32[i] << ' ';
  }
  cout << " Results for Avx2PermuteF32 (vpermilps) ";
  for (size_t i = 0; i < 8; i++)
  {
    cout << "i: " << setw(2) << i << " ";
    cout << "src2: " << setw(8) << src2.m_F32[i] << " ";
    cout << "idx2: " << setw(8) << idx2.m_I32[i] << " ";
    cout << "des2: " << setw(8) << des2.m_F32[i] << ' ';
  }
}
int main()
{
  AvxBlendF32();
  Avx2PermuteF32();
  return 0;
}
;-------------------------------------------------
;        Ch09_07.asm
;-------------------------------------------------
; extern "C" void AvxBlendF32_(YmmVal* des1, YmmVal* src1, YmmVal* src2, YmmVal* idx1)
    .code
AvxBlendF32_ proc
    vmovaps ymm0,ymmword ptr [rdx] ;ymm0 = src1
    vmovaps ymm1,ymmword ptr [r8]  ;ymm1 = src2
    vmovdqa ymm2,ymmword ptr [r9]  ;ymm2 = idx1
    vblendvps ymm3,ymm0,ymm1,ymm2  ;blend ymm0 & ymm1, ymm2 "indices"
    vmovaps ymmword ptr [rcx],ymm3 ;Save result to des1
    vzeroupper
    ret
AvxBlendF32_ endp
; extern "C" void Avx2PermuteF32_(YmmVal* des1, YmmVal* src1, YmmVal* idx1, YmmVal* des2, YmmVal* src2, YmmVal* idx2)
Avx2PermuteF32_ proc
; Perform vpermps permutation
    vmovaps ymm0,ymmword ptr [rdx]   ;ymm0 = src1
    vmovdqa ymm1,ymmword ptr [r8]    ;ymm1 = idx1
    vpermps ymm2,ymm1,ymm0       ;permute ymm0 using ymm1 indices
    vmovaps ymmword ptr [rcx],ymm2   ;save result to des1
; Perform vpermilps permutation
    mov rdx,[rsp+40]          ;rdx = src2 ptr
    mov r8,[rsp+48]           ;r8 = idx2 ptr
    vmovaps ymm3,ymmword ptr [rdx]   ;ymm3 = src2
    vmovdqa ymm4,ymmword ptr [r8]    ;ymm4 = idx1
    vpermilps ymm5,ymm3,ymm4      ;permute ymm3 using ymm4 indices
    vmovaps ymmword ptr [r9],ymm5    ;save result to des2
    vzeroupper
    ret
Avx2PermuteF32_ endp
    end
Listing 9-7.

Example Ch09_07

The C++ code in Listing 9-7 begins with a function named AvxBlendF32 that initializes YmmVal variables src1 and src2 using single-precision floating-point values. It also initializes a third YmmVal variable named src3 for use as a blend control mask. The high-order bit of each doubleword element in src3 specifies whether the corresponding element from src1 (high-order bit = 0) or src2 (high-order bit = 1) is copied to the destination operand. These three source operands are used by the vblendvps (Variable Blend Packed Single- Precision Floating-Point Values) instruction, which is located in the assembly language function AvxBlendF32_. Following execution of this function, the results are streamed to cout.

The C++ code in Listing 9-7 also includes a function named Avx2PermuteF32. This function initializes several YmmVal variables that demonstrate use of the vpermps and vpermips instructions. Both of these instructions require a set of indices that specify which source operand elements are copied to the destination operand. For example, the statement idx1.m_I32[0] = 3 is used to direct the vpermps instruction in Avx2PermuteF32_ to perform des1.m_F32[0] = src1.m_F32[3]. The vpermps instruction requires each index in idx1 to be between zero and seven. An index can be used more than once in idx1 in order to copy an element from src1 to multiple locations in des1. The vpermilps instruction requires its indices to be between zero and three.

The assembly language function AvxBlendF32_ begins by loading the source data operands into registers YMM0 and YMM1 using two vmovaps instructions. The vmovdqa instruction that follows loads the blend control mask into register YMM2. The ensuing vblendvps ymm3,ymm0,ymm1,ymm2 instruction blends elements from registers YMM0 and YMM1 into YMM3 according to the control values in YMM2. The high-order bit of each doubleword element in YMM2 specifies whether the corresponding element from YMM0 (high-order bit = 0) or YMM1 (high-order bit = 1) is copied to YMM3. Figure 9-7 illustrates the execution of this instruction in greater detail. The vblendvps instruction and its double-precision counterpart vblendvpd are examples of AVX instructions that require three source operands. Floating-point blend operations using an immediate control mask are also possible with the vblendp[d|s] instructions.
../images/326959_2_En_9_Chapter/326959_2_En_9_Fig7_HTML.jpg
Figure 9-7.

Execution of the vblendvps instruction

Following AvxBlendF32_ in Listing 9-7 is the function Avx2PermuteF32_, which demonstrates use of the vpermps and vpermilps instructions. The vpermps instruction permutes (or rearranges) the elements of its first source operand (which is 256 bits wide and contains eight single-precision floating-point values) according to the indices in the second source operand. The vpermilps (In-Lane Permute of Single-Precision Floating-Point Values) instruction performs its permutations using two independent 128-bit wide lanes (i.e., bits [255:128] and bits [127:0]). The control indices for an in-lane permutation must range between zero and three, and each lane uses its own distinct set of indices. Figure 9-8 illustrates the execution of these instructions in greater detail. AVX and AVX2 also include the double-precision floating-point permute instructions vpermilpd and vpermpd.
../images/326959_2_En_9_Chapter/326959_2_En_9_Fig8_HTML.jpg
Figure 9-8.

Execution of the vpermps and vpermilps instructions

Here is the output for source code example Ch09_07:
Results for AvxBlendF32 (vblendvps)
i: 0 src1:   10.0 src2:  100.0 idx1: 0x80000000 des1:  100.0
i: 1 src1:   20.0 src2:  200.0 idx1: 0x00000000 des1:   20.0
i: 2 src1:   30.0 src2:  300.0 idx1: 0x00000000 des1:   30.0
i: 3 src1:   40.0 src2:  400.0 idx1: 0x80000000 des1:  400.0
i: 4 src1:   50.0 src2:  500.0 idx1: 0x80000000 des1:  500.0
i: 5 src1:   60.0 src2:  600.0 idx1: 0x00000000 des1:   60.0
i: 6 src1:   70.0 src2:  700.0 idx1: 0x80000000 des1:  700.0
i: 7 src1:   80.0 src2:  800.0 idx1: 0x00000000 des1:   80.0
Results for Avx2PermuteF32 (vpermps)
i: 0 src1:  100.0 idx1:    3 des1:  400.0
i: 1 src1:  200.0 idx1:    7 des1:  800.0
i: 2 src1:  300.0 idx1:    0 des1:  100.0
i: 3 src1:  400.0 idx1:    4 des1:  500.0
i: 4 src1:  500.0 idx1:    6 des1:  700.0
i: 5 src1:  600.0 idx1:    6 des1:  700.0
i: 6 src1:  700.0 idx1:    1 des1:  200.0
i: 7 src1:  800.0 idx1:    2 des1:  300.0
Results for Avx2PermuteF32 (vpermilps)
i: 0 src2:  100.0 idx2:    3 des2:  400.0
i: 1 src2:  200.0 idx2:    1 des2:  200.0
i: 2 src2:  300.0 idx2:    1 des2:  200.0
i: 3 src2:  400.0 idx2:    2 des2:  300.0
i: 4 src2:  500.0 idx2:    3 des2:  800.0
i: 5 src2:  600.0 idx2:    2 des2:  700.0
i: 6 src2:  700.0 idx2:    0 des2:  500.0
i: 7 src2:  800.0 idx2:    0 des2:  500.0

Data Gather Instructions

The final source code example of this chapter, Ch09_08, explains how to use the AVX2 gather instructions. A gather instruction conditionally loads elements from non-contiguous memory locations (typically an array) into an XMM or YMM register. A gather instruction requires a set of indices and a merge control mask that specifies which elements to copy. Listing 9-8 shows the source code for example Ch09_08. Chapter 8 presented an overview of the AVX2 gather instructions, including a graphic (see Figure 8-1) that elucidated execution of the vgatherdps instruction. You may find it helpful to review that material prior to perusing the source code and discussions in this section.
//------------------------------------------------
//        Ch09_08.cpp
//------------------------------------------------
#include "stdafx.h"
#include <string>
#include <cstdint>
#include <iostream>
#include <iomanip>
#include <array>
#include <stdexcept>
using namespace std;
extern "C" void Avx2Gather8xF32_I32_(float* y, const float* x,
  const int32_t* indices, const int32_t* masks);
extern "C" void Avx2Gather8xF32_I64_(float* y, const float* x,
  const int64_t* indices, const int32_t* masks);
extern "C" void Avx2Gather8xF64_I32_(double* y, const double* x,
  const int32_t* indices, const int64_t* masks);
extern "C" void Avx2Gather8xF64_I64_(double* y, const double* x,
  const int64_t* indices, const int64_t* masks);
template <typename T, typename I, typename M, size_t N>
  void Print(const string& msg, const array<T, N>& y, const array<I, N>& indices,
  const array<M, N>& merge)
{
  if (y.size() != indices.size() || y.size() != merge.size())
    throw runtime_error("Non-conforming arrays - Print");
  cout << ' ' << msg << ' ';
  for (size_t i = 0; i < y.size(); i++)
  {
    string merge_s = (merge[i] == 1) ? "Yes" : "No";
    cout << "i: " << setw(2) << i << "  ";
    cout << "y: " << setw(10) << y[i] << "  ";
    cout << "index: " << setw(4) << indices[i] << "  ";
    cout << "merge: " << setw(4) << merge_s << ' ';
  }
}
void Avx2Gather8xF32_I32()
{
  array<float, 20> x;
  for (size_t i = 0; i < x.size(); i++)
    x[i] = (float)(i * 10);
  array<float, 8> y { -1, -1, -1, -1, -1, -1, -1, -1 };
  array<int32_t, 8> indices { 2, 1, 6, 5, 4, 13, 11, 9 };
  array<int32_t, 8> merge { 1, 1, 0, 1, 1, 0, 1, 1 };
  cout << fixed << setprecision(1);
  cout << " Results for Avx2Gather8xF32_I32 ";
  Print("Values before", y, indices, merge);
  Avx2Gather8xF32_I32_(y.data(), x.data(), indices.data(), merge.data());
  Print("Values after", y, indices, merge);
}
void Avx2Gather8xF32_I64()
{
  array<float, 20> x;
  for (size_t i = 0; i < x.size(); i++)
    x[i] = (float)(i * 10);
  array<float, 8> y { -1, -1, -1, -1, -1, -1, -1, -1 };
  array<int64_t, 8> indices { 19, 1, 0, 5, 4, 3, 11, 11 };
  array<int32_t, 8> merge { 1, 1, 1, 1, 0, 0, 1, 1 };
  cout << fixed << setprecision(1);
  cout << " Results for Avx2Gather8xF32_I64 ";
  Print("Values before", y, indices, merge);
  Avx2Gather8xF32_I64_(y.data(), x.data(), indices.data(), merge.data());
  Print("Values after", y, indices, merge);
}
void Avx2Gather8xF64_I32()
{
  array<double, 20> x;
  for (size_t i = 0; i < x.size(); i++)
    x[i] = (double)(i * 10);
  array<double, 8> y { -1, -1, -1, -1, -1, -1, -1, -1 };
  array<int32_t, 8> indices { 12, 11, 6, 15, 4, 13, 18, 3 };
  array<int64_t, 8> merge { 1, 1, 0, 1, 1, 0, 1, 0 };
  cout << fixed << setprecision(1);
  cout << " Results for Avx2Gather8xF64_I32 ";
  Print("Values before", y, indices, merge);
  Avx2Gather8xF64_I32_(y.data(), x.data(), indices.data(), merge.data());
  Print("Values after", y, indices, merge);
}
void Avx2Gather8xF64_I64()
{
  array<double, 20> x;
  for (size_t i = 0; i < x.size(); i++)
    x[i] = (double)(i * 10);
  array<double, 8> y { -1, -1, -1, -1, -1, -1, -1, -1 };
  array<int64_t, 8> indices { 11, 17, 1, 6, 14, 13, 8, 8 };
  array<int64_t, 8> merge { 1, 0, 1, 1, 1, 0, 1, 1 };
  cout << fixed << setprecision(1);
  cout << " Results for Avx2Gather8xF64_I64 ";
  Print("Values before", y, indices, merge);
  Avx2Gather8xF64_I64_(y.data(), x.data(), indices.data(), merge.data());
  Print("Values after", y, indices, merge);
}
int main()
{
  Avx2Gather8xF32_I32();
  Avx2Gather8xF32_I64();
  Avx2Gather8xF64_I32();
  Avx2Gather8xF64_I64();
  return 0;
}
;-------------------------------------------------
;        Ch09_08.asm
;-------------------------------------------------
; For each of the following functions, the contents of y are loaded
; into ymm0 prior to execution of the vgatherXXX instruction in order to
; demonstrate the effects of conditional merging.
    .code
; extern "C" void Avx2Gather8xF32_I32_(float* y, const float* x, const int32_t* indices, const int32_t* merge)
Avx2Gather8xF32_I32_ proc
    vmovups ymm0,ymmword ptr [rcx]   ;ymm0 = y[7]:y[0]
    vmovdqu ymm1,ymmword ptr [r8]    ;ymm1 = indices[7]:indices[0]
    vmovdqu ymm2,ymmword ptr [r9]    ;ymm2 = merge[7]:merge[0]
    vpslld ymm2,ymm2,31         ;shift merge vals to high-order bits
    vgatherdps ymm0,[rdx+ymm1*4],ymm2  ;ymm0 = gathered elements
    vmovups ymmword ptr [rcx],ymm0   ;save gathered elements
    vzeroupper
    ret
Avx2Gather8xF32_I32_ endp
; extern "C" void Avx2Gather8xF32_I64_(float* y, const float* x, const int64_t* indices, const int32_t* merge)
Avx2Gather8xF32_I64_ proc
    vmovups xmm0,xmmword ptr [rcx]   ;xmm0 = y[3]:y[0]
    vmovdqu ymm1,ymmword ptr [r8]    ;ymm1 = indices[3]:indices[0]
    vmovdqu xmm2,xmmword ptr [r9]    ;xmm2 = merge[3]:merge[0]
    vpslld xmm2,xmm2,31         ;shift merge vals to high-order bits
    vgatherqps xmm0,[rdx+ymm1*4],xmm2  ;xmm0 = gathered elements
    vmovups xmmword ptr [rcx],xmm0   ;save gathered elements
    vmovups xmm3,xmmword ptr [rcx+16]  ;xmm0 = des[7]:des[4]
    vmovdqu ymm1,ymmword ptr [r8+32]  ;ymm1 = indices[7]:indices[4]
    vmovdqu xmm2,xmmword ptr [r9+16]  ;xmm2 = merge[7]:merge[4]
    vpslld xmm2,xmm2,31         ;shift merge vals to high-order bits
    vgatherqps xmm3,[rdx+ymm1*4],xmm2  ;xmm0 = gathered elements
    vmovups xmmword ptr [rcx+16],xmm3  ;save gathered elements
    vzeroupper
    ret
Avx2Gather8xF32_I64_ endp
; extern "C" void Avx2Gather8xF64_I32_(double* y, const double* x, const int32_t* indices, const int64_t* merge)
Avx2Gather8xF64_I32_ proc
    vmovupd ymm0,ymmword ptr [rcx]   ;ymm0 = y[3]:y[0]
    vmovdqu xmm1,xmmword ptr [r8]    ;xmm1 = indices[3]:indices[0]
    vmovdqu ymm2,ymmword ptr [r9]    ;ymm2 = merge[3]:merge[0]
    vpsllq ymm2,ymm2,63         ;shift merge vals to high-order bits
    vgatherdpd ymm0,[rdx+xmm1*8],ymm2  ;ymm0 = gathered elements
    vmovupd ymmword ptr [rcx],ymm0   ;save gathered elements
    vmovupd ymm0,ymmword ptr [rcx+32]  ;ymm0 = y[7]:y[4]
    vmovdqu xmm1,xmmword ptr [r8+16]  ;xmm1 = indices[7]:indices[4]
    vmovdqu ymm2,ymmword ptr [r9+32]  ;ymm2 = merge[7]:merge[4]
    vpsllq ymm2,ymm2,63         ;shift merge vals to high-order bits
    vgatherdpd ymm0,[rdx+xmm1*8],ymm2  ;ymm0 = gathered elements
    vmovupd ymmword ptr [rcx+32],ymm0  ;save gathered elements
    vzeroupper
    ret
Avx2Gather8xF64_I32_ endp
; extern "C" void Avx2Gather8xF64_I64_(double* y, const double* x, const int64_t* indices, const int64_t* merge)
Avx2Gather8xF64_I64_ proc
    vmovupd ymm0,ymmword ptr [rcx]   ;ymm0 = y[3]:y[0]
    vmovdqu ymm1,ymmword ptr [r8]    ;ymm1 = indices[3]:indices[0]
    vmovdqu ymm2,ymmword ptr [r9]    ;ymm2 = merge[3]:merge[0]
    vpsllq ymm2,ymm2,63         ;shift merge vals to high-order bits
    vgatherqpd ymm0,[rdx+ymm1*8],ymm2  ;ymm0 = gathered elements
    vmovupd ymmword ptr [rcx],ymm0   ;save gathered elements
    vmovupd ymm0,ymmword ptr [rcx+32]  ;ymm0 = y[7]:y[4]
    vmovdqu ymm1,ymmword ptr [r8+32]  ;ymm1 = indices[7]:indices[4]
    vmovdqu ymm2,ymmword ptr [r9+32]  ;ymm2 = merge[7]:merge[4]
    vpsllq ymm2,ymm2,63         ;shift merge vals to high-order bits
    vgatherqpd ymm0,[rdx+ymm1*8],ymm2  ;ymm0 = gathered elements
    vmovupd ymmword ptr [rcx+32],ymm0  ;save gathered elements
    vzeroupper
    ret
Avx2Gather8xF64_I64_ endp
    end
Listing 9-8.

Example Ch09_08

The C++ source code in example Ch09_08 includes four functions that initialize test cases to perform single-precision and double-precision floating-point gather operations using signed doubleword or quadword indices . The function Avx2Gather8xF32_I32 begins by initializing the elements of array x (the source array) with test values. Note that this function uses the STL class array<> instead of a raw C++ array to demonstrate use of the former with an assembly language function. Appendix A contains a list of C++ references that you can consult if you’re interested in learning more about this class. Next, each element in array y (the destination array) is set to -1.0 in order to illustrate the effects of conditional merging. The arrays indices and merge are also primed with the required gather instruction indices and merge control mask values, respectively. The assembly language function Avx2Gather8xF32_I32_ is then called to carry out the gather operation. Note that raw pointers for the various STL arrays are obtained using template function array<>.data. The other C++ functions in this source example—Avx2Gather8xF32_I64, Avx2Gather8xF64_I32, and Avx2Gather8xF64_I64—are similarly structured.

The assembly language function Avx2Gather8xF32_I32_ begins by loading registers YMM0, YMM1, and YMM2 with the test arrays y, indices, and merge, respectively. Register RDX contains a pointer to the source array x. The vpslld ymm2,ymm2,31 instruction shifts the merge control mask values (each value in this mask is zero or one) to the high-order bit of each doubleword element. The ensuing vgatherdps ymm0,[rdx+ymm1*4],ymm2 instruction loads eight single-precision floating-point values from array x into register YMM0. The merge control mask in YMM2 dictates which array elements are actually copied into the destination operand YMM0. If the high-order bit of a merge control mask doubleword element is set to 1, the corresponding element in YMM0 is updated; otherwise, it is not changed. Subsequent to the successful load of an array element, the vgatherdps instruction sets the corresponding doubleword element in the merge control mask to zero. The vmovups ymmword ptr [rcx],ymm0 then saves the gather result to y.

The assembly language functions Avx2Gather8xF32_I64_, Avx2Gather8xF64_I32_, and Avx2Gather8xF64_I64_ are analogous to Avx2Gather8xF32_I32_. Note that the gather instructions used in these functions—vgatherqps, vgatherdpd, and vgatherqpd—gather only four elements, which explains why they’re used twice. Here are the results for source code example Ch09_08:
Results for Avx2Gather8xF32_I32
Values before
i: 0  y:    -1.0  index:  2  merge: Yes
i: 1  y:    -1.0  index:  1  merge: Yes
i: 2  y:    -1.0  index:  6  merge:  No
i: 3  y:    -1.0  index:  5  merge: Yes
i: 4  y:    -1.0  index:  4  merge: Yes
i: 5  y:    -1.0  index:  13  merge:  No
i: 6  y:    -1.0  index:  11  merge: Yes
i: 7  y:    -1.0  index:  9  merge: Yes
Values after
i: 0  y:    20.0  index:  2  merge: Yes
i: 1  y:    10.0  index:  1  merge: Yes
i: 2  y:    -1.0  index:  6  merge:  No
i: 3  y:    50.0  index:  5  merge: Yes
i: 4  y:    40.0  index:  4  merge: Yes
i: 5  y:    -1.0  index:  13  merge:  No
i: 6  y:   110.0  index:  11  merge: Yes
i: 7  y:    90.0  index:  9  merge: Yes
Results for Avx2Gather8xF32_I64
Values before
i: 0  y:    -1.0  index:  19  merge: Yes
i: 1  y:    -1.0  index:  1  merge: Yes
i: 2  y:    -1.0  index:  0  merge: Yes
i: 3  y:    -1.0  index:  5  merge: Yes
i: 4  y:    -1.0  index:  4  merge:  No
i: 5  y:    -1.0  index:  3  merge:  No
i: 6  y:    -1.0  index:  11  merge: Yes
i: 7  y:    -1.0  index:  11  merge: Yes
Values after
i: 0  y:   190.0  index:  19  merge: Yes
i: 1  y:    10.0  index:  1  merge: Yes
i: 2  y:    0.0  index:  0  merge: Yes
i: 3  y:    50.0  index:  5  merge: Yes
i: 4  y:    -1.0  index:  4  merge:  No
i: 5  y:    -1.0  index:  3  merge:  No
i: 6  y:   110.0  index:  11  merge: Yes
i: 7  y:   110.0  index:  11  merge: Yes
Results for Avx2Gather8xF64_I32
Values before
i: 0  y:    -1.0  index:  12  merge: Yes
i: 1  y:    -1.0  index:  11  merge: Yes
i: 2  y:    -1.0  index:  6  merge:  No
i: 3  y:    -1.0  index:  15  merge: Yes
i: 4  y:    -1.0  index:  4  merge: Yes
i: 5  y:    -1.0  index:  13  merge:  No
i: 6  y:    -1.0  index:  18  merge: Yes
i: 7  y:    -1.0  index:  3  merge:  No
Values after
i: 0  y:   120.0  index:  12  merge: Yes
i: 1  y:   110.0  index:  11  merge: Yes
i: 2  y:    -1.0  index:  6  merge:  No
i: 3  y:   150.0  index:  15  merge: Yes
i: 4  y:    40.0  index:  4  merge: Yes
i: 5  y:    -1.0  index:  13  merge:  No
i: 6  y:   180.0  index:  18  merge: Yes
i: 7  y:    -1.0  index:  3  merge:  No
Results for Avx2Gather8xF64_I64
Values before
i: 0  y:    -1.0  index:  11  merge: Yes
i: 1  y:    -1.0  index:  17  merge:  No
i: 2  y:    -1.0  index:  1  merge: Yes
i: 3  y:    -1.0  index:  6  merge: Yes
i: 4  y:    -1.0  index:  14  merge: Yes
i: 5  y:    -1.0  index:  13  merge:  No
i: 6  y:    -1.0  index:  8  merge: Yes
i: 7  y:    -1.0  index:  8  merge: Yes
Values after
i: 0  y:   110.0  index:  11  merge: Yes
i: 1  y:    -1.0  index:  17  merge:  No
i: 2  y:    10.0  index:  1  merge: Yes
i: 3  y:    60.0  index:  6  merge: Yes
i: 4  y:   140.0  index:  14  merge: Yes
i: 5  y:    -1.0  index:  13  merge:  No
i: 6  y:    80.0  index:  8  merge: Yes
i: 7  y:    80.0  index:  8  merge: Yes

Summary

Here are the key learning points of Chapter 9:
  • Nearly all AVX packed single-precision and double-precision floating-point instructions can be used with either 128-bit or 256-bit wide operands. Packed floating-point operands should always be properly aligned whenever possible, as described in this chapter.

  • The MASM align directive cannot be used to align a 256-bit wide operand on a 32-byte boundary. Assembly language code can align 256-bit wide constant or mutable operands on a 32-byte boundary using the MASM segment directive.

  • When performing packed arithmetic operations, the vcmpp[d|s] instructions can be used with the vandp[d|s], vandnp[d|s], and vorp[d|s] instructions to make logical decisions without any conditional jump instructions.

  • The non-associativity of floating-point arithmetic means that minute numerical discrepancies may occur when comparing values calculated using C++ and assembly language functions.

  • Assembly language functions can use the vperm2f128, vpermp[d|s], and vpermilp[d|s] instructions to rearrange the elements of a packed floating-point operand.

  • Assembly language functions can use the vblendp[d|s] and vblendvp[d|s] instructions to interleave the elements of two packed floating-point operands.

  • Assembly language functions can use the vgatherdp[d|s] and vgatherqp[d|s] instructions to conditionally load floating-point values from non-contiguous memory locations into an XMM or YMM register.

  • Assembly language functions that perform calculations using a YMM register should also use a vzeroupper instruction prior any epilog code or the ret instruction in order to avoid potential x86-AVX to x86-SSE state transition performance delays.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.35.178