Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Daniel KusswurmModern X86 Assembly Language Programminghttps://doi.org/10.1007/978-1-4842-4063-2_9

9. AVX2 Programming – Packed Floating-Point

Daniel Kusswurm¹

(1)

Geneva, IL, USA

In Chapter 6, you learned how to use the AVX instruction set to perform packed floating-point operations using the XMM register set and 128-bit wide operands. In this chapter, you learn how carry out packed floating-point operations using the YMM register set and 256-bit wide operands. The chapter begins with a simple example that demonstrates the basics of packed floating-point arithmetic and YMM register use. This is followed by three source code examples that illustrate how to perform packed calculations with floating-point arrays.

Chapter 6 also presented source code examples that exploited the AVX instruction set to accelerate matrix transposition and multiplication using single-precision floating-point values. In this chapter, you learn how to perform these same calculations using double-precision floating-point values. You also study a source code example that computes the inverse of a matrix. The final two source code examples in this chapter explain how to perform data blends , permutes, and gathers using packed floating-point operands.

You may recall that the source code examples in Chapter 6 used only XMM register operands with AVX instructions. This was done to avoid information overload and maintain a reasonable chapter length. Nearly all AVX floating-point instructions can use either the XMM or YMM registers as operands. Many of the source code examples in this chapter will run on a processor that supports AVX. The function names in these examples use the prefix Avx. Similarly, source code examples that required an AVX2-compatible processor use the function name prefix Avx2. You can use one of the freely-available tools listed in Appendix A to determine whether your computer supports only AVX or both AVX and AVX2.

Packed Floating-Point Arithmetic

Listing 9-1 shows the source code for example Ch09_01. This example illustrates how to perform common arithmetic operations using 256-bit wide single-precision and double-precision floating-point operands. It also illustrates how to use the vzeroupper instruction and several MASM directives for 256-bit wide operands.

//------------------------------------------------

// YmmVal.h

//------------------------------------------------

#pragma once

#include <string>

#include <cstdint>

#include <sstream>

#include <iomanip>

struct YmmVal

{

public:

union

{

int8_t m_I8[32];

int16_t m_I16[16];

int32_t m_I32[8];

int64_t m_I64[4];

uint8_t m_U8[32];

uint16_t m_U16[16];

uint32_t m_U32[8];

uint64_t m_U64[4];

float m_F32[8];

double m_F64[4];

};

//------------------------------------------------

// Ch09_01.cpp

//------------------------------------------------

#include "stdafx.h"

#include <iostream>

#include <iomanip>

#define _USE_MATH_DEFINES

#include <math.h>

#include "YmmVal.h"

using namespace std;

extern "C" void AvxPackedMathF32_(const YmmVal& a, const YmmVal& b, YmmVal c[8]);

extern "C" void AvxPackedMathF64_(const YmmVal& a, const YmmVal& b, YmmVal c[8]);

void AvxPackedMathF32(void)

{

alignas(32) YmmVal a;

alignas(32) YmmVal b;

alignas(32) YmmVal c[8];

a.m_F32[0] = 36.0f; b.m_F32[0] = -0.1111111f;

a.m_F32[1] = 0.03125f; b.m_F32[1] = 64.0f;

a.m_F32[2] = 2.0f; b.m_F32[2] = -0.0625f;

a.m_F32[3] = 42.0f; b.m_F32[3] = 8.666667f;

a.m_F32[4] = 7.0f; b.m_F32[4] = -18.125f;

a.m_F32[5] = 20.5f; b.m_F32[5] = 56.0f;

a.m_F32[6] = 36.125f; b.m_F32[6] = 24.0f;

a.m_F32[7] = 0.5f; b.m_F32[7] = -98.6f;

AvxPackedMathF32_(a, b, c);

cout << (" Results for AvxPackedMathF32 ");

cout << "a[0]: " << a.ToStringF32(0) << ' ';

cout << "b[0]: " << b.ToStringF32(0) << ' ';

cout << "addps[0]: " << c[0].ToStringF32(0) << ' ';

cout << "subps[0]: " << c[1].ToStringF32(0) << ' ';

cout << "mulps[0]: " << c[2].ToStringF32(0) << ' ';

cout << "divps[0]: " << c[3].ToStringF32(0) << ' ';

cout << "absps b[0]: " << c[4].ToStringF32(0) << ' ';

cout << "sqrtps a[0]:" << c[5].ToStringF32(0) << ' ';

cout << "minps[0]: " << c[6].ToStringF32(0) << ' ';

cout << "maxps[0]: " << c[7].ToStringF32(0) << ' ';

cout << ' ';

cout << "a[1]: " << a.ToStringF32(1) << ' ';

cout << "b[1]: " << b.ToStringF32(1) << ' ';

cout << "addps[1]: " << c[0].ToStringF32(1) << ' ';

cout << "subps[1]: " << c[1].ToStringF32(1) << ' ';

cout << "mulps[1]: " << c[2].ToStringF32(1) << ' ';

cout << "divps[1]: " << c[3].ToStringF32(1) << ' ';

cout << "absps b[1]: " << c[4].ToStringF32(1) << ' ';

cout << "sqrtps a[1]:" << c[5].ToStringF32(1) << ' ';

cout << "minps[1]: " << c[6].ToStringF32(1) << ' ';

cout << "maxps[1]: " << c[7].ToStringF32(1) << ' ';

}

void AvxPackedMathF64(void)

{

alignas(32) YmmVal a;

alignas(32) YmmVal b;

alignas(32) YmmVal c[8];

a.m_F64[0] = 2.0; b.m_F64[0] = M_PI;

a.m_F64[1] = 4.0 ; b.m_F64[1] = M_E;

a.m_F64[2] = 7.5; b.m_F64[2] = -9.125;

a.m_F64[3] = 3.0; b.m_F64[3] = -M_PI;

AvxPackedMathF64_(a, b, c);

cout << (" Results for AvxPackedMathF64 ");

cout << "a[0]: " << a.ToStringF64(0) << ' ';

cout << "b[0]: " << b.ToStringF64(0) << ' ';

cout << "addpd[0]: " << c[0].ToStringF64(0) << ' ';

cout << "subpd[0]: " << c[1].ToStringF64(0) << ' ';

cout << "mulpd[0]: " << c[2].ToStringF64(0) << ' ';

cout << "divpd[0]: " << c[3].ToStringF64(0) << ' ';

cout << "abspd b[0]: " << c[4].ToStringF64(0) << ' ';

cout << "sqrtpd a[0]:" << c[5].ToStringF64(0) << ' ';

cout << "minpd[0]: " << c[6].ToStringF64(0) << ' ';

cout << "maxpd[0]: " << c[7].ToStringF64(0) << ' ';

cout << ' ';

cout << "a[1]: " << a.ToStringF64(1) << ' ';

cout << "b[1]: " << b.ToStringF64(1) << ' ';

cout << "addpd[1]: " << c[0].ToStringF64(1) << ' ';

cout << "subpd[1]: " << c[1].ToStringF64(1) << ' ';

cout << "mulpd[1]: " << c[2].ToStringF64(1) << ' ';

cout << "divpd[1]: " << c[3].ToStringF64(1) << ' ';

cout << "abspd b[1]: " << c[4].ToStringF64(1) << ' ';

cout << "sqrtpd a[1]:" << c[5].ToStringF64(1) << ' ';

cout << "minpd[1]: " << c[6].ToStringF64(1) << ' ';

cout << "maxpd[1]: " << c[7].ToStringF64(1) << ' ';

}

int main()

{

AvxPackedMathF32();

AvxPackedMathF64();

return 0;

}

;-------------------------------------------------

; Ch09_01.asm

;-------------------------------------------------

; Mask values used to calculate floating-point absolute values

.const

AbsMaskF32 dword 8 dup(7fffffffh)

AbsMaskF64 qword 4 dup(7fffffffffffffffh)

; extern "C" void AvxPackedMathF32_(const YmmVal& a, const YmmVal& b, YmmVal c[8]);

.code

AvxPackedMathF32_ proc

; Load packed SP floating-point values

vmovaps ymm0,ymmword ptr [rcx] ;ymm0 = *a

vmovaps ymm1,ymmword ptr [rdx] ;ymm1 = *b

; Packed SP floating-point addition

vaddps ymm2,ymm0,ymm1

vmovaps ymmword ptr [r8],ymm2

; Packed SP floating-point subtraction

vsubps ymm2,ymm0,ymm1

vmovaps ymmword ptr [r8+32],ymm2

; Packed SP floating-point multiplication

vmulps ymm2,ymm0,ymm1

vmovaps ymmword ptr [r8+64],ymm2

; Packed SP floating-point division

vdivps ymm2,ymm0,ymm1

vmovaps ymmword ptr [r8+96],ymm2

; Packed SP floating-point absolute value (b)

vandps ymm2,ymm1,ymmword ptr [AbsMaskF32]

vmovaps ymmword ptr [r8+128],ymm2

; Packed SP floating-point square root (a)

vsqrtps ymm2,ymm0

vmovaps ymmword ptr [r8+160],ymm2

; Packed SP floating-point minimum

vminps ymm2,ymm0,ymm1

vmovaps ymmword ptr [r8+192],ymm2

; Packed SP floating-point maximum

vmaxps ymm2,ymm0,ymm1

vmovaps ymmword ptr [r8+224],ymm2

vzeroupper

ret

AvxPackedMathF32_ endp

; extern "C" void AvxPackedMathF64_(const YmmVal& a, const YmmVal& b, YmmVal c[8]);

AvxPackedMathF64_ proc

; Load packed DP floating-point values

vmovapd ymm0,ymmword ptr [rcx] ;ymm0 = *a

vmovapd ymm1,ymmword ptr [rdx] ;ymm1 = *b

; Packed DP floating-point addition

vaddpd ymm2,ymm0,ymm1

vmovapd ymmword ptr [r8],ymm2

; Packed DP floating-point subtraction

vsubpd ymm2,ymm0,ymm1

vmovapd ymmword ptr [r8+32],ymm2

; Packed DP floating-point multiplication

vmulpd ymm2,ymm0,ymm1

vmovapd ymmword ptr [r8+64],ymm2

; Packed DP floating-point division

vdivpd ymm2,ymm0,ymm1

vmovapd ymmword ptr [r8+96],ymm2

; Packed DP floating-point absolute value (b)

vandpd ymm2,ymm1,ymmword ptr [AbsMaskF64]

vmovapd ymmword ptr [r8+128],ymm2

; Packed DP floating-point square root (a)

vsqrtpd ymm2,ymm0

vmovapd ymmword ptr [r8+160],ymm2

; Packed DP floating-point minimum

vminpd ymm2,ymm0,ymm1

vmovapd ymmword ptr [r8+192],ymm2

; Packed DP floating-point maximum

vmaxpd ymm2,ymm0,ymm1

vmovapd ymmword ptr [r8+224],ymm2

vzeroupper

ret

AvxPackedMathF64_ endp

end

Listing 9-1.

Example Ch09_01

Listing 9-1 begins with the declaration of a C++ structure named YmmVal that’s declared in the header file YmmVal.h. This structure is similar to the XmmVal structure that you saw in Chapter 6. YmmVal contains a publicly-accessible anonymous union that facilitates packed operand data exchange between functions written in C++ and x86 assembly language. The members of this union correspond to the packed data types that can be used with a YMM register. The structure YmmVal also includes several formatting and display functions (the source code for these member functions is not shown).

The C++ code for example Ch09_01 starts with declarations for the assembly language functions AvxPackedMathF32_ and AvxPackedMathF64_. These functions carry out various packed single-precision and double-precision floating-point arithmetic operations using the supplied YmmVal arguments. Following the assembly language function declarations is the function AvxPackedMathF32. This function starts by initializing YmmVal variables a and b. Note that the C++ specifier alignas(32) is used with each YmmVal declaration. This specifier instructs the C++ compiler to align each YmmVal variable on a 32-byte boundary. Following YmmVal variable initialization, AvxPackedMathF32 calls the assembly language function AvxPackedMathF32_ to perform the required arithmetic. The results are then streamed to cout. The function AvxPackedMathF64 is the double-precision floating-point counterpart of AvxPackedMathF32.

Near the top of the assembly language code in Listing 9-1 is a .const section that defines packed constant values for calculating floating-point absolute values . The text dup is a MASM operator that allocates and optionally initializes multiple data values. In the current example, the statement AbsMaskF32 dword 8 dup(7fffffffh) allocates storage space for eight doubleword values and each value is initialized to 0x7fffffff. The following statement, AbsMaskF64 qword 4 dup(7fffffffffffffffh), allocates four quadwords of 0x7fffffffffffffff. Note that neither of these 256-bit wide operands is preceded by an align statement, which means that they may not be properly aligned in memory. The reason for this is that the MASM align directive does not support 32-byte alignment within a .const, .data, or .code section. Later in this chapter, you learn how to define a custom segment of constant values that supports 32-byte alignment.

Following the .const section, the first instruction of AvxPackedMathF32_, vmovaps ymm0,ymmword ptr [rcx], loads argument a (i.e., the eight floating-point values of YmmVal a) into register YMM0. The vmovaps can be used here since YmmVal a was defined using the alignas(32) specifier in the C++ code. The operator ymmword ptr directs the assembler to treat the memory location pointed to by RCX as a 256-bit wide operand. Use of the ymmword ptr operator is optional in this instance and employed to improve code readability. The ensuing vmovaps ymm1,ymmword ptr [rdx] instruction loads b into register YMM1. The vaddps ymm2,ymm0,ymm1 instruction that follows sums the packed single-precision floating-point values in YMM0 and YMM1; it then saves the result to YMM2. The vmovaps ymmword ptr [r8],ymm2 instruction saves the packed sums to c[0].

The ensuing vsubps, vmulps, and vdivps instructions carry out packed single-precision floating-point subtraction, multiplication, and division. This is followed by a vandps ymm2,ymm1,ymmword ptr [AbsMaskF32] instruction that calculates packed absolute values using argument b. The remaining instructions in AvxPackedMathF32_ calculate packed single-precision floating-point square roots, minimums, and maximums.

Prior to its ret instruction, the function AvxPackedMath32_ uses a vzeroupper instruction, which zeros the high-order 128 bits of each YMM register. As explained in Chapter 4, the vzeroupper instruction is needed here to avoid potential performance delays that can occur whenever the processor transitions from executing x86-AVX instructions that use 256-bit wide operands to executing x86-SSE instructions. Any assembly language function that uses one or more YMM registers and is callable from code that potentially uses x86-SSE instructions should always ensure that a vzeroupper instruction is executed before program control is transferred back to the calling function. You’ll see additional examples of vzeroupper instruction use in this and subsequent chapters.

The organization of function AvxPackedMathF64_ is similar to AvxPackedMathF32_. AvxPackedMathF64_ carries out its calculations using the double-precision versions of the same instructions that are used in AvxPackedMathF32_. Here is the output for source code example Ch09_01:

Results for AvxPackedMathF32

a[0]: 36.000000 0.031250 | 2.000000 42.000000

b[0]: -0.111111 64.000000 | -0.062500 8.666667

addps[0]: 35.888889 64.031250 | 1.937500 50.666668

subps[0]: 36.111111 -63.968750 | 2.062500 33.333332

mulps[0]: -4.000000 2.000000 | -0.125000 364.000000

divps[0]: -324.000031 0.000488 | -32.000000 4.846154

absps b[0]: 0.111111 64.000000 | 0.062500 8.666667

sqrtps a[0]: 6.000000 0.176777 | 1.414214 6.480741

minps[0]: -0.111111 0.031250 | -0.062500 8.666667

maxps[0]: 36.000000 64.000000 | 2.000000 42.000000

a[1]: 7.000000 20.500000 | 36.125000 0.500000

b[1]: -18.125000 56.000000 | 24.000000 -98.599998

addps[1]: -11.125000 76.500000 | 60.125000 -98.099998

subps[1]: 25.125000 -35.500000 | 12.125000 99.099998

mulps[1]: -126.875000 1148.000000 | 867.000000 -49.299999

divps[1]: -0.386207 0.366071 | 1.505208 -0.005071

absps b[1]: 18.125000 56.000000 | 24.000000 98.599998

sqrtps a[1]: 2.645751 4.527693 | 6.010407 0.707107

minps[1]: -18.125000 20.500000 | 24.000000 -98.599998

maxps[1]: 7.000000 56.000000 | 36.125000 0.500000

Results for AvxPackedMathF64

a[0]: 2.000000000000 | 4.000000000000

b[0]: 3.141592653590 | 2.718281828459

addpd[0]: 5.141592653590 | 6.718281828459

subpd[0]: -1.141592653590 | 1.281718171541

mulpd[0]: 6.283185307180 | 10.873127313836

divpd[0]: 0.636619772368 | 1.471517764686

abspd b[0]: 3.141592653590 | 2.718281828459

sqrtpd a[0]: 1.414213562373 | 2.000000000000

minpd[0]: 2.000000000000 | 2.718281828459

maxpd[0]: 3.141592653590 | 4.000000000000

a[1]: 7.500000000000 | 3.000000000000

b[1]: -9.125000000000 | -3.141592653590

addpd[1]: -1.625000000000 | -0.141592653590

subpd[1]: 16.625000000000 | 6.141592653590

mulpd[1]: -68.437500000000 | -9.424777960769

divpd[1]: -0.821917808219 | -0.954929658551

abspd b[1]: 9.125000000000 | 3.141592653590

sqrtpd a[1]: 2.738612787526 | 1.732050807569

minpd[1]: -9.125000000000 | -3.141592653590

maxpd[1]: 7.500000000000 | 3.000000000000

Packed Floating-Point Arrays

In previous chapters, you learned how to carry out integer and floating-point array calculations using the general-purpose and XMM register sets. In this section, you learn how to perform floating-point array operations using the YMM register set.

Simple Calculations

Listing 9-2 shows the source code for example Ch09_02. This example illustrates how to perform simple array calculations using 256-bit wide packed floating-point operands. It also demonstrates how to detect and exclude invalid array elements from packed calculations. Source code example Ch09_02 is an array implementation of example Ch05_02 from Chapter 5, which calculated sphere surface areas and volumes. In that example, the assembly language function CalcSphereAreaVolume_ computed the surface area and volume of a single sphere. In this example, the sphere radii are passed via an array to calculating functions coded using C++ and assembly language. To make the example a little more interesting, both the C++ and assembly language calculating functions test for radii less than zero. If an invalid radius is detected, the calculating functions set the corresponding elements in the surface area and volume arrays to QNaN.

//------------------------------------------------

// Ch09_02.cpp

//------------------------------------------------

#include "stdafx.h"

#include <iostream>

#include <iomanip>

#include <random>

#include <limits>

#define _USE_MATH_DEFINES

#include <math.h>

using namespace std;

extern "C" void AvxCalcSphereAreaVolume_(float* sa, float* vol, const float* r, size_t n);

extern "C" float c_PI_F32 = (float)M_PI;

extern "C" float c_QNaN_F32 = numeric_limits<float>::quiet_NaN();

void Init(float* r, size_t n, unsigned int seed)

{

uniform_int_distribution<> ui_dist {1, 100};

default_random_engine rng {seed};

for (size_t i = 0; i < n; i++)

r[i] = (float)ui_dist(rng) / 10.0f;

// Set invalid radii for test purposes

if (n > 2)

{

r[2] = -r[2];

r[n / 4] = -r[n / 4];

r[n / 2] = -r[n / 2];

r[n / 4 * 3] = -r[n / 4 * 3];

r[n - 2] = -r[n - 2];

}

void AvxCalcSphereAreaVolumeCpp(float* sa, float* vol, const float* r, size_t n)

{

for (size_t i = 0; i < n; i++)

{

if (r[i] < 0.0f)

sa[i] = vol[i] = c_QNaN_F32;

else

{

sa[i] = r[i] * r[i] * 4.0f * c_PI_F32;

vol[i] = sa[i] * r[i] / 3.0f;

}

void AvxCalcSphereAreaVolume(void)

{

const size_t n = 21;

alignas(32) float r[n];

alignas(32) float sa1[n];

alignas(32) float vol1[n];

alignas(32) float sa2[n];

alignas(32) float vol2[n];

Init(r, n, 93);

AvxCalcSphereAreaVolumeCpp(sa1, vol1, r, n);

AvxCalcSphereAreaVolume_(sa2, vol2, r, n);

cout << " Results for AvxCalcSphereAreaVolume ";

cout << fixed;

const float eps = 1.0e-6f;

for (size_t i = 0; i < n; i++)

{

cout << setw(2) << i << ": ";

cout << setprecision(2);

cout << setw(5) << r[i] << " | ";

cout << setprecision(6);

cout << setw(12) << sa1[i] << " ";

cout << setw(12) << sa2[i] << " | ";

cout << setw(12) << vol1[i] << " ";

cout << setw(12) << vol2[i];

bool b0 = (fabs(sa1[i] - sa2[i]) > eps);

bool b1 = (fabs(vol1[i] - vol2[i]) > eps);

if (b0 || b1)

cout << " Compare discrepancy";

cout << ' ';

}

int main()

{

AvxCalcSphereAreaVolume();

return 0;

}

;-------------------------------------------------

; Ch09_02.asm

;-------------------------------------------------

include <cmpequ.asmh>

include <MacrosX86-64-AVX.asmh>

.const

r4_3p0 real4 3.0

r4_4p0 real4 4.0

extern c_PI_F32:real4

extern c_QNaN_F32:real4

; extern "C" void AvxCalcSphereAreaVolume_(float* sa, float* vol, const float* r, size_t n);

.code

AvxCalcSphereAreaVolume_ proc frame

_CreateFrame CC_,0,64

_SaveXmmRegs xmm6,xmm7,xmm8,xmm9

_EndProlog

; Initialize

vbroadcastss ymm0,real4 ptr [r4_4p0] ;packed 4.0

vbroadcastss ymm1,real4 ptr [c_PI_F32] ;packed PI

vmulps ymm6,ymm0,ymm1 ;packed 4.0 * PI

vbroadcastss ymm7,real4 ptr [r4_3p0] ;packed 3.0

vbroadcastss ymm8,real4 ptr [c_QNaN_F32] ;packed QNaN

vxorps ymm9,ymm9,ymm9 ;packed 0.0

xor eax,eax ;common offset for arrays

cmp r9,8

jb FinalR ;skip main loop if n < 8

; Calculate surface area and volume values using packed arithmetic

@@: vmovdqa ymm0,ymmword ptr [r8+rax] ;load next 8 radii

vmulps ymm2,ymm6,ymm0 ;4.0 * PI * r

vmulps ymm3,ymm2,ymm0 ;4.0 * PI * r * r

vcmpps ymm1,ymm0,ymm9,CMP_LT ;ymm1 = mask of radii < 0.0

vandps ymm4,ymm1,ymm8 ;set surface area to QNaN for radii < 0.0

vandnps ymm5,ymm1,ymm3 ;keep surface area for radii >= 0.0

vorps ymm5,ymm4,ymm5 ;final packed surface area

vmovaps ymmword ptr[rcx+rax],ymm5 ;save packed surface area

vmulps ymm2,ymm3,ymm0 ;4.0 * PI * r * r * r

vdivps ymm3,ymm2,ymm7 ;4.0 * PI * r * r * r / 3.0

vandps ymm4,ymm1,ymm8 ;set volume to QNaN for radii < 0.0

vandnps ymm5,ymm1,ymm3 ;keep volume for radii >= 0.0

vorps ymm5,ymm4,ymm5 ;final packed volume

vmovaps ymmword ptr[rdx+rax],ymm5 ;save packed volume

add rax,32 ;rax = offset to next set of radii

sub r9,8

cmp r9,8

jae @B ;repeat until n < 8

; Perform final calculations using scalar arithmetic

FinalR: test r9,r9

jz Done ;skip loop of no more elements

@@: vmovss xmm0,real4 ptr [r8+rax]

vmulss xmm2,xmm6,xmm0 ;4.0 * PI * r

vmulss xmm3,xmm2,xmm0 ;4.0 * PI * r * r

vcmpss xmm1,xmm0,xmm9,CMP_LT

vandps xmm4,xmm1,xmm8

vandnps xmm5,xmm1,xmm3

vorps xmm5,xmm4,xmm5

vmovss real4 ptr[rcx+rax],xmm5 ;save surface area

vmulss xmm2,xmm3,xmm0 ;4.0 * PI * r * r * r

vdivss xmm3,xmm2,xmm7 ;4.0 * PI * r * r * r / 3.0

vandps xmm4,xmm1,xmm8

vandnps xmm5,xmm1,xmm3

vorps xmm5,xmm4,xmm5

vmovss real4 ptr[rdx+rax],xmm5 ;save volume

add rax,4

dec r9

jnz @B ;repeat until done

Done: vzeroupper

_RestoreXmmRegs xmm6,xmm7,xmm8,xmm9

_DeleteFrame

ret

AvxCalcSphereAreaVolume_ endp

end

Listing 9-2.

Example Ch09_02

The C++ code in Listing 9-2 includes a function named AvxCalcSphereAreaVolumeCpp. This function calculates sphere surface areas and volumes. The sphere radii are passed to AvxCalcSphereAreaVolumeCpp via an array. Prior to calculating a surface area or volume, the sphere’s radius (r[i]) is tested to verify that it’s not negative. If the radius is negative, the corresponding elements in the surface area and volume arrays (sa[i] and vol[i]) are set to c_QNaN_F32. The remaining C++ code performs the necessary initializations, exercises the C++ and assembly language calculating functions, and displays the results. Note that the function AvxCalcSphereAreaVolume employs the alignas(32) specifier with each array declaration.

The assembly language function AvxCalcSphereAreaVolume_ performs the same calculations as its C++ counterpart. Following its prolog, AvxCalcSphereAreaVolume_ uses a series of vbroadcastss instructions to initialize packed versions of the required constants. Prior to the start of the processing loop, a cmp r9,8 instruction checks the value of n. The reason for this check is that the processing loop carries out eight surface area and volume calculations simultaneously using 256-bit wide operands. The jb FinalR conditional jump instruction skips the processing loop if there are fewer than eight radii to process.

Each processing loop iteration begins with a vmovdqa ymm0,ymmword ptr [r8+rax] instruction that loads eight single-precision floating-point radii into register YMM0. The ensuing vmulps instructions calculate the sphere surface areas. The next instruction, vcmpps ymm1,ymm0,ymm9,CMP_LT, tests each sphere radii for a value less than 0.0 (register YMM9 contains packed 0.0). Recall that the vcmpps instruction signifies its results by setting elements in the destination operand to either 0x00000000 (false compare predicate) or 0xffffffff (true compare predicate). The vandps, vandnps, and vorps instructions that follow set the surface area of each sphere whose radius is less than 0.0 to c_QNaN_F32. Figure 9-1 illustrates this operation in greater detail. A vmovaps ymmword ptr[rcx+rax],ymm5 instruction saves the eight sphere surface area values to the array sa.

../images/326959_2_En_9_Chapter/326959_2_En_9_Fig1_HTML.jpg — Figure 9-1.
Surface area QNaN assignment for spheres with radius less than 0.0

Following the calculation of the surface areas, the vmulps ymm2,ymm3,ymm0 and vdivps ymm3,ymm2,ymm7 instructions compute the sphere volumes. The processing loop uses another vandps, vandnps, and vorps instruction sequence to set the volume of any negative-radius sphere to c_QNaN_F32. These values are then saved to the array vol. The processing loop repeats until there are fewer than eight remaining radii.

The next block of code computes sphere surface areas and volumes for the remaining (1–7) radii. Note that AvxCalcSphereAreaVolume_ carries out these calculations using scalar single-precision floating-point arithmetic. The scalar processing loop performs the same arithmetic and Boolean operations as the packed processing loop. Similar to the previous example, AvxCalcSphereAreaVolume_ uses a vzeroupper instruction immediately after the scalar processing loop. This instruction is needed since AvxCalcSphereAreaVolume_ carried out its calculations using the YMM register set. When a vzeroupper instruction is required, it should always be positioned before any function epilog macros (e.g., _RestoreXmmRegs and _DeleteFrame) and the ret instruction. Here are the results for source code example Ch09_02:

Results for AvxCalcSphereAreaVolume

0: 3.80 | 181.458389 181.458389 | 229.847290 229.847290

1: 10.00 | 1256.637085 1256.637085 | 4188.790527 4188.790527

2: -6.10 | nan nan | nan nan

3: 3.70 | 172.033630 172.033630 | 212.174805 212.174805

4: 9.60 | 1158.116821 1158.116821 | 3705.973877 3705.973877

5: -6.60 | nan nan | nan nan

6: 2.60 | 84.948662 84.948654 | 73.622169 73.622162 Compare discrepancy

7: 9.30 | 1086.865479 1086.865479 | 3369.283203 3369.283203

8: 9.00 | 1017.876038 1017.876038 | 3053.628174 3053.628174

9: 5.80 | 422.732758 422.732758 | 817.283386 817.283386

10: -2.90 | nan nan | nan nan

11: 8.10 | 824.479675 824.479675 | 2226.095215 2226.095215

12: 3.00 | 113.097336 113.097336 | 113.097328 113.097328

13: 8.00 | 804.247742 804.247742 | 2144.660645 2144.660645

14: 1.40 | 24.630087 24.630085 | 11.494040 11.494039 Compare discrepancy

15: -1.80 | nan nan | nan nan

16: 4.30 | 232.352219 232.352219 | 333.038177 333.038177

17: 6.60 | 547.391113 547.391113 | 1204.260376 1204.260376

18: 4.50 | 254.469009 254.469009 | 381.703522 381.703522

19: -1.20 | nan nan | nan nan

20: 4.50 | 254.469009 254.469009 | 381.703522 381.703522

The output for source code example Ch09_02 includes a couple of lines with the text “compare discrepancy.” This text was generated by the compare code in AvxCalcSphereAreaVolume to exemplify the non-associativity of floating-point arithmetic. In this example, the functions AvxCalcSphereAreaVolumeCpp and AvxCalcSphereAreaVolume_ carried out their respective floating-point calculations using different operands orderings. For each sphere surface area, the C++ code calculates sa[i] = r[i] * r[i] * 4.0 * c_PI_F32, while the assembly language code calculates sa[i] = 4.0 * c_PI_F32 * r[i] * r[i]. Tiny numerical discrepancies like this are not unusual when comparing floating-point values that are calculated using different operand orderings irrespective of the programming language. This is something that you should keep in mind if you’re developing production code that includes multiple versions of the same calculating function (e.g., one coded using C++ and an AVX/AVX2 accelerated version that’s implemented using x86 assembly language).

Finally, you may have noticed that the function AvxCalcSphereAreaVolume_ handled invalid radii sans any x86 conditional jump instructions. Minimizing the number of conditional jump instructions in a function, especially data-dependent ones, often results in faster executing code. You’ll learn more about jump instruction optimization techniques in Chapter 15.

Column Means

Listing 9-3 shows the source code for example Ch09_03. This example illustrates how to calculate the arithmetic mean of each column in a two-dimensional array of double-precision floating-point values.

//------------------------------------------------

// Ch09_03.cpp

//------------------------------------------------

#include "stdafx.h"

#include <iostream>

#include <iomanip>

#include <random>

#include <memory>

using namespace std;

extern "C" size_t c_NumRowsMax = 1024 * 1024;

extern "C" size_t c_NumColsMax = 1024 * 1024;

extern "C" bool AvxCalcColumnMeans_(const double* x, size_t nrows, size_t ncols, double* col_means);

void Init(double* x, size_t n, unsigned int seed)

{

uniform_int_distribution<> ui_dist {1, 2000};

default_random_engine rng {seed};

for (size_t i = 0; i < n; i++)

x[i] = (double)ui_dist(rng) / 10.0;

}

bool AvxCalcColumnMeansCpp(const double* x, size_t nrows, size_t ncols, double* col_means)

{

// Make sure nrows and ncols are valid

if (nrows == 0 || nrows > c_NumRowsMax)

return false;

if (ncols == 0 || ncols > c_NumColsMax)

return false;

// Set initial column means to zero

for (size_t i = 0; i < ncols; i++)

col_means[i] = 0.0;

// Calculate column means

for (size_t i = 0; i < nrows; i++)

{

for (size_t j = 0; j < ncols; j++)

col_means[j] += x[i * ncols + j];

}

for (size_t j = 0; j < ncols; j++)

col_means[j] /= nrows;

return true;

}

void AvxCalcColumnMeans(void)

{

const size_t nrows = 20;

const size_t ncols = 11;

unique_ptr<double[]> x {new double[nrows * ncols]};

unique_ptr<double[]> col_means1 {new double[ncols]};

unique_ptr<double[]> col_means2 {new double[ncols]};

Init(x.get(), nrows * ncols, 47);

bool rc1 = AvxCalcColumnMeansCpp(x.get(), nrows, ncols, col_means1.get());

bool rc2 = AvxCalcColumnMeans_(x.get(), nrows, ncols, col_means2.get());

cout << "Results for AvxCalcColumnMeans ";

if (!rc1 || !rc2)

{

cout << "Invalid return code: ";

cout << "rc1 = " << boolalpha << rc1 << ", ";

cout << "rc2 = " << boolalpha << rc2 << ' ';

return;

}

cout << " Test Matrix ";

cout << fixed << setprecision(1);

for (size_t i = 0; i < nrows; i++)

{

cout << "row " << setw(2) << i;

for (size_t j = 0; j < ncols; j++)

cout << setw(7) << x[i * ncols + j];

cout << ' ';

}

cout << " Column Means ";

cout << setprecision(2);

for (size_t j = 0; j < ncols; j++)

{

cout << "col_means1[" << setw(2) << j << "] =";

cout << setw(10) << col_means1[j] << " ";

cout << "col_means2[" << setw(2) << j << "] =";

cout << setw(10) << col_means2[j] << ' ';

}

int main()

{

AvxCalcColumnMeans();

return 0;

}

;-------------------------------------------------

; Ch09_03.asm

;-------------------------------------------------

; extern "C" bool AvxCalcColMeans_(const double* x, size_t nrows, size_t ncols, double* col_means)

extern c_NumRowsMax:qword

extern c_NumColsMax:qword

.code

AvxCalcColumnMeans_ proc

; Validate nrows and ncols

xor eax,eax ;error return code (also col_mean index)

test rdx,rdx

jz Done ;jump if nrows is zero

cmp rdx,[c_NumRowsMax]

ja Done ;jump if nrows is too large

test r8,r8

jz Done ;jump if ncols is zero

cmp r8,[c_NumColsMax]

ja Done ;jump if ncols is too large

; Initialize elements of col_means to zero

vxorpd xmm0,xmm0,xmm0 ;xmm0[63:0] = 0.0

@@: vmovsd real8 ptr[r9+rax*8],xmm0 ;col_means[i] = 0.0

inc rax

cmp rax,r8

jb @B ;repeat until done

vcvtsi2sd xmm2,xmm2,rdx ;convert nrows for later use

; Compute the sum of each column in x

LP1: mov r11,r9 ;r11 = ptr to col_means

xor r10,r10 ;r10 = col_index

LP2: mov rax,r10 ;rax = col_index

add rax,4

cmp rax,r8 ;4 or more columns remaining?

ja @F ;jump if no (col_index + 4 > ncols)

; Update col_means using next four columns

vmovupd ymm0,ymmword ptr [rcx] ;load next 4 columns of current row

vaddpd ymm1,ymm0,ymmword ptr [r11] ;add to col_means

vmovupd ymmword ptr [r11],ymm1 ;save updated col_means

add r10,4 ;col_index += 4

add rcx,32 ;update x ptr

add r11,32 ;update col_means ptr

jmp NextColSet

@@: sub rax,2

cmp rax,r8 ;2 or more columns remaining?

ja @F ;jump if no (col_index + 2 > ncols)

; Update col_means using next two columns

vmovupd xmm0,xmmword ptr [rcx] ;load next 2 columns of current row

vaddpd xmm1,xmm0,xmmword ptr [r11] ;add to col_means

vmovupd xmmword ptr [r11],xmm1 ;save updated col_means

add r10,2 ;col_index += 2

add rcx,16 ;update x ptr

add r11,16 ;update col_means ptr

jmp NextColSet

; Update col_means using next column (or last column in the current row)

@@: vmovsd xmm0,real8 ptr [rcx] ;load x from last column

vaddsd xmm1,xmm0,real8 ptr [r11] ;add to col_means

vmovsd real8 ptr [r11],xmm1 ;save updated col_means

inc r10 ;col_index += 1

add rcx,8 ;update x ptr

NextColSet:

cmp r10,r8 ;more columns in current row?

jb LP2 ;jump if yes

dec rdx ;nrows -= 1

jnz LP1 ;jump if more rows

; Compute the final col_means

@@: vmovsd xmm0,real8 ptr [r9] ;xmm0 = col_means[i]

vdivsd xmm1,xmm0,xmm2 ;compute final mean

vmovsd real8 ptr [r9],xmm1 ;save col_mean[i]

add r9,8 ;update col_means ptr

dec r8 ;ncols -= 1

jnz @B ;repeat until done

mov eax,1 ;set success return code

Done: vzeroupper

ret

AvxCalcColumnMeans_ endp

end

Listing 9-3.

Example Ch09_03

Toward the top of the C++ code is a function named AvxCalcColumnMeansCpp. This function calculates the column means of a two-dimensional array using a straightforward set of nested for loops and some simple arithmetic. The function AvxCalcColumnMeans contains code that uses the C++ smart pointer class unique_ptr<> to help manage its dynamically-allocated arrays. Note that storage space for the test array x is allocated using the C++ new operator, which means that the array may not be aligned on a 16- or 32-byte boundary. In this particular example, aligning the start of array x to a specific boundary would be of little benefit since it’s not possible to align the individual rows or columns of a standard C++ two-dimensional array (recall that the elements of a two-dimensional C++ array are stored in a contiguous block of memory using row-major ordering as described in Chapter 2).

The function AvxCalcColumnMeans also uses class unique_ptr<> and the new operator for the one-dimensional arrays col_means1 and col_means2. Using unique_ptr<> in this example simplifies the C++ code somewhat since its destructor automatically invokes the delete[] operator to release the storage space that was allocated by the new operator. If you’re interested in learning more about the smart pointer class unique_ptr<>, Appendix A contains a list of C++ references that you can consult. The remaining code in AvxCalcColumnMeans invokes the C++ and assembly language column-mean calculating functions and streams the results to cout.

Following argument validation, the assembly language function AvxCalcColMeans_ initializes each element in col_means to 0.0. These elements will maintain the intermediate column sums. In order to maximize throughput, the column summation code uses slightly different instruction sequences depending on the current column and the total number of columns in the array. For example, assume that array x contains seven columns. For each row, the elements of the first four columns in x can be added to col_means using 256-bit wide packed addition; the elements of the next two columns can be added to col_means using 128-bit wide packed addition; and the final column element must be added to col_means using scalar addition. Figure 9-2 illustrates this technique in greater detail.

../images/326959_2_En_9_Chapter/326959_2_En_9_Fig2_HTML.jpg — Figure 9-2.
Updating the *col_means* array using different operand sizes

The mov r11,r9 instruction next to the label LP1 is the starting point for adding elements in the current row of x to col_means. This instruction initializes R11 to first entry in col_means. The col_index counter in register R10 is then set to zero. The instruction group near the label LP2 determines the number of columns remaining to be processed in the current row. If four or more columns remain, the next four elements from the current row are added to the column sums in col_means. A vmovupd ymm0,ymmword ptr [rcx] instruction loads four double-precision floating-point values from x into YMM0 (a vmovapd instruction is not used here since alignment of the elements is unknown). The ensuing vaddpd ymm1,ymm0,ymmword ptr [r11] instruction sums the current array elements with the corresponding elements in col_means, and the vmovupd ymmword ptr [r11],ymm1 instruction saves the updated results back to col_means. The function’s various pointers and counters are then updated in preparation for the next set of elements from the current row of x.

The summation code repeats the steps described in the previous paragraph until the number of array elements that remain in the current row is less than four. As soon as this condition is met, the elements in the remaining columns (if any) must be processed using 128-bit wide or 64-bit wide operands. This is the reason for the distinct blocks of code in AvxCalcColumnMeans_ that process four elements, two elements, or a single element per row. Following computation of the column sums, each element in col_means is divided by n, which yields the final column mean. Here are the results for source code example Ch09_03:

Results for AvxCalcColumnMeans

Test Matrix

row 0 125.6 59.9 100.0 170.5 140.1 197.2 73.7 15.2 92.4 155.3 159.2

row 1 77.6 105.4 45.0 176.8 65.9 12.3 189.1 102.0 56.2 112.8 17.2

row 2 198.9 199.3 74.6 137.9 65.0 125.0 19.8 32.1 58.6 94.1 123.5

row 3 1.7 29.1 99.1 200.0 109.0 123.7 130.0 125.3 146.2 90.6 52.2

row 4 8.7 88.7 84.8 174.6 164.4 106.2 114.0 151.8 130.8 101.9 116.2

row 5 42.7 130.5 180.4 199.4 196.6 99.7 163.6 34.2 5.5 146.1 108.5

row 6 120.0 159.5 26.0 83.4 58.7 10.1 170.1 20.5 10.8 48.3 121.9

row 7 148.9 148.4 142.0 106.6 198.4 60.3 72.1 137.8 74.5 75.7 44.8

row 8 25.7 192.0 12.1 23.4 98.7 145.3 196.8 43.9 143.1 25.1 122.6

row 9 5.4 134.7 165.1 61.8 46.7 183.3 173.7 146.9 76.5 186.2 24.9

row 10 174.5 158.9 127.8 58.9 42.9 182.9 7.8 50.3 68.0 62.0 66.1

row 11 47.3 166.2 8.2 71.2 98.5 12.4 179.0 100.2 29.7 167.4 155.2

row 12 23.9 196.6 148.7 7.1 128.2 128.8 66.3 153.7 60.7 115.4 71.6

row 13 103.4 184.3 161.5 57.9 199.2 79.3 28.1 73.1 12.5 71.3 100.4

row 14 130.3 154.2 127.5 29.7 198.2 170.3 121.9 80.4 159.8 70.0 82.6

row 15 26.7 45.6 67.7 109.7 5.1 96.2 188.7 100.7 48.3 164.2 75.4

row 16 115.4 25.5 58.8 148.5 80.7 149.1 156.7 153.8 42.0 103.7 4.2

row 17 67.9 161.5 16.9 102.1 77.3 3.9 104.7 97.2 181.8 182.0 155.1

row 18 169.5 122.4 102.2 5.5 14.5 105.1 181.5 83.3 117.6 52.1 111.2

row 19 47.1 146.9 21.0 8.6 130.3 24.7 95.7 6.7 159.9 38.8 82.6

Column Means

col_means1[ 0] = 83.06 col_means2[ 0] = 83.06

col_means1[ 1] = 130.48 col_means2[ 1] = 130.48

col_means1[ 2] = 88.47 col_means2[ 2] = 88.47

col_means1[ 3] = 96.68 col_means2[ 3] = 96.68

col_means1[ 4] = 105.92 col_means2[ 4] = 105.92

col_means1[ 5] = 100.79 col_means2[ 5] = 100.79

col_means1[ 6] = 121.66 col_means2[ 6] = 121.66

col_means1[ 7] = 85.46 col_means2[ 7] = 85.46

col_means1[ 8] = 83.75 col_means2[ 8] = 83.75

col_means1[ 9] = 103.15 col_means2[ 9] = 103.15

col_means1[10] = 89.77 col_means2[10] = 89.77

Correlation Coefficient

The next source code example illustrates how to calculate a correlation coefficient using packed double-precision floating-point arithmetic. This example also demonstrates how to perform a few common auxiliary operations with packed floating-point operands, including 128-bit wide extractions and horizontal addition. Listing 9-4 shows the source code for example Ch09_04.

//------------------------------------------------

// Ch09_04.cpp

//------------------------------------------------

#include "stdafx.h"

#include <iostream>

#include <iomanip>

#include <string>

#include <random>

#include "AlignedMem.h"

using namespace std;

extern "C" bool AvxCalcCorrCoef_(const double* x, const double* y, size_t n, double sums[5], double epsilon, double* rho);

void Init(double* x, double* y, size_t n, unsigned int seed)

{

uniform_int_distribution<> ui_dist {1, 999};

default_random_engine rng {seed};

for (size_t i = 0; i < n; i++)

{

x[i] = (double)ui_dist(rng);

y[i] = x[i] + (ui_dist(rng) % 6000) - 3000;

}

bool AvxCalcCorrCoefCpp(const double* x, const double* y, size_t n, double sums[5], double epsilon, double* rho)

{

const size_t alignment = 32;

// Make sure n is valid

if (n == 0)

return false;

// Make sure x and y are properly aligned

if (!AlignedMem::IsAligned(x, alignment))

return false;

if (!AlignedMem::IsAligned(y, alignment))

return false;

// Calculate and save sum variables

double sum_x = 0, sum_y = 0, sum_xx = 0, sum_yy = 0, sum_xy = 0;

for (size_t i = 0; i < n; i++)

{

sum_x += x[i];

sum_y += y[i];

sum_xx += x[i] * x[i];

sum_yy += y[i] * y[i];

sum_xy += x[i] * y[i];

}

sums[0] = sum_x;

sums[1] = sum_y;

sums[2] = sum_xx;

sums[3] = sum_yy;

sums[4] = sum_xy;

// Calculate rho

double rho_num = n * sum_xy - sum_x * sum_y;

double rho_den = sqrt(n * sum_xx - sum_x * sum_x) * sqrt(n * sum_yy - sum_y * sum_y);

if (rho_den >= epsilon)

{

*rho = rho_num / rho_den;

return true;

}

else

{

*rho = 0;

return false;

}

int main()

{

const size_t n = 103;

const size_t alignment = 32;

AlignedArray<double> x_aa(n, alignment);

AlignedArray<double> y_aa(n, alignment);

double sums1[5], sums2[5];

double rho1, rho2;

double epsilon = 1.0e-12;

double* x = x_aa.Data();

double* y = y_aa.Data();

Init(x, y, n, 71);

bool rc1 = AvxCalcCorrCoefCpp(x, y, n, sums1, epsilon, &rho1);

bool rc2 = AvxCalcCorrCoef_(x, y, n, sums2, epsilon, &rho2);

cout << "Results for AvxCalcCorrCoef ";

if (!rc1 || !rc2)

{

cout << "Invalid return code ";

cout << "rc1 = " << boolalpha << rc1 << ", ";

cout << "rc2 = " << boolalpha << rc2 << ' ';

return 1;

}

int w = 14;

string sep(w * 3, '-');

cout << fixed << setprecision(8);

cout << "Value " << setw(w) << "C++" << " " << setw(w) << "x86-AVX" << ' ';

cout << sep << ' ';

cout << "rho: " << setw(w) << rho1 << " " << setw(w) << rho2 << " ";

cout << setprecision(1);

cout << "sum_x: " << setw(w) << sums1[0] << " " << setw(w) << sums2[0] << ' ';

cout << "sum_y: " << setw(w) << sums1[1] << " " << setw(w) << sums2[1] << ' ';

cout << "sum_xx: " << setw(w) << sums1[2] << " " << setw(w) << sums2[2] << ' ';

cout << "sum_yy: " << setw(w) << sums1[3] << " " << setw(w) << sums2[3] << ' ';

cout << "sum_xy: " << setw(w) << sums1[4] << " " << setw(w) << sums2[4] << ' ';

return 0;

}

;-------------------------------------------------

; Ch09_04.asm

;-------------------------------------------------

include <MacrosX86-64-AVX.asmh>

; extern "C" bool AvxCalcCorrCoef_(const double* x, const double* y, size_t n, double sums[5], double epsilon, double* rho)

;

; Returns 0 = error, 1 = success

.code

AvxCalcCorrCoef_ proc frame

_CreateFrame CC_,0,32

_SaveXmmRegs xmm6,xmm7

_EndProlog

; Validate arguments

or r8,r8

jz BadArg ;jump if n == 0

test rcx,1fh

jnz BadArg ;jump if x is not aligned

test rdx,1fh

jnz BadArg ;jump if y is not aligned

; Initialize sum variables to zero

vxorpd ymm3,ymm3,ymm3 ;ymm3 = packed sum_x

vxorpd ymm4,ymm4,ymm4 ;ymm4 = packed sum_y

vxorpd ymm5,ymm5,ymm5 ;ymm5 = packed sum_xx

vxorpd ymm6,ymm6,ymm6 ;ymm6 = packed sum_yy

vxorpd ymm7,ymm7,ymm7 ;ymm7 = packed sum_xy

mov r10,r8 ;r10 = n

cmp r8,4

jb LP2 ;jump if n >= 1 && n <= 3

; Calculate intermediate packed sum variables

LP1: vmovapd ymm0,ymmword ptr [rcx] ;ymm0 = packed x values

vmovapd ymm1,ymmword ptr [rdx] ;ymm1 = packed y values

vaddpd ymm3,ymm3,ymm0 ;update packed sum_x

vaddpd ymm4,ymm4,ymm1 ;update packed sum_y

vmulpd ymm2,ymm0,ymm1 ;ymm2 = packed xy values

vaddpd ymm7,ymm7,ymm2 ;update packed sum_xy

vmulpd ymm0,ymm0,ymm0 ;ymm0 = packed xx values

vmulpd ymm1,ymm1,ymm1 ;ymm1 = packed yy values

vaddpd ymm5,ymm5,ymm0 ;update packed sum_xx

vaddpd ymm6,ymm6,ymm1 ;update packed sum_yy

add rcx,32 ;update x ptr

add rdx,32 ;update y ptr

sub r8,4 ;n -= 4

cmp r8,4 ;is n >= 4?

jae LP1 ;jump if yes

or r8,r8 ;is n == 0?

jz FSV ;jump if yes

; Update sum variables with final x & y values

LP2: vmovsd xmm0,real8 ptr [rcx] ;xmm0[63:0] = x[i], ymm0[255:64] = 0

vmovsd xmm1,real8 ptr [rdx] ;xmm1[63:0] = y[i], ymm1[255:64] = 0

vaddpd ymm3,ymm3,ymm0 ;update packed sum_x

vaddpd ymm4,ymm4,ymm1 ;update packed sum_y

vmulpd ymm2,ymm0,ymm1 ;ymm2 = packed xy values

vaddpd ymm7,ymm7,ymm2 ;update packed sum_xy

vmulpd ymm0,ymm0,ymm0 ;ymm0 = packed xx values

vmulpd ymm1,ymm1,ymm1 ;ymm1 = packed yy values

vaddpd ymm5,ymm5,ymm0 ;update packed sum_xx

vaddpd ymm6,ymm6,ymm1 ;update packed sum_yy

add rcx,8 ;update x ptr

add rdx,8 ;update y ptr

sub r8,1 ;n -= 1

jnz LP2 ;repeat until done

; Calculate final sum variables

FSV: vextractf128 xmm0,ymm3,1

vaddpd xmm1,xmm0,xmm3

vhaddpd xmm3,xmm1,xmm1 ;xmm3[63:0] = sum_x

vextractf128 xmm0,ymm4,1

vaddpd xmm1,xmm0,xmm4

vhaddpd xmm4,xmm1,xmm1 ;xmm4[63:0] = sum_y

vextractf128 xmm0,ymm5,1

vaddpd xmm1,xmm0,xmm5

vhaddpd xmm5,xmm1,xmm1 ;xmm5[63:0] = sum_xx

vextractf128 xmm0,ymm6,1

vaddpd xmm1,xmm0,xmm6

vhaddpd xmm6,xmm1,xmm1 ;xmm6[63:0] = sum_yy

vextractf128 xmm0,ymm7,1

vaddpd xmm1,xmm0,xmm7

vhaddpd xmm7,xmm1,xmm1 ;xmm7[63:0] = sum_xy

; Save final sum variables

vmovsd real8 ptr [r9],xmm3 ;save sum_x

vmovsd real8 ptr [r9+8],xmm4 ;save sum_y

vmovsd real8 ptr [r9+16],xmm5 ;save sum_xx

vmovsd real8 ptr [r9+24],xmm6 ;save sum_yy

vmovsd real8 ptr [r9+32],xmm7 ;save sum_xy

; Calculate rho numerator

; rho_num = n * sum_xy - sum_x * sum_y;

vcvtsi2sd xmm2,xmm2,r10 ;xmm2 = n

vmulsd xmm0,xmm2,xmm7 ;xmm0 = = n * sum_xy

vmulsd xmm1,xmm3,xmm4 ;xmm1 = sum_x * sum_y

vsubsd xmm7,xmm0,xmm1 ;xmm7 = rho_num

; Calculate rho denominator

; t1 = sqrt(n * sum_xx - sum_x * sum_x)

; t2 = sqrt(n * sum_yy - sum_y * sum_y)

; rho_den = t1 * t2

vmulsd xmm0,xmm2,xmm5 ;xmm0 = n * sum_xx

vmulsd xmm3,xmm3,xmm3 ;xmm3 = sum_x * sum_x

vsubsd xmm3,xmm0,xmm3 ;xmm3 = n * sum_xx - sum_x * sum_x

vsqrtsd xmm3,xmm3,xmm3 ;xmm3 = t1

vmulsd xmm0,xmm2,xmm6 ;xmm0 = n * sum_yy

vmulsd xmm4,xmm4,xmm4 ;xmm4 = sum_y * sum_y

vsubsd xmm4,xmm0,xmm4 ;xmm4 = n * sum_yy - sum_y * sum_y

vsqrtsd xmm4,xmm4,xmm4 ;xmm4 = t2

vmulsd xmm0,xmm3,xmm4 ;xmm0 = rho_den

; Calculate and save final rho

xor eax,eax

vcomisd xmm0,real8 ptr [rbp+CC_OffsetStackArgs] ;rho_den < epsilon?

setae al ;set return code

jb BadRho ;jump if rho_den < epsilon

vdivsd xmm1,xmm7,xmm0 ;xmm1 = rho

SavRho: mov rdx,[rbp+CC_OffsetStackArgs+8] ;rdx = ptr to rho

vmovsd real8 ptr [rdx],xmm1 ;save rho

Done: vzeroupper

_RestoreXmmRegs xmm6,xmm7

_DeleteFrame

ret

; Error handling code

BadRho: vxorpd xmm1,xmm1,xmm1 ;rho = 0

jmp SavRho

BadArg: xor eax,eax ;eax = invalid arg ret code

jmp Done

AvxCalcCorrCoef_ endp

end

Listing 9-4.

Example Ch09_04

A correlation coefficient measures the strength of association between two variables. Correlation coefficients can range in value from -1.0 to +1.0, signifying a perfect negative or positive relationship between the two variables. Real-world correlation coefficients are rarely equal to these theoretical limits. A correlation coefficient of 0.0 indicates that the data variables are not associated. The C++ and assembly language code in this example calculate the well-known Pearson correlation coefficient using the following equation:

$ho =frac{nsum limits_i{x}_i{y}_i-sum limits_i{x}_isum limits_i{y}_i}{sqrt{nsum limits_i{x}_i^2-{left(sum limits_i{x}_i ight)}^2}sqrt{nsum limits_i{y}_i^2-{left(sum limits_i{y}_i ight)}^2}}$

In order to calculate a correlation coefficient using this formula, a function must compute the following five sum variables:

$sum\_x=sum limits_i{x}_i$

$sum\_y=sum limits_i{y}_i$

$sum\_ xx=sum limits_i{x}_i^2$

$sum\_ yy=sum limits_i{y}_i^2$

$sum\_ xy=sum limits_i{x}_i{y}_i$

The C++ function AvxCalcCorrCoefCpp shows how to calculate a correlation coefficient. This function begins by checking the value of n to make sure it’s greater than zero. It also validates the two data arrays x and y for proper alignment. The aforementioned sum variables are then calculated using a simple for loop. Following completion of the for loop, the function AvxCalcCorrCoefCpp saves the sum variables to the array sums for comparison and display purposes. It then computes the intermediate values rho_num and rho_den. Before computing the final correlation coefficient rho, rho_den is tested to confirm that it’s greater than or equal to epsilon.

Following its prolog, the assembly language function AvxCalcCorrCoef_ performs the same size and alignment checks as its C++ counterpart. It then initializes packed versions of sum_x, sum_y, sum_xx, sum_yy, and sum_xy to zero in registers YMM3–YMM7. During each iteration, the loop labeled LP1 processes four elements from arrays x and y using packed double-precision floating-point arithmetic. This means that registers YMM3–YMM7 maintain four distinct intermediate values for each sum variable. Execution of loop LP1 continues until there are fewer than four elements remaining to process.

Following completion of loop LP1, the loop labeled LP2 processes the final (1–3) entries in arrays x and y. The vmovsd xmm0,real8 ptr [rcx] and vmovsd xmm1,real8 ptr [rdx] instructions load x[i] and y[i] into registers XMM0 and XMM1, respectively. Note that these vmovsd instructions also zero out bits YMM0[255:64] and YMM1[255:64], which means that the same chain of vaddpd and vmulpd instructions used in loop LP1 to update the intermediate sum variables can also be used in loop LP2 (the scalar instructions vaddsd and vmulsd cannot be used here to update the sum variables without extra code since these instructions set bits 255:128 of their destination operand register to zero). Following completion of loop LP2, each packed sum variable is reduced to a single value using a vextractf128, vaddpd, and vhaddpd instruction, as illustrated in Figure 9-3. The final sum values are then saved to the sums array .

../images/326959_2_En_9_Chapter/326959_2_En_9_Fig3_HTML.jpg — Figure 9-3.
Calculation of *sum_x* using *vextractf128*, *vaddpd*, and *vhaddpd*

Function AvxCalcCorrCoef_ uses simple scalar arithmetic to compute the intermediate values rho_num and rho_den. Like the corresponding C++ function, AvxCalcCorrCoef_ compares rho_den to see if it’s less than epsilon (a value below epsilon is likely a rounding error and considered too close to zero to be valid). If rho_den is valid, the correlation coefficient rho is calculated and saved. Here are the results for source code example Ch09_04:

Results for AvxCalcCorrCoef

Value C++ x86-AVX

------------------------------------------

rho: 0.70128193 0.70128193

sum_x: 53081.0 53081.0

sum_y: -199158.0 -199158.0

sum_xx: 35732585.0 35732585.0

sum_yy: 401708868.0 401708868.0

sum_xy: -94360528.0 -94360528.0

Matrix Multiplication and Transposition

In Chapter 6, you learned how to perform 4 × 4 matrix transposition and multiplication using single-precision floating-point values (see source code examples Ch06_07 and Ch06_08). The source code example in this section illustrates how to carry out these same matrix operations using double-precision floating-point values. Listing 9-5 shows the source code for example Ch09_05. The fundamentals of matrix transposition and multiplication are explained in Chapter 6. If your understanding of these mathematical operations is lacking, you may want to review the relevant sections in Chapter 6 before proceeding.

//------------------------------------------------

// Ch09_05.cpp

//------------------------------------------------

#include "stdafx.h"

#include <iostream>

#include <iomanip>

#include "Ch09_05.h"

#include "Matrix.h"

using namespace std;

void AvxMat4x4TransposeF64(Matrix<double>& m_src1)

{

const size_t nr = 4;

const size_t nc = 4;

Matrix<double> m_des1(nr ,nc);

Matrix<double> m_des2(nr ,nc);

Matrix<double>::Transpose(m_des1, m_src1);

AvxMat4x4TransposeF64_(m_des2.Data(), m_src1.Data());

cout << fixed << setprecision(1);

m_src1.SetOstream(12, " ");

m_des1.SetOstream(12, " ");

m_des2.SetOstream(12, " ");

cout << "Results for AvxMat4x4TransposeF64 ";

cout << "Matrix m_src1 " << m_src1 << ' ';

cout << "Matrix m_des1 " << m_des1 << ' ';

cout << "Matrix m_des2 " << m_des2 << ' ';

if (m_des1 != m_des2)

cout << " Matrix compare failed - AvxMat4x4TransposeF64 ";

}

void AvxMat4x4MulF64(Matrix<double>& m_src1, Matrix<double>& m_src2)

{

const size_t nr = 4;

const size_t nc = 4;

Matrix<double> m_des1(nr ,nc);

Matrix<double> m_des2(nr ,nc);

Matrix<double>::Mul(m_des1, m_src1, m_src2);

AvxMat4x4MulF64_(m_des2.Data(), m_src1.Data(), m_src2.Data());

cout << fixed << setprecision(1);

m_src1.SetOstream(12, " ");

m_src2.SetOstream(12, " ");

m_des1.SetOstream(12, " ");

m_des2.SetOstream(12, " ");

cout << " Results for AvxMat4x4MulF64 ";

cout << "Matrix m_src1 " << m_src1 << ' ';

cout << "Matrix m_src2 " << m_src2 << ' ';

cout << "Matrix m_des1 " << m_des1 << ' ';

cout << "Matrix m_des2 " << m_des2 << ' ';

if (m_des1 != m_des2)

cout << " Matrix compare failed - AvxMat4x4MulF64 ";

}

int main()

{

const size_t nr = 4;

const size_t nc = 4;

Matrix<double> m_src1(nr ,nc);

Matrix<double> m_src2(nr ,nc);

const double src1_row0[] = { 10, 11, 12, 13 };

const double src1_row1[] = { 20, 21, 22, 23 };

const double src1_row2[] = { 30, 31, 32, 33 };

const double src1_row3[] = { 40, 41, 42, 43 };

const double src2_row0[] = { 100, 101, 102, 103 };

const double src2_row1[] = { 200, 201, 202, 203 };

const double src2_row2[] = { 300, 301, 302, 303 };

const double src2_row3[] = { 400, 401, 402, 403 };

m_src1.SetRow(0, src1_row0);

m_src1.SetRow(1, src1_row1);

m_src1.SetRow(2, src1_row2);

m_src1.SetRow(3, src1_row3);

m_src2.SetRow(0, src2_row0);

m_src2.SetRow(1, src2_row1);

m_src2.SetRow(2, src2_row2);

m_src2.SetRow(3, src2_row3);

// Test functions

AvxMat4x4TransposeF64(m_src1);

AvxMat4x4MulF64(m_src1, m_src2);

// Benchmark functions

AvxMat4x4TransposeF64_BM();

AvxMat4x4MulF64_BM();

return 0;

}

;-------------------------------------------------

; Ch09_05.asm

;-------------------------------------------------

include <MacrosX86-64-AVX.asmh>

; _Mat4x4TransposeF64 macro

;

; Description: This macro computes the transpose of a 4x4

; double-precision floating-point matrix.

;

; Input Matrix Output Matrix

; ---------------------------------------------------

; ymm0 a3 a2 a1 a0 ymm0 d0 c0 b0 a0

; ymm1 b3 b2 b1 b0 ymm1 d1 c1 b1 a1

; ymm2 c3 c2 c1 c0 ymm2 d2 c2 b2 a2

; ymm3 d3 d2 d1 d0 ymm3 d3 c3 b3 a3

;

_Mat4x4TransposeF64 macro

vunpcklpd ymm4,ymm0,ymm1 ;ymm4 = b2 a2 b0 a0

vunpckhpd ymm5,ymm0,ymm1 ;ymm5 = b3 a3 b1 a1

vunpcklpd ymm6,ymm2,ymm3 ;ymm6 = d2 c2 d0 c0

vunpckhpd ymm7,ymm2,ymm3 ;ymm7 = d3 c3 d1 c1

vperm2f128 ymm0,ymm4,ymm6,20h ;ymm0 = d0 c0 b0 a0

vperm2f128 ymm1,ymm5,ymm7,20h ;ymm1 = d1 c1 b1 a1

vperm2f128 ymm2,ymm4,ymm6,31h ;ymm2 = d2 c2 b2 a2

vperm2f128 ymm3,ymm5,ymm7,31h ;ymm3 = d3 c3 b3 a3

endm

; extern "C" void AvxMat4x4TransposeF64_(double* m_des, const double* m_src1)

.code

AvxMat4x4TransposeF64_ proc frame

_CreateFrame MT_,0,32

_SaveXmmRegs xmm6,xmm7

_EndProlog

; Transpose matrix m_src1

vmovaps ymm0,[rdx] ;ymm0 = m_src1.row_0

vmovaps ymm1,[rdx+32] ;ymm1 = m_src2.row_1

vmovaps ymm2,[rdx+64] ;ymm2 = m_src3.row_2

vmovaps ymm3,[rdx+96] ;ymm3 = m_src4.row_3

_Mat4x4TransposeF64

vmovaps [rcx],ymm0 ;save m_des.row_0

vmovaps [rcx+32],ymm1 ;save m_des.row_1

vmovaps [rcx+64],ymm2 ;save m_des.row_2

vmovaps [rcx+96],ymm3 ;save m_des.row_3

vzeroupper

Done: _RestoreXmmRegs xmm6,xmm7

_DeleteFrame

ret

AvxMat4x4TransposeF64_ endp

; _Mat4x4MulCalcRowF64 macro

;

; Description: This macro computes one row of a 4x4 matrix multiplication.

;

; Registers: ymm0 = m_src2.row0

; ymm1 = m_src2.row1

; ymm2 = m_src2.row2

; ymm3 = m_src2.row3

; rcx = m_des ptr

; rdx = m_src1 ptr

; ymm4 - ymm4 = scratch registers

_Mat4x4MulCalcRowF64 macro disp

vbroadcastsd ymm4,real8 ptr [rdx+disp] ;broadcast m_src1[i][0]

vbroadcastsd ymm5,real8 ptr [rdx+disp+8] ;broadcast m_src1[i][1]

vbroadcastsd ymm6,real8 ptr [rdx+disp+16] ;broadcast m_src1[i][2]

vbroadcastsd ymm7,real8 ptr [rdx+disp+24] ;broadcast m_src1[i][3]

vmulpd ymm4,ymm4,ymm0 ;m_src1[i][0] * m_src2.row_0

vmulpd ymm5,ymm5,ymm1 ;m_src1[i][1] * m_src2.row_1

vmulpd ymm6,ymm6,ymm2 ;m_src1[i][2] * m_src2.row_2

vmulpd ymm7,ymm7,ymm3 ;m_src1[i][3] * m_src2.row_3

vaddpd ymm4,ymm4,ymm5 ;calc m_des.row_i

vaddpd ymm6,ymm6,ymm7

vaddpd ymm4,ymm4,ymm6

vmovapd [rcx+disp],ymm4 ;save m_des.row_i

endm

; extern "C" void AvxMat4x4MulF64_(double* m_des, const double* m_src1, const double* m_src2)

AvxMat4x4MulF64_ proc frame

_CreateFrame MM_,0,32

_SaveXmmRegs xmm6,xmm7

_EndProlog

; Load m_src2 into YMM3:YMM0

vmovapd ymm0,[r8] ;ymm0 = m_src2.row_0

vmovapd ymm1,[r8+32] ;ymm1 = m_src2.row_1

vmovapd ymm2,[r8+64] ;ymm2 = m_src2.row_2

vmovapd ymm3,[r8+96] ;ymm3 = m_src2.row_3

; Compute matrix product

_Mat4x4MulCalcRowF64 0 ;calculate m_des.row_0

_Mat4x4MulCalcRowF64 32 ;calculate m_des.row_1

_Mat4x4MulCalcRowF64 64 ;calculate m_des.row_2

_Mat4x4MulCalcRowF64 96 ;calculate m_des.row_3

vzeroupper

Done: _RestoreXmmRegs xmm6,xmm7

_DeleteFrame

ret

AvxMat4x4MulF64_ endp

end

Listing 9-5.

Example Ch09_05

The C++ source code that’s shown in Listing 9-5 is very similar to what you saw in Chapter 6. It begins with a function named AvxMat4x4TransposeF64 that exercises both the C++ and assembly language matrix transposition calculating routines and displays the results. The function that follows, AvxMat4x4MulF64, implements the same tasks for matrix multiplication. Similar to the source code examples in Chapter 6, the C++ versions of matrix transposition and multiplication are implemented by the template functions Matrix<>::Transpose and Matrix<>::Mul, respectively. Chapter 6 contains additional details regarding these template functions.

Near the top of the assembly language code is a macro named _Mat4x4TransposeF64. This macro contains instructions that transpose a 4 × 4 matrix of double-precision floating-point values. The four rows of the source double-precision floating-point matrix must be loaded in registers YMM0–YMM3 prior to its use. Macro _Mat4x4TransposeF64 uses the vperm2f128 instruction to permute the 128-bit wide floating-point fields of its two source operands. This instruction uses an immediate 8-bit control mask to select which fields are copied from the source operands to the destination operand, as outlined in Table 9-1. Figure 9-4 shows the entire 4 × 4 matrix transposition operation in greater detail. The assembly language function AvxMat4x4TransposeF64_ uses the macro _Mat4x4TransposeF64 to transpose a 4 × 4 matrix of double-precision floating-point values.

Table 9-1.

Field Selection for vperm2f128 ymm0,ymm1,ymm2,imm8 Instruction

Destination Field	Source Field	imm8[1:0]	imm8[4:3]
ymm0[127:0]	ymm1[127:0]	0
	ymm1[255:128]	1
	ymm2[127:0]	2
	ymm2[255:128]	3
ymm0[255:128]	ymm1[127:0]		0
	ymm1[255:128]		1
	ymm2[127:0]		2
	ymm2[255:128]		3

../images/326959_2_En_9_Chapter/326959_2_En_9_Fig4_HTML.jpg — Figure 9-4.
Instruction sequence used by *_Max4x4TransposeF64* to transpose a 4 × 4 matrix of double-precision floating-point values

In Listing 9-5, the macro definition _Mat4x4MulCalcRowF64 follows the function AvxMat4x4TransposeF64_. This macro contains instructions that calculate a single row of a 4 × 4 matrix multiplication. The row-multiplication technique that’s used here is identical to the one that was used in source code example Ch06_08 in Chapter 6 (see Figure 6-7). The function AvxMat4x4MulF64_ uses the macro _Mat4x4MulCalcRowF64 to multiply two 4 × 4 double-precision floating-point matrices. Here are the results for source code example Ch09_05:

Results for AvxMat4x4TransposeF64

Matrix m_src1

10.0 11.0 12.0 13.0

20.0 21.0 22.0 23.0

30.0 31.0 32.0 33.0

40.0 41.0 42.0 43.0

Matrix m_des1

10.0 20.0 30.0 40.0

11.0 21.0 31.0 41.0

12.0 22.0 32.0 42.0

13.0 23.0 33.0 43.0

Matrix m_des2

10.0 20.0 30.0 40.0

11.0 21.0 31.0 41.0

12.0 22.0 32.0 42.0

13.0 23.0 33.0 43.0

Results for AvxMat4x4MulF64

Matrix m_src1

10.0 11.0 12.0 13.0

20.0 21.0 22.0 23.0

30.0 31.0 32.0 33.0

40.0 41.0 42.0 43.0

Matrix m_src2

100.0 101.0 102.0 103.0

200.0 201.0 202.0 203.0

300.0 301.0 302.0 303.0

400.0 401.0 402.0 403.0

Matrix m_des1

12000.0 12046.0 12092.0 12138.0

22000.0 22086.0 22172.0 22258.0

32000.0 32126.0 32252.0 32378.0

42000.0 42166.0 42332.0 42498.0

Matrix m_des2

12000.0 12046.0 12092.0 12138.0

22000.0 22086.0 22172.0 22258.0

32000.0 32126.0 32252.0 32378.0

42000.0 42166.0 42332.0 42498.0

Running benchmark function AvxMat4x4TransposeF64_BM - please wait

Benchmark times save to file Ch09_05_AvxMat4x4TransposeF64_BM_CHROMIUM.csv

Running benchmark function AvxMat4x4MulF64_BM - please wait

Benchmark times save to file Ch09_05_AvxMat4x4MulF64_BM_CHROMIUM.csv

Tables 9-2 and 9-3 contain benchmark timing measurements for the matrix transposition and multiplication functions presented in this section. These measurements were made using the procedure that’s described in Chapter 6.

Table 9-2.

Matrix Transposition Mean Execution Times (Microseconds), 1,000,000 Transpositions

CPU	C++	Assembly Language
i7-4790S	15562	2670
i9-7900X	13167	2112
i7-8700K	12194	1963

Table 9-3.

Matrix Multiplication Mean Execution Times (Microseconds), 1,000,000 Multiplications

CPU	C++	Assembly Language
i7-4790S	55652	5874
i9-7900X	46910	5286
i7-8700K	43118	4505

Matrix Inversion

Besides transposition and multiplication, matrix inversion is another common operation that’s often applied to 4 × 4 matrices. In this section, you examine a program that calculates the inverse of a 4 × 4 matrix of double-precision floating-point values. Listing 9-6 shows the source code for example Ch09_06.

//------------------------------------------------

// Ch09_06.cpp

//------------------------------------------------

#include "stdafx.h"

#include <cmath>

#include "Ch09_06.h"

#include "Matrix.h"

using namespace std;

bool Avx2Mat4x4InvF64Cpp(Matrix<double>& m_inv, const Matrix<double>& m, double epsilon, bool* is_singular)

{

// The intermediate matrices below are declared static for benchmarking purposes.

static const size_t nrows = 4;

static const size_t ncols = 4;

static Matrix<double> m2(nrows, ncols);

static Matrix<double> m3(nrows, ncols);

static Matrix<double> m4(nrows, ncols);

static Matrix<double> I(nrows, ncols, true);

static Matrix<double> tempA(nrows, ncols);

static Matrix<double> tempB(nrows, ncols);

static Matrix<double> tempC(nrows, ncols);

static Matrix<double> tempD(nrows, ncols);

Matrix<double>::Mul(m2, m, m);

Matrix<double>::Mul(m3, m2, m);

Matrix<double>::Mul(m4, m3, m);

double t1 = m.Trace();

double t2 = m2.Trace();

double t3 = m3.Trace();

double t4 = m4.Trace();

double c1 = -t1;

double c2 = -1.0 / 2.0 * (c1 * t1 + t2);

double c3 = -1.0 / 3.0 * (c2 * t1 + c1 * t2 + t3);

double c4 = -1.0 / 4.0 * (c3 * t1 + c2 * t2 + c1 * t3 + t4);

// Make sure matrix is not singular

*is_singular = (fabs(c4) < epsilon);

if (*is_singular)

return false;

// Calculate = -1.0 / c4 * (m3 + c1 * m2 + c2 * m + c3 * I)

Matrix<double>::MulScalar(tempA, I, c3);

Matrix<double>::MulScalar(tempB, m, c2);

Matrix<double>::MulScalar(tempC, m2, c1);

Matrix<double>::Add(tempD, tempA, tempB);

Matrix<double>::Add(tempD, tempD, tempC);

Matrix<double>::Add(tempD, tempD, m3);

Matrix<double>::MulScalar(m_inv, tempD, -1.0 / c4);

return true;

}

void Avx2Mat4x4InvF64(const Matrix<double>& m, const char* msg)

{

cout << ' ' << msg << " - Test Matrix ";

cout << m << ' ';

const double epsilon = 1.0e-9;

const size_t nrows = m.GetNumRows();

const size_t ncols = m.GetNumCols();

Matrix<double> m_inv_a(nrows, ncols);

Matrix<double> m_ver_a(nrows, ncols);

Matrix<double> m_inv_b(nrows, ncols);

Matrix<double> m_ver_b(nrows, ncols);

for (int i = 0; i <= 1; i++)

{

string fn;

const size_t nrows = m.GetNumRows();

const size_t ncols = m.GetNumCols();

Matrix<double> m_inv(nrows, ncols);

Matrix<double> m_ver(nrows, ncols);

bool rc, is_singular;

if (i == 0)

{

fn = "Avx2Mat4x4InvF64Cpp";

rc = Avx2Mat4x4InvF64Cpp(m_inv, m, epsilon, &is_singular);

if (rc)

Matrix<double>::Mul(m_ver, m_inv, m);

}

else

{

fn = "Avx2Mat4x4InvF64_";

rc = Avx2Mat4x4InvF64_(m_inv.Data(), m.Data(), epsilon, &is_singular);

if (rc)

Avx2Mat4x4MulF64_(m_ver.Data(), m_inv.Data(), m.Data());

}

if (rc)

{

cout << msg << " - " << fn << " - Inverse Matrix ";

cout << m_inv << ' ';

// Round to zero used for display purposes, can be removed.

cout << msg << " - " << fn << " - Verify Matrix ";

m_ver.RoundToZero(epsilon);

cout << m_ver << ' ';

}

else

{

if (is_singular)

cout << msg << " - " << fn << " - Singular Matrix ";

else

cout << msg << " - " << fn << " - Unexpected error occurred ";

}

int main()

{

cout << " Results for Avx2Mat4x4InvF64 ";

// Test Matrix #1 - Non-Singular

Matrix<double> m1(4, 4);

const double m1_row0[] = { 2, 7, 3, 4 };

const double m1_row1[] = { 5, 9, 6, 4.75 };

const double m1_row2[] = { 6.5, 3, 4, 10 };

const double m1_row3[] = { 7, 5.25, 8.125, 6 };

m1.SetRow(0, m1_row0);

m1.SetRow(1, m1_row1);

m1.SetRow(2, m1_row2);

m1.SetRow(3, m1_row3);

// Test Matrix #2 - Non-Singular

Matrix<double> m2(4, 4);

const double m2_row0[] = { 0.5, 12, 17.25, 4 };

const double m2_row1[] = { 5, 2, 6.75, 8 };

const double m2_row2[] = { 13.125, 1, 3, 9.75 };

const double m2_row3[] = { 16, 1.625, 7, 0.25 };

m2.SetRow(0, m2_row0);

m2.SetRow(1, m2_row1);

m2.SetRow(2, m2_row2);

m2.SetRow(3, m2_row3);

// Test Matrix #3 - Singular

Matrix<double> m3(4, 4);

const double m3_row0[] = { 2, 0, 0, 1 };

const double m3_row1[] = { 0, 4, 5, 0 };

const double m3_row2[] = { 0, 0, 0, 7 };

const double m3_row3[] = { 0, 0, 0, 6 };

m3.SetRow(0, m3_row0);

m3.SetRow(1, m3_row1);

m3.SetRow(2, m3_row2);

m3.SetRow(3, m3_row3);

Avx2Mat4x4InvF64(m1, "Test #1");

Avx2Mat4x4InvF64(m2, "Test #2");

Avx2Mat4x4InvF64(m3, "Test #3");

Avx2Mat4x4InvF64_BM(m1);

return 0;

}

;-------------------------------------------------

; Ch09_06.asm

;-------------------------------------------------

include <MacrosX86-64-AVX.asmh>

; Custom segment for constants

ConstVals segment readonly align(32) 'const'

Mat4x4I real8 1.0, 0.0, 0.0, 0.0

real8 0.0, 1.0, 0.0, 0.0

real8 0.0, 0.0, 1.0, 0.0

real8 0.0, 0.0, 0.0, 1.0

r8_SignBitMask qword 4 dup (8000000000000000h)

r8_AbsMask qword 4 dup (7fffffffffffffffh)

r8_1p0 real8 1.0

r8_N1p0 real8 -1.0

r8_N0p5 real8 -0.5

r8_N0p3333 real8 -0.33333333333333

r8_N0p25 real8 -0.25

ConstVals ends

.code

; _Mat4x4TraceF64 macro

;

; Description: This macro contains instructions that compute the trace

; of the 4x4 double-precision floating-point matrix in ymm3:ymm0.

_Max4x4TraceF64 macro

vblendpd ymm0,ymm0,ymm1,00000010b ;ymm0[127:0] = row 1,0 diag vals

vblendpd ymm1,ymm2,ymm3,00001000b ;ymm1[255:128] = row 3,2 diag vals

vperm2f128 ymm2,ymm1,ymm1,00000001b ;ymm2[127:0] = row 3,2 diag vals

vaddpd ymm3,ymm0,ymm2

vhaddpd ymm0,ymm3,ymm3 ;xmm0[63:0] = trace

endm

; extern "C" double Avx2Mat4x4TraceF64_(const double* m_src1)

;

; Description: The following function computes the trace of a

; 4x4 double-precision floating-point array.

Avx2Mat4x4TraceF64_ proc

vmovapd ymm0,[rcx] ;ymm0 = m_src1.row_0

vmovapd ymm1,[rcx+32] ;ymm1 = m_src1.row_1

vmovapd ymm2,[rcx+64] ;ymm2 = m_src1.row_2

vmovapd ymm3,[rcx+96] ;ymm3 = m_src1.row_3

_Max4x4TraceF64 ;xmm0[63:0] = m_src1.trace()

vzeroupper

ret

Avx2Mat4x4TraceF64_ endp

; _Mat4x4MulCalcRowF64 macro

;

; Description: This macro is used to compute one row of a 4x4 matrix

; multiply.

;

; Registers: ymm0 = m_src2.row0

; ymm1 = m_src2.row1

; ymm2 = m_src2.row2

; ymm3 = m_src2.row3

; ymm4 - ymm7 = scratch registers

_Mat4x4MulCalcRowF64 macro dreg,sreg,disp

vbroadcastsd ymm4,real8 ptr [sreg+disp] ;broadcast m_src1[i][0]

vbroadcastsd ymm5,real8 ptr [sreg+disp+8] ;broadcast m_src1[i][1]

vbroadcastsd ymm6,real8 ptr [sreg+disp+16] ;broadcast m_src1[i][2]

vbroadcastsd ymm7,real8 ptr [sreg+disp+24] ;broadcast m_src1[i][3]

vmulpd ymm4,ymm4,ymm0 ;m_src1[i][0] * m_src2.row_0

vmulpd ymm5,ymm5,ymm1 ;m_src1[i][1] * m_src2.row_1

vmulpd ymm6,ymm6,ymm2 ;m_src1[i][2] * m_src2.row_2

vmulpd ymm7,ymm7,ymm3 ;m_src1[i][3] * m_src2.row_3

vaddpd ymm4,ymm4,ymm5 ;calc m_des.row_i

vaddpd ymm6,ymm6,ymm7

vaddpd ymm4,ymm4,ymm6

vmovapd[dreg+disp],ymm4 ;save m_des.row_i

endm

; extern "C" void Avx2Mat4x4MulF64_(double* m_des, const double* m_src1, const double* m_src2)

Avx2Mat4x4MulF64_ proc frame

_CreateFrame MM_,0,32

_SaveXmmRegs xmm6,xmm7

_EndProlog

vmovapd ymm0,[r8] ;ymm0 = m_src2.row_0

vmovapd ymm1,[r8+32] ;ymm1 = m_src2.row_1

vmovapd ymm2,[r8+64] ;ymm2 = m_src2.row_2

vmovapd ymm3,[r8+96] ;ymm3 = m_src2.row_3

_Mat4x4MulCalcRowF64 rcx,rdx,0 ;calculate m_des.row_0

_Mat4x4MulCalcRowF64 rcx,rdx,32 ;calculate m_des.row_1

_Mat4x4MulCalcRowF64 rcx,rdx,64 ;calculate m_des.row_2

_Mat4x4MulCalcRowF64 rcx,rdx,96 ;calculate m_des.row_3

vzeroupper

_RestoreXmmRegs xmm6,xmm7

_DeleteFrame

ret

Avx2Mat4x4MulF64_ endp

; extern "C" bool Avx2Mat4x4InvF64_(double* m_inv, const double* m, double epsilon, bool* is_singular);

; Offsets of intermediate matrices on stack relative to rsp

OffsetM2 equ 32

OffsetM3 equ 160

OffsetM4 equ 288

Avx2Mat4x4InvF64_ proc frame

_CreateFrame MI_,0,160

_SaveXmmRegs xmm6,xmm7,xmm8,xmm9,xmm10,xmm11,xmm12,xmm13,xmm14,xmm15

_EndProlog

; Save args to home area for later use

mov qword ptr [rbp+MI_OffsetHomeRCX],rcx ;save m_inv ptr

mov qword ptr [rbp+MI_OffsetHomeRDX],rdx ;save m ptr

vmovsd real8 ptr [rbp+MI_OffsetHomeR8],xmm2 ;save epsilon

mov qword ptr [rbp+MI_OffsetHomeR9],r9 ;save is_singular ptr

; Allocate 384 bytes of stack space for temp matrices + 32 bytes for function calls

and rsp,0ffffffe0h ;align rsp to 32-byte boundary

sub rsp,416 ;alloc stack space

; Calculate m2

lea rcx,[rsp+OffsetM2] ;rcx = m2 ptr

mov r8,rdx ;rdx, r8 = m ptr

call Avx2Mat4x4MulF64_ ;calculate and save m2

; Calculate m3

lea rcx,[rsp+OffsetM3] ;rcx = m3 ptr

lea rdx,[rsp+OffsetM2] ;rdx = m2 ptr

mov r8,[rbp+MI_OffsetHomeRDX] ;r8 = m

call Avx2Mat4x4MulF64_ ;calculate and save m3

; Calculate m4

lea rcx,[rsp+OffsetM4] ;rcx = m4 ptr

lea rdx,[rsp+OffsetM3] ;rdx = m3 ptr

mov r8,[rbp+MI_OffsetHomeRDX] ;r8 = m

call Avx2Mat4x4MulF64_ ;calculate and save m4

; Calculate trace of m, m2, m3, and m4

mov rcx,[rbp+MI_OffsetHomeRDX]

call Avx2Mat4x4TraceF64_

vmovsd xmm8,xmm8,xmm0 ;xmm8 = t1

lea rcx,[rsp+OffsetM2]

call Avx2Mat4x4TraceF64_

vmovsd xmm9,xmm9,xmm0 ;xmm9 = t2

lea rcx,[rsp+OffsetM3]

call Avx2Mat4x4TraceF64_

vmovsd xmm10,xmm10,xmm0 ;xmm10 = t3

lea rcx,[rsp+OffsetM4]

call Avx2Mat4x4TraceF64_

vmovsd xmm11,xmm11,xmm0 ;xmm10 = t4

; Calculate the required coefficients

; c1 = -t1;

; c2 = -1.0f / 2.0f * (c1 * t1 + t2);

; c3 = -1.0f / 3.0f * (c2 * t1 + c1 * t2 + t3);

; c4 = -1.0f / 4.0f * (c3 * t1 + c2 * t2 + c1 * t3 + t4);

;

; Registers used:

; t1-t4 = xmm8-xmm11

; c1-c4 = xmm12-xmm15

vxorpd xmm12,xmm8,real8 ptr [r8_SignBitMask] ;xmm12 = c1

vmulsd xmm13,xmm12,xmm8 ;c1 * t1

vaddsd xmm13,xmm13,xmm9 ;c1 * t1 + t2

vmulsd xmm13,xmm13,[r8_N0p5] ;c2

vmulsd xmm14,xmm13,xmm8 ;c2 * t1

vmulsd xmm0,xmm12,xmm9 ;c1 * t2

vaddsd xmm14,xmm14,xmm0 ;c2 * t1 + c1 * t2

vaddsd xmm14,xmm14,xmm10 ;c2 * t1 + c1 * t2 + t3

vmulsd xmm14,xmm14,[r8_N0p3333] ;c3

vmulsd xmm15,xmm14,xmm8 ;c3 * t1

vmulsd xmm0,xmm13,xmm9 ;c2 * t2

vmulsd xmm1,xmm12,xmm10 ;c1 * t3

vaddsd xmm2,xmm0,xmm1 ;c2 * t2 + c1 * t3

vaddsd xmm15,xmm15,xmm2 ;c3 * t1 + c2 * t2 + c1 * t3

vaddsd xmm15,xmm15,xmm11 ;c3 * t1 + c2 * t2 + c1 * t3 + t4

vmulsd xmm15,xmm15,[r8_N0p25] ;c4

; Make sure matrix is not singular

vandpd xmm0,xmm15,[r8_AbsMask] ;compute fabs(c4)

vmovsd xmm1,real8 ptr [rbp+MI_OffsetHomeR8]

vcomisd xmm0,real8 ptr [rbp+MI_OffsetHomeR8] ;compare against epsilon

setp al ;set al = if unordered

setb ah ;set ah = if fabs(c4) < epsilon

or al,ah ;al = is_singular

mov rcx,[rbp+MI_OffsetHomeR9] ;rax = is_singular ptr

mov [rcx],al ;save is_singular state

jnz Error ;jump if singular

; Calculate m_inv = -1.0 / c4 * (m3 + c1 * m2 + c2 * m1 + c3 * I)

vbroadcastsd ymm14,xmm14 ;ymm14 = packed c3

lea rcx,[Mat4x4I] ;rcx = I ptr

vmulpd ymm0,ymm14,ymmword ptr [rcx]

vmulpd ymm1,ymm14,ymmword ptr [rcx+32]

vmulpd ymm2,ymm14,ymmword ptr [rcx+64]

vmulpd ymm3,ymm14,ymmword ptr [rcx+96] ;c3 * I

vbroadcastsd ymm13,xmm13 ;ymm13 = packed c2

mov rcx,[rbp+MI_OffsetHomeRDX] ;rcx = m ptr

vmulpd ymm4,ymm13,ymmword ptr [rcx]

vmulpd ymm5,ymm13,ymmword ptr [rcx+32]

vmulpd ymm6,ymm13,ymmword ptr [rcx+64]

vmulpd ymm7,ymm13,ymmword ptr [rcx+96] ;c2 * m1

vaddpd ymm0,ymm0,ymm4

vaddpd ymm1,ymm1,ymm5

vaddpd ymm2,ymm2,ymm6

vaddpd ymm3,ymm3,ymm7 ;c2 * m1 + c3 * I

vbroadcastsd ymm12,xmm12 ;ymm12 = packed c1

lea rcx,[rsp+OffsetM2] ;rcx = m2 ptr

vmulpd ymm4,ymm12,ymmword ptr [rcx]

vmulpd ymm5,ymm12,ymmword ptr [rcx+32]

vmulpd ymm6,ymm12,ymmword ptr [rcx+64]

vmulpd ymm7,ymm12,ymmword ptr [rcx+96] ;c1 * m2

vaddpd ymm0,ymm0,ymm4

vaddpd ymm1,ymm1,ymm5

vaddpd ymm2,ymm2,ymm6

vaddpd ymm3,ymm3,ymm7 ;c1 * m2 + c2 * m1 + c3 * I

lea rcx,[rsp+OffsetM3] ;rcx = m3 ptr

vaddpd ymm0,ymm0,ymmword ptr [rcx]

vaddpd ymm1,ymm1,ymmword ptr [rcx+32]

vaddpd ymm2,ymm2,ymmword ptr [rcx+64]

vaddpd ymm3,ymm3,ymmword ptr [rcx+96] ;m3 + c1 * m2 + c2 * m1 + c3 * I

vmovsd xmm4,[r8_N1p0]

vdivsd xmm4,xmm4,xmm15 ;xmm4 = -1.0 / c4

vbroadcastsd ymm4,xmm4

vmulpd ymm0,ymm0,ymm4

vmulpd ymm1,ymm1,ymm4

vmulpd ymm2,ymm2,ymm4

vmulpd ymm3,ymm3,ymm4 ;ymm3:ymm0 = m_inv

; Save m_inv

mov rcx,[rbp+MI_OffsetHomeRCX]

vmovapd ymmword ptr [rcx],ymm0

vmovapd ymmword ptr [rcx+32],ymm1

vmovapd ymmword ptr [rcx+64],ymm2

vmovapd ymmword ptr [rcx+96],ymm3

mov eax,1 ;set success return code

Done: vzeroupper

_RestoreXmmRegs xmm6,xmm7,xmm8,xmm9,xmm10,xmm11,xmm12,xmm13,xmm14,xmm15

_DeleteFrame

ret

Error: xor eax,eax

jmp Done

Avx2Mat4x4InvF64_ endp

end

Listing 9-6.

Example Ch09_06

The multiplicative inverse of a matrix is defined as follows: Let A and X represent n × n matrices. Matrix X is an inverse of A if AX = XA = I, where I denotes an n × n identity matrix (i.e., a matrix of all zeros except for the diagonal elements, which are equal to one). Figure 9-5 shows an example of an inverse matrix. It is important to note that inverses do not exist for all n × n matrices. A matrix without an inverse is called a singular matrix.

../images/326959_2_En_9_Chapter/326959_2_En_9_Fig5_HTML.jpg — Figure 9-5.
Matrix A and its multiplicative inverse Matrix X

The inverse of a 4 × 4 matrix can be calculated using a variety of mathematical techniques. Source code example Ch09_06 uses a computational method based on the Cayley-Hamilton theorem , which employs common matrix operations that are relatively easy to carry out using SIMD arithmetic. Here are the required equations:

${mathbf{A}}^1=mathbf{A};{mathbf{A}}^2=mathbf{A}mathbf{A};{mathbf{A}}^3=mathbf{A}mathbf{A}mathbf{A};{mathbf{A}}^4=mathbf{A}mathbf{A}mathbf{A}mathbf{A}$

$trace;left(mathbf{A} ight)=sum limits_i{a}_{ii}$

${t}_n= trace;left({mathbf{A}}^n ight)$

${c}_1=-{t}_1$

${c}_2=-frac{1}{2}left({c}_1{t}_1+{t}_2 ight)$

${c}_3=-frac{1}{3}left({c}_2{t}_1+{c}_1{t}_2+{t}_3 ight)$

${c}_4=-frac{1}{4}left({c}_3{t}_1+{c}_2{t}_2+{c}_1{t}_3+{t}_4 ight)$

${mathbf{A}}^{-1}=-frac{1}{c_4}left({mathbf{A}}^3+{c}_1{mathbf{A}}^2+{c}_2{mathbf{A}}^3+{c}_3mathbf{I} ight)$

Toward the top of the C++ code is a function named Avx2Mat4x4InvF64Cpp. This function calculates the inverse of a 4 × 4 matrix of double-precision floating-point values using the aforementioned equations. Function Avx2Mat4x4InvF64Cpp uses the C++ class Matrix<> to perform many of the required intermediate computations, including matrix addition, multiplication, and trace. The source code for class Matrix<> is not shown but included with the chapter download package. Note that the intermediate matrices are declared using the static qualifier in order to avoid constructor overhead when performing benchmark timing measurements. The drawback of using the static qualifier here means that the function is not thread-safe (a thread-safe function can be simultaneously used by multiple threads). Following calculation of the trace values t1 - t4, Avx2Mat4x4InvF64Cpp computes c1–c4 using simple scalar arithmetic. It then checks to make sure the source matrix m is not singular by comparing c4 against epsilon. If matrix m is not singular, the final inverse is calculated. The remaining C++ code performs test case initialization and exercises both the C++ and assembly language matrix inversion functions.

The assembly language code in Listing 9-6 begins with a custom segment that contains definitions of the constant values needed by the assembly language matrix inversion functions. The statement ConstVals segment readonly align(32) 'const' marks the start of a segment that begins on a 32-byte boundary and contains read-only data. The reason for using a custom segment here is that the MASM align directive does not support aligning data items on a 32-byte boundary. In this example, proper alignment of the packed constants is essential in order to maximize performance. Note that the scalar double-precision floating-point constants are defined after the 256-bit wide packed constants and are aligned on an 8-byte boundary. The MASM statement ConstVals ends terminates the custom segment.

Following the custom constant segment is the macro _Max4x4TraceF64. This macro contains instructions that calculate the trace of a 4 × 4 matrix of double-precision floating-point values. Macro _Max4x4TraceF64 requires the four rows of the source matrix to be loaded in registers YMM0–YMM3 and uses the vblendpd, vperm2f128, and vhaddpd instructions to calculate the matrix trace, as shown in Figure 9-6. The vblendpd (Blend Packed Double-Precision Floating-Point Values) instruction merges values from its two source operands according to an immediate control mask. If bit 0 of the control mask equals 0, element 0 (i.e., bits 63:0) from the first source operand is copied to the corresponding element position in the destination operand; otherwise, element 0 from the second source operand is copied to the destination operand. Bits 1–3 of the control mask are used in a similar manner for the other three elements. Register XMM0[63:0] contains the trace value following execution of the vhaddpd instruction.

../images/326959_2_En_9_Chapter/326959_2_En_9_Fig6_HTML.jpg — Figure 9-6.
Trace calculation for a 4 × 4 matrix

The assembly language function Avx2Mat4x4InvF64_ calculates an inverse matrix using the same technique as the corresponding C++ function. Following its prolog, the function Avx2Mat4x4InvF64_ saves its argument values to the home area for later use. It then allocates storage space on the stack to hold intermediate results. More specifically, the and rsp,0ffffffe0h instruction aligns RSP to a 32-byte boundary, and the sub rsp,416 instruction allocates local stack space that’s required for the intermediate matrices m2, m3, and m4 plus 32 bytes for function calls. Next, a series of calls are made to the functions Avx2Mat4x4MulF64_ and Avx2Mat4x4TraceF64_ to calculate the trace values t1–t4. The matrix multiplication code that’s used in this example is basically the same code that you saw in example Ch09_05. The algorithm coefficients c1–c4 are calculated next using simple scalar floating-point arithmetic. Coefficient c4 is then tested to verify that the source matrix is not singular. If the source matrix is not singular, the function calculates the inverse matrix m_inv. Note that all of the arithmetic required to calculate m_inv is carried out using straightforward packed double-precision floating-point multiplication and addition. Here is the output for source code example Ch09_06:

Results for Avx2Mat4x4InvF64

Test #1 - Test Matrix

2 7 3 4

5 9 6 4.75

6.5 3 4 10

7 5.25 8.125 6

Test #1 - Avx2Mat4x4InvF64Cpp - Inverse Matrix

-0.943926 0.91657 0.197547 -0.425579

-0.0568818 0.251148 0.00302831 -0.165952

0.545399 -0.647656 -0.213597 0.505123

0.412456 -0.412053 0.0561248 0.124363

Test #1 - Avx2Mat4x4InvF64Cpp - Verify Matrix

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

Test #1 - Avx2Mat4x4InvF64_ - Inverse Matrix

-0.943926 0.91657 0.197547 -0.425579

-0.0568818 0.251148 0.00302831 -0.165952

0.545399 -0.647656 -0.213597 0.505123

0.412456 -0.412053 0.0561248 0.124363

Test #1 - Avx2Mat4x4InvF64_ - Verify Matrix

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

Test #2 - Test Matrix

0.5 12 17.25 4

5 2 6.75 8

13.125 1 3 9.75

16 1.625 7 0.25

Test #2 - Avx2Mat4x4InvF64Cpp - Inverse Matrix

0.00165165 -0.0690239 0.0549591 0.0389347

0.135369 -0.359846 0.242038 -0.0903252

-0.0350097 0.239298 -0.183964 0.0772214

-0.0053352 0.056194 0.0603606 -0.0669085

Test #2 - Avx2Mat4x4InvF64Cpp - Verify Matrix

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

Test #2 - Avx2Mat4x4InvF64_ - Inverse Matrix

0.00165165 -0.0690239 0.0549591 0.0389347

0.135369 -0.359846 0.242038 -0.0903252

-0.0350097 0.239298 -0.183964 0.0772214

-0.0053352 0.056194 0.0603606 -0.0669085

Test #2 - Avx2Mat4x4InvF64_ - Verify Matrix

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

Test #3 - Test Matrix

2 0 0 1

0 4 5 0

0 0 0 7

0 0 0 6

Test #3 - Avx2Mat4x4InvF64Cpp - Singular Matrix

Test #3 - Avx2Mat4x4InvF64_ - Singular Matrix

Running benchmark function Avx2Mat4x4InvF64_BM - please wait

Benchmark times save to file Ch09_06_Avx2Mat4x4InvF64_BM_CHROMIUM.csv

Table 9-4 contains benchmark timing measurements for the matrix inversion functions.

Table 9-4.

Matrix Inverse Mean Execution Times (Microseconds), 100,000 Inversions

CPU	C++	Assembly Language
i7-4790S	30417	4168
i9-7900X	26646	3773
i7-8700K	24485	2941

Blend and Permute Instructions

A data blend operation conditionally copies elements from two packed source operands to a packed destination operand using a control mask that specifies which elements to copy. A data permute operation rearranges the elements of a packed source operand according to a control mask. You’ve already seen several source code examples in this chapter that exploited data blend and permute operations. The next source code example is named Ch09_07 and includes code that demonstrates how to use additional blend and permute instructions. Listing 9-7 shows the source code for example Ch09_07.

//------------------------------------------------

// Ch09_07.cpp

//------------------------------------------------

#include "stdafx.h"

#include <cstdint>

#include <iostream>

#include "YmmVal.h"

using namespace std;

extern "C" void AvxBlendF32_(YmmVal* des1, YmmVal* src1, YmmVal* src2, YmmVal* idx1);

extern "C" void Avx2PermuteF32_(YmmVal* des1, YmmVal* src1, YmmVal* idx1, YmmVal* des2, YmmVal* src2, YmmVal* idx2);

void AvxBlendF32(void)

{

const uint32_t sel0 = 0x00000000;

const uint32_t sel1 = 0x80000000;

alignas(32) YmmVal des1, src1, src2, idx1;

src1.m_F32[0] = 10.0f; src2.m_F32[0] = 100.0f; idx1.m_I32[0] = sel1;

src1.m_F32[1] = 20.0f; src2.m_F32[1] = 200.0f; idx1.m_I32[1] = sel0;

src1.m_F32[2] = 30.0f; src2.m_F32[2] = 300.0f; idx1.m_I32[2] = sel0;

src1.m_F32[3] = 40.0f; src2.m_F32[3] = 400.0f; idx1.m_I32[3] = sel1;

src1.m_F32[4] = 50.0f; src2.m_F32[4] = 500.0f; idx1.m_I32[4] = sel1;

src1.m_F32[5] = 60.0f; src2.m_F32[5] = 600.0f; idx1.m_I32[5] = sel0;

src1.m_F32[6] = 70.0f; src2.m_F32[6] = 700.0f; idx1.m_I32[6] = sel1;

src1.m_F32[7] = 80.0f; src2.m_F32[7] = 800.0f; idx1.m_I32[7] = sel0;

AvxBlendF32_(&des1, &src1, &src2, &idx1);

cout << " Results for AvxBlendF32 (vblendvps) ";

cout << fixed << setprecision(1);

for (size_t i = 0; i < 8; i++)

{

cout << "i: " << setw(2) << i << " ";

cout << "src1: " << setw(8) << src1.m_F32[i] << " ";

cout << "src2: " << setw(8) << src2.m_F32[i] << " ";

cout << setfill('0');

cout << "idx1: 0x" << setw(8) << hex << idx1.m_U32[i] << " ";

cout << setfill(' ');

cout << "des1: " << setw(8) << des1.m_F32[i] << ' ';

}

void Avx2PermuteF32(void)

{

alignas(32) YmmVal des1, src1, idx1;

alignas(32) YmmVal des2, src2, idx2;

// idx1 values must be between 0 and 7.

src1.m_F32[0] = 100.0f; idx1.m_I32[0] = 3;

src1.m_F32[1] = 200.0f; idx1.m_I32[1] = 7;

src1.m_F32[2] = 300.0f; idx1.m_I32[2] = 0;

src1.m_F32[3] = 400.0f; idx1.m_I32[3] = 4;

src1.m_F32[4] = 500.0f; idx1.m_I32[4] = 6;

src1.m_F32[5] = 600.0f; idx1.m_I32[5] = 6;

src1.m_F32[6] = 700.0f; idx1.m_I32[6] = 1;

src1.m_F32[7] = 800.0f; idx1.m_I32[7] = 2;

// idx2 values must be between 0 and 3.

src2.m_F32[0] = 100.0f; idx2.m_I32[0] = 3;

src2.m_F32[1] = 200.0f; idx2.m_I32[1] = 1;

src2.m_F32[2] = 300.0f; idx2.m_I32[2] = 1;

src2.m_F32[3] = 400.0f; idx2.m_I32[3] = 2;

src2.m_F32[4] = 500.0f; idx2.m_I32[4] = 3;

src2.m_F32[5] = 600.0f; idx2.m_I32[5] = 2;

src2.m_F32[6] = 700.0f; idx2.m_I32[6] = 0;

src2.m_F32[7] = 800.0f; idx2.m_I32[7] = 0;

Avx2PermuteF32_(&des1, &src1, &idx1, &des2, &src2, &idx2);

cout << " Results for Avx2PermuteF32 (vpermps) ";

cout << fixed << setprecision(1);

for (size_t i = 0; i < 8; i++)

{

cout << "i: " << setw(2) << i << " ";

cout << "src1: " << setw(8) << src1.m_F32[i] << " ";

cout << "idx1: " << setw(8) << idx1.m_I32[i] << " ";

cout << "des1: " << setw(8) << des1.m_F32[i] << ' ';

}

cout << " Results for Avx2PermuteF32 (vpermilps) ";

for (size_t i = 0; i < 8; i++)

{

cout << "i: " << setw(2) << i << " ";

cout << "src2: " << setw(8) << src2.m_F32[i] << " ";

cout << "idx2: " << setw(8) << idx2.m_I32[i] << " ";

cout << "des2: " << setw(8) << des2.m_F32[i] << ' ';

}

int main()

{

AvxBlendF32();

Avx2PermuteF32();

return 0;

}

;-------------------------------------------------

; Ch09_07.asm

;-------------------------------------------------

; extern "C" void AvxBlendF32_(YmmVal* des1, YmmVal* src1, YmmVal* src2, YmmVal* idx1)

.code

AvxBlendF32_ proc

vmovaps ymm0,ymmword ptr [rdx] ;ymm0 = src1

vmovaps ymm1,ymmword ptr [r8] ;ymm1 = src2

vmovdqa ymm2,ymmword ptr [r9] ;ymm2 = idx1

vblendvps ymm3,ymm0,ymm1,ymm2 ;blend ymm0 & ymm1, ymm2 "indices"

vmovaps ymmword ptr [rcx],ymm3 ;Save result to des1

vzeroupper

ret

AvxBlendF32_ endp

; extern "C" void Avx2PermuteF32_(YmmVal* des1, YmmVal* src1, YmmVal* idx1, YmmVal* des2, YmmVal* src2, YmmVal* idx2)

Avx2PermuteF32_ proc

; Perform vpermps permutation

vmovaps ymm0,ymmword ptr [rdx] ;ymm0 = src1

vmovdqa ymm1,ymmword ptr [r8] ;ymm1 = idx1

vpermps ymm2,ymm1,ymm0 ;permute ymm0 using ymm1 indices

vmovaps ymmword ptr [rcx],ymm2 ;save result to des1

; Perform vpermilps permutation

mov rdx,[rsp+40] ;rdx = src2 ptr

mov r8,[rsp+48] ;r8 = idx2 ptr

vmovaps ymm3,ymmword ptr [rdx] ;ymm3 = src2

vmovdqa ymm4,ymmword ptr [r8] ;ymm4 = idx1

vpermilps ymm5,ymm3,ymm4 ;permute ymm3 using ymm4 indices

vmovaps ymmword ptr [r9],ymm5 ;save result to des2

vzeroupper

ret

Avx2PermuteF32_ endp

end

Listing 9-7.

Example Ch09_07

The C++ code in Listing 9-7 begins with a function named AvxBlendF32 that initializes YmmVal variables src1 and src2 using single-precision floating-point values. It also initializes a third YmmVal variable named src3 for use as a blend control mask. The high-order bit of each doubleword element in src3 specifies whether the corresponding element from src1 (high-order bit = 0) or src2 (high-order bit = 1) is copied to the destination operand. These three source operands are used by the vblendvps (Variable Blend Packed Single- Precision Floating-Point Values) instruction, which is located in the assembly language function AvxBlendF32_. Following execution of this function, the results are streamed to cout.

The C++ code in Listing 9-7 also includes a function named Avx2PermuteF32. This function initializes several YmmVal variables that demonstrate use of the vpermps and vpermips instructions. Both of these instructions require a set of indices that specify which source operand elements are copied to the destination operand. For example, the statement idx1.m_I32[0] = 3 is used to direct the vpermps instruction in Avx2PermuteF32_ to perform des1.m_F32[0] = src1.m_F32[3]. The vpermps instruction requires each index in idx1 to be between zero and seven. An index can be used more than once in idx1 in order to copy an element from src1 to multiple locations in des1. The vpermilps instruction requires its indices to be between zero and three.

The assembly language function AvxBlendF32_ begins by loading the source data operands into registers YMM0 and YMM1 using two vmovaps instructions. The vmovdqa instruction that follows loads the blend control mask into register YMM2. The ensuing vblendvps ymm3,ymm0,ymm1,ymm2 instruction blends elements from registers YMM0 and YMM1 into YMM3 according to the control values in YMM2. The high-order bit of each doubleword element in YMM2 specifies whether the corresponding element from YMM0 (high-order bit = 0) or YMM1 (high-order bit = 1) is copied to YMM3. Figure 9-7 illustrates the execution of this instruction in greater detail. The vblendvps instruction and its double-precision counterpart vblendvpd are examples of AVX instructions that require three source operands. Floating-point blend operations using an immediate control mask are also possible with the vblendp[d|s] instructions.

../images/326959_2_En_9_Chapter/326959_2_En_9_Fig7_HTML.jpg — Figure 9-7.
Execution of the *vblendvps* instruction

Following AvxBlendF32_ in Listing 9-7 is the function Avx2PermuteF32_, which demonstrates use of the vpermps and vpermilps instructions. The vpermps instruction permutes (or rearranges) the elements of its first source operand (which is 256 bits wide and contains eight single-precision floating-point values) according to the indices in the second source operand. The vpermilps (In-Lane Permute of Single-Precision Floating-Point Values) instruction performs its permutations using two independent 128-bit wide lanes (i.e., bits [255:128] and bits [127:0]). The control indices for an in-lane permutation must range between zero and three, and each lane uses its own distinct set of indices. Figure 9-8 illustrates the execution of these instructions in greater detail. AVX and AVX2 also include the double-precision floating-point permute instructions vpermilpd and vpermpd.

../images/326959_2_En_9_Chapter/326959_2_En_9_Fig8_HTML.jpg — Figure 9-8.
Execution of the *vpermps* and *vpermilps* instructions

Here is the output for source code example Ch09_07:

Results for AvxBlendF32 (vblendvps)

i: 0 src1: 10.0 src2: 100.0 idx1: 0x80000000 des1: 100.0

i: 1 src1: 20.0 src2: 200.0 idx1: 0x00000000 des1: 20.0

i: 2 src1: 30.0 src2: 300.0 idx1: 0x00000000 des1: 30.0

i: 3 src1: 40.0 src2: 400.0 idx1: 0x80000000 des1: 400.0

i: 4 src1: 50.0 src2: 500.0 idx1: 0x80000000 des1: 500.0

i: 5 src1: 60.0 src2: 600.0 idx1: 0x00000000 des1: 60.0

i: 6 src1: 70.0 src2: 700.0 idx1: 0x80000000 des1: 700.0

i: 7 src1: 80.0 src2: 800.0 idx1: 0x00000000 des1: 80.0

Results for Avx2PermuteF32 (vpermps)

i: 0 src1: 100.0 idx1: 3 des1: 400.0

i: 1 src1: 200.0 idx1: 7 des1: 800.0

i: 2 src1: 300.0 idx1: 0 des1: 100.0

i: 3 src1: 400.0 idx1: 4 des1: 500.0

i: 4 src1: 500.0 idx1: 6 des1: 700.0

i: 5 src1: 600.0 idx1: 6 des1: 700.0

i: 6 src1: 700.0 idx1: 1 des1: 200.0

i: 7 src1: 800.0 idx1: 2 des1: 300.0

Results for Avx2PermuteF32 (vpermilps)

i: 0 src2: 100.0 idx2: 3 des2: 400.0

i: 1 src2: 200.0 idx2: 1 des2: 200.0

i: 2 src2: 300.0 idx2: 1 des2: 200.0

i: 3 src2: 400.0 idx2: 2 des2: 300.0

i: 4 src2: 500.0 idx2: 3 des2: 800.0

i: 5 src2: 600.0 idx2: 2 des2: 700.0

i: 6 src2: 700.0 idx2: 0 des2: 500.0

i: 7 src2: 800.0 idx2: 0 des2: 500.0

Data Gather Instructions

The final source code example of this chapter, Ch09_08, explains how to use the AVX2 gather instructions. A gather instruction conditionally loads elements from non-contiguous memory locations (typically an array) into an XMM or YMM register. A gather instruction requires a set of indices and a merge control mask that specifies which elements to copy. Listing 9-8 shows the source code for example Ch09_08. Chapter 8 presented an overview of the AVX2 gather instructions, including a graphic (see Figure 8-1) that elucidated execution of the vgatherdps instruction. You may find it helpful to review that material prior to perusing the source code and discussions in this section.

//------------------------------------------------

// Ch09_08.cpp

//------------------------------------------------

#include "stdafx.h"

#include <string>

#include <cstdint>

#include <iostream>

#include <iomanip>

#include <array>

#include <stdexcept>

using namespace std;

extern "C" void Avx2Gather8xF32_I32_(float* y, const float* x,

const int32_t* indices, const int32_t* masks);

extern "C" void Avx2Gather8xF32_I64_(float* y, const float* x,

const int64_t* indices, const int32_t* masks);

extern "C" void Avx2Gather8xF64_I32_(double* y, const double* x,

const int32_t* indices, const int64_t* masks);

extern "C" void Avx2Gather8xF64_I64_(double* y, const double* x,

const int64_t* indices, const int64_t* masks);

template <typename T, typename I, typename M, size_t N>

void Print(const string& msg, const array<T, N>& y, const array<I, N>& indices,

const array<M, N>& merge)

{

if (y.size() != indices.size() || y.size() != merge.size())

throw runtime_error("Non-conforming arrays - Print");

cout << ' ' << msg << ' ';

for (size_t i = 0; i < y.size(); i++)

{

string merge_s = (merge[i] == 1) ? "Yes" : "No";

cout << "i: " << setw(2) << i << " ";

cout << "y: " << setw(10) << y[i] << " ";

cout << "index: " << setw(4) << indices[i] << " ";

cout << "merge: " << setw(4) << merge_s << ' ';

}

void Avx2Gather8xF32_I32()

{

array<float, 20> x;

for (size_t i = 0; i < x.size(); i++)

x[i] = (float)(i * 10);

array<float, 8> y { -1, -1, -1, -1, -1, -1, -1, -1 };

array<int32_t, 8> indices { 2, 1, 6, 5, 4, 13, 11, 9 };

array<int32_t, 8> merge { 1, 1, 0, 1, 1, 0, 1, 1 };

cout << fixed << setprecision(1);

cout << " Results for Avx2Gather8xF32_I32 ";

Print("Values before", y, indices, merge);

Avx2Gather8xF32_I32_(y.data(), x.data(), indices.data(), merge.data());

Print("Values after", y, indices, merge);

}

void Avx2Gather8xF32_I64()

{

array<float, 20> x;

for (size_t i = 0; i < x.size(); i++)

x[i] = (float)(i * 10);

array<float, 8> y { -1, -1, -1, -1, -1, -1, -1, -1 };

array<int64_t, 8> indices { 19, 1, 0, 5, 4, 3, 11, 11 };

array<int32_t, 8> merge { 1, 1, 1, 1, 0, 0, 1, 1 };

cout << fixed << setprecision(1);

cout << " Results for Avx2Gather8xF32_I64 ";

Print("Values before", y, indices, merge);

Avx2Gather8xF32_I64_(y.data(), x.data(), indices.data(), merge.data());

Print("Values after", y, indices, merge);

}

void Avx2Gather8xF64_I32()

{

array<double, 20> x;

for (size_t i = 0; i < x.size(); i++)

x[i] = (double)(i * 10);

array<double, 8> y { -1, -1, -1, -1, -1, -1, -1, -1 };

array<int32_t, 8> indices { 12, 11, 6, 15, 4, 13, 18, 3 };

array<int64_t, 8> merge { 1, 1, 0, 1, 1, 0, 1, 0 };

cout << fixed << setprecision(1);

cout << " Results for Avx2Gather8xF64_I32 ";

Print("Values before", y, indices, merge);

Avx2Gather8xF64_I32_(y.data(), x.data(), indices.data(), merge.data());

Print("Values after", y, indices, merge);

}

void Avx2Gather8xF64_I64()

{

array<double, 20> x;

for (size_t i = 0; i < x.size(); i++)

x[i] = (double)(i * 10);

array<double, 8> y { -1, -1, -1, -1, -1, -1, -1, -1 };

array<int64_t, 8> indices { 11, 17, 1, 6, 14, 13, 8, 8 };

array<int64_t, 8> merge { 1, 0, 1, 1, 1, 0, 1, 1 };

cout << fixed << setprecision(1);

cout << " Results for Avx2Gather8xF64_I64 ";

Print("Values before", y, indices, merge);

Avx2Gather8xF64_I64_(y.data(), x.data(), indices.data(), merge.data());

Print("Values after", y, indices, merge);

}

int main()

{

Avx2Gather8xF32_I32();

Avx2Gather8xF32_I64();

Avx2Gather8xF64_I32();

Avx2Gather8xF64_I64();

return 0;

}

;-------------------------------------------------

; Ch09_08.asm

;-------------------------------------------------

; For each of the following functions, the contents of y are loaded

; into ymm0 prior to execution of the vgatherXXX instruction in order to

; demonstrate the effects of conditional merging.

.code

; extern "C" void Avx2Gather8xF32_I32_(float* y, const float* x, const int32_t* indices, const int32_t* merge)

Avx2Gather8xF32_I32_ proc

vmovups ymm0,ymmword ptr [rcx] ;ymm0 = y[7]:y[0]

vmovdqu ymm1,ymmword ptr [r8] ;ymm1 = indices[7]:indices[0]

vmovdqu ymm2,ymmword ptr [r9] ;ymm2 = merge[7]:merge[0]

vpslld ymm2,ymm2,31 ;shift merge vals to high-order bits

vgatherdps ymm0,[rdx+ymm1*4],ymm2 ;ymm0 = gathered elements

vmovups ymmword ptr [rcx],ymm0 ;save gathered elements

vzeroupper

ret

Avx2Gather8xF32_I32_ endp

; extern "C" void Avx2Gather8xF32_I64_(float* y, const float* x, const int64_t* indices, const int32_t* merge)

Avx2Gather8xF32_I64_ proc

vmovups xmm0,xmmword ptr [rcx] ;xmm0 = y[3]:y[0]

vmovdqu ymm1,ymmword ptr [r8] ;ymm1 = indices[3]:indices[0]

vmovdqu xmm2,xmmword ptr [r9] ;xmm2 = merge[3]:merge[0]

vpslld xmm2,xmm2,31 ;shift merge vals to high-order bits

vgatherqps xmm0,[rdx+ymm1*4],xmm2 ;xmm0 = gathered elements

vmovups xmmword ptr [rcx],xmm0 ;save gathered elements

vmovups xmm3,xmmword ptr [rcx+16] ;xmm0 = des[7]:des[4]

vmovdqu ymm1,ymmword ptr [r8+32] ;ymm1 = indices[7]:indices[4]

vmovdqu xmm2,xmmword ptr [r9+16] ;xmm2 = merge[7]:merge[4]

vpslld xmm2,xmm2,31 ;shift merge vals to high-order bits

vgatherqps xmm3,[rdx+ymm1*4],xmm2 ;xmm0 = gathered elements

vmovups xmmword ptr [rcx+16],xmm3 ;save gathered elements

vzeroupper

ret

Avx2Gather8xF32_I64_ endp

; extern "C" void Avx2Gather8xF64_I32_(double* y, const double* x, const int32_t* indices, const int64_t* merge)

Avx2Gather8xF64_I32_ proc

vmovupd ymm0,ymmword ptr [rcx] ;ymm0 = y[3]:y[0]

vmovdqu xmm1,xmmword ptr [r8] ;xmm1 = indices[3]:indices[0]

vmovdqu ymm2,ymmword ptr [r9] ;ymm2 = merge[3]:merge[0]

vpsllq ymm2,ymm2,63 ;shift merge vals to high-order bits

vgatherdpd ymm0,[rdx+xmm1*8],ymm2 ;ymm0 = gathered elements

vmovupd ymmword ptr [rcx],ymm0 ;save gathered elements

vmovupd ymm0,ymmword ptr [rcx+32] ;ymm0 = y[7]:y[4]

vmovdqu xmm1,xmmword ptr [r8+16] ;xmm1 = indices[7]:indices[4]

vmovdqu ymm2,ymmword ptr [r9+32] ;ymm2 = merge[7]:merge[4]

vpsllq ymm2,ymm2,63 ;shift merge vals to high-order bits

vgatherdpd ymm0,[rdx+xmm1*8],ymm2 ;ymm0 = gathered elements

vmovupd ymmword ptr [rcx+32],ymm0 ;save gathered elements

vzeroupper

ret

Avx2Gather8xF64_I32_ endp

; extern "C" void Avx2Gather8xF64_I64_(double* y, const double* x, const int64_t* indices, const int64_t* merge)

Avx2Gather8xF64_I64_ proc

vmovupd ymm0,ymmword ptr [rcx] ;ymm0 = y[3]:y[0]

vmovdqu ymm1,ymmword ptr [r8] ;ymm1 = indices[3]:indices[0]

vmovdqu ymm2,ymmword ptr [r9] ;ymm2 = merge[3]:merge[0]

vpsllq ymm2,ymm2,63 ;shift merge vals to high-order bits

vgatherqpd ymm0,[rdx+ymm1*8],ymm2 ;ymm0 = gathered elements

vmovupd ymmword ptr [rcx],ymm0 ;save gathered elements

vmovupd ymm0,ymmword ptr [rcx+32] ;ymm0 = y[7]:y[4]

vmovdqu ymm1,ymmword ptr [r8+32] ;ymm1 = indices[7]:indices[4]

vmovdqu ymm2,ymmword ptr [r9+32] ;ymm2 = merge[7]:merge[4]

vpsllq ymm2,ymm2,63 ;shift merge vals to high-order bits

vgatherqpd ymm0,[rdx+ymm1*8],ymm2 ;ymm0 = gathered elements

vmovupd ymmword ptr [rcx+32],ymm0 ;save gathered elements

vzeroupper

ret

Avx2Gather8xF64_I64_ endp

end

Listing 9-8.

Example Ch09_08

The C++ source code in example Ch09_08 includes four functions that initialize test cases to perform single-precision and double-precision floating-point gather operations using signed doubleword or quadword indices . The function Avx2Gather8xF32_I32 begins by initializing the elements of array x (the source array) with test values. Note that this function uses the STL class array<> instead of a raw C++ array to demonstrate use of the former with an assembly language function. Appendix A contains a list of C++ references that you can consult if you’re interested in learning more about this class. Next, each element in array y (the destination array) is set to -1.0 in order to illustrate the effects of conditional merging. The arrays indices and merge are also primed with the required gather instruction indices and merge control mask values, respectively. The assembly language function Avx2Gather8xF32_I32_ is then called to carry out the gather operation. Note that raw pointers for the various STL arrays are obtained using template function array<>.data. The other C++ functions in this source example—Avx2Gather8xF32_I64, Avx2Gather8xF64_I32, and Avx2Gather8xF64_I64—are similarly structured.

The assembly language function Avx2Gather8xF32_I32_ begins by loading registers YMM0, YMM1, and YMM2 with the test arrays y, indices, and merge, respectively. Register RDX contains a pointer to the source array x. The vpslld ymm2,ymm2,31 instruction shifts the merge control mask values (each value in this mask is zero or one) to the high-order bit of each doubleword element. The ensuing vgatherdps ymm0,[rdx+ymm1*4],ymm2 instruction loads eight single-precision floating-point values from array x into register YMM0. The merge control mask in YMM2 dictates which array elements are actually copied into the destination operand YMM0. If the high-order bit of a merge control mask doubleword element is set to 1, the corresponding element in YMM0 is updated; otherwise, it is not changed. Subsequent to the successful load of an array element, the vgatherdps instruction sets the corresponding doubleword element in the merge control mask to zero. The vmovups ymmword ptr [rcx],ymm0 then saves the gather result to y.

The assembly language functions Avx2Gather8xF32_I64_, Avx2Gather8xF64_I32_, and Avx2Gather8xF64_I64_ are analogous to Avx2Gather8xF32_I32_. Note that the gather instructions used in these functions—vgatherqps, vgatherdpd, and vgatherqpd—gather only four elements, which explains why they’re used twice. Here are the results for source code example Ch09_08:

Results for Avx2Gather8xF32_I32

Values before

i: 0 y: -1.0 index: 2 merge: Yes

i: 1 y: -1.0 index: 1 merge: Yes

i: 2 y: -1.0 index: 6 merge: No

i: 3 y: -1.0 index: 5 merge: Yes

i: 4 y: -1.0 index: 4 merge: Yes

i: 5 y: -1.0 index: 13 merge: No

i: 6 y: -1.0 index: 11 merge: Yes

i: 7 y: -1.0 index: 9 merge: Yes

Values after

i: 0 y: 20.0 index: 2 merge: Yes

i: 1 y: 10.0 index: 1 merge: Yes

i: 2 y: -1.0 index: 6 merge: No

i: 3 y: 50.0 index: 5 merge: Yes

i: 4 y: 40.0 index: 4 merge: Yes

i: 5 y: -1.0 index: 13 merge: No

i: 6 y: 110.0 index: 11 merge: Yes

i: 7 y: 90.0 index: 9 merge: Yes

Results for Avx2Gather8xF32_I64

Values before

i: 0 y: -1.0 index: 19 merge: Yes

i: 1 y: -1.0 index: 1 merge: Yes

i: 2 y: -1.0 index: 0 merge: Yes

i: 3 y: -1.0 index: 5 merge: Yes

i: 4 y: -1.0 index: 4 merge: No

i: 5 y: -1.0 index: 3 merge: No

i: 6 y: -1.0 index: 11 merge: Yes

i: 7 y: -1.0 index: 11 merge: Yes

Values after

i: 0 y: 190.0 index: 19 merge: Yes

i: 1 y: 10.0 index: 1 merge: Yes

i: 2 y: 0.0 index: 0 merge: Yes

i: 3 y: 50.0 index: 5 merge: Yes

i: 4 y: -1.0 index: 4 merge: No

i: 5 y: -1.0 index: 3 merge: No

i: 6 y: 110.0 index: 11 merge: Yes

i: 7 y: 110.0 index: 11 merge: Yes

Results for Avx2Gather8xF64_I32

Values before

i: 0 y: -1.0 index: 12 merge: Yes

i: 1 y: -1.0 index: 11 merge: Yes

i: 2 y: -1.0 index: 6 merge: No

i: 3 y: -1.0 index: 15 merge: Yes

i: 4 y: -1.0 index: 4 merge: Yes

i: 5 y: -1.0 index: 13 merge: No

i: 6 y: -1.0 index: 18 merge: Yes

i: 7 y: -1.0 index: 3 merge: No

Values after

i: 0 y: 120.0 index: 12 merge: Yes

i: 1 y: 110.0 index: 11 merge: Yes

i: 2 y: -1.0 index: 6 merge: No

i: 3 y: 150.0 index: 15 merge: Yes

i: 4 y: 40.0 index: 4 merge: Yes

i: 5 y: -1.0 index: 13 merge: No

i: 6 y: 180.0 index: 18 merge: Yes

i: 7 y: -1.0 index: 3 merge: No

Results for Avx2Gather8xF64_I64

Values before

i: 0 y: -1.0 index: 11 merge: Yes

i: 1 y: -1.0 index: 17 merge: No

i: 2 y: -1.0 index: 1 merge: Yes

i: 3 y: -1.0 index: 6 merge: Yes

i: 4 y: -1.0 index: 14 merge: Yes

i: 5 y: -1.0 index: 13 merge: No

i: 6 y: -1.0 index: 8 merge: Yes

i: 7 y: -1.0 index: 8 merge: Yes

Values after

i: 0 y: 110.0 index: 11 merge: Yes

i: 1 y: -1.0 index: 17 merge: No

i: 2 y: 10.0 index: 1 merge: Yes

i: 3 y: 60.0 index: 6 merge: Yes

i: 4 y: 140.0 index: 14 merge: Yes

i: 5 y: -1.0 index: 13 merge: No

i: 6 y: 80.0 index: 8 merge: Yes

i: 7 y: 80.0 index: 8 merge: Yes

Summary

Here are the key learning points of Chapter 9:

Nearly all AVX packed single-precision and double-precision floating-point instructions can be used with either 128-bit or 256-bit wide operands. Packed floating-point operands should always be properly aligned whenever possible, as described in this chapter.
The MASM align directive cannot be used to align a 256-bit wide operand on a 32-byte boundary. Assembly language code can align 256-bit wide constant or mutable operands on a 32-byte boundary using the MASM segment directive.
When performing packed arithmetic operations, the vcmpp[d|s] instructions can be used with the vandp[d|s], vandnp[d|s], and vorp[d|s] instructions to make logical decisions without any conditional jump instructions.
The non-associativity of floating-point arithmetic means that minute numerical discrepancies may occur when comparing values calculated using C++ and assembly language functions.
Assembly language functions can use the vperm2f128, vpermp[d|s], and vpermilp[d|s] instructions to rearrange the elements of a packed floating-point operand.
Assembly language functions can use the vblendp[d|s] and vblendvp[d|s] instructions to interleave the elements of two packed floating-point operands.
Assembly language functions can use the vgatherdp[d|s] and vgatherqp[d|s] instructions to conditionally load floating-point values from non-contiguous memory locations into an XMM or YMM register.
Assembly language functions that perform calculations using a YMM register should also use a vzeroupper instruction prior any epilog code or the ret instruction in order to avoid potential x86-AVX to x86-SSE state transition performance delays.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 9. AVX2 Programming – Packed Floating-Point

Create new playlist

Sign In

Sign Up

9. AVX2 Programming – Packed Floating-Point

Packed Floating-Point Arithmetic

Packed Floating-Point Arrays

Simple Calculations

Column Means

Correlation Coefficient

Matrix Multiplication and Transposition

Matrix Inversion

Blend and Permute Instructions

Data Gather Instructions

Summary

Table of Contents for
9. AVX2 Programming – Packed Floating-Point