Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

D. KusswurmModern Parallel Programming with C++ and Assembly Languagehttps://doi.org/10.1007/978-1-4842-7918-2_12

12. Core Assembly Language Programming: Part 2

Daniel Kusswurm¹

(1)

Geneva, IL, USA

In the previous chapter, you were introduced to the fundamentals of x86-64 assembly language programming. You learned how to use elementary instructions that performed integer addition, subtraction, multiplication, and division. You also acquired valuable knowledge regarding memory addressing modes, condition codes, and assembly language programming syntax. The chapter that you are about to read is a continuation of the previous chapter. Topics discussed include scalar floating-point arithmetic, compares, and conversions. This chapter also provides additional details regarding the Visual C++ calling convention including volatile and nonvolatile registers, stack frames, and function prologues and epilogues.

Scalar Floating-Point Arithmetic

Besides its SIMD capabilities, AVX also includes instructions that perform scalar floating-point operations including basic arithmetic, compares, and conversions. Many modern programs use AVX scalar floating-point instructions instead of legacy SSE2 or x87 FPU instructions. The primary reason for this is that most AVX instructions employ three operands: two nondestructive source operands and one destination operand. The use of nondestructive source operands often reduces the number of register-to-register transfers that a function must perform, which yields more efficient code. In this section, you will learn how to code functions that perform scalar floating-point operations using AVX. You will also learn how to pass floating-point arguments and return values between a C++ and assembly language function.

Single-Precision Arithmetic

Listing 12-1 shows the source code for example Ch12_01. This example illustrates how to perform temperature conversions between Fahrenheit and Celsius using AVX and single-precision floating-point values. It also explains how to define and use floating-point constants in an assembly language function.

//------------------------------------------------

// Ch12_01.h

//------------------------------------------------

#pragma once

// Ch12_01_fasm.asm

extern "C" float ConvertFtoC_Aavx(float deg_f);

extern "C" float ConvertCtoF_Aavx(float deg_c);

//------------------------------------------------

// Ch12_01.cpp

//------------------------------------------------

#include <iostream>

#include <iomanip>

#include "Ch12_01.h"

static void ConvertFtoC(void);

static void ConvertCtoF(void);

int main()

{

ConvertFtoC();

ConvertCtoF();

return 0;

}

static void ConvertFtoC(void)

{

const size_t w = 10;

float deg_fvals[] = {-459.67f, -40.0f, 0.0f, 32.0f, 72.0f, 98.6f, 212.0f};

size_t n = sizeof(deg_fvals) / sizeof(float);

std::cout << " -------- ConvertFtoC Results -------- ";

std::cout << std::fixed << std::setprecision(4);

for (size_t i = 0; i < n; i++)

{

float deg_c = ConvertFtoC_Aavx(deg_fvals[i]);

std::cout << " i: " << i << " ";

std::cout << "f: " << std::setw(w) << deg_fvals[i] << " ";

std::cout << "c: " << std::setw(w) << deg_c << ' ';

}

static void ConvertCtoF(void)

{

const size_t w = 10;

float deg_cvals[] = {-273.15f, -40.0f, -17.777778f, 0.0f, 25.0f, 37.0f, 100.0f};

size_t n = sizeof(deg_cvals) / sizeof(float);

std::cout << " -------- ConvertCtoF Results -------- ";

std::cout << std::fixed << std::setprecision(4);

for (size_t i = 0; i < n; i++)

{

float deg_f = ConvertCtoF_Aavx(deg_cvals[i]);

std::cout << " i: " << i << " ";

std::cout << "c: " << std::setw(w) << deg_cvals[i] << " ";

std::cout << "f: " << std::setw(w) << deg_f << ' ';

}

;-------------------------------------------------

; Ch12_01_fasm.asm

;-------------------------------------------------

.const

r4_ScaleFtoC real4 0.55555556 ; 5 / 9

r4_ScaleCtoF real4 1.8 ; 9 / 5

r4_32p0 real4 32.0

;--------------------------------------------------------------------------

; extern "C" float ConvertFtoC_Aavx(float deg_f);

;--------------------------------------------------------------------------

.code

ConvertFtoC_Aavx proc

vmovss xmm1,[r4_32p0] ;xmm1 = 32

vsubss xmm2,xmm0,xmm1 ;xmm2 = f - 32

vmovss xmm1,[r4_ScaleFtoC] ;xmm1 = 5 / 9

vmulss xmm0,xmm2,xmm1 ;xmm0 = (f - 32) * 5 / 9

ret

ConvertFtoC_Aavx endp

;--------------------------------------------------------------------------

; extern "C" float ConvertCtoF_Aavx(float deg_c);

;--------------------------------------------------------------------------

ConvertCtoF_Aavx proc

vmulss xmm0,xmm0,[r4_ScaleCtoF] ;xmm0 = c * 9 / 5

vaddss xmm0,xmm0,[r4_32p0] ;xmm0 = c * 9 / 5 + 32

ret

ConvertCtoF_Aavx endp

end

Listing 12-1

Example Ch12_01

In Listing 12-1, the header file Ch12_01.h includes declaration statements for the functions ConvertFtoC_Aavx() and ConvertCtoF_Aavx(). Note that these functions require a single argument value of type float. Both functions also return a value of type float. The file Ch12_01.cpp includes a function named ConvertFtoC() that performs test case initialization for ConvertFtoC_Aavx() and displays the calculated results. File Ch12_01.cpp also includes the function ConvertCtoF(), which is the Celsius to Fahrenheit counterpart of ConvertFtoC().

The assembly language code in Ch12_01_fasm.asm starts with a .const section that defines the constants needed to convert a temperature value from Fahrenheit to Celsius and vice versa. The text real4 is a MASM directive that allocates storage space for single-precision floating-point value (the directive real8 can be used for double-precision floating-point values). Following the .const section is the code for function ConvertFtoC_Aavx(). The first instruction of this function, vmovss xmm1,[r4_32p0] (Move or Merge Scalar SPFP¹ Value), loads the single-precision floating-point value 32.0 from memory into register XMM1 (more precisely into XMM1[31:0]). A memory operand is used here since AVX does not support using immediate operands for scalar floating-point constants.

Per the Visual C++ calling convention, the first four floating-point argument values are passed to a function using registers XMM0, XMM1, XMM2, and XMM3. This means that upon entry to function ConvertFtoC_Aavx(), register XMM0 contains argument value deg_f. Following execution of the vmovss instruction, the vsubss xmm2,xmm0,xmm1 (Subtract Scalar SPFP Value) instruction calculates deg_f - 32.0 and saves the result in XMM2[31:0]. Execution of vsubss does not modify the contents of source operands XMM0 and XMM1. However, this instruction copies bits XMM0[127:32] to XMM2[127:32] (other AVX scalar arithmetic instructions also perform this copy operation). The ensuing vmovss xmm1,[r4_ScaleFtoC] loads the constant value 0.55555556 (or 5 / 9) into register XMM1. This is followed by a vmulss xmm0,xmm2,xmm1 (Multiply Scalar SPFP Value) instruction that computes (deg_f - 32.0) * 0.55555556 and saves the result (i.e., the converted temperature in Celsius) in XMM0. The Visual C++ calling convention designates register XMM0 for floating-point return values. Since the return value is already in XMM0, no additional vmovss instructions are necessary.

The assembly language function ConvertCtoF_Aavx() follows next. The code for this function differs slightly from ConvertFtoC_Aavx() in that the AVX scalar floating-point arithmetic instructions use memory operands to reference the required conversion constants. At entry to ConvertCtoF_Aavx(), register XMM0 contains argument value deg_c. The instruction vmulss xmm0,xmm0,[r4_ScaleCtoF] calculates deg_c * 1.8. This is followed by a vaddss xmm0,xmm0,[r4_32p0] (Add Scalar SPFP Value) instruction that calculates deg_c * 1.8 + 32.0. It should be noted at this point that neither ConvertFtoC_Aavx() nor ConvertCtoF_Aavx() perform any validity checks for argument values that are physically impossible (e.g., a temperature of -1000 degrees Fahrenheit). Such checks require floating-point compare instructions, and you will learn about these instructions later in this chapter. Here are the results for source code example Ch12_01:

-------- ConvertFtoC Results --------

i: 0 f: -459.6700 c: -273.1500

i: 1 f: -40.0000 c: -40.0000

i: 2 f: 0.0000 c: -17.7778

i: 3 f: 32.0000 c: 0.0000

i: 4 f: 72.0000 c: 22.2222

i: 5 f: 98.6000 c: 37.0000

i: 6 f: 212.0000 c: 100.0000

-------- ConvertCtoF Results --------

i: 0 c: -273.1500 f: -459.6700

i: 1 c: -40.0000 f: -40.0000

i: 2 c: -17.7778 f: 0.0000

i: 3 c: 0.0000 f: 32.0000

i: 4 c: 25.0000 f: 77.0000

i: 5 c: 37.0000 f: 98.6000

i: 6 c: 100.0000 f: 212.0000

Double-Precision Arithmetic

Listing 12-2 shows the source code for example Ch12_02. This example calculates 3D distances using AVX scalar arithmetic and double-precision floating-point values.

//------------------------------------------------

// Ch12_02.h

//------------------------------------------------

#pragma once

// Ch12_02_fcpp.cpp

extern double CalcDistance_Cpp(double x1, double y1, double z1, double x2,

double y2, double z2);

// Ch12_02_fasm.asm

extern "C" double CalcDistance_Aavx(double x1, double y1, double z1, double x2,

double y2, double z2);

// Ch12_02_misc.cpp

extern void InitArrays(double* x, double* y, double* z, size_t n,

unsigned int rng_seed);

//------------------------------------------------

// Ch12_02.cpp

//------------------------------------------------

#include <iostream>

#include <iomanip>

#include "Ch12_02.h"

static void CalcDistance(void);

int main()

{

CalcDistance();

return 0;

}

static void CalcDistance(void)

{

const size_t n = 20;

double x1[n], y1[n], z1[n], dist1[n];

double x2[n], y2[n], z2[n], dist2[n];

InitArrays(x1, y1, z1, n, 29);

InitArrays(x2, y2, z2, n, 37);

for (size_t i = 0; i < n; i++)

{

dist1[i] = CalcDistance_Cpp(x1[i], y1[i], z1[i], x2[i], y2[i], z2[i]);

dist2[i] = CalcDistance_Aavx(x1[i], y1[i], z1[i], x2[i], y2[i], z2[i]);

}

size_t w1 = 3, w2 = 8;

std::cout << std::fixed;

for (size_t i = 0; i < n; i++)

{

std::cout << "i: " << std::setw(w1) << i << " ";

std::cout << std::setprecision(0);

std::cout << "p1(";

std::cout << std::setw(w1) << x1[i] << ",";

std::cout << std::setw(w1) << y1[i] << ",";

std::cout << std::setw(w1) << z1[i] << ") | ";

std::cout << "p2(";

std::cout << std::setw(w1) << x2[i] << ",";

std::cout << std::setw(w1) << y2[i] << ",";

std::cout << std::setw(w1) << z2[i] << ") | ";

std::cout << std::setprecision(4);

std::cout << "dist1: " << std::setw(w2) << dist1[i] << " | ";

std::cout << "dist2: " << std::setw(w2) << dist2[i] << ' ';

}

;-------------------------------------------------

; Ch12_02_fasm.asm

;-------------------------------------------------

;--------------------------------------------------------------------------

; extern "C" double CalcDistance_Aavx(double x1, double y1, double z1, double x2,

; double y2, double z2);

;--------------------------------------------------------------------------

.code

CalcDistance_Aavx proc

; Load arguments from stack

vmovsd xmm4,real8 ptr [rsp+40] ;xmm4 = y2

vmovsd xmm5,real8 ptr [rsp+48] ;xmm5 = z2

; Calculate squares of coordinate distances

vsubsd xmm0,xmm3,xmm0 ;xmm0 = x2 - x1

vmulsd xmm0,xmm0,xmm0 ;xmm0 = (x2 - x1) * (x2 - x1)

vsubsd xmm1,xmm4,xmm1 ;xmm1 = y2 - y1

vmulsd xmm1,xmm1,xmm1 ;xmm1 = (y2 - y1) * (y2 - y1)

vsubsd xmm2,xmm5,xmm2 ;xmm2 = z2 - z1

vmulsd xmm2,xmm2,xmm2 ;xmm2 = (z2 - z1) * (z2 - z1)

; Calculate final distance

vaddsd xmm3,xmm0,xmm1

vaddsd xmm4,xmm2,xmm3 ;xmm4 = sum of squares

vsqrtsd xmm0,xmm0,xmm4 ;xmm0 = final distance value

ret

CalcDistance_Aavx endp

end

Listing 12-2

Example Ch12_02

The Euclidian distance between two 3D points can be calculated using the following equation:

$dist=sqrt{{left({x}_2-{x}_1 ight)}^2+{left({y}_2-{y}_1 ight)}^2+{left({z}_2-{z}_1 ight)}^2}$

If you examine the declaration of function CalcDistance_Aavx(), you will notice that it specifies six argument values of type double. Argument values x1, y1, z1, and x2 are passed in registers XMM0, XMM1, XMM2, and XMM3, respectively. The final two argument values, y2 and z2, are passed on the stack as illustrated in Figure 12-1. Note that this figure shows only the low-order quadword (bits 63:0) of each XMM register; the high-order quadword (bits 127:64) of each XMM register is undefined. Registers RCX, RDX, R8, and R9 are also undefined since CalcDistance_Aavx() does not utilize any integer or pointer arguments.

Figure 12-1
Stack layout and argument registers at entry to CalcDistance_Aavx()

The function CalcDistance_Aavx() begins with a vmovsd xmm4,real8 ptr [rsp+40] (Move or Merge Scalar DPFP² Value) instruction that loads argument value y2 from the stack into register XMM4 (more precisely into XMM4[63:0]). This is followed by a vmovsd xmm5,real8 ptr [rsp+48] instruction that loads argument value z2 into register XMM5. The next two instructions, vsubsd xmm0,xmm3,xmm0 (Subtract Scalar DPFP Value) and vmulsd xmm0,xmm0,xmm0 (Multiply Scalar DPFP Value), calculate (x2 – x1) * (x2 – x1). Similar sequences of instructions are then employed to calculate (y2 – y1) * (y2 – y1) and (z2 – z1) * (z2 – z1). This is followed by two vaddsd (Add Scalar DPFP Value) instructions that sum the three coordinate squares. A vsqrtsd xmm0,xmm0,xmm4 (Compute Square Root of Scalar DPFP Value) instruction computes the final distance. It is important to note that vsqrtsd computes the square root of its second source operand. Like other scalar double-precision floating-point arithmetic instructions, vsqrtsd also copies bits 127:64 of its first source operand to the same bit positions of the destination operand. Here are the results for source example Ch12_02:

i: 0 p1( 24, 4, 45) | p2( 8, 45, 20) | dist1: 50.6162 | dist2: 50.6162

i: 1 p1( 54, 59, 33) | p2( 22, 20, 81) | dist1: 69.6348 | dist2: 69.6348

i: 2 p1( 25, 23, 61) | p2( 83, 20, 44) | dist1: 60.5145 | dist2: 60.5145

i: 3 p1( 83, 4, 22) | p2( 98, 20, 62) | dist1: 45.6180 | dist2: 45.6180

i: 4 p1( 81, 21, 12) | p2( 73, 49, 64) | dist1: 59.5987 | dist2: 59.5987

i: 5 p1( 81, 97, 22) | p2( 70, 48, 45) | dist1: 55.2359 | dist2: 55.2359

i: 6 p1( 24, 62, 77) | p2( 20, 32, 15) | dist1: 68.9928 | dist2: 68.9928

i: 7 p1( 97, 81, 45) | p2( 20, 79, 18) | dist1: 81.6211 | dist2: 81.6211

i: 8 p1( 94, 81, 17) | p2( 8, 89, 87) | dist1: 111.1755 | dist2: 111.1755

i: 9 p1( 53, 82, 62) | p2( 43, 31, 84) | dist1: 56.4358 | dist2: 56.4358

i: 10 p1( 90, 72, 88) | p2( 25, 27, 30) | dist1: 98.0510 | dist2: 98.0510

i: 11 p1( 32, 4, 46) | p2( 62, 33, 53) | dist1: 42.3084 | dist2: 42.3084

i: 12 p1( 7, 88, 13) | p2( 12, 75, 30) | dist1: 21.9773 | dist2: 21.9773

i: 13 p1( 3, 90, 97) | p2( 89, 52, 38) | dist1: 111.0000 | dist2: 111.0000

i: 14 p1( 60, 95, 54) | p2( 91, 51, 33) | dist1: 57.7754 | dist2: 57.7754

i: 15 p1( 16, 10, 52) | p2( 2, 32, 50) | dist1: 26.1534 | dist2: 26.1534

i: 16 p1( 87, 2, 68) | p2( 53, 20, 75) | dist1: 39.1024 | dist2: 39.1024

i: 17 p1( 32, 10, 37) | p2( 8, 41, 13) | dist1: 45.9674 | dist2: 45.9674

i: 18 p1( 62, 29, 84) | p2( 62, 37, 35) | dist1: 49.6488 | dist2: 49.6488

i: 19 p1( 16, 32, 31) | p2( 85, 19, 17) | dist1: 71.5961 | dist2: 71.5961

The final two letters of many x86-AVX arithmetic instruction mnemonics denote the operand type. You have already seen the instructions vaddss and vaddsd, which perform scalar single-precision and double-precision floating-point addition. In these instructions, the suffixes ss and sd denote scalar single-precision and double-precision values, respectively. X86-AVX instructions also use the mnemonic suffixes ps and pd to signify packed single-precision and double-precision values. X86-AVX instructions that manipulate more than one data type often include multiple data type characters in their mnemonics.

Compares

Listing 12-3 shows the source code for example Ch12_03, which demonstrates the use of the floating-point compare instruction vcomiss (Compare Scalar SPFP Values). The vcomiss instruction compares two single-precision floating-point values and sets status flags in RFLAGS to signify a result of less than, equal, greater than, or unordered. The vcomisd (Compare Scalar DPFP Values) instruction is the double counterpart of vcomiss.

//------------------------------------------------

// Ch12_03.h

//------------------------------------------------

#pragma once

#include <cstdint>

// Ch12_03_fasm.asm

extern "C" void CompareF32_Aavx(float a, float b, uint8_t* results);

// Ch12_03_misc.cpp

extern void DisplayResults(float a, float b, const uint8_t* cmp_results);

// Miscellaenous constants

const size_t c_NumCmpOps = 7;

//------------------------------------------------

// Ch12_03.cpp

//------------------------------------------------

#include <iostream>

#include <iomanip>

#include <limits>

#include <string>

#include "Ch12_03.h"

static void CompareF32(void);

int main()

{

CompareF32();

return 0;

}

static void CompareF32(void)

{

const size_t n = 6;

float a[n] {120.0, 250.0, 300.0, -18.0, -81.0, 42.0};

float b[n] {130.0, 240.0, 300.0, 32.0, -100.0, 0.0};

// Set NAN test value

b[n - 1] = std::numeric_limits<float>::quiet_NaN();

std::cout << " ----- Results for CompareF32 ----- ";

for (size_t i = 0; i < n; i++)

{

uint8_t cmp_results[c_NumCmpOps];

CompareF32_Aavx(a[i], b[i], cmp_results);

DisplayResults(a[i], b[i], cmp_results);

}

;-------------------------------------------------

; Ch12_03_fasm.asm

;-------------------------------------------------

;--------------------------------------------------------------------------

; extern "C" void CompareF32_Aavx(float a, float b, uint8_t* results);

;--------------------------------------------------------------------------

.code

CompareF32_Aavx proc

; Set result flags based on compare status

vcomiss xmm0,xmm1

setp byte ptr [r8] ;RFLAGS.PF = 1 if unordered

jnp @F

xor al,al

mov byte ptr [r8+1],al ;set remaining elements in array

mov byte ptr [r8+2],al ;result[] to 0

mov byte ptr [r8+3],al

mov byte ptr [r8+4],al

mov byte ptr [r8+5],al

mov byte ptr [r8+6],al

ret

@@: setb byte ptr [r8+1] ;set byte if a < b

setbe byte ptr [r8+2] ;set byte if a <= b

sete byte ptr [r8+3] ;set byte if a == b

setne byte ptr [r8+4] ;set byte if a != b

seta byte ptr [r8+5] ;set byte if a > b

setae byte ptr [r8+6] ;set byte if a >= b

ret

CompareF32_Aavx endp

end

Listing 12-3

Example Ch12_03

The function CompareF32_Aavx() accepts three argument values: two of type float and a pointer to an array of uint8_t values for the results. The first instruction of CompareF32_Aavx(), vcomiss xmm0,xmm1, performs a single-precision floating-point compare of argument values a and b Note that these values were passed to CompareF32_Aavx() in registers XMM0 and XMM1, respectively. Execution of vcomiss sets RFLAGS.ZF, RFLAGS.PF, and RFLAGS.CF as shown in Table 12-1. The setting of these status flags facilitates the use of the conditional instructions cmovcc, jcc, and setcc (Set Byte on Condition) as shown in Table 12-2.

Table 12-1

Status Flags Set by vcomis[d|s]

Condition	RFLAGS.ZF	RFLAGS.PF	RFLAGS.CF
XMM0 > XMM1	0	0	0
XMM0 == XMM1	1	0	0
XMM0 < XMM1	0	0	1
Unordered	1	1	1

Table 12-2

Condition Codes Following Execution of vcomis[d|s]

Relational Operator	Condition Code	RFLAGS Test Condition
XMM0 < XMM1	Below (b)	CF == 1
XMM0 <= XMM1	Below or equal (be)	CF == 1 \|\| ZF == 1
XMM0 == XMM1	Equal (e or z)	ZF == 1
XMM0 != XMM1	Not Equal (ne or nz)	ZF == 0
XMM0 > XMM1	Above (a)	CF == 0 && ZF == 0
XMM0 >= XMM1	Above or Equal (ae)	CF == 0
Unordered	Parity (p)	PF == 1

It should be noted that the status flags shown in Table 12-1 are set only if floating-point exceptions are masked (the default state for Visual C++ and most other C++ compilers). If floating-point invalid operation or denormal exceptions are unmasked (MXCSR.IM = 0 or MXCSR.DM = 0) and one of the compare operands is a QNaN, SNaN, or denormal, the processor will generate an exception without updating the status flags in RFLAGS.

Following execution of the vcomiss xmm0,xmm1 instruction, CompareF32_Aavx() uses a series of setcc instructions to highlight the relational operators shown in Table 12-2. The setp byte ptr [r8] instruction sets the destination operand byte pointed to by R8 to 1 if RFLAGS.PF is set (i.e., one of the operands is a QNaN or SNaN); otherwise, the destination operand byte is set to 0. If the compare was ordered, the remaining setcc instructions in CompareF32_Aavx() save all possible compare outcomes by setting each entry in array results to 0 or 1. As previously mentioned, a function can also use a jcc or cmovcc instruction following execution of a vcomis[d|s] instruction to perform conditional jumps or moves based on the outcome of a floating-point compare. Here is the output for source code example Ch12_03:

----- Results for CompareF32 -----

a = 120, b = 130

UO=0 LT=1 LE=1 EQ=0 NE=1 GT=0 GE=0

a = 250, b = 240

UO=0 LT=0 LE=0 EQ=0 NE=1 GT=1 GE=1

a = 300, b = 300

UO=0 LT=0 LE=1 EQ=1 NE=0 GT=0 GE=1

a = -18, b = 32

UO=0 LT=1 LE=1 EQ=0 NE=1 GT=0 GE=0

a = -81, b = -100

UO=0 LT=0 LE=0 EQ=0 NE=1 GT=1 GE=1

a = 42, b = nan

UO=1 LT=0 LE=0 EQ=0 NE=0 GT=0 GE=0

Conversions

Most C++ programs perform type conversions. For example, it is often necessary to cast a single-precision or double-precision floating-point value to an integer or vice versa. A function may also need to size-promote a single-precision floating-point value to double precision or narrow a double-precision floating-point value to single precision. AVX includes several instructions that perform conversions using either scalar or packed operands. Listing 12-4 shows the source code for example Ch12_04. This example illustrates the use of AVX scalar conversion instructions. Source code example Ch12_04 also introduces macros and explains how to change the rounding control bits in the MXCSR register.

//------------------------------------------------

// Ch12_04.h

//------------------------------------------------

#pragma once

// Simple union for data exchange

union Uval

{

int32_t m_I32;

int64_t m_I64;

float m_F32;

double m_F64;

};

// The order of values in enum CvtOp must match the jump table

// that's defined in the .asm file.

enum class CvtOp : unsigned int

{

I32_F32, // int32_t to float

F32_I32, // float to int32_t

I32_F64, // int32_t to double

F64_I32, // double to int32_t

I64_F32, // int64_t to float

F32_I64, // float to int64_t

I64_F64, // int64_t to double

F64_I64, // double to int64_t

F32_F64, // float to double

F64_F32, // double to float

};

// Enumerated type for rounding control

enum class RC : unsigned int

{

Nearest, Down, Up, Zero // Do not change order

};

// Ch12_04_fasm.asm

extern "C" bool ConvertScalar_Aavx(Uval* a, Uval* b, CvtOp cvt_op, RC rc);

//------------------------------------------------

// Ch12_04.cpp

//------------------------------------------------

#include <iostream>

#include <iomanip>

#include <cstdint>

#include <string>

#include <limits>

#define _USE_MATH_DEFINES

#include <math.h>

#include "Ch12_04.h"

const std::string c_RcStrings[] = {"Nearest", "Down", "Up", "Zero"};

const RC c_RcVals[] = {RC::Nearest, RC::Down, RC::Up, RC::Zero};

const size_t c_NumRC = sizeof(c_RcVals) / sizeof (RC);

static void ConvertScalars(void);

int main()

{

ConvertScalars();

return 0;

}

static void ConvertScalars(void)

{

const char nl = ' ';

Uval src1, src2, src3, src4, src5, src6, src7;

src1.m_F32 = (float)M_PI;

src2.m_F32 = (float)-M_E;

src3.m_F64 = M_SQRT2;

src4.m_F64 = M_SQRT1_2;

src5.m_F64 = 1.0 + DBL_EPSILON;

src6.m_I32 = std::numeric_limits<int>::max();

src7.m_I64 = std::numeric_limits<long long>::max();

std::cout << "----- Results for ConvertScalars() ----- ";

for (size_t i = 0; i < c_NumRC; i++)

{

RC rc = c_RcVals[i];

Uval des1, des2, des3, des4, des5, des6, des7;

ConvertScalar_Aavx(&des1, &src1, CvtOp::F32_I32, rc);

ConvertScalar_Aavx(&des2, &src2, CvtOp::F32_I64, rc);

ConvertScalar_Aavx(&des3, &src3, CvtOp::F64_I32, rc);

ConvertScalar_Aavx(&des4, &src4, CvtOp::F64_I64, rc);

ConvertScalar_Aavx(&des5, &src5, CvtOp::F64_F32, rc);

ConvertScalar_Aavx(&des6, &src6, CvtOp::I32_F32, rc);

ConvertScalar_Aavx(&des7, &src7, CvtOp::I64_F64, rc);

std::cout << std::fixed;

std::cout << " Rounding control = " << c_RcStrings[(int)rc] << nl;

std::cout << " F32_I32: " << std::setprecision(8);

std::cout << src1.m_F32 << " --> " << des1.m_I32 << nl;

std::cout << " F32_I64: " << std::setprecision(8);

std::cout << src2.m_F32 << " --> " << des2.m_I64 << nl;

std::cout << " F64_I32: " << std::setprecision(8);

std::cout << src3.m_F64 << " --> " << des3.m_I32 << nl;

std::cout << " F64_I64: " << std::setprecision(8);

std::cout << src4.m_F64 << " --> " << des4.m_I64 << nl;

std::cout << " F64_F32: ";

std::cout << std::setprecision(16) << src5.m_F64 << " --> ";

std::cout << std::setprecision(8) << des5.m_F32 << nl;

std::cout << " I32_F32: " << std::setprecision(8);

std::cout << src6.m_I32 << " --> " << des6.m_F32 << nl;

std::cout << " I64_F64: " << std::setprecision(8);

std::cout << src7.m_I64 << " --> " << des7.m_F64 << nl;

}

;-------------------------------------------------

; Ch12_04_fasm.asm

;-------------------------------------------------

MxcsrRcMask equ 9fffh ;bit mask for MXCSR.RC

MxcsrRcShift equ 13 ;shift count for MXCSR.RC

;--------------------------------------------------------------------------

; Macro GetRC_M - copies MXCSR.RC to r10d[1:0]

;--------------------------------------------------------------------------

GetRC_M macro

vstmxcsr dword ptr [rsp+8] ;save mxcsr register

mov r10d,[rsp+8]

shr r10d,MxcsrRcShift ;r10d[1:0] = MXCSR.RC bits

and r10d,3 ;clear unused bits

endm

;--------------------------------------------------------------------------

; Macro SetRC_M - sets MXCSR.RC to rm_reg[1:0]

;--------------------------------------------------------------------------

SetRC_M macro RcReg

vstmxcsr dword ptr [rsp+8] ;save current MXCSR

mov eax,[rsp+8]

and RcReg,3 ;clear unusned bits

shl RcReg,MxcsrRcShift ;rc_reg[14:13] = rc

and eax,MxcsrRcMask ;clear non MXCSR.RC bits

or eax,RcReg ;insert new MXCSR.RC

mov [rsp+8],eax

vldmxcsr dword ptr [rsp+8] ;load updated MXCSR

endm

;--------------------------------------------------------------------------

; extern "C" bool ConvertScalar_Aavx(Uval* des, const Uval* src, CvtOp cvt_op, RC rc)

;

; Note: This function requires linker option /LARGEADDRESSAWARE:NO

;--------------------------------------------------------------------------

.code

ConvertScalar_Aavx proc

; Make sure cvt_op is valid

cmp r8d,CvtOpTableCount ;is cvt_op >= CvtOpTableCount

jae BadCvtOp ;jump if cvt_op is invalid

; Save current MSCSR.RC

GetRC_M ;r10d = current RC

; Set new rounding mode

SetRC_M r9d ;set new MXCSR.RC

; Jump to target conversion code block

mov eax,r8d ;rax = cvt_op

jmp [CvtOpTable+rax*8]

; Conversions between int32_t and float/double

I32_F32:

mov eax,[rdx] ;load integer value

vcvtsi2ss xmm0,xmm0,eax ;convert to float

vmovss real4 ptr [rcx],xmm0 ;save result

jmp Done

F32_I32:

vmovss xmm0,real4 ptr [rdx] ;load float value

vcvtss2si eax,xmm0 ;convert to integer

mov [rcx],eax ;save result

jmp Done

I32_F64:

mov eax,[rdx] ;load integer value

vcvtsi2sd xmm0,xmm0,eax ;convert to double

vmovsd real8 ptr [rcx],xmm0 ;save result

jmp Done

F64_I32:

vmovsd xmm0,real8 ptr [rdx] ;load double value

vcvtsd2si eax,xmm0 ;convert to integer

mov [rcx],eax ;save result

jmp Done

; Conversions between int64_t and float/double

I64_F32:

mov rax,[rdx] ;load integer value

vcvtsi2ss xmm0,xmm0,rax ;convert to float

vmovss real4 ptr [rcx],xmm0 ;save result

jmp Done

F32_I64:

vmovss xmm0,real4 ptr [rdx] ;load float value

vcvtss2si rax,xmm0 ;convert to integer

mov [rcx],rax ;save result

jmp Done

I64_F64:

mov rax,[rdx] ;load integer value

vcvtsi2sd xmm0,xmm0,rax ;convert to double

vmovsd real8 ptr [rcx],xmm0 ;save result

jmp Done

F64_I64:

vmovsd xmm0,real8 ptr [rdx] ;load double value

vcvtsd2si rax,xmm0 ;convert to integer

mov [rcx],rax ;save result

jmp Done

; Conversions between float and double

F32_F64:

vmovss xmm0,real4 ptr [rdx] ;load float value

vcvtss2sd xmm1,xmm1,xmm0 ;convert to double

vmovsd real8 ptr [rcx],xmm1 ;save result

jmp Done

F64_F32:

vmovsd xmm0,real8 ptr [rdx] ;load double value

vcvtsd2ss xmm1,xmm1,xmm0 ;convert to float

vmovss real4 ptr [rcx],xmm1 ;save result

jmp Done

BadCvtOp:

xor eax,eax ;set error return code

ret

Done: SetRC_M r10d ;restore original MXCSR.RC

mov eax,1 ;set success return code

ret

; The order of values in following table must match enum CvtOp

; that's defined in the .h file.

align 8

CvtOpTable equ $

qword I32_F32, F32_I32

qword I32_F64, F64_I32

qword I64_F32, F32_I64

qword I64_F64, F64_I64

qword F32_F64, F64_F32

CvtOpTableCount equ ($ - CvtOpTable) / size qword

ConvertScalar_Aavx endp

end

Listing 12-4

Example Ch12_04

Near the top of Listing 12-4 is the definition of a union named Uval . Source code example Ch12_04 uses this union to simplify data exchange between the C++ and assembly language code. Following Uval is an enum named CvtOp, which defines symbolic names for the conversions. Also included in file Ch12_04.h is the enum RC. This type defines symbolic names for the floating-point rounding modes. Recall from the discussions in Chapter 10 that the MXCSR register contains a two-bit field that specifies the rounding method for floating-point operations (see Table 10-4).

Also shown in Listing 12-4 is the file Ch12_04.cpp. This file includes the driver function ConvertScalars(), which performs test case initialization and streams results to std::cout. Note that each use of the assembly language function ConvertScalar_Aavx() requires two argument values of type Uval , one argument of type CvtOp, and one argument of type RC.

Assembly language source code files often employ the equ (equate) directive to define symbolic names for numerical expressions. The equ directive is somewhat analogous to a C++ const definition (e.g., const int x = 100;). The first noncomment statement in Ch12_04_fasm.asm, MxcsrRcMask equ 9fffh, defines a symbolic name for a mask that will be used to modify bits MXCSR.RC. This is followed by another equ directive MxcsrRcShift equ 13 that defines a shift count for bits MXCSR.RC.

Immediately following the two equate statements is the definition of a macro named GetRC_M. A macro is a text substitution mechanism that enables a programmer to represent a sequence of assembly language instructions, data, or other statements using a single text string. Assembly language macros are typically employed to generate sequences of instructions that will be used more than once. Macros are also frequently exercised to factor out and reuse code without the performance overhead of a function call.

Macro GetRC_M emits a sequence of assembly language instructions that obtain the current value of MXCSR.RC. The first instruction of this macro, vstmxcsr dword ptr [rsp+8] (Store MXCSR Register State), saves the contents of register MXCSR on the stack. The reason for saving MXCSR on the stack is that vstmxcsr only supports memory operands. The next instruction, mov r10d,[rsp+8], copies this value from the stack and loads it into register R10D. The ensuing instruction pair, shr r10d,MxcsrRcShift and and r10d,3, relocates the rounding control bits to bits 1:0 of register R10D; all other bits in R10D are set to zero. The text endm is an assembler directive that signifies the end of macro GetRC_M.

Following the definition of macro GetRC_M is another macro named SetRC_M. This macro emits instructions that modify MXCSR.RC. Note that macro SetRC_M includes an argument named RcReg. This is a symbolic name for the general-purpose register that contains the new value for MXCSR.RC. More on this in a moment. Macro SetRC_M also begins with the instruction sequence vstmxcsr dword ptr [rsp+8] and mov eax,[rsp+8] to obtain the current contents of MXCSR. It then employs the instruction pair and RcReg,3 and shl RcReg,MxcsrRcShift. These instructions shift the new bits for MXCSR.RC into the correct position. During macro expansion, the assembler replaces macro argument RcReg with the actual register name as you will soon see. The ensuing and eax,MxcsrRcMask and or eax,RcReg instructions update MXCSR.RC with the new rounding mode. The next instruction pair, mov [rsp+8],eax and vldmxcsr dword ptr [rsp+8] (Load MXCSR Register State), loads the new RC control bits into MXCSR.RC. Note that the instruction sequence used in SetRC_M preserves all other bits in the MXCSR register.

Function ConvertScalar_Aavx() begins its execution with the instruction pair cmp r8d,CvtOpTableCount and jae BadCvtOp that validates argument value cvt_op. If cvt_op is valid, ConvertScalar_Aavx() uses GetRC_M and SetRC_M r9d to modify MXCSR.RC. Note that register R9D contains the new rounding mode. Figure 12-2 contains a portion of the MASM listing file (with some minor edits to improve readability) that shows the expansion of macros GetRC_M and SetRC_M. The MASM listing file denotes macro expanded instructions with a ‘1’ in a column located to the left of each instruction mnemonic. Note that in the expansion of macro SetRC_M, register r9d is substituted for macro argument RcReg.

Figure 12-2
Expansion of macros GetRC_M and SetRC_M

Function ConvertScalar_Aavx() uses argument value cvt_op and a jump table to select a conversion code block. This construct is akin to a C++ switch statement. Immediately after the ret instruction is a jump table named CvtOpTable. The align 8 statement that appears just before the start of CvtOpTable is an assembler directive that instructs the assembler to align the start of CvtOpTable on a quadword boundary. The align 8 directive is used here since CvtOpTable contains quadword elements of labels defined in ConvertScalar_Aavx(). The labels correspond to code blocks that perform a specific numerical conversion. The instruction jmp [CvtOpTable+rax*8] transfers program control to the code block specified by cvt_op, which was copied into RAX. More specifically, execution of the jmp [CvtOpTable+rax*8] instruction loads RIP with the quadword value stored in memory location CvtOpTable + rax * 8.

Each conversion code block in ConvertScalar_Aavx() uses a different AVX instruction to carry out a specific conversion operation. For example, the code block that follows label I32_F32 uses the instruction vcvtsi2ss (Convert Doubleword Integer to SPFP Value) to convert a 32-bit signed integer to single-precision floating-point. Table 12-3 summarizes the scalar floating-point conversion instructions used in example Ch12_04.

Table 12-3

AVX Scalar Floating-Point Conversion Instructions

Instruction Mnemonic	Description
vcvtsi2ss	Convert 32- or 64-bit signed integer to SPFP
vcvtsi2sd	Convert 32- or 64-bit signed integer to DPFP
vcvtss2si	Convert SPFP to 32- or 64-bit signed integer
vcvtsd2si	Convert DPFP to 32- or 64-bit signed integer
vcvtss2sd	Convert SPFP to DPFP
vcvtsd2ss	Convert DPFP to SPFP

The last instruction of each conversion code block is a jmp Done instruction. The label Done is located near the end of function ConvertScalar_Aavx(). At label Done, function ConvertScalar_Aavx() uses SetRC_M r10d to restore the original value of MXCSR.RC. The Visual C++ calling convention requires MXCSR.RC to be preserved across function boundaries. You will learn more about this later in this chapter. Here are the results for source code example Ch12_04:

----- Results for ConvertScalars() -----

Rounding control = Nearest

F32_I32: 3.14159274  3

F32_I64: -2.71828175 --> -3

F64_I32: 1.41421356 --> 1

F64_I64: 0.70710678 --> 1

F64_F32: 1.0000000000000002 --> 1.00000000

I32_F32: 2147483647 --> 2147483648.00000000

I64_F64: 9223372036854775807 --> 9223372036854775808.00000000

Rounding control = Down

F32_I32: 3.14159274 --> 3

F32_I64: -2.71828175 --> -3

F64_I32: 1.41421356 --> 1

F64_I64: 0.70710678 --> 0

F64_F32: 1.0000000000000002 --> 1.00000000

I32_F32: 2147483647 --> 2147483520.00000000

I64_F64: 9223372036854775807 --> 9223372036854774784.00000000

Rounding control = Up

F32_I32: 3.14159274 --> 4

F32_I64: -2.71828175 --> -2

F64_I32: 1.41421356 --> 2

F64_I64: 0.70710678 --> 1

F64_F32: 1.0000000000000002 --> 1.00000012

I32_F32: 2147483647 --> 2147483648.00000000

I64_F64: 9223372036854775807 --> 9223372036854775808.00000000

Rounding control = Zero

F32_I32: 3.14159274 --> 3

F32_I64: -2.71828175 --> -2

F64_I32: 1.41421356 --> 1

F64_I64: 0.70710678 --> 0

F64_F32: 1.0000000000000002 --> 1.00000000

I32_F32: 2147483647 --> 2147483520.00000000

I64_F64: 9223372036854775807 --> 9223372036854774784.00000000

Scalar Floating-Point Arrays

Listing 12-5 shows the source code for example Ch12_05. This example illustrates how to calculate the mean and standard deviation of an array of single-precision floating-point values. Listing 12-5 only shows the assembly language code for example Ch12_05 since most of the other code is identical to what you saw in example Ch03_04. The equations used to calculate the mean and standard deviation are also the same.

;-------------------------------------------------

; Ch12_05_fasm.asm

;-------------------------------------------------

;--------------------------------------------------------------------------

; exte"n""C" bool CalcMeanF32_Aavx(float* mean, const float* x, size_t n);

;--------------------------------------------------------------------------

.code

CalcMeanF32_Aavx proc

; Make sure n is valid

cmp r8,2 ;is n >= 2?

jae @F ;jump if yes

xor eax,eax ;set error return code

ret

; Initialize

@@: vxorps xmm0,xmm0,xmm0 ;sum = 0.0f

mov rax,-1 ;i = -1

; Sum the elements of x

Loop1: inc rax ;i += 1

cmp rax,r8 ;is i >= n?

jae CalcM ;jump if yes

vaddss xmm0,xmm0,real4 ptr [rdx+rax*4] ;sum += x[i]

jmp Loop1

; Calculate and save the mean

CalcM: vcvtsi2ss xmm1,xmm1,r8 ;convert n to SPFP

vdivss xmm1,xmm0,xmm1 ;xmm2 = mean = sum / n

vmovss real4 ptr [rcx],xmm1 ;save mean

mov eax,1 ;set success return code

ret

CalcMeanF32_Aavx endp

;--------------------------------------------------------------------------

; exte"n""C" bool CalcStDevF32_Aavx(float* st_dev, const float* x, size_t n, float mean);

;--------------------------------------------------------------------------

CalcStDevF32_Aavx proc

; Make sure n is valid

cmp r8,2 ;is n >= 2?

jae @F ;jump if yes

xor eax,eax ;set error return code

ret

; Initialize

@@: vxorps xmm0,xmm0,xmm0 ;sum_squares = 0.0f

mov rax,-1 ;i = -1

; Sum the elements of x

Loop1: inc rax ;i += 1

cmp rax,r8 ;is i >= n?

jae CalcSD ;jump if yes

vmovss xmm1,real4 ptr [rdx+rax*4] ;xmm1 = x[i]

vsubss xmm2,xmm1,xmm3 ;xmm2 = x[–] - mean

vmulss xmm2,xmm2,xmm2 ;xmm2 = (x[–] - mean) ** 2

vaddss xmm0,xmm0,xmm2 ;update sum_squares

jmp Loop1

; Calculate and save standard deviation

CalcSD: dec r8 ;r8 =–n - 1

vcvtsi2ss xmm1,xmm1,r8 ;convert–n - 1 to SPFP

vdivss xmm0,xmm0,xmm1 ;xmm0 = sum_squares / –n - 1)

vsqrtss xmm0,xmm0,xmm0 ;xmm0 = st_dev

vmovss real4 ptr [rcx],xmm0 ;save st_dev

mov eax,1 ;set success return code

ret

CalcStDevF32_Aavx endp

end

Listing 12-5

Example Ch12_05

Listing 12-5 begins with the definition of assembly language function CalcMeanF32_Aavx(). The first code block of this function verifies that n >= 2 is true. Following validation of n, CalcMeanF32_Aavx() uses a vxorps xmm0,xmm0,xmm0 (Bitwise Logical XOR of Packed SPFP Values) instruction to set sum = 0.0. The next instruction, mov rax,-1, initializes loop index variable i to -1. Each iteration of Loop1 begins with an inc rax instruction that calculates i += 1. The ensuing instruction pair, cmp rax,r8 and jae CalcM, terminates Loop1 when i >= n is true. The vaddss xmm0,xmm0,real4 ptr [rdx+rax*4] instruction computes sum += x[i]. Following the calculation of sum, CalcMeanF32_Aavx() converts n to a single-precision floating-point value using the AVX instruction vcvtsi2ss xmm1,xmm1,r8. The next two instructions, vdivss xmm1,xmm0,xmm1 and vmovss real4 ptr [rcx],xmm1, calculate and save mean.

Function CalcStDev_Aavx() uses a similar for-loop construct to calculate the standard deviation. Inside Loop1, CalcStDev_Aavx() calculates sum_squares using the AVX instructions vsubss, vmulss, and vaddss . Note that argument value mean was passed in register XMM3. Following execution of Loop1, CalcStDev_Aavx() calculates the standard deviation using the instructions dec r8 (to calculate n - 1), vcvtsi2ss , vdivss, and vsqrtss.

The assembly language code in Listing 12-5 can be easily modified to create the double-precision counterpart functions CalcMeanF64_Aavx() and CalcStDevF64_Aavx(). Simply switch the single-precision (ss suffix) instructions to their double-precision (sd suffix) counterparts. Instructions that reference operands in memory will also need to use real8 ptr and a scale factor of 8 instead of real4 ptr and 4. Here are the results for source code example Ch12_05:

Results for CalcMeanF32_Cpp and CalcStDevF32_Cpp

mean1: 49.602146 st_dev1: 27.758242

Results for CalcMeanF32_Aavx and CalcStDevF32_Aavx

mean2: 49.602146 st_dev2: 27.758242

Calling Convention: Part 2

The source code presented thus far has informally discussed various aspects of the Visual C++ calling convention. In this section, the calling convention is formally explained. It reiterates some earlier elucidations and introduces new requirements that have not been discussed. A basic understanding of the calling convention is necessary since it is used extensively in subsequent chapters that explain x86-AVX SIMD programming using x86-64 assembly language.

Note

As a reminder, if you are reading this book to learn x86-64 assembly language programming and plan on using it with a different operating system or high-level language, you should consult the appropriate documentation for more information regarding the particulars of that calling convention.

The Visual C++ calling convention designates each x86-64 processor general-purpose register as volatile or nonvolatile. It also applies a volatile or nonvolatile classification to each XMM register. An x86-64 assembly language function can modify the contents of any volatile register but must preserve the contents of any nonvolatile register it uses. Table 12-4 lists the volatile and nonvolatile general-purpose and XMM registers.

Table 12-4

Visual C++ 64-Bit Volatile and Nonvolatile Registers

Register Group	Volatile Registers	Nonvolatile Registers
General-purpose	RAX, RCX, RDX, R8, R9, R10, R11	RBX, RSI, RDI, RBP, RSP, R12, R13, R14, R15
Floating point and SIMD	XMM0–XMM5	XMM6–XMM15

Volatile Registers

Nonvolatile Registers

General-purpose

RAX, RCX, RDX, R8, R9, R10, R11

RBX, RSI, RDI, RBP, RSP,

R12, R13, R14, R15

Floating point and SIMD

XMM0–XMM5

XMM6–XMM15

On systems that support AVX or AVX2, the high-order 128 bits of registers YMM0–YMM15 are classified as volatile. Similarly, the high-order 384 bits of registers ZMM0–ZMM15 are classified as volatile on systems that support AVX-512. Registers ZMM16–ZMM31 and their corresponding YMM and XMM registers are also designated as volatile and need not be preserved. The legacy x87 FPU register stack is classified as volatile. All control bits in RFLAGS and MXCSR must be preserved across function boundaries. For example, assume function Foo() changes MXCSR.RC prior to performing a floating-point calculation. It then needs to call the C++ library function cos() to perform another calculation. Function Foo() must restore the original contents of MXCSR.RC before calling cos().

The programming requirements imposed on an x86-64 assembly language function by the Visual C++ calling convention vary depending on whether the function is a leaf or nonleaf function. Leaf functions are functions that

Do not call any other functions
Do not modify the contents of register RSP
Do not allocate any local stack space
Do not modify any of the nonvolatile general-purpose or XMM registers
Do not use exception handling

X86-64-bit assembly language leaf functions are easier to code, but they are only suitable for relatively simple computations. A nonleaf function can use the entire x86-64 register set, create a stack frame, or allocate local stack space. The preservation of nonvolatile registers and local stack space allocation is typically performed at the beginning of a function in a code block known as the prologue. Functions that utilize a prologue must also include a corresponding epilogue. A function epilogue releases any locally allocated stack space and restores any prologue preserved nonvolatile registers.

In the remainder of this section, you will examine four source code examples. The first three examples illustrate how to code nonleaf functions using explicit x86-64 assembly language instructions and assembler directives. These examples also convey critical programming information regarding the organization of a nonleaf function stack frame. The fourth example demonstrates how to use several prologue and epilogue macros. These macros help automate most of the programming labor that is associated with a nonleaf function. The source code listings in this section include only the C++ header file and the x86-64 assembly language code. The C++ code that performs test case initialization, argument checking, displaying of results, etc., is not shown to streamline the elucidations. The software download package includes the complete source code for each example.

Stack Frames

Listing 12-6 shows the source code for example Ch12_06. This example demonstrates how to create and use a stack frame pointer in an assembly language function. Source code example Ch12_06 also illustrates some of the programming protocols that an assembly language function prologue and epilogue must observe.

//------------------------------------------------

// Ch12_06.h

//------------------------------------------------

#pragma once

#include <cstdint>

// Ch12_06_fasm.asm

extern "C" int64_t SumIntegers_A(int8_t a, int16_t b, int32_t c, int64_t d,

int8_t e, int16_t f, int32_t g, int64_t h);

;-------------------------------------------------

; Ch12_06_fasm.asm

;-------------------------------------------------

;--------------------------------------------------------------------------

; extern "C" int64_t SumIntegers_A(int8_t a, int16_t b, int32_t c, int64_t d,

; int8_t e, int16_t f, int32_t g, int64_t h);

;--------------------------------------------------------------------------

; Named expressions for constant values:

;

; RBP_RA = number of bytes between RBP and return address on stack

; STK_LOCAL = size of local stack space

RBP_RA = 24

STK_LOCAL = 16

.code

SumIntegers_A proc frame

; Function prologue

push rbp ;save caller's rbp register

.pushreg rbp

sub rsp,STK_LOCAL ;allocate local stack space

.allocstack STK_LOCAL

mov rbp,rsp ;set frame pointer

.setframe rbp,0

.endprolog ;mark end of prologe

; Save argument registers to home area (optional)

mov [rbp+RBP_RA+8],rcx

mov [rbp+RBP_RA+16],rdx

mov [rbp+RBP_RA+24],r8

mov [rbp+RBP_RA+32],r9

; Calculate a + b + c + d

movsx rcx,cl ;rcx = a

movsx rdx,dx ;rdx = b

movsxd r8,r8d ;r8 = c;

add rcx,rdx ;rcx = a + b

add r8,r9 ;r8 = c + d

add r8,rcx ;r8 = a + b + c + d

mov [rbp],r8 ;save a + b + c + d on stack

; Calculate e + f + g + h

movsx rcx,byte ptr [rbp+RBP_RA+40] ;rcx = e

movsx rdx,word ptr [rbp+RBP_RA+48] ;rdx = f

movsxd r8,dword ptr [rbp+RBP_RA+56] ;r8 = g

add rcx,rdx ;rcx = e + f

add r8,qword ptr [rbp+RBP_RA+64] ;r8 = g + h

add r8,rcx ;r8 = e + f + g + h

; Compute final sum

mov rax,[rbp] ;rax = a + b + c + d

add rax,r8 ;rax = final sum

; Function epilogue

add rsp,16 ;release local stack space

pop rbp ;restore caller's rbp register

ret

SumIntegers_A endp

end

Listing 12-6

Example Ch12_06

Functions that need to reference both argument values and local variables on the stack often create a stack frame during execution of their prologues. During creation of a stack frame, register RBP is typically initialized as a stack frame pointer. Following stack frame initialization, the remaining code in a function can access items on the stack using RBP as a base register.

Near the top of file Ch12_06_fasm.asm are the statements RBP_RA = 24 and STK_LOCAL = 16. The = symbol is an assembler directive that defines a symbolic name for a numerical value. Unlike the equ directive, symbolic names defined using the = directive can be redefined. RBP_RA denotes the number of bytes between RBP and the return address on stack (it also equals the number of extra bytes needed to reference the stack home area). STK_LOCAL represents the number of bytes allocated on the stack for local storage. More on these values in a moment.

Following definition of RBP_RA and STK_LOCAL is the statement SumIntegers_A proc frame, which defines the beginning of function SumIntegers_A(). The frame attribute notifies the assembler that the function SumIntegers_A uses a stack frame pointer. It also instructs the assembler to generate static table data that the Visual C++ runtime environment uses to process exceptions. The ensuing push rbp instruction saves the caller’s RBP register on the stack since function SumIntegers_A() uses this register as its stack frame pointer. The .pushreg rbp statement that follows is an assembler directive that saves offset information about the push rbp instruction in an assembler-maintained exception handling table (see example Ch11_08 for more information about why this is necessary). It is important to keep in mind that assembler directives are not executable instructions; they are directions to the assembler on how to perform specific actions during assembly of the source code.

The sub rsp,STK_LOCAL instruction allocates STK_LOCAL bytes of space on the stack for local variables. Function SumIntegers_A() only uses eight bytes of this space, but the Visual C++ calling convention for 64-bit programs requires nonleaf functions to maintain double quadword (16-byte) alignment of the stack pointer outside of the prologue. You will learn more about stack pointer alignment requirements later in this section. The next statement, .allocstack STK_LOCAL, is an assembler directive that saves local stack size allocation information in the Visual C++ runtime exception handling tables.

The mov rbp,rsp instruction initializes register RBP as the stack frame pointer, and the .setframe rbp,0 directive notifies the assembler of this action. The offset value 0 that is included in the .setframe directive is the difference in bytes between RSP and RBP. In function SumIntegers_A(), registers RSP and RBP are the same, so the offset value is zero. Later in this section, you learn more about the .setframe directive. It should be noted that x86-64 assembly language functions can use any nonvolatile register as a stack frame pointer. Using RBP provides consistency between x86-64 and x86-32 assembly language code, which uses register EBP. The final assembler directive, .endprolog, signifies the end of the prologue for function SumIntegers_A(). Figure 12-3 shows the stack layout and argument registers following execution of the prologue.

Figure 12-3
Stack layout and registers of function SumIntegers_A() following execution of the prologue

The next code block contains a series of mov instructions that save registers RCX, RDX, R8, and R9 to their respective home areas on this stack. This step is optional and included in SumIntegers_A() for demonstration purposes. Note that the offset of each mov instruction includes the symbolic constant RBP_RA. Another option allowed by the Visual C++ calling convention is to save an argument register to its corresponding home area prior to the push rbp instruction using RSP as a base register (e.g., mov [rsp+8],rcx, mov [rsp+16],rdx, and so on). Also keep in mind that a function can use its home area to store other temporary values. When used for alternative storage purposes, the home area should not be referenced by an assembly language instruction until after the .endprolog directive per the Visual C++ calling convention.

Following the home area save operation, the function SumIntegers_A() sums argument values a, b, c, and d. It then saves this intermediate sum to LocalVar1 on the stack using a mov [rbp],r8 instruction. Note that the summation calculation sign-extends argument values a, b, and c using a movsx or movsxd instruction. A similar sequence of instructions is used to sum argument values e, f, g, and h, which are located on the stack and referenced using the stack frame pointer RBP and a constant offset. The symbolic name RBP_RA is also used here to account for the extra stack space needed to reference argument values on the stack. The two intermediate sums are then added to produce the final sum in register RAX.

A function epilogue must release any local stack storage space that was allocated in the prologue, restore any nonvolatile registers that were saved on the stack, and execute a function return. The add rsp,16 instruction releases the 16 bytes of stack space that SumIntegers_A() allocated in its prologue. This is followed by a pop rbp instruction, which restores the caller’s RBP register. The obligatory ret instruction is next. Here are the results for source code example Ch12_06:

----- Results for SumIntegers_A -----

a: 10

b: -200

c: -300

d: 4000

e: -20

f: 400

g: -600

h: -8000

sum: -4710

Using Nonvolatile General-Purpose Registers

The next source code example, Ch12_07, demonstrates how to use nonvolatile general-purpose registers in an x86-64-bit assembly language function. It also provides additional programming details regarding stack frames and the use of local variables. Listing 12-7 shows the header file and assembly language source code for source code example Ch12_07.

//------------------------------------------------

// Ch12_07.h

//------------------------------------------------

#pragma once

#include <cstdint>

// Ch12_07_fasm.asm

extern "C" void CalcSumProd_A(const int64_t* a, const int64_t* b, int32_t n,

int64_t* sum_a, int64_t* sum_b, int64_t* prod_a, int64_t* prod_b);

;-------------------------------------------------

; Ch12_07_fasm.asm

;-------------------------------------------------

;--------------------------------------------------------------------------

; extern "C" void CalcSumProd_A(const int64_t* a, const int64_t* b, int32_t n,

; int64_t* sum_a, int64_t* sum_b, int64_t* prod_a, int64_t* prod_b);

;--------------------------------------------------------------------------

; Named expressions for constant values:

;

; NUM_PUSHREG = number of prolog non-volatile register pushes

; STK_LOCAL1 = size in bytes of STK_LOCAL1 area (see figure in text)

; STK_LOCAL2 = size in bytes of STK_LOCAL2 area (see figure in text)

; STK_PAD = extra bytes (0 or 8) needed to 16-byte align RSP

; STK_TOTAL = total size in bytes of local stack

; RBP_RA = number of bytes between RBP and return address on stack

NUM_PUSHREG = 4

STK_LOCAL1 = 32

STK_LOCAL2 = 16

STK_PAD = ((NUM_PUSHREG AND 1) XOR 1) * 8

STK_TOTAL = STK_LOCAL1 + STK_LOCAL2 + STK_PAD

RBP_RA = NUM_PUSHREG * 8 + STK_LOCAL1 + STK_PAD

.const

TestVal db 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

.code

CalcSumProd_A proc frame

; Function prologue

push rbp ;save non-volatile register RBP

.pushreg rbp

push rbx ;save non-volatile register RBX

.pushreg rbx

push r12 ;save non-volatile register R12

.pushreg r12

push r13 ;save non-volatile register R13

.pushreg r13

sub rsp,STK_TOTAL ;allocate local stack space

.allocstack STK_TOTAL

lea rbp,[rsp+STK_LOCAL2] ;set frame pointer

.setframe rbp,STK_LOCAL2

.endprolog ;end of prologue

; Initialize local variables on the stack (demonstration only)

vmovdqu xmm5, xmmword ptr [TestVal]

vmovdqa xmmword ptr [rbp-16],xmm5 ;save xmm5 to LocalVar2A/2B

mov qword ptr [rbp],0aah ;save 0xaa to LocalVar1A

mov qword ptr [rbp+8],0bbh ;save 0xbb to LocalVar1B

mov qword ptr [rbp+16],0cch ;save 0xcc to LocalVar1C

mov qword ptr [rbp+24],0ddh ;save 0xdd to LocalVar1D

; Save argument values to home area (optional)

mov qword ptr [rbp+RBP_RA+8],rcx

mov qword ptr [rbp+RBP_RA+16],rdx

mov qword ptr [rbp+RBP_RA+24],r8

mov qword ptr [rbp+RBP_RA+32],r9

; Perform required initializations for processing loop

test r8d,r8d ;is n <= 0?

jle Done ;jump if n <= 0

mov rbx,-8 ;rbx = offset to array elements

xor r10,r10 ;r10 = sum_a

xor r11,r11 ;r11 = sum_b

mov r12,1 ;r12 = prod_a

mov r13,1 ;r13 = prod_b

; Compute the array sums and products

@@: add rbx,8 ;rbx = offset to next elements

mov rax,[rcx+rbx] ;rax = a[i]

add r10,rax ;update sum_a

imul r12,rax ;update prod_a

mov rax,[rdx+rbx] ;rax = b[i]

add r11,rax ;update sum_b

imul r13,rax ;update prod_b

dec r8d ;adjust count

jnz @B ;repeat until done

; Save the final results

mov [r9],r10 ;save sum_a

mov rax,[rbp+RBP_RA+40] ;rax = ptr to sum_b

mov [rax],r11 ;save sum_b

mov rax,[rbp+RBP_RA+48] ;rax = ptr to prod_a

mov [rax],r12 ;save prod_a

mov rax,[rbp+RBP_RA+56] ;rax = ptr to prod_b

mov [rax],r13 ;save prod_b

; Function epilogue

Done: lea rsp,[rbp+STK_LOCAL1+STK_PAD] ;restore rsp

pop r13 ;restore non-volatile GP registers

pop r12

pop rbx

pop rbp

ret

CalcSumProd_A endp

end

Listing 12-7

Example Ch12_07

Toward the top of the assembly language code is a series of named constants that control how much stack space is allocated in the prologue of function CalcSumProd_A(). Like the previous example, the function CalcSumProd_A() includes the frame attribute as part of its proc statement to indicate that it uses a stack frame pointer. A series of push instructions saves nonvolatile registers RBP, RBX, R12, and R13 on the stack. Note that a .pushreg directive follows each x86-64 push instruction, which instructs the assembler to add information about each push instruction to the Visual C++ runtime exception handling tables.

A sub rsp,STK_TOTAL instruction allocates space on the stack for local variables, and the required .allocstack STK_TOTAL directive follows next. Register RBP is then initialized as the function’s stack frame pointer using a lea rbp,[rsp+STK_LOCAL2] (Load Effective Address) instruction, which loads rsp + STK_LOCAL2 into register RBP. Figure 12-4 illustrates the layout of the stack following execution of the lea instruction. Positioning RBP so that it “splits” the local stack area into two sections enables the assembler to generate machine code that is slightly more efficient since a larger portion of the local stack area can be referenced using 8-bit signed instead of 32-bit signed displacements. It also simplifies the saving and restoring of nonvolatile XMM registers, which is discussed later in this chapter. Following the lea instruction is a .setframe rbp,STK_LOCAL2 directive that enables the assembler to properly configure the runtime exception handling tables. The size parameter of a .setframe directive must be an even multiple of 16 and less than or equal to 240. The .endprolog directive signifies the end of the prologue for function CalcSumProd_A().

Figure 12-4
Stack layout and argument registers following execution of lea rbp,[rsp+STK_LOCAL2] in function CalcSumProd_A()

The next code block contains instructions that initialize several local variables on the stack. These instructions are for demonstration purposes only. Note that the vmovdqa [rbp-16],xmm5 (Move Aligned Packed Integer Values) instruction requires its destination operand to be aligned on a 16-byte boundary. Following initialization of the local variables, the argument registers are saved to their home locations, also just for demonstration purposes.

Function CalcSumProd_A() computes sums and products using the elements of two integer arrays. Prior to the start of the for-loop, the instruction pair test r8d,r8d (Logical Compare) and jle Done skips over the for-loop if n <= 0 is true. The test instruction performs a bitwise logical AND of its two operands and updates the status flags in RFLAGS; the result of the bitwise and operation is discarded. Following validation of argument value n, the function CalcSumProd_A() initializes the intermediate values sum_a (R10) and sum_b (R11) to zero and prod_a (R12) and prod_b (R13) to one. It then calculates the sum and product of the input arrays a and b. The results are saved to the memory locations specified by the caller. Note that the pointers for sum_b, prod_a, and prod_b were passed to CalcSumProd_A() via the stack as shown in Figure 12-4.

The epilogue of function CalcSumProd_A() begins with a lea rsp,[rbp+STK_LOCAL1+STK_PAD] instruction that restores register RSP to the value it had immediately after execution of the push r13 instruction in the prologue. When restoring RSP in an epilogue, the Visual C++ calling convention specifies that either a lea rsp,[RFP+X] or add rsp,X instruction must be used, where RFP denotes the frame pointer register and X is a constant value. This limits the number of instruction patterns that the runtime exception handler must identify. The subsequent pop instructions restore the nonvolatile general-purpose registers prior to execution of the ret instruction. According to the Visual C++ calling convention, function epilogues must be void of any processing logic including the setting of a return value. Here are the results for source code example Ch12_07:

----- Results for CalcSumProd_A -----

i: 0 a: 2 b: 3

i: 1 a: -2 b: 5

i: 2 a: -6 b: -7

i: 3 a: 7 b: 8

i: 4 a: 12 b: 4

i: 5 a: 5 b: 9

sum_a = 18 sum_b = 22

prod_a = 10080 prod_b = -30240

Using Nonvolatile SIMD Registers

Earlier in this chapter, you learned how to use XMM registers to perform scalar floating-point arithmetic. The next source code example, named Ch12_08, illustrates the prologue and epilogue conventions that must be observed before a function can use any of the nonvolatile XMM registers. Listing 12-8 shows the source code for example Ch12_08.

//------------------------------------------------

// Ch12_08.h

//------------------------------------------------

#pragma once

// Ch12_08_fcpp.cpp

extern bool CalcConeAreaVol_Cpp(const double* r, const double* h, int n,

double* sa_cone, double* vol_cone);

// Ch12_08_fasm.asm

extern "C" bool CalcConeAreaVol_A(const double* r, const double* h, int n,

double* sa_cone, double* vol_cone);

;-------------------------------------------------

; Ch12_08_fasm.asm

;-------------------------------------------------

;--------------------------------------------------------------------------

; extern "C" bool CalcConeAreaVol_A(const double* r, const double* h, int n,

; double* sa_cone, double* vol_cone);

;--------------------------------------------------------------------------

; Named expressions for constant values

;

; NUM_PUSHREG = number of prolog non-volatile register pushes

; STK_LOCAL1 = size in bytes of STK_LOCAL1 area (see figure in text)

; STK_LOCAL2 = size in bytes of STK_LOCAL2 area (see figure in text)

; STK_PAD = extra bytes (0 or 8) needed to 16-byte align RSP

; STK_TOTAL = total size in bytes of local stack

; RBP_RA = number of bytes between RBP and ret addr on stack

NUM_PUSHREG = 7

STK_LOCAL1 = 16

STK_LOCAL2 = 64

STK_PAD = ((NUM_PUSHREG AND 1) XOR 1) * 8

STK_TOTAL = STK_LOCAL1 + STK_LOCAL2 + STK_PAD

RBP_RA = NUM_PUSHREG * 8 + STK_LOCAL1 + STK_PAD

.const

r8_3p0 real8 3.0

r8_pi real8 3.14159265358979323846

.code

CalcConeAreaVol_A proc frame

; Save non-volatile general-purpose registers

push rbp

.pushreg rbp

push rbx

.pushreg rbx

push rsi

.pushreg rsi

push r12

.pushreg r12

push r13

.pushreg r13

push r14

.pushreg r14

push r15

.pushreg r15

; Allocate local stack space and initialize frame pointer

sub rsp,STK_TOTAL ;allocate local stack space

.allocstack STK_TOTAL

lea rbp,[rsp+STK_LOCAL2] ;rbp = stack frame pointer

.setframe rbp,STK_LOCAL2

; Save non-volatile registers XMM12 - XMM15. Note that STK_LOCAL2 must

; be greater than or equal to the number of XMM register saves times 16.

vmovdqa xmmword ptr [rbp-STK_LOCAL2+48],xmm12

.savexmm128 xmm12,48

vmovdqa xmmword ptr [rbp-STK_LOCAL2+32],xmm13

.savexmm128 xmm13,32

vmovdqa xmmword ptr [rbp-STK_LOCAL2+16],xmm14

.savexmm128 xmm14,16

vmovdqa xmmword ptr [rbp-STK_LOCAL2],xmm15

.savexmm128 xmm15,0

.endprolog

; Access local variables on the stack (demonstration only)

mov qword ptr [rbp],-1 ;LocalVar1A = -1

mov qword ptr [rbp+8],-2 ;LocalVar1B = -2

; Initialize the processing loop variables. Note that many of the

; register initializations below are performed merely to illustrate

; use of the non-volatile GP and XMM registers.

mov esi,r8d ;esi = n

test esi,esi ;is n > 0?

jg @F ;jump if n > 0

xor eax,eax ;set error return code

jmp Done

@@: mov rbx,-8 ;rbx = offset to array elements

mov r12,rcx ;r12 = ptr to r

mov r13,rdx ;r13 = ptr to h

mov r14,r9 ;r14 = ptr to sa_cone

mov r15,[rbp+RBP_RA+40] ;r15 = ptr to vol_cone

vmovsd xmm14,real8 ptr [r8_pi] ;xmm14 = pi

vmovsd xmm15,real8 ptr [r8_3p0] ;xmm15 = 3.0

; Calculate cone surface areas and volumes

; sa = pi * r * (r + sqrt(r * r + h * h))

; vol = pi * r * r * h / 3

@@: add rbx,8 ;rbx = offset to next elements

vmovsd xmm0,real8 ptr [r12+rbx] ;xmm0 = r

vmovsd xmm1,real8 ptr [r13+rbx] ;xmm1 = h

vmovsd xmm12,xmm12,xmm0 ;xmm12 = r

vmovsd xmm13,xmm13,xmm1 ;xmm13 = h

vmulsd xmm0,xmm0,xmm0 ;xmm0 = r * r

vmulsd xmm1,xmm1,xmm1 ;xmm1 = h * h

vaddsd xmm0,xmm0,xmm1 ;xmm0 = r * r + h * h

vsqrtsd xmm0,xmm0,xmm0 ;xmm0 = sqrt(r * r + h * h)

vaddsd xmm0,xmm0,xmm12 ;xmm0 = r + sqrt(r * r + h * h)

vmulsd xmm0,xmm0,xmm12 ;xmm0 = r * (r + sqrt(r * r + h * h))

vmulsd xmm0,xmm0,xmm14 ;xmm0 = pi * r * (r + sqrt(r * r + h * h))

vmulsd xmm12,xmm12,xmm12 ;xmm12 = r * r

vmulsd xmm13,xmm13,xmm14 ;xmm13 = h * pi

vmulsd xmm13,xmm13,xmm12 ;xmm13 = pi * r * r * h

vdivsd xmm13,xmm13,xmm15 ;xmm13 = pi * r * r * h / 3

vmovsd real8 ptr [r14+rbx],xmm0 ;save surface area

vmovsd real8 ptr [r15+rbx],xmm13 ;save volume

dec esi ;update counter

jnz @B ;repeat until done

mov eax,1 ;set success return code

; Restore non-volatile XMM registers

Done: vmovdqa xmm12,xmmword ptr [rbp-STK_LOCAL2+48]

vmovdqa xmm13,xmmword ptr [rbp-STK_LOCAL2+32]

vmovdqa xmm14,xmmword ptr [rbp-STK_LOCAL2+16]

vmovdqa xmm15,xmmword ptr [rbp-STK_LOCAL2]

; Restore non-volatile general-purpose registers

lea rsp,[rbp+STK_LOCAL1+STK_PAD] ;restore rsp

pop r15

pop r14

pop r13

pop r12

pop rsi

pop rbx

pop rbp

ret

CalcConeAreaVol_A endp

end

Listing 12-8

Example Ch12_08

The assembly language function CalcConeAreaVol_A() calculates surface areas and volumes of right-circular cones. The following formulas are used to calculate these values:

$sa=pi rleft(r+sqrt{r^2+{h}^2} ight)$

$vol=pi {r}^2h/3$

The function CalcConeAreaVol_A() begins by saving the nonvolatile general-purpose registers that it uses on the stack. It then allocates the specified amount of local stack space and initializes RBP as the stack frame pointer. The next code block saves nonvolatile registers XMM12-XMM15 on the stack using a series of vmovdqa instructions. A .savexmm128 directive must be used after each vmovdqa instruction. Like the other prologue directives, the .savexmm128 directive instructs the assembler to store information regarding the preservation of a nonvolatile XMM register in its exception handling tables. The offset argument of a .savexmm128 directive represents the displacement of the saved XMM register on the stack relative to register RSP. Note that the size of STK_LOCAL2 must be greater than or equal to the number of saved XMM registers multiplied by 16. Figure 12-5 illustrates the layout of the stack following execution of the vmovdqa xmmword ptr [rbp-STK_LOCAL2],xmm15 instruction.

Figure 12-5
Stack layout and argument registers following execution of vmovdqa xmmword ptr [rbp-STK_LOCAL2],xmm15 in function CalcConeAreaVol_A()

Following the prologue, local variables LocalVar1A and LocalVar1B are accessed for demonstration purposes. Initialization of the registers used by the main processing loop occurs next. Note that many of these initializations are either suboptimal or superfluous; they are performed merely to highlight the use of nonvolatile registers, both general purpose and XMM. Calculation of the cone surface areas and volumes is then carried out using AVX double-precision floating-point arithmetic.

Upon completion of the processing loop, the nonvolatile XMM registers are restored using a series of vmovdqa instructions. The function CalcConeAreaVol_A() then releases its local stack space and restores the previously saved nonvolatile general-purpose registers that it used. Here are the results for source code example Ch12_08:

----- Results for CalcConeAreaVol -----

r/h: 1.00 1.00

sa: 7.584476 7.584476

vol: 1.047198 1.047198

r/h: 1.00 2.00

sa: 10.166407 10.166407

vol: 2.094395 2.094395

r/h: 2.00 3.00

sa: 35.220717 35.220717

vol: 12.566371 12.566371

r/h: 2.00 4.00

sa: 40.665630 40.665630

vol: 16.755161 16.755161

r/h: 3.00 5.00

sa: 83.229761 83.229761

vol: 47.123890 47.123890

r/h: 3.00 10.00

sa: 126.671905 126.671905

vol: 94.247780 94.247780

r/h: 4.25 12.50

sa: 233.025028 233.025028

vol: 236.437572 236.437572

Macros for Function Prologues and Epilogues

The purpose of the three previous source code examples was to explicate the requirements of the Visual C++ calling convention for 64-bit nonleaf functions. The calling convention’s rigid requisites for function prologues and epilogues are somewhat lengthy and a potential source of programming errors. It is important to recognize that the stack layout of a nonleaf function is primarily determined by the number of nonvolatile (both general-purpose and XMM) registers that must be preserved and the amount of local stack space that is needed. A method is needed to automate most of the coding drudgery associated with the calling convention.

Listing 12-9 shows the assembly language source code for example Ch12_09. This source code example demonstrates how to use several macros that I have written to simplify prologue and epilogue coding in a nonleaf function. This example also illustrates how to call a C++ library function from an x86-64 assembly language function.

//------------------------------------------------

// Ch12_09.h

//------------------------------------------------

#pragma once

// Ch12_09_fcpp.cpp

extern bool CalcBSA_Cpp(const double* ht, const double* wt, int n,

double* bsa1, double* bsa2, double* bsa3);

// Ch12_09_fasm.asm

extern "C" bool CalcBSA_Aavx(const double* ht, const double* wt, int n,

double* bsa1, double* bsa2, double* bsa3);

;-------------------------------------------------

; Ch12_09_fasm.asm

;-------------------------------------------------

include <MacrosX86-64-AVX.asmh>

;--------------------------------------------------------------------------

; extern "C" bool CalcBSA_Aavx(const double* ht, const double* wt, int n,

; double* bsa1, double* bsa2, double* bsa3);

;--------------------------------------------------------------------------

.const

r8_0p007184 real8 0.007184

r8_0p725 real8 0.725

r8_0p425 real8 0.425

r8_0p0235 real8 0.0235

r8_0p42246 real8 0.42246

r8_0p51456 real8 0.51456

r8_3600p0 real8 3600.0

.code

extern pow:proc

CalcBSA_Aavx proc frame

CreateFrame_M BSA_,16,64,rbx,rsi,r12,r13,r14,r15

SaveXmmRegs_M xmm6,xmm7,xmm8,xmm9

EndProlog_M

; Save argument registers to home area (optional). Note that the home

; area can also be used to store other transient data values.

mov qword ptr [rbp+BSA_OffsetHomeRCX],rcx

mov qword ptr [rbp+BSA_OffsetHomeRDX],rdx

mov qword ptr [rbp+BSA_OffsetHomeR8],r8

mov qword ptr [rbp+BSA_OffsetHomeR9],r9

; Initialize processing loop pointers. Note that the pointers are

; maintained in non-volatile registers, which eliminates reloads after

; the calls to pow().

test r8d,r8d ;is n > 0?

jg @F ;jump if n > 0

xor eax,eax ;set error return code

jmp Done

@@: mov [rbp],r8d ;save n to local var

mov r12,rcx ;r12 = ptr to ht

mov r13,rdx ;r13 = ptr to wt

mov r14,r9 ;r14 = ptr to bsa1

mov r15,[rbp+BSA_OffsetStackArgs] ;r15 = ptr to bsa2

mov rbx,[rbp+BSA_OffsetStackArgs+8] ;rbx = ptr to bsa3

mov rsi,-8 ;rsi = array element offset

; Allocate home space on stack for use by pow()

sub rsp,32

; Calculate bsa1 = 0.007184 * pow(ht, 0.725) * pow(wt, 0.425);

@@: add rsi,8 ;rsi = next offset

vmovsd xmm0,real8 ptr [r12+rsi] ;xmm0 = ht

vmovsd xmm8,xmm8,xmm0

vmovsd xmm1,real8 ptr [r8_0p725]

call pow ;xmm0 = pow(ht, 0.725)

vmovsd xmm6,xmm6,xmm0

vmovsd xmm0,real8 ptr [r13+rsi] ;xmm0 = wt

vmovsd xmm9,xmm9,xmm0

vmovsd xmm1,real8 ptr [r8_0p425]

call pow ;xmm0 = pow(wt, 0.425)

vmulsd xmm6,xmm6,real8 ptr [r8_0p007184]

vmulsd xmm6,xmm6,xmm0 ;xmm6 = bsa1

; Calculate bsa2 = 0.0235 * pow(ht, 0.42246) * pow(wt, 0.51456);

vmovsd xmm0,xmm0,xmm8 ;xmm0 = ht

vmovsd xmm1,real8 ptr [r8_0p42246]

call pow ;xmm0 = pow(ht, 0.42246)

vmovsd xmm7,xmm7,xmm0

vmovsd xmm0,xmm0,xmm9 ;xmm0 = wt

vmovsd xmm1,real8 ptr [r8_0p51456]

call pow ;xmm0 = pow(wt, 0.51456)

vmulsd xmm7,xmm7,real8 ptr [r8_0p0235]

vmulsd xmm7,xmm7,xmm0 ;xmm7 = bsa2

; Calculate bsa3 = sqrt(ht * wt / 3600.0);

vmulsd xmm8,xmm8,xmm9 ;xmm8 = ht * wt

vdivsd xmm8,xmm8,real8 ptr [r8_3600p0] ;xmm8 = ht * wt / 3600

vsqrtsd xmm8,xmm8,xmm8 ;xmm8 = bsa3

; Save BSA results

vmovsd real8 ptr [r14+rsi],xmm6 ;save bsa1 result

vmovsd real8 ptr [r15+rsi],xmm7 ;save bsa2 result

vmovsd real8 ptr [rbx+rsi],xmm8 ;save bsa3 result

dec dword ptr [rbp] ;n -= 1

jnz @B

mov eax,1 ;set success return code

Done: RestoreXmmRegs_M xmm6,xmm7,xmm8,xmm9

DeleteFrame_M rbx,rsi,r12,r13,r14,r15

ret

CalcBSA_Aavx endp

end

Listing 12-9

Example Ch12_09

In Listing 12-9, the assembly language code begins with the statement include <MacrosX86-64-AVX.asmh>, which incorporates the contents of file MacrosX86-64-AVX.asmh into Ch12_09_fasm.asm during assembly. This file (source code not shown but included in the software download package) contains several macros that help automate much of the coding grunt work associated with the Visual C++ calling convention. Using an assembly language include file is analogous to using a C++ include file. The angled brackets that surround the file name can be omitted in some cases, but it is usually simpler and more consistent to just always use them. Note that there is no standard file name extension for x86 assembly language header files; I use .asmh but .inc is also used.

Figure 12-6 shows a generic stack layout diagram for a nonleaf function. Note the similarities between this figure and the more detailed stack layouts of Figures 12-4 and 12-5. The macros defined in MacrosX86-64-AVX.asmh assume that a function’s stack layout will conform to what is shown in Figure 12-6. They enable a function to tailor a custom stack frame by specifying the amount of local stack space that is needed and which nonvolatile registers must be preserved. The macros also perform most of the critical stack offset calculations, which reduces the risk of a programming error in a function prologue or epilogue.

Figure 12-6
Generic stack layout for a nonleaf function

Function CalcBSA_A() computes body surface areas (BSA) using the same equations that were used in example Ch09_03 (see Table 9-1). Following the include statement in Listing 12-9 is .const section that contains definitions for the various floating-point constant values used in the BSA equations. The line extern pow:proc enables the use of the external C++ library function pow(). Following the CalcBSA_A proc frame statement, the macro CreateFrame_M emits assembly language code that initializes the stack frame. It also saves the specified nonvolatile general-purpose registers on the stack. Macro CreateFrame_M requires several parameters including a prefix string and the size in bytes of StkSizeLocal1 and StkSizeLocal2 (see Figure 12-6). Macro CreateFrame_M uses the specified prefix string to generate symbolic names that can be employed to reference items on the stack. It is somewhat convenient to use a shortened version of the function name as the prefix string, but any file-unique text string can be used. Both StkSizeLocal1 and StkSizeLocal2 must be evenly divisible by 16. StkSizeLocal2 must also be less than or equal to 240 and greater than or equal to the number of saved XMM registers multiplied by 16.

The next statement makes use of the SaveXmmRegs_M macro to save the specified nonvolatile XMM registers to the XMM save area on the stack. This is followed by the EndProlog_M macro, which signifies the end of the function’s prologue. At this point, register RBP is configured as the function’s stack frame pointer. It is also safe to use any of the saved nonvolatile general-purpose or XMM registers.

The code block that follows EndProlog_M saves argument registers RCX, RDX, R8, and R9 to their home locations on the stack. Note that each mov instruction includes a symbolic name that equates to the offset of the register’s home area on the stack relative to the RBP register. The symbolic names and the corresponding offset values were automatically generated by the CreateFrame_M macro. The home area can also be used to store temporary data instead of the argument registers, as mentioned earlier in this chapter.

Initialization of the processing for-loop variables occurs next. Argument value n in register R8D is checked for validity and then saved on the stack as a local variable. Several nonvolatile registers are then initialized as pointer registers. Nonvolatile registers are used to avoid register reloads following each call to the C++ library function pow(). Note that the pointer to array bsa2 is loaded from the stack using a mov r15,[rbp+BSA_OffsetStackArgs] instruction. The symbolic constant BSA_OffsetStackArgs also was automatically generated by the macro CreateFrame_M and equates to the offset of the first stack argument relative to the RBP register. A mov rbx,[rbp+BSA_OffsetStackArgs+8] instruction loads argument bsa3 into register RBX; the constant 8 is included as part of the source operand displacement since bsa3 is the second argument passed via the stack.

The Visual C++ calling convention requires the caller of a function to allocate that function’s home area on the stack. The sub rsp,32 instruction performs this operation for function pow(). The ensuing code block calculates BSA values using the equations shown in Table 9-1. Note that registers XMM0 and XMM1 are loaded with the necessary argument values prior to each call to pow(). Also note that some of the return values from pow() are preserved in nonvolatile XMM registers prior to their actual use.

Following completion of the BSA processing for-loop is the epilogue for CalcBSA_A(). Before execution of the ret instruction, function CalcBSA_A() must restore all nonvolatile XMM and general-purpose registers that it saved in the prologue. The stack frame must also be properly deleted. The RestoreXmmRegs_M macro restores the nonvolatile XMM registers. Note that this macro requires the order of the registers in its argument list to match the register list that was used with the SaveXmmRegs_M macro. Stack frame cleanup and general-purpose register restores are handled by the DeleteFrame_M macro. The order of the registers specified in this macro’s argument list must be identical to the prologue’s CreateFrame_M macro. The DeleteFrame_M macro also restores RSP from RBP, which means that it is not necessary to code an explicit add rsp,32 instruction to release the home area that was allocated on the stack for pow(). You will see additional examples of function prologue and epilogue macro usage in subsequent chapters. Here are the results for source code example Ch12_09:

----- Results for CalcBSA -----

height: 150.0 (cm)

weight: 50.0 (kg)

BSA (C++): 1.432500 1.460836 1.443376 (sq. m)

BSA (AVX): 1.432500 1.460836 1.443376 (sq. m)

height: 160.0 (cm)

weight: 60.0 (kg)

BSA (C++): 1.622063 1.648868 1.632993 (sq. m)

BSA (AVX): 1.622063 1.648868 1.632993 (sq. m)

height: 170.0 (cm)

weight: 70.0 (kg)

BSA (C++): 1.809708 1.831289 1.818119 (sq. m)

BSA (AVX): 1.809708 1.831289 1.818119 (sq. m)

height: 180.0 (cm)

weight: 80.0 (kg)

BSA (C++): 1.996421 2.009483 2.000000 (sq. m)

BSA (AVX): 1.996421 2.009483 2.000000 (sq. m)

height: 190.0 (cm)

weight: 90.0 (kg)

BSA (C++): 2.182809 2.184365 2.179449 (sq. m)

BSA (AVX): 2.182809 2.184365 2.179449 (sq. m)

height: 200.0 (cm)

weight: 100.0 (kg)

BSA (C++): 2.369262 2.356574 2.357023 (sq. m)

BSA (AVX): 2.369262 2.356574 2.357023 (sq. m)

If the discussions of this section have left you feeling a little bewildered, don’t worry. In this book’s remaining chapters, you will see an abundance of x86-64 assembly language source code that demonstrates proper use of the Visual C++ calling convention and its programming requirements.

Summary

Table 12-5 summarizes the x86 assembly language instructions introduced in this chapter. This table also includes closely related instructions. Before proceeding to the next chapter, make sure you understand the operation that is performed by each instruction shown in Table 12-5.

Table 12-5

X86 Assembly Language Instruction Summary for Chapter 12

Instruction Mnemonic	Description
call	Call procedure/function
lea	Load effective address
setcc	Set byte if condition is true; clear otherwise
test	Logical compare (bitwise logical AND to set RFLAGS)
vadds[d\|s]	Scalar floating-point addition
vcvtsd2ss	Convert scalar DPFP value to SPFP
vcomis[d\|s]	Scalar floating-point compare
vcvts[d\|s]2si	Convert scalar floating-point to signed integer
vcvtsi2s[d\|s]	Convert signed integer to scalar floating-point
vcvtss2sd	Convert scalar SPFP to scalar DPFP
vdivs[d\|s]	Scalar floating-point division
vldmxcsr	Load MXCSR register
vmovdqa	Move double quadword (aligned)
vmovdqu	Move double quadword (unaligned)
vmovs[d\|s]	Move scalar floating-point value
vmuls[d\|s]	Scalar floating-point multiplication
vsqrts[d\|s]	Scalar floating-point square root
vstmxcsr	Store MXCSR register
vsubs[d\|s]	Scalar floating-point subtraction
vxorp[d\|s]	Packed floating-point bitwise logical exclusive OR

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 12. Core Assembly Language Programming: Part 2

Create new playlist

Sign In

Sign Up

12. Core Assembly Language Programming: Part 2

Scalar Floating-Point Arithmetic

Single-Precision Arithmetic

Double-Precision Arithmetic

Compares

Conversions

Scalar Floating-Point Arrays

Calling Convention: Part 2

Stack Frames

Using Nonvolatile General-Purpose Registers

Using Nonvolatile SIMD Registers

Macros for Function Prologues and Epilogues

Summary

Table of Contents for
12. Core Assembly Language Programming: Part 2