© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
D. KusswurmModern Parallel Programming with C++ and Assembly Languagehttps://doi.org/10.1007/978-1-4842-7918-2_12

12. Core Assembly Language Programming: Part 2

Daniel Kusswurm1  
(1)
Geneva, IL, USA
 

In the previous chapter, you were introduced to the fundamentals of x86-64 assembly language programming. You learned how to use elementary instructions that performed integer addition, subtraction, multiplication, and division. You also acquired valuable knowledge regarding memory addressing modes, condition codes, and assembly language programming syntax. The chapter that you are about to read is a continuation of the previous chapter. Topics discussed include scalar floating-point arithmetic, compares, and conversions. This chapter also provides additional details regarding the Visual C++ calling convention including volatile and nonvolatile registers, stack frames, and function prologues and epilogues.

Scalar Floating-Point Arithmetic

Besides its SIMD capabilities, AVX also includes instructions that perform scalar floating-point operations including basic arithmetic, compares, and conversions. Many modern programs use AVX scalar floating-point instructions instead of legacy SSE2 or x87 FPU instructions. The primary reason for this is that most AVX instructions employ three operands: two nondestructive source operands and one destination operand. The use of nondestructive source operands often reduces the number of register-to-register transfers that a function must perform, which yields more efficient code. In this section, you will learn how to code functions that perform scalar floating-point operations using AVX. You will also learn how to pass floating-point arguments and return values between a C++ and assembly language function.

Single-Precision Arithmetic

Listing 12-1 shows the source code for example Ch12_01. This example illustrates how to perform temperature conversions between Fahrenheit and Celsius using AVX and single-precision floating-point values. It also explains how to define and use floating-point constants in an assembly language function.
//------------------------------------------------
//               Ch12_01.h
//------------------------------------------------
#pragma once
// Ch12_01_fasm.asm
extern "C" float ConvertFtoC_Aavx(float deg_f);
extern "C" float ConvertCtoF_Aavx(float deg_c);
//------------------------------------------------
//               Ch12_01.cpp
//------------------------------------------------
#include <iostream>
#include <iomanip>
#include "Ch12_01.h"
static void ConvertFtoC(void);
static void ConvertCtoF(void);
int main()
{
    ConvertFtoC();
    ConvertCtoF();
    return 0;
}
static void ConvertFtoC(void)
{
    const size_t w = 10;
    float deg_fvals[] = {-459.67f, -40.0f, 0.0f, 32.0f, 72.0f, 98.6f, 212.0f};
    size_t n = sizeof(deg_fvals) / sizeof(float);
    std::cout << " -------- ConvertFtoC Results -------- ";
    std::cout << std::fixed << std::setprecision(4);
    for (size_t i = 0; i < n; i++)
    {
        float deg_c = ConvertFtoC_Aavx(deg_fvals[i]);
        std::cout << "  i: " << i << "  ";
        std::cout << "f: " << std::setw(w) << deg_fvals[i] << "  ";
        std::cout << "c: " << std::setw(w) << deg_c << ' ';
    }
}
static void ConvertCtoF(void)
{
    const size_t w = 10;
    float deg_cvals[] = {-273.15f, -40.0f, -17.777778f, 0.0f, 25.0f, 37.0f, 100.0f};
    size_t n = sizeof(deg_cvals) / sizeof(float);
    std::cout << " -------- ConvertCtoF Results -------- ";
    std::cout << std::fixed << std::setprecision(4);
    for (size_t i = 0; i < n; i++)
    {
        float deg_f = ConvertCtoF_Aavx(deg_cvals[i]);
        std::cout << "  i: " << i << "  ";
        std::cout << "c: " << std::setw(w) << deg_cvals[i] << "  ";
        std::cout << "f: " << std::setw(w) << deg_f << ' ';
    }
}
;-------------------------------------------------
;               Ch12_01_fasm.asm
;-------------------------------------------------
               .const
r4_ScaleFtoC   real4 0.55555556                ; 5 / 9
r4_ScaleCtoF   real4 1.8                       ; 9 / 5
r4_32p0        real4 32.0
;--------------------------------------------------------------------------
; extern "C" float ConvertFtoC_Aavx(float deg_f);
;--------------------------------------------------------------------------
        .code
ConvertFtoC_Aavx proc
        vmovss xmm1,[r4_32p0]               ;xmm1 = 32
        vsubss xmm2,xmm0,xmm1               ;xmm2 = f - 32
        vmovss xmm1,[r4_ScaleFtoC]          ;xmm1 = 5 / 9
        vmulss xmm0,xmm2,xmm1               ;xmm0 = (f - 32) * 5 / 9
        ret
ConvertFtoC_Aavx endp
;--------------------------------------------------------------------------
; extern "C" float ConvertCtoF_Aavx(float deg_c);
;--------------------------------------------------------------------------
ConvertCtoF_Aavx proc
        vmulss xmm0,xmm0,[r4_ScaleCtoF]     ;xmm0 = c * 9 / 5
        vaddss xmm0,xmm0,[r4_32p0]          ;xmm0 = c * 9 / 5 + 32
        ret
ConvertCtoF_Aavx endp
        end
Listing 12-1

Example Ch12_01

In Listing 12-1, the header file Ch12_01.h includes declaration statements for the functions ConvertFtoC_Aavx() and ConvertCtoF_Aavx(). Note that these functions require a single argument value of type float. Both functions also return a value of type float. The file Ch12_01.cpp includes a function named ConvertFtoC() that performs test case initialization for ConvertFtoC_Aavx() and displays the calculated results. File Ch12_01.cpp also includes the function ConvertCtoF(), which is the Celsius to Fahrenheit counterpart of ConvertFtoC().

The assembly language code in Ch12_01_fasm.asm starts with a .const section that defines the constants needed to convert a temperature value from Fahrenheit to Celsius and vice versa. The text real4 is a MASM directive that allocates storage space for single-precision floating-point value (the directive real8 can be used for double-precision floating-point values). Following the .const section is the code for function ConvertFtoC_Aavx(). The first instruction of this function, vmovss xmm1,[r4_32p0] (Move or Merge Scalar SPFP1 Value), loads the single-precision floating-point value 32.0 from memory into register XMM1 (more precisely into XMM1[31:0]). A memory operand is used here since AVX does not support using immediate operands for scalar floating-point constants.

Per the Visual C++ calling convention, the first four floating-point argument values are passed to a function using registers XMM0, XMM1, XMM2, and XMM3. This means that upon entry to function ConvertFtoC_Aavx(), register XMM0 contains argument value deg_f. Following execution of the vmovss instruction, the vsubss xmm2,xmm0,xmm1 (Subtract Scalar SPFP Value) instruction calculates deg_f - 32.0 and saves the result in XMM2[31:0]. Execution of vsubss does not modify the contents of source operands XMM0 and XMM1. However, this instruction copies bits XMM0[127:32] to XMM2[127:32] (other AVX scalar arithmetic instructions also perform this copy operation). The ensuing vmovss xmm1,[r4_ScaleFtoC] loads the constant value 0.55555556 (or 5 / 9) into register XMM1. This is followed by a vmulss xmm0,xmm2,xmm1 (Multiply Scalar SPFP Value) instruction that computes (deg_f - 32.0) * 0.55555556 and saves the result (i.e., the converted temperature in Celsius) in XMM0. The Visual C++ calling convention designates register XMM0 for floating-point return values. Since the return value is already in XMM0, no additional vmovss instructions are necessary.

The assembly language function ConvertCtoF_Aavx() follows next. The code for this function differs slightly from ConvertFtoC_Aavx() in that the AVX scalar floating-point arithmetic instructions use memory operands to reference the required conversion constants. At entry to ConvertCtoF_Aavx(), register XMM0 contains argument value deg_c. The instruction vmulss xmm0,xmm0,[r4_ScaleCtoF] calculates deg_c * 1.8. This is followed by a vaddss xmm0,xmm0,[r4_32p0] (Add Scalar SPFP Value) instruction that calculates deg_c * 1.8 + 32.0. It should be noted at this point that neither ConvertFtoC_Aavx() nor ConvertCtoF_Aavx() perform any validity checks for argument values that are physically impossible (e.g., a temperature of -1000 degrees Fahrenheit). Such checks require floating-point compare instructions, and you will learn about these instructions later in this chapter. Here are the results for source code example Ch12_01:
-------- ConvertFtoC Results --------
  i: 0  f:  -459.6700  c:  -273.1500
  i: 1  f:   -40.0000  c:   -40.0000
  i: 2  f:     0.0000  c:   -17.7778
  i: 3  f:    32.0000  c:     0.0000
  i: 4  f:    72.0000  c:    22.2222
  i: 5  f:    98.6000  c:    37.0000
  i: 6  f:   212.0000  c:   100.0000
-------- ConvertCtoF Results --------
  i: 0  c:  -273.1500  f:  -459.6700
  i: 1  c:   -40.0000  f:   -40.0000
  i: 2  c:   -17.7778  f:     0.0000
  i: 3  c:     0.0000  f:    32.0000
  i: 4  c:    25.0000  f:    77.0000
  i: 5  c:    37.0000  f:    98.6000
  i: 6  c:   100.0000  f:   212.0000

Double-Precision Arithmetic

Listing 12-2 shows the source code for example Ch12_02. This example calculates 3D distances using AVX scalar arithmetic and double-precision floating-point values.
//------------------------------------------------
//               Ch12_02.h
//------------------------------------------------
#pragma once
// Ch12_02_fcpp.cpp
extern double CalcDistance_Cpp(double x1, double y1, double z1, double x2,
    double y2, double z2);
// Ch12_02_fasm.asm
extern "C" double CalcDistance_Aavx(double x1, double y1, double z1, double x2,
    double y2, double z2);
// Ch12_02_misc.cpp
extern void InitArrays(double* x, double* y, double* z, size_t n,
    unsigned int rng_seed);
//------------------------------------------------
//               Ch12_02.cpp
//------------------------------------------------
#include <iostream>
#include <iomanip>
#include "Ch12_02.h"
static void CalcDistance(void);
int main()
{
    CalcDistance();
    return 0;
}
static void CalcDistance(void)
{
    const size_t n = 20;
    double x1[n], y1[n], z1[n], dist1[n];
    double x2[n], y2[n], z2[n], dist2[n];
    InitArrays(x1, y1, z1, n, 29);
    InitArrays(x2, y2, z2, n, 37);
    for (size_t i = 0; i < n; i++)
    {
        dist1[i] = CalcDistance_Cpp(x1[i], y1[i], z1[i], x2[i], y2[i], z2[i]);
        dist2[i] = CalcDistance_Aavx(x1[i], y1[i], z1[i], x2[i], y2[i], z2[i]);
    }
    size_t w1 = 3, w2 = 8;
    std::cout << std::fixed;
    for (size_t i = 0; i < n; i++)
    {
        std::cout << "i: " << std::setw(w1) << i << "  ";
        std::cout << std::setprecision(0);
        std::cout << "p1(";
        std::cout << std::setw(w1) << x1[i] << ",";
        std::cout << std::setw(w1) << y1[i] << ",";
        std::cout << std::setw(w1) << z1[i] << ") | ";
        std::cout << "p2(";
        std::cout << std::setw(w1) << x2[i] << ",";
        std::cout << std::setw(w1) << y2[i] << ",";
        std::cout << std::setw(w1) << z2[i] << ") | ";
        std::cout << std::setprecision(4);
        std::cout << "dist1: " << std::setw(w2) << dist1[i] << " | ";
        std::cout << "dist2: " << std::setw(w2) << dist2[i] << ' ';
    }
}
;-------------------------------------------------
;               Ch12_02_fasm.asm
;-------------------------------------------------
;--------------------------------------------------------------------------
; extern "C" double CalcDistance_Aavx(double x1, double y1, double z1, double x2,
;   double y2, double z2);
;--------------------------------------------------------------------------
        .code
CalcDistance_Aavx proc
; Load arguments from stack
        vmovsd xmm4,real8 ptr [rsp+40]      ;xmm4 = y2
        vmovsd xmm5,real8 ptr [rsp+48]      ;xmm5 = z2
; Calculate squares of coordinate distances
        vsubsd xmm0,xmm3,xmm0               ;xmm0 = x2 - x1
        vmulsd xmm0,xmm0,xmm0               ;xmm0 = (x2 - x1) * (x2 - x1)
        vsubsd xmm1,xmm4,xmm1               ;xmm1 = y2 - y1
        vmulsd xmm1,xmm1,xmm1               ;xmm1 = (y2 - y1) * (y2 - y1)
        vsubsd xmm2,xmm5,xmm2               ;xmm2 = z2 - z1
        vmulsd xmm2,xmm2,xmm2               ;xmm2 = (z2 - z1) * (z2 - z1)
; Calculate final distance
        vaddsd xmm3,xmm0,xmm1
        vaddsd xmm4,xmm2,xmm3               ;xmm4 = sum of squares
        vsqrtsd xmm0,xmm0,xmm4              ;xmm0 = final distance value
        ret
CalcDistance_Aavx endp
        end
Listing 12-2

Example Ch12_02

The Euclidian distance between two 3D points can be calculated using the following equation:
$$ dist=sqrt{{left({x}_2-{x}_1
ight)}^2+{left({y}_2-{y}_1
ight)}^2+{left({z}_2-{z}_1
ight)}^2} $$
If you examine the declaration of function CalcDistance_Aavx(), you will notice that it specifies six argument values of type double. Argument values x1, y1, z1, and x2 are passed in registers XMM0, XMM1, XMM2, and XMM3, respectively. The final two argument values, y2 and z2, are passed on the stack as illustrated in Figure 12-1. Note that this figure shows only the low-order quadword (bits 63:0) of each XMM register; the high-order quadword (bits 127:64) of each XMM register is undefined. Registers RCX, RDX, R8, and R9 are also undefined since CalcDistance_Aavx() does not utilize any integer or pointer arguments.
Figure 12-1

Stack layout and argument registers at entry to CalcDistance_Aavx()

The function CalcDistance_Aavx() begins with a vmovsd xmm4,real8 ptr [rsp+40] (Move or Merge Scalar DPFP2 Value) instruction that loads argument value y2 from the stack into register XMM4 (more precisely into XMM4[63:0]). This is followed by a vmovsd xmm5,real8 ptr [rsp+48] instruction that loads argument value z2 into register XMM5. The next two instructions, vsubsd xmm0,xmm3,xmm0 (Subtract Scalar DPFP Value) and vmulsd xmm0,xmm0,xmm0 (Multiply Scalar DPFP Value), calculate (x2 – x1) * (x2 – x1). Similar sequences of instructions are then employed to calculate (y2 – y1) * (y2 – y1) and (z2 – z1) * (z2 – z1). This is followed by two vaddsd (Add Scalar DPFP Value) instructions that sum the three coordinate squares. A vsqrtsd xmm0,xmm0,xmm4 (Compute Square Root of Scalar DPFP Value) instruction computes the final distance. It is important to note that vsqrtsd computes the square root of its second source operand. Like other scalar double-precision floating-point arithmetic instructions, vsqrtsd also copies bits 127:64 of its first source operand to the same bit positions of the destination operand. Here are the results for source example Ch12_02:
i:   0  p1( 24,  4, 45) | p2(  8, 45, 20) | dist1:  50.6162 | dist2:  50.6162
i:   1  p1( 54, 59, 33) | p2( 22, 20, 81) | dist1:  69.6348 | dist2:  69.6348
i:   2  p1( 25, 23, 61) | p2( 83, 20, 44) | dist1:  60.5145 | dist2:  60.5145
i:   3  p1( 83,  4, 22) | p2( 98, 20, 62) | dist1:  45.6180 | dist2:  45.6180
i:   4  p1( 81, 21, 12) | p2( 73, 49, 64) | dist1:  59.5987 | dist2:  59.5987
i:   5  p1( 81, 97, 22) | p2( 70, 48, 45) | dist1:  55.2359 | dist2:  55.2359
i:   6  p1( 24, 62, 77) | p2( 20, 32, 15) | dist1:  68.9928 | dist2:  68.9928
i:   7  p1( 97, 81, 45) | p2( 20, 79, 18) | dist1:  81.6211 | dist2:  81.6211
i:   8  p1( 94, 81, 17) | p2(  8, 89, 87) | dist1: 111.1755 | dist2: 111.1755
i:   9  p1( 53, 82, 62) | p2( 43, 31, 84) | dist1:  56.4358 | dist2:  56.4358
i:  10  p1( 90, 72, 88) | p2( 25, 27, 30) | dist1:  98.0510 | dist2:  98.0510
i:  11  p1( 32,  4, 46) | p2( 62, 33, 53) | dist1:  42.3084 | dist2:  42.3084
i:  12  p1(  7, 88, 13) | p2( 12, 75, 30) | dist1:  21.9773 | dist2:  21.9773
i:  13  p1(  3, 90, 97) | p2( 89, 52, 38) | dist1: 111.0000 | dist2: 111.0000
i:  14  p1( 60, 95, 54) | p2( 91, 51, 33) | dist1:  57.7754 | dist2:  57.7754
i:  15  p1( 16, 10, 52) | p2(  2, 32, 50) | dist1:  26.1534 | dist2:  26.1534
i:  16  p1( 87,  2, 68) | p2( 53, 20, 75) | dist1:  39.1024 | dist2:  39.1024
i:  17  p1( 32, 10, 37) | p2(  8, 41, 13) | dist1:  45.9674 | dist2:  45.9674
i:  18  p1( 62, 29, 84) | p2( 62, 37, 35) | dist1:  49.6488 | dist2:  49.6488
i:  19  p1( 16, 32, 31) | p2( 85, 19, 17) | dist1:  71.5961 | dist2:  71.5961

The final two letters of many x86-AVX arithmetic instruction mnemonics denote the operand type. You have already seen the instructions vaddss and vaddsd, which perform scalar single-precision and double-precision floating-point addition. In these instructions, the suffixes ss and sd denote scalar single-precision and double-precision values, respectively. X86-AVX instructions also use the mnemonic suffixes ps and pd to signify packed single-precision and double-precision values. X86-AVX instructions that manipulate more than one data type often include multiple data type characters in their mnemonics.

Compares

Listing 12-3 shows the source code for example Ch12_03, which demonstrates the use of the floating-point compare instruction vcomiss (Compare Scalar SPFP Values). The vcomiss instruction compares two single-precision floating-point values and sets status flags in RFLAGS to signify a result of less than, equal, greater than, or unordered. The vcomisd (Compare Scalar DPFP Values) instruction is the double counterpart of vcomiss.
//------------------------------------------------
//               Ch12_03.h
//------------------------------------------------
#pragma once
#include <cstdint>
// Ch12_03_fasm.asm
extern "C" void CompareF32_Aavx(float a, float b, uint8_t* results);
// Ch12_03_misc.cpp
extern void DisplayResults(float a, float b, const uint8_t* cmp_results);
// Miscellaenous constants
const size_t c_NumCmpOps = 7;
//------------------------------------------------
//               Ch12_03.cpp
//------------------------------------------------
#include <iostream>
#include <iomanip>
#include <limits>
#include <string>
#include "Ch12_03.h"
static void CompareF32(void);
int main()
{
    CompareF32();
    return 0;
}
static void CompareF32(void)
{
    const size_t n = 6;
    float a[n] {120.0, 250.0, 300.0, -18.0, -81.0, 42.0};
    float b[n] {130.0, 240.0, 300.0, 32.0, -100.0, 0.0};
    // Set NAN test value
    b[n - 1] = std::numeric_limits<float>::quiet_NaN();
    std::cout << " ----- Results for CompareF32 ----- ";
    for (size_t i = 0; i < n; i++)
    {
        uint8_t cmp_results[c_NumCmpOps];
        CompareF32_Aavx(a[i], b[i], cmp_results);
        DisplayResults(a[i], b[i], cmp_results);
    }
}
;-------------------------------------------------
;               Ch12_03_fasm.asm
;-------------------------------------------------
;--------------------------------------------------------------------------
; extern "C" void CompareF32_Aavx(float a, float b, uint8_t* results);
;--------------------------------------------------------------------------
        .code
CompareF32_Aavx proc
; Set result flags based on compare status
        vcomiss xmm0,xmm1
        setp byte ptr [r8]                  ;RFLAGS.PF = 1 if unordered
        jnp @F
        xor al,al
        mov byte ptr [r8+1],al              ;set remaining elements in array
        mov byte ptr [r8+2],al              ;result[] to 0
        mov byte ptr [r8+3],al
        mov byte ptr [r8+4],al
        mov byte ptr [r8+5],al
        mov byte ptr [r8+6],al
        ret
@@:     setb byte ptr [r8+1]                ;set byte if a < b
        setbe byte ptr [r8+2]               ;set byte if a <= b
        sete byte ptr [r8+3]                ;set byte if a == b
        setne byte ptr [r8+4]               ;set byte if a != b
        seta byte ptr [r8+5]                ;set byte if a > b
        setae byte ptr [r8+6]               ;set byte if a >= b
        ret
CompareF32_Aavx endp
        end
Listing 12-3

Example Ch12_03

The function CompareF32_Aavx() accepts three argument values: two of type float and a pointer to an array of uint8_t values for the results. The first instruction of CompareF32_Aavx(), vcomiss xmm0,xmm1, performs a single-precision floating-point compare of argument values a and b Note that these values were passed to CompareF32_Aavx() in registers XMM0 and XMM1, respectively. Execution of vcomiss sets RFLAGS.ZF, RFLAGS.PF, and RFLAGS.CF as shown in Table 12-1. The setting of these status flags facilitates the use of the conditional instructions cmovcc, jcc, and setcc (Set Byte on Condition) as shown in Table 12-2.
Table 12-1

Status Flags Set by vcomis[d|s]

Condition

RFLAGS.ZF

RFLAGS.PF

RFLAGS.CF

XMM0 > XMM1

0

0

0

XMM0 == XMM1

1

0

0

XMM0 < XMM1

0

0

1

Unordered

1

1

1

Table 12-2

Condition Codes Following Execution of vcomis[d|s]

Relational Operator

Condition Code

RFLAGS Test Condition

XMM0 < XMM1

Below (b)

CF == 1

XMM0 <= XMM1

Below or equal (be)

CF == 1 || ZF == 1

XMM0 == XMM1

Equal (e or z)

ZF == 1

XMM0 != XMM1

Not Equal (ne or nz)

ZF == 0

XMM0 > XMM1

Above (a)

CF == 0 && ZF == 0

XMM0 >= XMM1

Above or Equal (ae)

CF == 0

Unordered

Parity (p)

PF == 1

It should be noted that the status flags shown in Table 12-1 are set only if floating-point exceptions are masked (the default state for Visual C++ and most other C++ compilers). If floating-point invalid operation or denormal exceptions are unmasked (MXCSR.IM = 0 or MXCSR.DM = 0) and one of the compare operands is a QNaN, SNaN, or denormal, the processor will generate an exception without updating the status flags in RFLAGS.

Following execution of the vcomiss xmm0,xmm1 instruction, CompareF32_Aavx() uses a series of setcc instructions to highlight the relational operators shown in Table 12-2. The setp byte ptr [r8] instruction sets the destination operand byte pointed to by R8 to 1 if RFLAGS.PF is set (i.e., one of the operands is a QNaN or SNaN); otherwise, the destination operand byte is set to 0. If the compare was ordered, the remaining setcc instructions in CompareF32_Aavx() save all possible compare outcomes by setting each entry in array results to 0 or 1. As previously mentioned, a function can also use a jcc or cmovcc instruction following execution of a vcomis[d|s] instruction to perform conditional jumps or moves based on the outcome of a floating-point compare. Here is the output for source code example Ch12_03:
----- Results for CompareF32 -----
a = 120, b = 130
UO=0      LT=1      LE=1      EQ=0      NE=1      GT=0      GE=0
a = 250, b = 240
UO=0      LT=0      LE=0      EQ=0      NE=1      GT=1      GE=1
a = 300, b = 300
UO=0      LT=0      LE=1      EQ=1      NE=0      GT=0      GE=1
a = -18, b = 32
UO=0      LT=1      LE=1      EQ=0      NE=1      GT=0      GE=0
a = -81, b = -100
UO=0      LT=0      LE=0      EQ=0      NE=1      GT=1      GE=1
a = 42, b = nan
UO=1      LT=0      LE=0      EQ=0      NE=0      GT=0      GE=0

Conversions

Most C++ programs perform type conversions. For example, it is often necessary to cast a single-precision or double-precision floating-point value to an integer or vice versa. A function may also need to size-promote a single-precision floating-point value to double precision or narrow a double-precision floating-point value to single precision. AVX includes several instructions that perform conversions using either scalar or packed operands. Listing 12-4 shows the source code for example Ch12_04. This example illustrates the use of AVX scalar conversion instructions. Source code example Ch12_04 also introduces macros and explains how to change the rounding control bits in the MXCSR register.
//------------------------------------------------
//               Ch12_04.h
//------------------------------------------------
#pragma once
// Simple union for data exchange
union Uval
{
    int32_t m_I32;
    int64_t m_I64;
    float m_F32;
    double m_F64;
};
// The order of values in enum CvtOp must match the jump table
// that's defined in the .asm file.
enum class CvtOp : unsigned int
{
    I32_F32,       // int32_t to float
    F32_I32,       // float to int32_t
    I32_F64,       // int32_t to double
    F64_I32,       // double to int32_t
    I64_F32,       // int64_t to float
    F32_I64,       // float to int64_t
    I64_F64,       // int64_t to double
    F64_I64,       // double to int64_t
    F32_F64,       // float to double
    F64_F32,       // double to float
};
// Enumerated type for rounding control
enum class RC : unsigned int
{
    Nearest, Down, Up, Zero     // Do not change order
};
// Ch12_04_fasm.asm
extern "C" bool ConvertScalar_Aavx(Uval* a, Uval* b, CvtOp cvt_op, RC rc);
//------------------------------------------------
//               Ch12_04.cpp
//------------------------------------------------
#include <iostream>
#include <iomanip>
#include <cstdint>
#include <string>
#include <limits>
#define _USE_MATH_DEFINES
#include <math.h>
#include "Ch12_04.h"
const std::string c_RcStrings[] = {"Nearest", "Down", "Up", "Zero"};
const RC c_RcVals[] = {RC::Nearest, RC::Down, RC::Up, RC::Zero};
const size_t c_NumRC = sizeof(c_RcVals) / sizeof (RC);
static void ConvertScalars(void);
int main()
{
    ConvertScalars();
    return 0;
}
static void ConvertScalars(void)
{
    const char nl = ' ';
    Uval src1, src2, src3, src4, src5, src6, src7;
    src1.m_F32 = (float)M_PI;
    src2.m_F32 = (float)-M_E;
    src3.m_F64 = M_SQRT2;
    src4.m_F64 = M_SQRT1_2;
    src5.m_F64 = 1.0 + DBL_EPSILON;
    src6.m_I32 = std::numeric_limits<int>::max();
    src7.m_I64 = std::numeric_limits<long long>::max();
std::cout << "----- Results for ConvertScalars() ----- ";
    for (size_t i = 0; i < c_NumRC; i++)
    {
        RC rc = c_RcVals[i];
        Uval des1, des2, des3, des4, des5, des6, des7;
        ConvertScalar_Aavx(&des1, &src1, CvtOp::F32_I32, rc);
        ConvertScalar_Aavx(&des2, &src2, CvtOp::F32_I64, rc);
        ConvertScalar_Aavx(&des3, &src3, CvtOp::F64_I32, rc);
        ConvertScalar_Aavx(&des4, &src4, CvtOp::F64_I64, rc);
        ConvertScalar_Aavx(&des5, &src5, CvtOp::F64_F32, rc);
        ConvertScalar_Aavx(&des6, &src6, CvtOp::I32_F32, rc);
        ConvertScalar_Aavx(&des7, &src7, CvtOp::I64_F64, rc);
        std::cout << std::fixed;
        std::cout << " Rounding control = " << c_RcStrings[(int)rc] << nl;
        std::cout << "  F32_I32: " << std::setprecision(8);
        std::cout << src1.m_F32 << " --> " << des1.m_I32 << nl;
        std::cout << "  F32_I64: " << std::setprecision(8);
        std::cout << src2.m_F32 << " --> " << des2.m_I64 << nl;
        std::cout << "  F64_I32: " << std::setprecision(8);
        std::cout << src3.m_F64 << " --> " << des3.m_I32 << nl;
        std::cout << "  F64_I64: " << std::setprecision(8);
        std::cout << src4.m_F64 << " --> " << des4.m_I64 << nl;
        std::cout << "  F64_F32: ";
        std::cout << std::setprecision(16) << src5.m_F64 << " --> ";
        std::cout << std::setprecision(8) << des5.m_F32 << nl;
        std::cout << "  I32_F32: " << std::setprecision(8);
        std::cout << src6.m_I32 << " --> " << des6.m_F32 << nl;
        std::cout << "  I64_F64: " << std::setprecision(8);
        std::cout << src7.m_I64 << " --> " << des7.m_F64 << nl;
    }
}
;-------------------------------------------------
;               Ch12_04_fasm.asm
;-------------------------------------------------
MxcsrRcMask     equ 9fffh                   ;bit mask for MXCSR.RC
MxcsrRcShift    equ 13                      ;shift count for MXCSR.RC
;--------------------------------------------------------------------------
; Macro GetRC_M - copies MXCSR.RC to r10d[1:0]
;--------------------------------------------------------------------------
GetRC_M macro
        vstmxcsr dword ptr [rsp+8]          ;save mxcsr register
        mov r10d,[rsp+8]
        shr r10d,MxcsrRcShift               ;r10d[1:0] = MXCSR.RC bits
        and r10d,3                          ;clear unused bits
        endm
;--------------------------------------------------------------------------
; Macro SetRC_M - sets MXCSR.RC to rm_reg[1:0]
;--------------------------------------------------------------------------
SetRC_M macro RcReg
        vstmxcsr dword ptr [rsp+8]          ;save current MXCSR
        mov eax,[rsp+8]
        and RcReg,3                         ;clear unusned bits
        shl RcReg,MxcsrRcShift              ;rc_reg[14:13] = rc
        and eax,MxcsrRcMask                 ;clear non MXCSR.RC bits
        or eax,RcReg                        ;insert new MXCSR.RC
        mov [rsp+8],eax
        vldmxcsr dword ptr [rsp+8]          ;load updated MXCSR
        endm
;--------------------------------------------------------------------------
; extern "C" bool ConvertScalar_Aavx(Uval* des, const Uval* src, CvtOp cvt_op, RC rc)
;
; Note:     This function requires linker option /LARGEADDRESSAWARE:NO
;--------------------------------------------------------------------------
        .code
ConvertScalar_Aavx proc
; Make sure cvt_op is valid
        cmp r8d,CvtOpTableCount             ;is cvt_op >= CvtOpTableCount
        jae BadCvtOp                        ;jump if cvt_op is invalid
; Save current MSCSR.RC
        GetRC_M                             ;r10d = current RC
; Set new rounding mode
        SetRC_M r9d                         ;set new MXCSR.RC
; Jump to target conversion code block
        mov eax,r8d                         ;rax = cvt_op
        jmp [CvtOpTable+rax*8]
; Conversions between int32_t and float/double
I32_F32:
        mov eax,[rdx]                       ;load integer value
        vcvtsi2ss xmm0,xmm0,eax             ;convert to float
        vmovss real4 ptr [rcx],xmm0         ;save result
        jmp Done
F32_I32:
        vmovss xmm0,real4 ptr [rdx]         ;load float value
        vcvtss2si eax,xmm0                  ;convert to integer
        mov [rcx],eax                       ;save result
        jmp Done
I32_F64:
        mov eax,[rdx]                       ;load integer value
        vcvtsi2sd xmm0,xmm0,eax             ;convert to double
        vmovsd real8 ptr [rcx],xmm0         ;save result
        jmp Done
F64_I32:
        vmovsd xmm0,real8 ptr [rdx]         ;load double value
        vcvtsd2si eax,xmm0                  ;convert to integer
        mov [rcx],eax                       ;save result
        jmp Done
; Conversions between int64_t and float/double
I64_F32:
        mov rax,[rdx]                       ;load integer value
        vcvtsi2ss xmm0,xmm0,rax             ;convert to float
        vmovss real4 ptr [rcx],xmm0         ;save result
        jmp Done
F32_I64:
        vmovss xmm0,real4 ptr [rdx]         ;load float value
        vcvtss2si rax,xmm0                  ;convert to integer
        mov [rcx],rax                       ;save result
        jmp Done
I64_F64:
        mov rax,[rdx]                       ;load integer value
        vcvtsi2sd xmm0,xmm0,rax             ;convert to double
        vmovsd real8 ptr [rcx],xmm0         ;save result
        jmp Done
F64_I64:
        vmovsd xmm0,real8 ptr [rdx]         ;load double value
        vcvtsd2si rax,xmm0                  ;convert to integer
        mov [rcx],rax                       ;save result
        jmp Done
; Conversions between float and double
F32_F64:
        vmovss xmm0,real4 ptr [rdx]         ;load float value
        vcvtss2sd xmm1,xmm1,xmm0            ;convert to double
        vmovsd real8 ptr [rcx],xmm1         ;save result
        jmp Done
F64_F32:
        vmovsd xmm0,real8 ptr [rdx]         ;load double value
        vcvtsd2ss xmm1,xmm1,xmm0            ;convert to float
        vmovss real4 ptr [rcx],xmm1         ;save result
        jmp Done
BadCvtOp:
        xor eax,eax                         ;set error return code
        ret
Done:   SetRC_M r10d                        ;restore original MXCSR.RC
        mov eax,1                           ;set success return code
        ret
; The order of values in following table must match enum CvtOp
; that's defined in the .h file.
        align 8
CvtOpTable equ $
        qword I32_F32, F32_I32
        qword I32_F64, F64_I32
        qword I64_F32, F32_I64
        qword I64_F64, F64_I64
        qword F32_F64, F64_F32
CvtOpTableCount equ ($ - CvtOpTable) / size qword
ConvertScalar_Aavx endp
        end
Listing 12-4

Example Ch12_04

Near the top of Listing 12-4 is the definition of a union named Uval . Source code example Ch12_04 uses this union to simplify data exchange between the C++ and assembly language code. Following Uval is an enum named CvtOp, which defines symbolic names for the conversions. Also included in file Ch12_04.h is the enum RC. This type defines symbolic names for the floating-point rounding modes. Recall from the discussions in Chapter 10 that the MXCSR register contains a two-bit field that specifies the rounding method for floating-point operations (see Table 10-4).

Also shown in Listing 12-4 is the file Ch12_04.cpp. This file includes the driver function ConvertScalars(), which performs test case initialization and streams results to std::cout. Note that each use of the assembly language function ConvertScalar_Aavx() requires two argument values of type Uval , one argument of type CvtOp, and one argument of type RC.

Assembly language source code files often employ the equ (equate) directive to define symbolic names for numerical expressions. The equ directive is somewhat analogous to a C++ const definition (e.g., const int x = 100;). The first noncomment statement in Ch12_04_fasm.asm, MxcsrRcMask equ 9fffh, defines a symbolic name for a mask that will be used to modify bits MXCSR.RC. This is followed by another equ directive MxcsrRcShift equ 13 that defines a shift count for bits MXCSR.RC.

Immediately following the two equate statements is the definition of a macro named GetRC_M. A macro is a text substitution mechanism that enables a programmer to represent a sequence of assembly language instructions, data, or other statements using a single text string. Assembly language macros are typically employed to generate sequences of instructions that will be used more than once. Macros are also frequently exercised to factor out and reuse code without the performance overhead of a function call.

Macro GetRC_M emits a sequence of assembly language instructions that obtain the current value of MXCSR.RC. The first instruction of this macro, vstmxcsr dword ptr [rsp+8] (Store MXCSR Register State), saves the contents of register MXCSR on the stack. The reason for saving MXCSR on the stack is that vstmxcsr only supports memory operands. The next instruction, mov r10d,[rsp+8], copies this value from the stack and loads it into register R10D. The ensuing instruction pair, shr r10d,MxcsrRcShift and and r10d,3, relocates the rounding control bits to bits 1:0 of register R10D; all other bits in R10D are set to zero. The text endm is an assembler directive that signifies the end of macro GetRC_M.

Following the definition of macro GetRC_M is another macro named SetRC_M. This macro emits instructions that modify MXCSR.RC. Note that macro SetRC_M includes an argument named RcReg. This is a symbolic name for the general-purpose register that contains the new value for MXCSR.RC. More on this in a moment. Macro SetRC_M also begins with the instruction sequence vstmxcsr dword ptr [rsp+8] and mov eax,[rsp+8] to obtain the current contents of MXCSR. It then employs the instruction pair and RcReg,3 and shl RcReg,MxcsrRcShift. These instructions shift the new bits for MXCSR.RC into the correct position. During macro expansion, the assembler replaces macro argument RcReg with the actual register name as you will soon see. The ensuing and eax,MxcsrRcMask and or eax,RcReg instructions update MXCSR.RC with the new rounding mode. The next instruction pair, mov [rsp+8],eax and vldmxcsr dword ptr [rsp+8] (Load MXCSR Register State), loads the new RC control bits into MXCSR.RC. Note that the instruction sequence used in SetRC_M preserves all other bits in the MXCSR register.

Function ConvertScalar_Aavx() begins its execution with the instruction pair cmp r8d,CvtOpTableCount and jae BadCvtOp that validates argument value cvt_op. If cvt_op is valid, ConvertScalar_Aavx() uses GetRC_M and SetRC_M r9d to modify MXCSR.RC. Note that register R9D contains the new rounding mode. Figure 12-2 contains a portion of the MASM listing file (with some minor edits to improve readability) that shows the expansion of macros GetRC_M and SetRC_M. The MASM listing file denotes macro expanded instructions with a ‘1’ in a column located to the left of each instruction mnemonic. Note that in the expansion of macro SetRC_M, register r9d is substituted for macro argument RcReg.
Figure 12-2

Expansion of macros GetRC_M and SetRC_M

Function ConvertScalar_Aavx() uses argument value cvt_op and a jump table to select a conversion code block. This construct is akin to a C++ switch statement. Immediately after the ret instruction is a jump table named CvtOpTable. The align 8 statement that appears just before the start of CvtOpTable is an assembler directive that instructs the assembler to align the start of CvtOpTable on a quadword boundary. The align 8 directive is used here since CvtOpTable contains quadword elements of labels defined in ConvertScalar_Aavx(). The labels correspond to code blocks that perform a specific numerical conversion. The instruction jmp [CvtOpTable+rax*8] transfers program control to the code block specified by cvt_op, which was copied into RAX. More specifically, execution of the jmp [CvtOpTable+rax*8] instruction loads RIP with the quadword value stored in memory location CvtOpTable + rax * 8.

Each conversion code block in ConvertScalar_Aavx() uses a different AVX instruction to carry out a specific conversion operation. For example, the code block that follows label I32_F32 uses the instruction vcvtsi2ss (Convert Doubleword Integer to SPFP Value) to convert a 32-bit signed integer to single-precision floating-point. Table 12-3 summarizes the scalar floating-point conversion instructions used in example Ch12_04.
Table 12-3

AVX Scalar Floating-Point Conversion Instructions

Instruction Mnemonic

Description

vcvtsi2ss

Convert 32- or 64-bit signed integer to SPFP

vcvtsi2sd

Convert 32- or 64-bit signed integer to DPFP

vcvtss2si

Convert SPFP to 32- or 64-bit signed integer

vcvtsd2si

Convert DPFP to 32- or 64-bit signed integer

vcvtss2sd

Convert SPFP to DPFP

vcvtsd2ss

Convert DPFP to SPFP

The last instruction of each conversion code block is a jmp Done instruction. The label Done is located near the end of function ConvertScalar_Aavx(). At label Done, function ConvertScalar_Aavx() uses SetRC_M r10d to restore the original value of MXCSR.RC. The Visual C++ calling convention requires MXCSR.RC to be preserved across function boundaries. You will learn more about this later in this chapter. Here are the results for source code example Ch12_04:
----- Results for ConvertScalars() -----
Rounding control = Nearest
  F32_I32: 3.14159274 &#xF0E0; 3
  F32_I64: -2.71828175 --> -3
  F64_I32: 1.41421356 --> 1
  F64_I64: 0.70710678 --> 1
  F64_F32: 1.0000000000000002 --> 1.00000000
  I32_F32: 2147483647 --> 2147483648.00000000
  I64_F64: 9223372036854775807 --> 9223372036854775808.00000000
Rounding control = Down
  F32_I32: 3.14159274 --> 3
  F32_I64: -2.71828175 --> -3
  F64_I32: 1.41421356 --> 1
  F64_I64: 0.70710678 --> 0
  F64_F32: 1.0000000000000002 --> 1.00000000
  I32_F32: 2147483647 --> 2147483520.00000000
  I64_F64: 9223372036854775807 --> 9223372036854774784.00000000
Rounding control = Up
  F32_I32: 3.14159274 --> 4
  F32_I64: -2.71828175 --> -2
  F64_I32: 1.41421356 --> 2
  F64_I64: 0.70710678 --> 1
  F64_F32: 1.0000000000000002 --> 1.00000012
  I32_F32: 2147483647 --> 2147483648.00000000
  I64_F64: 9223372036854775807 --> 9223372036854775808.00000000
Rounding control = Zero
  F32_I32: 3.14159274 --> 3
  F32_I64: -2.71828175 --> -2
  F64_I32: 1.41421356 --> 1
  F64_I64: 0.70710678 --> 0
  F64_F32: 1.0000000000000002 --> 1.00000000
  I32_F32: 2147483647 --> 2147483520.00000000
  I64_F64: 9223372036854775807 --> 9223372036854774784.00000000

Scalar Floating-Point Arrays

Listing 12-5 shows the source code for example Ch12_05. This example illustrates how to calculate the mean and standard deviation of an array of single-precision floating-point values. Listing 12-5 only shows the assembly language code for example Ch12_05 since most of the other code is identical to what you saw in example Ch03_04. The equations used to calculate the mean and standard deviation are also the same.
;-------------------------------------------------
;               Ch12_05_fasm.asm
;-------------------------------------------------
;--------------------------------------------------------------------------
; exte"n""C" bool CalcMeanF32_Aavx(float* mean, const float* x, size_t n);
;--------------------------------------------------------------------------
        .code
CalcMeanF32_Aavx proc
; Make sure n is valid
        cmp r8,2                            ;is n >= 2?
        jae @F                              ;jump if yes
        xor eax,eax                         ;set error return code
        ret
; Initialize
@@:     vxorps xmm0,xmm0,xmm0               ;sum = 0.0f
        mov rax,-1                          ;i = -1
; Sum the elements of x
Loop1:  inc rax                             ;i += 1
        cmp rax,r8                          ;is i >= n?
        jae CalcM                           ;jump if yes
        vaddss xmm0,xmm0,real4 ptr [rdx+rax*4]  ;sum += x[i]
        jmp Loop1
; Calculate and save the mean
CalcM:  vcvtsi2ss xmm1,xmm1,r8              ;convert n to SPFP
        vdivss xmm1,xmm0,xmm1               ;xmm2 = mean = sum / n
        vmovss real4 ptr [rcx],xmm1         ;save mean
        mov eax,1                           ;set success return code
        ret
CalcMeanF32_Aavx endp
;--------------------------------------------------------------------------
; exte"n""C" bool CalcStDevF32_Aavx(float* st_dev, const float* x, size_t n, float mean);
;--------------------------------------------------------------------------
CalcStDevF32_Aavx proc
; Make sure n is valid
        cmp r8,2                            ;is n >= 2?
        jae @F                              ;jump if yes
        xor eax,eax                         ;set error return code
        ret
; Initialize
@@:     vxorps xmm0,xmm0,xmm0               ;sum_squares = 0.0f
        mov rax,-1                          ;i = -1
; Sum the elements of x
Loop1:  inc rax                             ;i += 1
        cmp rax,r8                          ;is i >= n?
        jae CalcSD                          ;jump if yes
        vmovss xmm1,real4 ptr [rdx+rax*4]   ;xmm1 = x[i]
        vsubss xmm2,xmm1,xmm3               ;xmm2 = x[–] - mean
        vmulss xmm2,xmm2,xmm2               ;xmm2 = (x[–] - mean) ** 2
        vaddss xmm0,xmm0,xmm2               ;update sum_squares
        jmp Loop1
; Calculate and save standard deviation
CalcSD: dec r8                              ;r8 =–n - 1
        vcvtsi2ss xmm1,xmm1,r8              ;convert–n - 1 to SPFP
        vdivss xmm0,xmm0,xmm1               ;xmm0 = sum_squares / –n - 1)
        vsqrtss xmm0,xmm0,xmm0              ;xmm0 = st_dev
        vmovss real4 ptr [rcx],xmm0         ;save st_dev
        mov eax,1                           ;set success return code
        ret
CalcStDevF32_Aavx endp
        end
Listing 12-5

Example Ch12_05

Listing 12-5 begins with the definition of assembly language function CalcMeanF32_Aavx(). The first code block of this function verifies that n >= 2 is true. Following validation of n, CalcMeanF32_Aavx() uses a vxorps xmm0,xmm0,xmm0 (Bitwise Logical XOR of Packed SPFP Values) instruction to set sum = 0.0. The next instruction, mov rax,-1, initializes loop index variable i to -1. Each iteration of Loop1 begins with an inc rax instruction that calculates i += 1. The ensuing instruction pair, cmp rax,r8 and jae CalcM, terminates Loop1 when i >= n is true. The vaddss xmm0,xmm0,real4 ptr [rdx+rax*4] instruction computes sum += x[i]. Following the calculation of sum, CalcMeanF32_Aavx() converts n to a single-precision floating-point value using the AVX instruction vcvtsi2ss xmm1,xmm1,r8. The next two instructions, vdivss xmm1,xmm0,xmm1 and vmovss real4 ptr [rcx],xmm1, calculate and save mean.

Function CalcStDev_Aavx() uses a similar for-loop construct to calculate the standard deviation. Inside Loop1, CalcStDev_Aavx() calculates sum_squares using the AVX instructions vsubss, vmulss, and vaddss . Note that argument value mean was passed in register XMM3. Following execution of Loop1, CalcStDev_Aavx() calculates the standard deviation using the instructions dec r8 (to calculate n - 1), vcvtsi2ss , vdivss, and vsqrtss.

The assembly language code in Listing 12-5 can be easily modified to create the double-precision counterpart functions CalcMeanF64_Aavx() and CalcStDevF64_Aavx(). Simply switch the single-precision (ss suffix) instructions to their double-precision (sd suffix) counterparts. Instructions that reference operands in memory will also need to use real8 ptr and a scale factor of 8 instead of real4 ptr and 4. Here are the results for source code example Ch12_05:
Results for CalcMeanF32_Cpp and CalcStDevF32_Cpp
mean1:    49.602146  st_dev1:  27.758242
Results for CalcMeanF32_Aavx and CalcStDevF32_Aavx
mean2:    49.602146  st_dev2:  27.758242

Calling Convention: Part 2

The source code presented thus far has informally discussed various aspects of the Visual C++ calling convention. In this section, the calling convention is formally explained. It reiterates some earlier elucidations and introduces new requirements that have not been discussed. A basic understanding of the calling convention is necessary since it is used extensively in subsequent chapters that explain x86-AVX SIMD programming using x86-64 assembly language.

Note

As a reminder, if you are reading this book to learn x86-64 assembly language programming and plan on using it with a different operating system or high-level language, you should consult the appropriate documentation for more information regarding the particulars of that calling convention.

The Visual C++ calling convention designates each x86-64 processor general-purpose register as volatile or nonvolatile. It also applies a volatile or nonvolatile classification to each XMM register. An x86-64 assembly language function can modify the contents of any volatile register but must preserve the contents of any nonvolatile register it uses. Table 12-4 lists the volatile and nonvolatile general-purpose and XMM registers.
Table 12-4

Visual C++ 64-Bit Volatile and Nonvolatile Registers

Register Group

Volatile Registers

Nonvolatile Registers

General-purpose

RAX, RCX, RDX, R8, R9, R10, R11

RBX, RSI, RDI, RBP, RSP,

R12, R13, R14, R15

Floating point and SIMD

XMM0–XMM5

XMM6–XMM15

On systems that support AVX or AVX2, the high-order 128 bits of registers YMM0–YMM15 are classified as volatile. Similarly, the high-order 384 bits of registers ZMM0–ZMM15 are classified as volatile on systems that support AVX-512. Registers ZMM16–ZMM31 and their corresponding YMM and XMM registers are also designated as volatile and need not be preserved. The legacy x87 FPU register stack is classified as volatile. All control bits in RFLAGS and MXCSR must be preserved across function boundaries. For example, assume function Foo() changes MXCSR.RC prior to performing a floating-point calculation. It then needs to call the C++ library function cos() to perform another calculation. Function Foo() must restore the original contents of MXCSR.RC before calling cos().

The programming requirements imposed on an x86-64 assembly language function by the Visual C++ calling convention vary depending on whether the function is a leaf or nonleaf function. Leaf functions are functions that
  • Do not call any other functions

  • Do not modify the contents of register RSP

  • Do not allocate any local stack space

  • Do not modify any of the nonvolatile general-purpose or XMM registers

  • Do not use exception handling

X86-64-bit assembly language leaf functions are easier to code, but they are only suitable for relatively simple computations. A nonleaf function can use the entire x86-64 register set, create a stack frame, or allocate local stack space. The preservation of nonvolatile registers and local stack space allocation is typically performed at the beginning of a function in a code block known as the prologue. Functions that utilize a prologue must also include a corresponding epilogue. A function epilogue releases any locally allocated stack space and restores any prologue preserved nonvolatile registers.

In the remainder of this section, you will examine four source code examples. The first three examples illustrate how to code nonleaf functions using explicit x86-64 assembly language instructions and assembler directives. These examples also convey critical programming information regarding the organization of a nonleaf function stack frame. The fourth example demonstrates how to use several prologue and epilogue macros. These macros help automate most of the programming labor that is associated with a nonleaf function. The source code listings in this section include only the C++ header file and the x86-64 assembly language code. The C++ code that performs test case initialization, argument checking, displaying of results, etc., is not shown to streamline the elucidations. The software download package includes the complete source code for each example.

Stack Frames

Listing 12-6 shows the source code for example Ch12_06. This example demonstrates how to create and use a stack frame pointer in an assembly language function. Source code example Ch12_06 also illustrates some of the programming protocols that an assembly language function prologue and epilogue must observe.
//------------------------------------------------
//               Ch12_06.h
//------------------------------------------------
#pragma once
#include <cstdint>
// Ch12_06_fasm.asm
extern "C" int64_t SumIntegers_A(int8_t a, int16_t b, int32_t c, int64_t d,
    int8_t e, int16_t f, int32_t g, int64_t h);
;-------------------------------------------------
;               Ch12_06_fasm.asm
;-------------------------------------------------
;--------------------------------------------------------------------------
; extern "C" int64_t SumIntegers_A(int8_t a, int16_t b, int32_t c, int64_t d,
;   int8_t e, int16_t f, int32_t g, int64_t h);
;--------------------------------------------------------------------------
; Named expressions for constant values:
;
; RBP_RA        = number of bytes between RBP and return address on stack
; STK_LOCAL     = size of local stack space
RBP_RA = 24
STK_LOCAL = 16
        .code
SumIntegers_A proc frame
; Function prologue
        push rbp                            ;save caller's rbp register
        .pushreg rbp
        sub rsp,STK_LOCAL                   ;allocate local stack space
        .allocstack STK_LOCAL
        mov rbp,rsp                         ;set frame pointer
        .setframe rbp,0
        .endprolog                          ;mark end of prologe
; Save argument registers to home area (optional)
        mov [rbp+RBP_RA+8],rcx
        mov [rbp+RBP_RA+16],rdx
        mov [rbp+RBP_RA+24],r8
        mov [rbp+RBP_RA+32],r9
; Calculate a + b + c + d
        movsx rcx,cl                        ;rcx = a
        movsx rdx,dx                        ;rdx = b
        movsxd r8,r8d                       ;r8 = c;
        add rcx,rdx                         ;rcx = a + b
        add r8,r9                           ;r8 = c + d
        add r8,rcx                          ;r8 = a + b + c + d
        mov [rbp],r8                        ;save a + b + c + d on stack
; Calculate e + f + g + h
        movsx rcx,byte ptr [rbp+RBP_RA+40]  ;rcx = e
        movsx rdx,word ptr [rbp+RBP_RA+48]  ;rdx = f
        movsxd r8,dword ptr [rbp+RBP_RA+56] ;r8 = g
        add rcx,rdx                         ;rcx = e + f
        add r8,qword ptr [rbp+RBP_RA+64]    ;r8 = g + h
        add r8,rcx                          ;r8 = e + f + g + h
; Compute final sum
        mov rax,[rbp]                       ;rax = a + b + c + d
        add rax,r8                          ;rax = final sum
; Function epilogue
        add rsp,16                          ;release local stack space
        pop rbp                             ;restore caller's rbp register
        ret
SumIntegers_A endp
        end
Listing 12-6

Example Ch12_06

Functions that need to reference both argument values and local variables on the stack often create a stack frame during execution of their prologues. During creation of a stack frame, register RBP is typically initialized as a stack frame pointer. Following stack frame initialization, the remaining code in a function can access items on the stack using RBP as a base register.

Near the top of file Ch12_06_fasm.asm are the statements RBP_RA = 24 and STK_LOCAL = 16. The = symbol is an assembler directive that defines a symbolic name for a numerical value. Unlike the equ directive, symbolic names defined using the = directive can be redefined. RBP_RA denotes the number of bytes between RBP and the return address on stack (it also equals the number of extra bytes needed to reference the stack home area). STK_LOCAL represents the number of bytes allocated on the stack for local storage. More on these values in a moment.

Following definition of RBP_RA and STK_LOCAL is the statement SumIntegers_A proc frame, which defines the beginning of function SumIntegers_A(). The frame attribute notifies the assembler that the function SumIntegers_A uses a stack frame pointer. It also instructs the assembler to generate static table data that the Visual C++ runtime environment uses to process exceptions. The ensuing push rbp instruction saves the caller’s RBP register on the stack since function SumIntegers_A() uses this register as its stack frame pointer. The .pushreg rbp statement that follows is an assembler directive that saves offset information about the push rbp instruction in an assembler-maintained exception handling table (see example Ch11_08 for more information about why this is necessary). It is important to keep in mind that assembler directives are not executable instructions; they are directions to the assembler on how to perform specific actions during assembly of the source code.

The sub rsp,STK_LOCAL instruction allocates STK_LOCAL bytes of space on the stack for local variables. Function SumIntegers_A() only uses eight bytes of this space, but the Visual C++ calling convention for 64-bit programs requires nonleaf functions to maintain double quadword (16-byte) alignment of the stack pointer outside of the prologue. You will learn more about stack pointer alignment requirements later in this section. The next statement, .allocstack STK_LOCAL, is an assembler directive that saves local stack size allocation information in the Visual C++ runtime exception handling tables.

The mov rbp,rsp instruction initializes register RBP as the stack frame pointer, and the .setframe rbp,0 directive notifies the assembler of this action. The offset value 0 that is included in the .setframe directive is the difference in bytes between RSP and RBP. In function SumIntegers_A(), registers RSP and RBP are the same, so the offset value is zero. Later in this section, you learn more about the .setframe directive. It should be noted that x86-64 assembly language functions can use any nonvolatile register as a stack frame pointer. Using RBP provides consistency between x86-64 and x86-32 assembly language code, which uses register EBP. The final assembler directive, .endprolog, signifies the end of the prologue for function SumIntegers_A(). Figure 12-3 shows the stack layout and argument registers following execution of the prologue.
Figure 12-3

Stack layout and registers of function SumIntegers_A() following execution of the prologue

The next code block contains a series of mov instructions that save registers RCX, RDX, R8, and R9 to their respective home areas on this stack. This step is optional and included in SumIntegers_A() for demonstration purposes. Note that the offset of each mov instruction includes the symbolic constant RBP_RA. Another option allowed by the Visual C++ calling convention is to save an argument register to its corresponding home area prior to the push rbp instruction using RSP as a base register (e.g., mov [rsp+8],rcx, mov [rsp+16],rdx, and so on). Also keep in mind that a function can use its home area to store other temporary values. When used for alternative storage purposes, the home area should not be referenced by an assembly language instruction until after the .endprolog directive per the Visual C++ calling convention.

Following the home area save operation, the function SumIntegers_A() sums argument values a, b, c, and d. It then saves this intermediate sum to LocalVar1 on the stack using a mov [rbp],r8 instruction. Note that the summation calculation sign-extends argument values a, b, and c using a movsx or movsxd instruction. A similar sequence of instructions is used to sum argument values e, f, g, and h, which are located on the stack and referenced using the stack frame pointer RBP and a constant offset. The symbolic name RBP_RA is also used here to account for the extra stack space needed to reference argument values on the stack. The two intermediate sums are then added to produce the final sum in register RAX.

A function epilogue must release any local stack storage space that was allocated in the prologue, restore any nonvolatile registers that were saved on the stack, and execute a function return. The add rsp,16 instruction releases the 16 bytes of stack space that SumIntegers_A() allocated in its prologue. This is followed by a pop rbp instruction, which restores the caller’s RBP register. The obligatory ret instruction is next. Here are the results for source code example Ch12_06:
----- Results for SumIntegers_A -----
a:        10
b:      -200
c:      -300
d:      4000
e:       -20
f:       400
g:      -600
h:     -8000
sum:   -4710

Using Nonvolatile General-Purpose Registers

The next source code example, Ch12_07, demonstrates how to use nonvolatile general-purpose registers in an x86-64-bit assembly language function. It also provides additional programming details regarding stack frames and the use of local variables. Listing 12-7 shows the header file and assembly language source code for source code example Ch12_07.
//------------------------------------------------
//               Ch12_07.h
//------------------------------------------------
#pragma once
#include <cstdint>
// Ch12_07_fasm.asm
extern "C" void CalcSumProd_A(const int64_t* a, const int64_t* b, int32_t n,
    int64_t* sum_a, int64_t* sum_b, int64_t* prod_a, int64_t* prod_b);
;-------------------------------------------------
;               Ch12_07_fasm.asm
;-------------------------------------------------
;--------------------------------------------------------------------------
; extern "C" void CalcSumProd_A(const int64_t* a, const int64_t* b, int32_t n,
;    int64_t* sum_a, int64_t* sum_b, int64_t* prod_a, int64_t* prod_b);
;--------------------------------------------------------------------------
; Named expressions for constant values:
;
; NUM_PUSHREG   = number of prolog non-volatile register pushes
; STK_LOCAL1    = size in bytes of STK_LOCAL1 area (see figure in text)
; STK_LOCAL2    = size in bytes of STK_LOCAL2 area (see figure in text)
; STK_PAD       = extra bytes (0 or 8) needed to 16-byte align RSP
; STK_TOTAL     = total size in bytes of local stack
; RBP_RA        = number of bytes between RBP and return address on stack
NUM_PUSHREG     = 4
STK_LOCAL1      = 32
STK_LOCAL2      = 16
STK_PAD         = ((NUM_PUSHREG AND 1) XOR 1) * 8
STK_TOTAL       = STK_LOCAL1 + STK_LOCAL2 + STK_PAD
RBP_RA          = NUM_PUSHREG * 8 + STK_LOCAL1 + STK_PAD
        .const
TestVal db 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
        .code
CalcSumProd_A proc frame
; Function prologue
        push rbp                            ;save non-volatile register RBP
        .pushreg rbp
        push rbx                            ;save non-volatile register RBX
        .pushreg rbx
        push r12                            ;save non-volatile register R12
        .pushreg r12
        push r13                            ;save non-volatile register R13
        .pushreg r13
        sub rsp,STK_TOTAL                   ;allocate local stack space
        .allocstack STK_TOTAL
        lea rbp,[rsp+STK_LOCAL2]            ;set frame pointer
        .setframe rbp,STK_LOCAL2
        .endprolog                          ;end of prologue
; Initialize local variables on the stack (demonstration only)
        vmovdqu xmm5, xmmword ptr [TestVal]
        vmovdqa xmmword ptr [rbp-16],xmm5   ;save xmm5 to LocalVar2A/2B
        mov qword ptr [rbp],0aah            ;save 0xaa to LocalVar1A
        mov qword ptr [rbp+8],0bbh          ;save 0xbb to LocalVar1B
        mov qword ptr [rbp+16],0cch         ;save 0xcc to LocalVar1C
        mov qword ptr [rbp+24],0ddh         ;save 0xdd to LocalVar1D
; Save argument values to home area (optional)
        mov qword ptr [rbp+RBP_RA+8],rcx
        mov qword ptr [rbp+RBP_RA+16],rdx
        mov qword ptr [rbp+RBP_RA+24],r8
        mov qword ptr [rbp+RBP_RA+32],r9
; Perform required initializations for processing loop
        test r8d,r8d                        ;is n <= 0?
        jle Done                            ;jump if n <= 0
        mov rbx,-8                          ;rbx = offset to array elements
        xor r10,r10                         ;r10 = sum_a
        xor r11,r11                         ;r11 = sum_b
        mov r12,1                           ;r12 = prod_a
        mov r13,1                           ;r13 = prod_b
; Compute the array sums and products
@@:     add rbx,8                           ;rbx = offset to next elements
        mov rax,[rcx+rbx]                   ;rax = a[i]
        add r10,rax                         ;update sum_a
        imul r12,rax                        ;update prod_a
        mov rax,[rdx+rbx]                   ;rax = b[i]
        add r11,rax                         ;update sum_b
        imul r13,rax                        ;update prod_b
        dec r8d                             ;adjust count
        jnz @B                              ;repeat until done
; Save the final results
        mov [r9],r10                        ;save sum_a
        mov rax,[rbp+RBP_RA+40]             ;rax = ptr to sum_b
        mov [rax],r11                       ;save sum_b
        mov rax,[rbp+RBP_RA+48]             ;rax = ptr to prod_a
        mov [rax],r12                       ;save prod_a
        mov rax,[rbp+RBP_RA+56]             ;rax = ptr to prod_b
        mov [rax],r13                       ;save prod_b
; Function epilogue
Done:   lea rsp,[rbp+STK_LOCAL1+STK_PAD]    ;restore rsp
        pop r13                             ;restore non-volatile GP registers
        pop r12
        pop rbx
        pop rbp
        ret
CalcSumProd_A endp
        end
Listing 12-7

Example Ch12_07

Toward the top of the assembly language code is a series of named constants that control how much stack space is allocated in the prologue of function CalcSumProd_A(). Like the previous example, the function CalcSumProd_A() includes the frame attribute as part of its proc statement to indicate that it uses a stack frame pointer. A series of push instructions saves nonvolatile registers RBP, RBX, R12, and R13 on the stack. Note that a .pushreg directive follows each x86-64 push instruction, which instructs the assembler to add information about each push instruction to the Visual C++ runtime exception handling tables.

A sub rsp,STK_TOTAL instruction allocates space on the stack for local variables, and the required .allocstack STK_TOTAL directive follows next. Register RBP is then initialized as the function’s stack frame pointer using a lea rbp,[rsp+STK_LOCAL2] (Load Effective Address) instruction, which loads rsp + STK_LOCAL2 into register RBP. Figure 12-4 illustrates the layout of the stack following execution of the lea instruction. Positioning RBP so that it “splits” the local stack area into two sections enables the assembler to generate machine code that is slightly more efficient since a larger portion of the local stack area can be referenced using 8-bit signed instead of 32-bit signed displacements. It also simplifies the saving and restoring of nonvolatile XMM registers, which is discussed later in this chapter. Following the lea instruction is a .setframe rbp,STK_LOCAL2 directive that enables the assembler to properly configure the runtime exception handling tables. The size parameter of a .setframe directive must be an even multiple of 16 and less than or equal to 240. The .endprolog directive signifies the end of the prologue for function CalcSumProd_A().
Figure 12-4

Stack layout and argument registers following execution of lea rbp,[rsp+STK_LOCAL2] in function CalcSumProd_A()

The next code block contains instructions that initialize several local variables on the stack. These instructions are for demonstration purposes only. Note that the vmovdqa [rbp-16],xmm5 (Move Aligned Packed Integer Values) instruction requires its destination operand to be aligned on a 16-byte boundary. Following initialization of the local variables, the argument registers are saved to their home locations, also just for demonstration purposes.

Function CalcSumProd_A() computes sums and products using the elements of two integer arrays. Prior to the start of the for-loop, the instruction pair test r8d,r8d (Logical Compare) and jle Done skips over the for-loop if n <= 0 is true. The test instruction performs a bitwise logical AND of its two operands and updates the status flags in RFLAGS; the result of the bitwise and operation is discarded. Following validation of argument value n, the function CalcSumProd_A() initializes the intermediate values sum_a (R10) and sum_b (R11) to zero and prod_a (R12) and prod_b (R13) to one. It then calculates the sum and product of the input arrays a and b. The results are saved to the memory locations specified by the caller. Note that the pointers for sum_b, prod_a, and prod_b were passed to CalcSumProd_A() via the stack as shown in Figure 12-4.

The epilogue of function CalcSumProd_A() begins with a lea rsp,[rbp+STK_LOCAL1+STK_PAD] instruction that restores register RSP to the value it had immediately after execution of the push r13 instruction in the prologue. When restoring RSP in an epilogue, the Visual C++ calling convention specifies that either a lea rsp,[RFP+X] or add rsp,X instruction must be used, where RFP denotes the frame pointer register and X is a constant value. This limits the number of instruction patterns that the runtime exception handler must identify. The subsequent pop instructions restore the nonvolatile general-purpose registers prior to execution of the ret instruction. According to the Visual C++ calling convention, function epilogues must be void of any processing logic including the setting of a return value. Here are the results for source code example Ch12_07:
----- Results for CalcSumProd_A -----
i:      0   a:      2   b:      3
i:      1   a:     -2   b:      5
i:      2   a:     -6   b:     -7
i:      3   a:      7   b:      8
i:      4   a:     12   b:      4
i:      5   a:      5   b:      9
sum_a =      18   sum_b =      22
prod_a =  10080   prod_b = -30240

Using Nonvolatile SIMD Registers

Earlier in this chapter, you learned how to use XMM registers to perform scalar floating-point arithmetic. The next source code example, named Ch12_08, illustrates the prologue and epilogue conventions that must be observed before a function can use any of the nonvolatile XMM registers. Listing 12-8 shows the source code for example Ch12_08.
//------------------------------------------------
//               Ch12_08.h
//------------------------------------------------
#pragma once
// Ch12_08_fcpp.cpp
extern bool CalcConeAreaVol_Cpp(const double* r, const double* h, int n,
    double* sa_cone, double* vol_cone);
// Ch12_08_fasm.asm
extern "C" bool CalcConeAreaVol_A(const double* r, const double* h, int n,
    double* sa_cone, double* vol_cone);
;-------------------------------------------------
;               Ch12_08_fasm.asm
;-------------------------------------------------
;--------------------------------------------------------------------------
; extern "C" bool CalcConeAreaVol_A(const double* r, const double* h, int n,
; double* sa_cone, double* vol_cone);
;--------------------------------------------------------------------------
; Named expressions for constant values
;
; NUM_PUSHREG   = number of prolog non-volatile register pushes
; STK_LOCAL1    = size in bytes of STK_LOCAL1 area (see figure in text)
; STK_LOCAL2    = size in bytes of STK_LOCAL2 area (see figure in text)
; STK_PAD       = extra bytes (0 or 8) needed to 16-byte align RSP
; STK_TOTAL     = total size in bytes of local stack
; RBP_RA        = number of bytes between RBP and ret addr on stack
NUM_PUSHREG     = 7
STK_LOCAL1      = 16
STK_LOCAL2      = 64
STK_PAD         = ((NUM_PUSHREG AND 1) XOR 1) * 8
STK_TOTAL       = STK_LOCAL1 + STK_LOCAL2 + STK_PAD
RBP_RA          = NUM_PUSHREG * 8 + STK_LOCAL1 + STK_PAD
            .const
r8_3p0      real8 3.0
r8_pi       real8 3.14159265358979323846
        .code
CalcConeAreaVol_A proc frame
; Save non-volatile general-purpose registers
        push rbp
        .pushreg rbp
        push rbx
        .pushreg rbx
        push rsi
        .pushreg rsi
        push r12
        .pushreg r12
        push r13
        .pushreg r13
        push r14
        .pushreg r14
        push r15
        .pushreg r15
; Allocate local stack space and initialize frame pointer
        sub rsp,STK_TOTAL                   ;allocate local stack space
        .allocstack STK_TOTAL
        lea rbp,[rsp+STK_LOCAL2]            ;rbp = stack frame pointer
        .setframe rbp,STK_LOCAL2
; Save non-volatile registers XMM12 - XMM15. Note that STK_LOCAL2 must
; be greater than or equal to the number of XMM register saves times 16.
        vmovdqa xmmword ptr [rbp-STK_LOCAL2+48],xmm12
       .savexmm128 xmm12,48
        vmovdqa xmmword ptr [rbp-STK_LOCAL2+32],xmm13
       .savexmm128 xmm13,32
        vmovdqa xmmword ptr [rbp-STK_LOCAL2+16],xmm14
       .savexmm128 xmm14,16
        vmovdqa xmmword ptr [rbp-STK_LOCAL2],xmm15
       .savexmm128 xmm15,0
        .endprolog
; Access local variables on the stack (demonstration only)
        mov qword ptr [rbp],-1              ;LocalVar1A = -1
        mov qword ptr [rbp+8],-2            ;LocalVar1B = -2
; Initialize the processing loop variables. Note that many of the
; register initializations below are performed merely to illustrate
; use of the non-volatile GP and XMM registers.
        mov esi,r8d                         ;esi = n
        test esi,esi                        ;is n > 0?
        jg @F                               ;jump if n > 0
        xor eax,eax                         ;set error return code
        jmp Done
@@:     mov rbx,-8                          ;rbx = offset to array elements
        mov r12,rcx                         ;r12 = ptr to r
        mov r13,rdx                         ;r13 = ptr to h
        mov r14,r9                          ;r14 = ptr to sa_cone
        mov r15,[rbp+RBP_RA+40]             ;r15 = ptr to vol_cone
        vmovsd xmm14,real8 ptr [r8_pi]      ;xmm14 = pi
        vmovsd xmm15,real8 ptr [r8_3p0]     ;xmm15 = 3.0
; Calculate cone surface areas and volumes
; sa = pi * r * (r + sqrt(r * r + h * h))
; vol = pi * r * r * h / 3
@@:     add rbx,8                           ;rbx = offset to next elements
        vmovsd xmm0,real8 ptr [r12+rbx]     ;xmm0 = r
        vmovsd xmm1,real8 ptr [r13+rbx]     ;xmm1 = h
        vmovsd xmm12,xmm12,xmm0             ;xmm12 = r
        vmovsd xmm13,xmm13,xmm1             ;xmm13 = h
        vmulsd xmm0,xmm0,xmm0         ;xmm0 = r * r
        vmulsd xmm1,xmm1,xmm1         ;xmm1 = h * h
        vaddsd xmm0,xmm0,xmm1         ;xmm0 = r * r + h * h
        vsqrtsd xmm0,xmm0,xmm0        ;xmm0 = sqrt(r * r + h * h)
        vaddsd xmm0,xmm0,xmm12        ;xmm0 = r + sqrt(r * r + h * h)
        vmulsd xmm0,xmm0,xmm12        ;xmm0 = r * (r + sqrt(r * r + h * h))
        vmulsd xmm0,xmm0,xmm14        ;xmm0 = pi * r * (r + sqrt(r * r + h * h))
        vmulsd xmm12,xmm12,xmm12       ;xmm12 = r * r
        vmulsd xmm13,xmm13,xmm14       ;xmm13 = h * pi
        vmulsd xmm13,xmm13,xmm12       ;xmm13 = pi * r * r * h
        vdivsd xmm13,xmm13,xmm15       ;xmm13 = pi * r * r * h / 3
        vmovsd real8 ptr [r14+rbx],xmm0     ;save surface area
        vmovsd real8 ptr [r15+rbx],xmm13    ;save volume
        dec esi                             ;update counter
        jnz @B                              ;repeat until done
        mov eax,1                           ;set success return code
; Restore non-volatile XMM registers
Done:   vmovdqa xmm12,xmmword ptr [rbp-STK_LOCAL2+48]
        vmovdqa xmm13,xmmword ptr [rbp-STK_LOCAL2+32]
        vmovdqa xmm14,xmmword ptr [rbp-STK_LOCAL2+16]
        vmovdqa xmm15,xmmword ptr [rbp-STK_LOCAL2]
; Restore non-volatile general-purpose registers
        lea rsp,[rbp+STK_LOCAL1+STK_PAD]    ;restore rsp
        pop r15
        pop r14
        pop r13
        pop r12
        pop rsi
        pop rbx
        pop rbp
        ret
CalcConeAreaVol_A endp
        end
Listing 12-8

Example Ch12_08

The assembly language function CalcConeAreaVol_A() calculates surface areas and volumes of right-circular cones. The following formulas are used to calculate these values:
$$ sa=pi rleft(r+sqrt{r^2+{h}^2}
ight) $$
$$ vol=pi {r}^2h/3 $$
The function CalcConeAreaVol_A() begins by saving the nonvolatile general-purpose registers that it uses on the stack. It then allocates the specified amount of local stack space and initializes RBP as the stack frame pointer. The next code block saves nonvolatile registers XMM12-XMM15 on the stack using a series of vmovdqa instructions. A .savexmm128 directive must be used after each vmovdqa instruction. Like the other prologue directives, the .savexmm128 directive instructs the assembler to store information regarding the preservation of a nonvolatile XMM register in its exception handling tables. The offset argument of a .savexmm128 directive represents the displacement of the saved XMM register on the stack relative to register RSP. Note that the size of STK_LOCAL2 must be greater than or equal to the number of saved XMM registers multiplied by 16. Figure 12-5 illustrates the layout of the stack following execution of the vmovdqa xmmword ptr [rbp-STK_LOCAL2],xmm15 instruction.
Figure 12-5

Stack layout and argument registers following execution of vmovdqa xmmword ptr [rbp-STK_LOCAL2],xmm15 in function CalcConeAreaVol_A()

Following the prologue, local variables LocalVar1A and LocalVar1B are accessed for demonstration purposes. Initialization of the registers used by the main processing loop occurs next. Note that many of these initializations are either suboptimal or superfluous; they are performed merely to highlight the use of nonvolatile registers, both general purpose and XMM. Calculation of the cone surface areas and volumes is then carried out using AVX double-precision floating-point arithmetic.

Upon completion of the processing loop, the nonvolatile XMM registers are restored using a series of vmovdqa instructions. The function CalcConeAreaVol_A() then releases its local stack space and restores the previously saved nonvolatile general-purpose registers that it used. Here are the results for source code example Ch12_08:
----- Results for CalcConeAreaVol -----
r/h:           1.00           1.00
sa:        7.584476       7.584476
vol:       1.047198       1.047198
r/h:           1.00           2.00
sa:       10.166407      10.166407
vol:       2.094395       2.094395
r/h:           2.00           3.00
sa:       35.220717      35.220717
vol:      12.566371      12.566371
r/h:           2.00           4.00
sa:       40.665630      40.665630
vol:      16.755161      16.755161
r/h:           3.00           5.00
sa:       83.229761      83.229761
vol:      47.123890      47.123890
r/h:           3.00          10.00
sa:      126.671905     126.671905
vol:      94.247780      94.247780
r/h:           4.25          12.50
sa:      233.025028     233.025028
vol:     236.437572     236.437572

Macros for Function Prologues and Epilogues

The purpose of the three previous source code examples was to explicate the requirements of the Visual C++ calling convention for 64-bit nonleaf functions. The calling convention’s rigid requisites for function prologues and epilogues are somewhat lengthy and a potential source of programming errors. It is important to recognize that the stack layout of a nonleaf function is primarily determined by the number of nonvolatile (both general-purpose and XMM) registers that must be preserved and the amount of local stack space that is needed. A method is needed to automate most of the coding drudgery associated with the calling convention.

Listing 12-9 shows the assembly language source code for example Ch12_09. This source code example demonstrates how to use several macros that I have written to simplify prologue and epilogue coding in a nonleaf function. This example also illustrates how to call a C++ library function from an x86-64 assembly language function.
//------------------------------------------------
//               Ch12_09.h
//------------------------------------------------
#pragma once
// Ch12_09_fcpp.cpp
extern bool CalcBSA_Cpp(const double* ht, const double* wt, int n,
    double* bsa1, double* bsa2, double* bsa3);
// Ch12_09_fasm.asm
extern "C" bool CalcBSA_Aavx(const double* ht, const double* wt, int n,
    double* bsa1, double* bsa2, double* bsa3);
;-------------------------------------------------
;               Ch12_09_fasm.asm
;-------------------------------------------------
        include <MacrosX86-64-AVX.asmh>
;--------------------------------------------------------------------------
; extern "C" bool CalcBSA_Aavx(const double* ht, const double* wt, int n,
;   double* bsa1, double* bsa2, double* bsa3);
;--------------------------------------------------------------------------
                .const
r8_0p007184     real8 0.007184
r8_0p725        real8 0.725
r8_0p425        real8 0.425
r8_0p0235       real8 0.0235
r8_0p42246      real8 0.42246
r8_0p51456      real8 0.51456
r8_3600p0       real8 3600.0
        .code
        extern pow:proc
CalcBSA_Aavx proc frame
        CreateFrame_M BSA_,16,64,rbx,rsi,r12,r13,r14,r15
        SaveXmmRegs_M xmm6,xmm7,xmm8,xmm9
        EndProlog_M
; Save argument registers to home area (optional). Note that the home
; area can also be used to store other transient data values.
        mov qword ptr [rbp+BSA_OffsetHomeRCX],rcx
        mov qword ptr [rbp+BSA_OffsetHomeRDX],rdx
        mov qword ptr [rbp+BSA_OffsetHomeR8],r8
        mov qword ptr [rbp+BSA_OffsetHomeR9],r9
; Initialize processing loop pointers. Note that the pointers are
; maintained in non-volatile registers, which eliminates reloads after
; the calls to pow().
        test r8d,r8d                            ;is n > 0?
        jg @F                                   ;jump if n > 0
        xor eax,eax                             ;set error return code
        jmp Done
@@:     mov [rbp],r8d                           ;save n to local var
        mov r12,rcx                             ;r12 = ptr to ht
        mov r13,rdx                             ;r13 = ptr to wt
        mov r14,r9                              ;r14 = ptr to bsa1
        mov r15,[rbp+BSA_OffsetStackArgs]       ;r15 = ptr to bsa2
        mov rbx,[rbp+BSA_OffsetStackArgs+8]     ;rbx = ptr to bsa3
        mov rsi,-8                              ;rsi = array element offset
; Allocate home space on stack for use by pow()
        sub rsp,32
; Calculate bsa1 = 0.007184 * pow(ht, 0.725) * pow(wt, 0.425);
@@:     add rsi,8                                   ;rsi = next offset
        vmovsd xmm0,real8 ptr [r12+rsi]             ;xmm0 = ht
        vmovsd xmm8,xmm8,xmm0
        vmovsd xmm1,real8 ptr [r8_0p725]
        call pow                                    ;xmm0 = pow(ht, 0.725)
        vmovsd xmm6,xmm6,xmm0
        vmovsd xmm0,real8 ptr [r13+rsi]             ;xmm0 = wt
        vmovsd xmm9,xmm9,xmm0
        vmovsd xmm1,real8 ptr [r8_0p425]
        call pow                                    ;xmm0 = pow(wt, 0.425)
        vmulsd xmm6,xmm6,real8 ptr [r8_0p007184]
        vmulsd xmm6,xmm6,xmm0                       ;xmm6 = bsa1
; Calculate bsa2 = 0.0235 * pow(ht, 0.42246) * pow(wt, 0.51456);
        vmovsd xmm0,xmm0,xmm8                       ;xmm0 = ht
        vmovsd xmm1,real8 ptr [r8_0p42246]
        call pow                                    ;xmm0 = pow(ht, 0.42246)
        vmovsd xmm7,xmm7,xmm0
        vmovsd xmm0,xmm0,xmm9                       ;xmm0 = wt
        vmovsd xmm1,real8 ptr [r8_0p51456]
        call pow                                    ;xmm0 = pow(wt, 0.51456)
        vmulsd xmm7,xmm7,real8 ptr [r8_0p0235]
        vmulsd xmm7,xmm7,xmm0                       ;xmm7 = bsa2
; Calculate bsa3 = sqrt(ht * wt / 3600.0);
        vmulsd xmm8,xmm8,xmm9                   ;xmm8 = ht * wt
        vdivsd xmm8,xmm8,real8 ptr [r8_3600p0]  ;xmm8 = ht * wt / 3600
        vsqrtsd xmm8,xmm8,xmm8                  ;xmm8 = bsa3
; Save BSA results
        vmovsd real8 ptr [r14+rsi],xmm6         ;save bsa1 result
        vmovsd real8 ptr [r15+rsi],xmm7         ;save bsa2 result
        vmovsd real8 ptr [rbx+rsi],xmm8         ;save bsa3 result
        dec dword ptr [rbp]                     ;n -= 1
        jnz @B
        mov eax,1                               ;set success return code
Done:   RestoreXmmRegs_M xmm6,xmm7,xmm8,xmm9
        DeleteFrame_M rbx,rsi,r12,r13,r14,r15
        ret
CalcBSA_Aavx endp
        end
Listing 12-9

Example Ch12_09

In Listing 12-9, the assembly language code begins with the statement include <MacrosX86-64-AVX.asmh>, which incorporates the contents of file MacrosX86-64-AVX.asmh into Ch12_09_fasm.asm during assembly. This file (source code not shown but included in the software download package) contains several macros that help automate much of the coding grunt work associated with the Visual C++ calling convention. Using an assembly language include file is analogous to using a C++ include file. The angled brackets that surround the file name can be omitted in some cases, but it is usually simpler and more consistent to just always use them. Note that there is no standard file name extension for x86 assembly language header files; I use .asmh but .inc is also used.

Figure 12-6 shows a generic stack layout diagram for a nonleaf function. Note the similarities between this figure and the more detailed stack layouts of Figures 12-4 and 12-5. The macros defined in MacrosX86-64-AVX.asmh assume that a function’s stack layout will conform to what is shown in Figure 12-6. They enable a function to tailor a custom stack frame by specifying the amount of local stack space that is needed and which nonvolatile registers must be preserved. The macros also perform most of the critical stack offset calculations, which reduces the risk of a programming error in a function prologue or epilogue.
Figure 12-6

Generic stack layout for a nonleaf function

Function CalcBSA_A() computes body surface areas (BSA) using the same equations that were used in example Ch09_03 (see Table 9-1). Following the include statement in Listing 12-9 is .const section that contains definitions for the various floating-point constant values used in the BSA equations. The line extern pow:proc enables the use of the external C++ library function pow(). Following the CalcBSA_A proc frame statement, the macro CreateFrame_M emits assembly language code that initializes the stack frame. It also saves the specified nonvolatile general-purpose registers on the stack. Macro CreateFrame_M requires several parameters including a prefix string and the size in bytes of StkSizeLocal1 and StkSizeLocal2 (see Figure 12-6). Macro CreateFrame_M uses the specified prefix string to generate symbolic names that can be employed to reference items on the stack. It is somewhat convenient to use a shortened version of the function name as the prefix string, but any file-unique text string can be used. Both StkSizeLocal1 and StkSizeLocal2 must be evenly divisible by 16. StkSizeLocal2 must also be less than or equal to 240 and greater than or equal to the number of saved XMM registers multiplied by 16.

The next statement makes use of the SaveXmmRegs_M macro to save the specified nonvolatile XMM registers to the XMM save area on the stack. This is followed by the EndProlog_M macro, which signifies the end of the function’s prologue. At this point, register RBP is configured as the function’s stack frame pointer. It is also safe to use any of the saved nonvolatile general-purpose or XMM registers.

The code block that follows EndProlog_M saves argument registers RCX, RDX, R8, and R9 to their home locations on the stack. Note that each mov instruction includes a symbolic name that equates to the offset of the register’s home area on the stack relative to the RBP register. The symbolic names and the corresponding offset values were automatically generated by the CreateFrame_M macro. The home area can also be used to store temporary data instead of the argument registers, as mentioned earlier in this chapter.

Initialization of the processing for-loop variables occurs next. Argument value n in register R8D is checked for validity and then saved on the stack as a local variable. Several nonvolatile registers are then initialized as pointer registers. Nonvolatile registers are used to avoid register reloads following each call to the C++ library function pow(). Note that the pointer to array bsa2 is loaded from the stack using a mov r15,[rbp+BSA_OffsetStackArgs] instruction. The symbolic constant BSA_OffsetStackArgs also was automatically generated by the macro CreateFrame_M and equates to the offset of the first stack argument relative to the RBP register. A mov rbx,[rbp+BSA_OffsetStackArgs+8] instruction loads argument bsa3 into register RBX; the constant 8 is included as part of the source operand displacement since bsa3 is the second argument passed via the stack.

The Visual C++ calling convention requires the caller of a function to allocate that function’s home area on the stack. The sub rsp,32 instruction performs this operation for function pow(). The ensuing code block calculates BSA values using the equations shown in Table 9-1. Note that registers XMM0 and XMM1 are loaded with the necessary argument values prior to each call to pow(). Also note that some of the return values from pow() are preserved in nonvolatile XMM registers prior to their actual use.

Following completion of the BSA processing for-loop is the epilogue for CalcBSA_A(). Before execution of the ret instruction, function CalcBSA_A() must restore all nonvolatile XMM and general-purpose registers that it saved in the prologue. The stack frame must also be properly deleted. The RestoreXmmRegs_M macro restores the nonvolatile XMM registers. Note that this macro requires the order of the registers in its argument list to match the register list that was used with the SaveXmmRegs_M macro. Stack frame cleanup and general-purpose register restores are handled by the DeleteFrame_M macro. The order of the registers specified in this macro’s argument list must be identical to the prologue’s CreateFrame_M macro. The DeleteFrame_M macro also restores RSP from RBP, which means that it is not necessary to code an explicit add rsp,32 instruction to release the home area that was allocated on the stack for pow(). You will see additional examples of function prologue and epilogue macro usage in subsequent chapters. Here are the results for source code example Ch12_09:
----- Results for CalcBSA -----
height:  150.0 (cm)
weight:   50.0 (kg)
BSA (C++):  1.432500  1.460836  1.443376 (sq. m)
BSA (AVX):  1.432500  1.460836  1.443376 (sq. m)
height:  160.0 (cm)
weight:   60.0 (kg)
BSA (C++):  1.622063  1.648868  1.632993 (sq. m)
BSA (AVX):  1.622063  1.648868  1.632993 (sq. m)
height:  170.0 (cm)
weight:   70.0 (kg)
BSA (C++):  1.809708  1.831289  1.818119 (sq. m)
BSA (AVX):  1.809708  1.831289  1.818119 (sq. m)
height:  180.0 (cm)
weight:   80.0 (kg)
BSA (C++):  1.996421  2.009483  2.000000 (sq. m)
BSA (AVX):  1.996421  2.009483  2.000000 (sq. m)
height:  190.0 (cm)
weight:   90.0 (kg)
BSA (C++):  2.182809  2.184365  2.179449 (sq. m)
BSA (AVX):  2.182809  2.184365  2.179449 (sq. m)
height:  200.0 (cm)
weight:  100.0 (kg)
BSA (C++):  2.369262  2.356574  2.357023 (sq. m)
BSA (AVX):  2.369262  2.356574  2.357023 (sq. m)

If the discussions of this section have left you feeling a little bewildered, don’t worry. In this book’s remaining chapters, you will see an abundance of x86-64 assembly language source code that demonstrates proper use of the Visual C++ calling convention and its programming requirements.

Summary

Table 12-5 summarizes the x86 assembly language instructions introduced in this chapter. This table also includes closely related instructions. Before proceeding to the next chapter, make sure you understand the operation that is performed by each instruction shown in Table 12-5.
Table 12-5

X86 Assembly Language Instruction Summary for Chapter 12

Instruction Mnemonic

Description

call

Call procedure/function

lea

Load effective address

setcc

Set byte if condition is true; clear otherwise

test

Logical compare (bitwise logical AND to set RFLAGS)

vadds[d|s]

Scalar floating-point addition

vcvtsd2ss

Convert scalar DPFP value to SPFP

vcomis[d|s]

Scalar floating-point compare

vcvts[d|s]2si

Convert scalar floating-point to signed integer

vcvtsi2s[d|s]

Convert signed integer to scalar floating-point

vcvtss2sd

Convert scalar SPFP to scalar DPFP

vdivs[d|s]

Scalar floating-point division

vldmxcsr

Load MXCSR register

vmovdqa

Move double quadword (aligned)

vmovdqu

Move double quadword (unaligned)

vmovs[d|s]

Move scalar floating-point value

vmuls[d|s]

Scalar floating-point multiplication

vsqrts[d|s]

Scalar floating-point square root

vstmxcsr

Store MXCSR register

vsubs[d|s]

Scalar floating-point subtraction

vxorp[d|s]

Packed floating-point bitwise logical exclusive OR

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.5.68