Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

D. KusswurmModern Parallel Programming with C++ and Assembly Languagehttps://doi.org/10.1007/978-1-4842-7918-2_9

9. Supplemental C++ SIMD Programming

Daniel Kusswurm¹

(1)

Geneva, IL, USA

In the previous eight chapters, you learned critical programming details about AVX, AVX2, and AVX-512. You also discovered how to create SIMD calculating functions that exploited the computational resources of these x86 instruction set extensions. This chapter focuses on supplemental x86 C++ SIMD programming topics. It begins with a source code example that demonstrates utilization of the cpuid instruction and how to exercise this instruction to detect x86 instruction set extensions such as AVX, AVX2, and AVX-512. This is followed by a section that explains how to use SIMD versions of common C++ math library routines.

Using CPUID

It has been mentioned several times already in this book, but it bears repeating one more time: a program should never assume that a specific instruction set extension such as FMA, AVX, AVX2, or AVX-512 is available on its host processor. To ensure software compatibility with both current and future x86 processors, a program should always use the x86 cpuid instruction (or an equivalent C++ intrinsic function) to verify that any required x86 instruction set extensions are available. An application program will crash or be terminated by the host operating system if it attempts to execute a nonsupported x86-AVX instruction. Besides x86 instruction set extensions, the cpuid instruction can also be directed to obtain supplemental feature information about a processor. The focus of this section is the use of cpuid to detect the presence of x86 instruction set extensions and a few basic processor features. If you are interested in learning how to use cpuid to detect other processor features, you should consult the AMD and Intel programming reference manuals listed in Appendix B.

Source code example Ch09_01 demonstrates how to use the cpuid instruction to detect x86 processor instruction set extensions. It also illustrates using cpuid to obtain useful processor feature information including vendor name, vendor brand, and cache sizes. Listing 9-1 includes the principal C++ data structures and software functions for source code for example Ch09_01. The complete source code for this example is included as part of the software download package.

//------------------------------------------------

// Cpuid__.h

//------------------------------------------------

#pragma once

#include <cstdint>

struct CpuidRegs

{

uint32_t EAX;

uint32_t EBX;

uint32_t ECX;

uint32_t EDX;

};

// Cpuid__.cpp

extern uint32_t Cpuid__(uint32_t r_eax, uint32_t r_ecx, CpuidRegs* r_out);

extern void Xgetbv__(uint32_t r_ecx, uint32_t* r_eax, uint32_t* r_edx);

//------------------------------------------------

// Cpuid__.cpp

//------------------------------------------------

#include <string>

#include <cassert>

#include <immintrin.h>

#include "Cpuid__.h"

#if defined(_MSC_VER)

#include <intrin.h>

#elif defined (__GNUG__)

#include <cpuid.h>

#include <x86intrin.h>

#else

#error Unknown target in Cpuid__.cpp

#endif

uint32_t Cpuid__(uint32_t r_eax, uint32_t r_ecx, CpuidRegs* r_out)

{

#if defined(_MSC_VER)

int cpuid_info[4];

cpuid_info[0] = cpuid_info[1] = cpuid_info[2] = cpuid_info[3] = 0;

__cpuidex(cpuid_info, r_eax, r_ecx);

#endif

#if defined (__GNUG__)

uint32_t cpuid_info[4];

cpuid_info[0] = cpuid_info[1] = cpuid_info[2] = cpuid_info[3] = 0;

__get_cpuid_count(r_eax, r_ecx, &cpuid_info[0], &cpuid_info[1],

&cpuid_info[2], &cpuid_info[3]);

#endif

r_out->EAX = cpuid_info[0];

r_out->EBX = cpuid_info[1];

r_out->ECX = cpuid_info[2];

r_out->EDX = cpuid_info[3];

uint32_t rc = cpuid_info[0] | cpuid_info[1] | cpuid_info[2] | cpuid_info[3];

return rc;

}

void Xgetbv__(uint32_t r_ecx, uint32_t* r_eax, uint32_t* r_edx)

{

uint64_t x = _xgetbv(r_ecx);

*r_eax = (uint32_t)(x & 0xFFFFFFFF);

*r_edx = (uint32_t)((x & 0xFFFFFFFF00000000) >> 32);

}

//------------------------------------------------

// CpuidInfo.h

//------------------------------------------------

#pragma once

#include <cstdint>

#include <vector>

#include <string>

#include "Cpuid__.h"

class CpuidInfo

{

public:

class CacheInfo

{

public:

enum class Type

{

Unknown, Data, Instruction, Unified

};

private:

uint32_t m_Level = 0;

Type m_Type = Type::Unknown;

uint32_t m_Size = 0;

public:

uint32_t GetLevel(void) const { return m_Level; }

uint32_t GetSize(void) const { return m_Size; }

Type GetType(void) const { return m_Type; }

// These are defined in CacheInfo.cpp

CacheInfo(uint32_t level, uint32_t type, uint32_t size);

std::string GetTypeString(void) const;

};

private:

uint32_t m_MaxEax; // Max EAX for basic CPUID

uint32_t m_MaxEaxExt; // Max EAX for extended CPUID

uint64_t m_FeatureFlags; // Processor feature flags

std::vector<CpuidInfo::CacheInfo> m_CacheInfo; // Processor cache information

char m_VendorId[13]; // Processor vendor ID string

char m_ProcessorBrand[49]; // Processor brand string

bool m_OsXsave; // XSAVE is enabled for app use

bool m_OsAvxState; // AVX state is enabled by OS

bool m_OsAvx512State; // AVX-512 state is enabled by OS

void Init(void);

void InitProcessorBrand(void);

void LoadInfo0(void);

void LoadInfo1(void);

void LoadInfo2(void);

void LoadInfo3(void);

void LoadInfo4(void);

void LoadInfo5(void);

public:

enum class FF : uint64_t

{

FXSR = (uint64_t)1 << 0,

MMX = (uint64_t)1 << 1,

MOVBE = (uint64_t)1 << 2,

SSE = (uint64_t)1 << 3,

SSE2 = (uint64_t)1 << 4,

SSE3 = (uint64_t)1 << 5,

SSSE3 = (uint64_t)1 << 6,

SSE4_1 = (uint64_t)1 << 7,

SSE4_2 = (uint64_t)1 << 8,

PCLMULQDQ = (uint64_t)1 << 9,

POPCNT = (uint64_t)1 << 10,

PREFETCHW = (uint64_t)1 << 11,

PREFETCHWT1 = (uint64_t)1 << 12,

RDRAND = (uint64_t)1 << 13,

RDSEED = (uint64_t)1 << 14,

ERMSB = (uint64_t)1 << 15,

AVX = (uint64_t)1 << 16,

AVX2 = (uint64_t)1 << 17,

F16C = (uint64_t)1 << 18,

FMA = (uint64_t)1 << 19,

BMI1 = (uint64_t)1 << 20,

BMI2 = (uint64_t)1 << 21,

LZCNT = (uint64_t)1 << 22,

ADX = (uint64_t)1 << 23,

AVX512F = (uint64_t)1 << 24,

AVX512ER = (uint64_t)1 << 25,

AVX512PF = (uint64_t)1 << 26,

AVX512DQ = (uint64_t)1 << 27,

AVX512CD = (uint64_t)1 << 28,

AVX512BW = (uint64_t)1 << 29,

AVX512VL = (uint64_t)1 << 30,

AVX512_IFMA = (uint64_t)1 << 31,

AVX512_VBMI = (uint64_t)1 << 32,

AVX512_4FMAPS = (uint64_t)1 << 33,

AVX512_4VNNIW = (uint64_t)1 << 34,

AVX512_VPOPCNTDQ = (uint64_t)1 << 35,

AVX512_VNNI = (uint64_t)1 << 36,

AVX512_VBMI2 = (uint64_t)1 << 37,

AVX512_BITALG = (uint64_t)1 << 38,

AVX512_BF16 = (uint64_t)1 << 39,

AVX512_VP2INTERSECT = (uint64_t)1 << 40,

CLWB = (uint64_t)1 << 41,

GFNI = (uint64_t)1 << 42,

AESNI = (uint64_t)1 << 43,

VAES = (uint64_t)1 << 44,

VPCLMULQDQ = (uint64_t)1 << 45,

AVX_VNNI = (uint64_t)1 << 46,

AVX512_FP16 = (uint64_t)1 << 47,

};

CpuidInfo(void) { Init(); };

~CpuidInfo() {};

const std::vector<CpuidInfo::CacheInfo>& GetCacheInfo(void) const

{

return m_CacheInfo;

}

bool GetFF(FF flag) const

{

return (m_FeatureFlags & (uint64_t)flag) != 0;

}

std::string GetProcessorBrand(void) const { return std::string(m_ProcessorBrand); }

std::string GetProcessorVendor(void) const { return std::string(m_VendorId); }

void LoadInfo(void);

};

//------------------------------------------------

// Ch09_01.cpp

//------------------------------------------------

#include <iostream>

#include <string>

#include "CpuidInfo.h"

static void DisplayProcessorInfo(const CpuidInfo& ci);

static void DisplayCacheInfo(const CpuidInfo& ci);

static void DisplayFeatureFlags(const CpuidInfo& ci);

int main()

{

CpuidInfo ci;

ci.LoadInfo();

DisplayProcessorInfo(ci);

DisplayCacheInfo(ci);

DisplayFeatureFlags(ci);

return 0;

}

static void DisplayProcessorInfo(const CpuidInfo& ci)

{

const char nl = ' ';

std::cout << " ----- Processor Info -----" << nl;

std::cout << "Processor vendor: " << ci.GetProcessorVendor() << nl;

std::cout << "Processor brand: " << ci.GetProcessorBrand() << nl;

}

static void DisplayCacheInfo(const CpuidInfo& ci)

{

const char nl = ' ';

const std::vector<CpuidInfo::CacheInfo>& cache_info = ci.GetCacheInfo();

std::cout << " ----- Cache Info -----" << nl;

for (const CpuidInfo::CacheInfo& x : cache_info)

{

uint32_t cache_size = x.GetSize();

uint32_t cache_size_kb = cache_size / 1024;

std::cout << "Cache L" << x.GetLevel() << ": ";

std::cout << cache_size_kb << " KB - ";

std::cout << x.GetTypeString() << nl;

}

static void DisplayFeatureFlags(const CpuidInfo& ci)

{

const char nl = ' ';

std::cout << " ----- Processor CPUID Feature Flags -----" << nl;

std::cout << "FMA: " << ci.GetFF(CpuidInfo::FF::FMA) << nl;

std::cout << "AVX: " << ci.GetFF(CpuidInfo::FF::AVX) << nl;

std::cout << "AVX2: " << ci.GetFF(CpuidInfo::FF::AVX2) << nl;

std::cout << "AVX512F: " << ci.GetFF(CpuidInfo::FF::AVX512F) << nl;

std::cout << "AVX512CD: " << ci.GetFF(CpuidInfo::FF::AVX512CD) << nl;

std::cout << "AVX512DQ: " << ci.GetFF(CpuidInfo::FF::AVX512DQ) << nl;

std::cout << "AVX512BW: " << ci.GetFF(CpuidInfo::FF::AVX512BW) << nl;

std::cout << "AVX512VL: " << ci.GetFF(CpuidInfo::FF::AVX512VL) << nl;

std::cout << "AVX512_IFMA: " << ci.GetFF(CpuidInfo::FF::AVX512_IFMA) << nl;

std::cout << "AVX512_VBMI: " << ci.GetFF(CpuidInfo::FF::AVX512_VBMI) << nl;

std::cout << "AVX512_VNNI: " << ci.GetFF(CpuidInfo::FF::AVX512_VNNI) << nl;

std::cout << "AVX512_VPOPCNTDQ: " << ci.GetFF(CpuidInfo::FF::AVX512_VPOPCNTDQ) << nl;

std::cout << "AVX512_VBMI2: " << ci.GetFF(CpuidInfo::FF::AVX512_VBMI2) << nl;

std::cout << "AVX512_BITALG: " << ci.GetFF(CpuidInfo::FF::AVX512_BITALG) << nl;

std::cout << "AVX512_BF16: " << ci.GetFF(CpuidInfo::FF::AVX512_BF16) << nl;

std::cout << "AVX512_VP2INTERSECT: " << ci.GetFF(CpuidInfo::FF::AVX512_VP2INTERSECT) << nl;

std::cout << "AVX512_FP16: " << ci.GetFF(CpuidInfo::FF::AVX512_FP16) << nl;

}

//------------------------------------------------

// CpuidInfo.cpp

//------------------------------------------------

#include <string>

#include <cstring>

#include <vector>

#include "CpuidInfo.h"

void CpuidInfo::LoadInfo(void)

{

// Note: LoadInfo0 must be called first

LoadInfo0();

LoadInfo1();

LoadInfo2();

LoadInfo3();

LoadInfo4();

LoadInfo5();

}

void CpuidInfo::LoadInfo4(void)

{

CpuidRegs r_eax01h;

CpuidRegs r_eax07h;

CpuidRegs r_eax07h_ecx01h;

if (m_MaxEax < 7)

return;

Cpuid__(1, 0, &r_eax01h);

Cpuid__(7, 0, &r_eax07h);

Cpuid__(7, 1, &r_eax07h_ecx01h);

// Test CPUID.(EAX=01H, ECX=00H):ECX.OSXSAVE[bit 27] to verify use of XGETBV

m_OsXsave = (r_eax01h.ECX & (0x1 << 27)) ? true : false;

if (m_OsXsave)

{

// Use XGETBV to obtain following information

// AVX state is enabled by OS if (XCR0[2:1] == '11b') is true

// AVX-512 state is enabled by OS if (XCR0[7:5] == '111b') is true

uint32_t xgetbv_eax, xgetbv_edx;

Xgetbv__(0, &xgetbv_eax, &xgetbv_edx);

m_OsAvxState = (((xgetbv_eax >> 1) & 0x03) == 0x03) ? true : false;

if (m_OsAvxState)

{

// CPUID.(EAX=01H, ECX=00H):ECX.AVX[bit 28]

if (r_eax01h.ECX & (0x1 << 28))

{

m_FeatureFlags |= (uint64_t)FF::AVX;

// Decode ECX flags

// CPUID.(EAX=07H, ECX=00H):EBX.AVX2[bit 5]

if (r_eax07h.EBX & (0x1 << 5))

m_FeatureFlags |= (uint64_t)FF::AVX2;

// CPUID.(EAX=07H, ECX=00H):ECX.VAES[bit 9]

if (r_eax07h.ECX & (0x1 << 9))

m_FeatureFlags |= (uint64_t)FF::VAES;

// CPUID.(EAX=07H, ECX=00H):ECX.VPCLMULQDQ[bit 10]

if (r_eax07h.ECX & (0x1 << 10))

m_FeatureFlags |= (uint64_t)FF::VPCLMULQDQ;

// CPUID.(EAX=01H, ECX=00H):ECX.FMA[bit 12]

if (r_eax01h.ECX & (0x1 << 12))

m_FeatureFlags |= (uint64_t)FF::FMA;

// CPUID.(EAX=01H, ECX=00H):ECX.F16C[bit 29]

if (r_eax01h.ECX & (0x1 << 29))

m_FeatureFlags |= (uint64_t)FF::F16C;

// Decode EAX flags (subleaf 1)

// CPUID.(EAX=07H, ECX=01H):EAX.AVX_VNNI[bit 4]

if (r_eax07h_ecx01h.EAX & (0x1 << 4))

m_FeatureFlags |= (uint64_t)FF::AVX_VNNI;

m_OsAvx512State = (((xgetbv_eax >> 5) & 0x07) == 0x07) ? true : false;

if (m_OsAvx512State)

{

// CPUID.(EAX=07H, ECX=00H):EBX.AVX512F[bit 16]

if (r_eax07h.EBX & (0x1 << 16))

{

m_FeatureFlags |= (uint64_t)FF::AVX512F;

// Decode EBX flags

// CPUID.(EAX=07H, ECX=00H):EBX.AVX512DQ[bit 17]

if (r_eax07h.EBX & (0x1 << 17))

m_FeatureFlags |= (uint64_t)FF::AVX512DQ;

// CPUID.(EAX=07H, ECX=00H):EBX.AVX512_IFMA[bit 21]

if (r_eax07h.EBX & (0x1 << 21))

m_FeatureFlags |= (uint64_t)FF::AVX512_IFMA;

// CPUID.(EAX=07H, ECX=00H):EBX.AVX512PF[bit 26]

if (r_eax07h.EBX & (0x1 << 26))

m_FeatureFlags |= (uint64_t)FF::AVX512PF;

// CPUID.(EAX=07H, ECX=00H):EBX.AVX512ER[bit 27]

if (r_eax07h.EBX & (0x1 << 27))

m_FeatureFlags |= (uint64_t)FF::AVX512ER;

// CPUID.(EAX=07H, ECX=00H):EBX.AVX512CD[bit 28]

if (r_eax07h.EBX & (0x1 << 28))

m_FeatureFlags |= (uint64_t)FF::AVX512CD;

// CPUID.(EAX=07H, ECX=00H):EBX.AVX512BW[bit 30]

if (r_eax07h.EBX & (0x1 << 30))

m_FeatureFlags |= (uint64_t)FF::AVX512BW;

// CPUID.(EAX=07H, ECX=00H):EBX.AVX512VL[bit 31]

if (r_eax07h.EBX & (0x1 << 31))

m_FeatureFlags |= (uint64_t)FF::AVX512VL;

// Decode ECX flags

// CPUID.(EAX=07H, ECX=00H):ECX.AVX512_VBMI[bit 1]

if (r_eax07h.ECX & (0x1 << 1))

m_FeatureFlags |= (uint64_t)FF::AVX512_VBMI;

// CPUID.(EAX=07H, ECX=00H):ECX.AVX512_VBMI2[bit 6]

if (r_eax07h.ECX & (0x1 << 6))

m_FeatureFlags |= (uint64_t)FF::AVX512_VBMI2;

// CPUID.(EAX=07H, ECX=00H):ECX.AVX512_VNNI[bit 11]

if (r_eax07h.ECX & (0x1 << 11))

m_FeatureFlags |= (uint64_t)FF::AVX512_VNNI;

// CPUID.(EAX=07H, ECX=00H):ECX.AVX512_BITALG[bit 12]

if (r_eax07h.ECX & (0x1 << 12))

m_FeatureFlags |= (uint64_t)FF::AVX512_BITALG;

// CPUID.(EAX=07H, ECX=00H):ECX.AVX512_VPOPCNTDQ[bit 14]

if (r_eax07h.ECX & (0x1 << 14))

m_FeatureFlags |= (uint64_t)FF::AVX512_VPOPCNTDQ;

// Decode EDX flags

// CPUID.(EAX=07H, ECX=00H):EDX.AVX512_4FMAPS[bit 2]

if (r_eax07h.EDX & (0x1 << 2))

m_FeatureFlags |= (uint64_t)FF::AVX512_4FMAPS;

// CPUID.(EAX=07H, ECX=00H):EDX.AVX512_4VNNIW[bit 3]

if (r_eax07h.EDX & (0x1 << 3))

m_FeatureFlags |= (uint64_t)FF::AVX512_4VNNIW;

// CPUID.(EAX=07H, ECX=00H):EDX.AVX512_VP2INTERSECT[bit 8]

if (r_eax07h.EDX & (0x1 << 8))

m_FeatureFlags |= (uint64_t)FF::AVX512_VP2INTERSECT;

// CPUID.(EAX=07H, ECX=00H):EDX.AVX512_FP16[bit 23]

if (r_eax07h.EDX & (0x1 << 23))

m_FeatureFlags |= (uint64_t)FF::AVX512_FP16;

// Decode EAX flags (subleaf 1)

// CPUID.(EAX=07H, ECX=01H):EAX.AVX512_BF16[bit 5]

if (r_eax07h_ecx01h.EAX & (0x1 << 5))

m_FeatureFlags |= (uint64_t)FF::AVX512_BF16;

}

Listing 9-1

Example Ch09_01

Before examining the source code in Listing 9-1, a few words regarding x86 registers and cpuid instruction usage are necessary. A register is a storage area within a processor that contains data. Most x86 processor instructions carry out their operations using one or more registers as operands. A register can also be used to temporarily store an intermediate result instead of saving it to memory. The cpuid instruction uses four 32-bit wide x86 registers named EAX, EBX, ECX, and EDX to query and return processor feature information. You will learn more about x86 processor registers in Chapter 10.

Prior to using the cpuid instruction, the calling function must load a “leaf” value into the processor’s EAX register. The leaf value specifies what information the cpuid instruction should return. The function may also need to load a “sub-leaf” value into register ECX before using cpuid. The cpuid instruction returns its results in registers EAX, EBX, ECX, and EDX. The calling function must then decipher the values in these registers to ascertain processor feature or instruction set availability. It is often necessary for a program to use cpuid multiple times. Most programs typically exercise the cpuid instruction during initialization and save the results for later use. The reason for this is that cpuid is a serializing instruction. A serializing instruction forces the processor to finish executing all previously fetched instructions and perform any pending memory writes before fetching the next instruction. In other words, it takes the processor a long time to execute a cpuid instruction.

Listing 9-1 begins with the definition of a simple C++ structure named CpuidRegs , which is located in file Cpuid__.h. This structure contains four uint32_t members named EAX, EBX, ECX, and EDX. Source code example Ch09_01 uses the CpuidRegs structure to hold cpuid instruction leaf and sub-leaf values. It also uses CpuidRegs to obtain and process the information returned from cpuid.

The next file in Listing 9-1 is named Cpuid__.cpp. The first function in this file, Cpuid__(), is a wrapper function that hides implementation differences between Windows and Linux. This function uses C++ compiler preprocessor definitions to select which cpuid intrinsic function, __cpuidex (Windows) or __get_cpuid_count (Linux), to use. The other function in Cpuid__.cpp is named Xgetbv__() . This is a wrapper function for the x86 xgetbv (Get Value of Extended Control Register) instruction. Function Xgetbv__() obtains state information from the processor that indicates whether the host operating system has enabled support for AVX, AVX2, or AVX-512.

Following Cpuid__.cpp in Listing 9-1 is a file named CpuidInfo.h. This file contains the declaration of class CpuidInfo. Class CpuidInfo begins with the declaration of a subclass named CacheInfo. As implied by its name, CpuidInfo::CacheInfo includes a public interface that provides information about the processor’s on-chip memory caches. Following the declaration of CpuidInfo::CacheInfo are the private data values for class CpuidInfo. These values maintain the data that is returned by various executions of the cpuid instruction.

Class CpuidInfo also includes a public interface that a program can use to obtain information returned by the cpuid instruction. The type CpuidInfo::FF defines symbolic names for x86 instruction set extensions that are commonly used in application programs. Note that the public function CpuidInfo::GetFF() requires a single argument value of type CpuidInfo::FF. This function returns a bool value that signifies whether the host processor (and host operating system) supports the specified instruction set extension. Class CpuidInfo also includes other useful public member functions. The functions CpuidInfo::GetProcessorBrand() and CpuidInfo::GetProcessorVendor() return text strings that report processor brand and vendor information. The member function CpuidInfo::GetCacheInfo() obtains information about the processor’s memory caches. Finally, the member function CpuidInfo::LoadInfo() performs one-time data initialization tasks. Calling this function triggers multiple executions of the cpuid instruction.

The next file in Listing 9-1 is Ch09_01.cpp. The code in this file demonstrates how to properly use class CpuidInfo. Function main() begins with the instantiation of a CpuidInfo object named ci. The ci.LoadInfo() call that follows initializes the private data members of ci. Note that CpuidInfo::LoadInfo() must be called prior to calling any other public member functions of class CpuidInfo. A program typically requires only one CpuidInfo instance, but multiple instances can be created. Following the ci.LoadInfo() call, function main() calls DisplayProcessorInfo(), DisplayCacheInfo(), and DisplayFeatureFlags(). These functions stream processor feature information obtained during the execution of CpuidInfo::LoadInfo() to std::cout.

The final file in Listing 9-1 is CpuidInfo.cpp. Near the top of this file is the definition of member function CpuidInfo::LoadInfo(). Recall that function main() calls CpuidInfo::LoadInfo() to initialize the private members of CpuidInfo instance ci. During its execution, CpuidInfo::LoadInfo() calls six private member functions named CpuidInfo::LoadInfo0() – CpuidInfo::LoadInfo5(). These functions exercise the previously described functions Cpuid__() and Xgetbv__() to determine processor support for the various x86 instruction set extensions enumerated by CpuidInfo::FF. Listing 9-1 only shows the source code for CpuidInfo::LoadInfo4(), which ascertains processor support for FMA, AVX, AVX2, AVX-512, and several other recent x86 instruction set extensions. Due to their length, the source code for the five other CpuidInfo::LoadInfoX() functions is not shown in Listing 9-1, but this code is included in the download software package.

Function CpuidInfo::LoadInfo4() begins its execution with three calls to Cpuid__(). Note that Cpuid__() requires three arguments: a leaf value, a sub-leaf value, and a pointer to a CpuidRegs structure. The specific leaf and sub-leaf values employed here direct Cpuid__() to obtain status flags that facilitate detection of x86-AVX instruction set extensions. The AMD and Intel programming reference manuals contain additional details regarding permissible cpuid instruction leaf and sub-leaf values.

Following execution of the three Cpuid__() calls, CpuidInfo::LoadInfo4() determines if the host operating system allows an application program to use the xgetbv instruction (or _xgetbv() intrinsic function), which is used in Xgetbv__(). Function Xgetbv__() sets status flags in xgetbv_eax that indicate whether the host operating system has enabled the internal processor states necessary for AVX and AVX2. If m_OsAvxState is true, function CpuidInfo::LoadInfo4() initiates a series of brute-force flag checks that test for AVX, AVX2, FMA, and several other x86 instruction set extensions. Note that each successful test sets a status flag in CpuidInfo::m_FeatureFlags to indicate availability of a specific x86 instruction set extension. These flags are the same ones returned by CpuidInfo::GetFF().

If m_OsAvxState is true, CpuidInfo::LoadInfo4() also checks additional status bits in xgetbv_eax to ascertain host operating system support for AVX-512. If m_OsAvx512State is true, CpuidInfo::LoadInfo4() initiates another series of brute-force flag tests to determine which AVX-512 instruction set extensions (see Table 7-1) are available. These tests also update CpuidInfo::m_FeatureFlags to indicate processor support for specific AVX-512 instruction set extensions. Here are the results for source code Ch09_01 that were obtained using Intel Core i7-8700K and Intel Core i5-11600K:

----- Processor Info -----

Processor vendor: GenuineIntel

Processor brand: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz

----- Cache Info -----

Cache L1: 32 KB - Data

Cache L1: 32 KB - Instruction

Cache L2: 256 KB - Unified

Cache L3: 12288 KB - Unified

----- Processor CPUID Feature Flags -----

FMA: 1

AVX: 1

AVX2: 1

AVX512F: 0

AVX512CD: 0

AVX512DQ: 0

AVX512BW: 0

AVX512VL: 0

AVX512_IFMA: 0

AVX512_VBMI: 0

AVX512_VNNI: 0

AVX512_VPOPCNTDQ: 0

AVX512_VBMI2: 0

AVX512_BITALG: 0

AVX512_BF16: 0

AVX512_VP2INTERSECT: 0

AVX512_FP16: 0

----- Processor Info -----

Processor vendor: GenuineIntel

Processor brand: 11th Gen Intel(R) Core(TM) i5-11600K @ 3.90GHz

----- Cache Info -----

Cache L1: 48 KB - Data

Cache L1: 32 KB - Instruction

Cache L2: 512 KB - Unified

Cache L3: 12288 KB - Unified

----- Processor CPUID Feature Flags -----

FMA: 1

AVX: 1

AVX2: 1

AVX512F: 1

AVX512CD: 1

AVX512DQ: 1

AVX512BW: 1

AVX512VL: 1

AVX512_IFMA: 1

AVX512_VBMI: 1

AVX512_VNNI: 1

AVX512_VPOPCNTDQ: 1

AVX512_VBMI2: 1

AVX512_BITALG: 1

AVX512_BF16: 0

AVX512_VP2INTERSECT: 0

AVX512_FP16: 0

Source code example Ch09_01 illustrates how to code a comprehensive x86 instruction set extension detection class. Code fragments from this example can be extracted to create a streamlined x86 instruction set detection class with fewer detection capabilities (e.g., only AVX, AVX2, and FMA). Finally, it should also be noted that many AVX-512 instructions (and their corresponding C++ SIMD intrinsic functions) can only be used if the host processor supports multiple AVX-512 instruction set extensions. For example, the AVX-512 C++ SIMD intrinsic function _mm256_mask_sub_epi16() will only execute on a processor that supports AVX512F, AVX512BW, and AVX512VL. One programming strategy to overcome the inconvenience of having to test multiple AVX-512 instruction set extensions is to create a single application-level status flag that logically ANDs any required AVX-512 instruction-set-extension status flags into a single Boolean variable. The Intel programming reference manuals listed in Appendix B contain additional information about this topic. Appendix B also contains a list of open-source libraries that you can use to determine x86 processor instruction set availability.

Short Vector Math Library

Many numerically oriented algorithms use standard C++ math library routines such as exp(), log(), log10(), pow(), sin(), cos(), and tan(). These functions carry out their calculations using scalar single-precision or double-precision floating-point values. The Short Vector Math Library (SVML), originally developed by Intel for their C/C++ compilers, contains SIMD versions of most standard C++ math library routines. SVML functions can also be used in application programs that are developed using Visual Studio 2019 or later. In this section, you will learn how to code some SIMD calculating functions that exploit SVML. The first example demonstrates converting an array of rectangular coordinates into polar coordinates. This is followed by an example that calculates body surface areas using arrays of patient heights and weights.

Rectangular to Polar Coordinates

A point on a two-dimensional plane can be uniquely specified using an ordered (x, y) pair. The values x and y represent signed distances from an origin point, which is located at the intersection of two perpendicular axes. An ordered (x, y) pair is called a rectangular (or Cartesian) coordinate. A point on a two-dimensional plane can also be uniquely specified using a radius vector r and angle θ as illustrated in Figure 9-1. An ordered (r, θ) pair is called a polar coordinate.

Figure 9-1
Specification of a point using rectangular and polar coordinates

A rectangular coordinate can be converted to a polar coordinate using the following equations:

$r=sqrt{x^2+{y}^2}$

$heta = atan2left(frac{y}{x} ight) mathrm{where} heta =left[-pi, pi ight]$

A polar coordinate can be converted to a rectangular coordinate using the following equations:

Listing 9-2 shows the source code for example Ch09_02. This example illustrates how to use several SVML functions to convert arrays of rectangular coordinates to polar coordinates and vice versa.

//------------------------------------------------

// Ch09_02.h

//------------------------------------------------

#pragma once

#include <vector>

// Ch09_02_fcpp.cpp

extern void ConvertRectToPolarF32_Cpp(std::vector<float>& r, std::vector<float>& a,

const std::vector<float>& x, const std::vector<float>& y);

extern void ConvertRectToPolarF32_Iavx(std::vector<float>& r, std::vector<float>& a,

const std::vector<float>& x, const std::vector<float>& y);

extern void ConvertPolarToRectF32_Cpp(std::vector<float>& x, std::vector<float>& y,

const std::vector<float>& r, const std::vector<float>& a);

extern void ConvertPolarToRectF32_Iavx(std::vector<float>& x, std::vector<float>& y, const std::vector<float>& r, const std::vector<float>& a);

// Ch09_02_misc.cpp

extern bool CheckArgs(const std::vector<float>& v1, const std::vector<float>& v2,

const std::vector<float>& v3, const std::vector<float>& v4);

extern bool CompareResults(const std::vector<float>& v1,

const std::vector<float>& v2);

extern void FillVectorsRect(std::vector<float>& x, std::vector<float>& y);

extern void FillVectorsPolar(std::vector<float>& r, std::vector<float>& a);

//------------------------------------------------

// Ch09_02_misc.cpp

//------------------------------------------------

#include <vector>

#include <stdexcept>

#include <cmath>

#include "Ch09_02.h"

#include "MT.h"

bool CheckArgs(const std::vector<float>& v1, const std::vector<float>& v2,

const std::vector<float>& v3, const std::vector<float>& v4)

{

size_t n = v1.size();

return (n == v2.size() && n == v3.size() && n == v4.size());

}

bool CompareResults(const std::vector<float>& v1, const std::vector<float>& v2)

{

float epsilon = 1.0e-4f;

if (v1.size() != v2.size())

return false;

size_t n = v1.size();

for (size_t i = 0; i < n; i++)

{

if (fabs(v1[i] - v2[i]) > epsilon)

return false;

}

return true;

}

void FillVectorsRect(std::vector<float>& x, std::vector<float>& y)

{

if (x.size() != y.size())

throw std::runtime_error("FillVectorsRect() - non-conforming vectors");

const int rng_min = -25;

const int rng_max = 25;

const unsigned int rng_seed_x = 699;

const unsigned int rng_seed_y = 701;

MT::FillArray(x.data(), x.size(), rng_min, rng_max, rng_seed_x, true);

MT::FillArray(y.data(), y.size(), rng_min, rng_max, rng_seed_y, true);

}

void FillVectorsPolar(std::vector<float>& r, std::vector<float>& a)

{

if (r.size() != a.size())

throw std::runtime_error("FillVectorsPolar() - non-conforming vectors");

const int rng_min_r = 1;

const int rng_max_r = 50;

const int rng_min_a = -359;

const int rng_max_a = 359;

const unsigned int rng_seed_r = 703;

const unsigned int rng_seed_a = 707;

MT::FillArray(r.data(), r.size(), rng_min_r, rng_max_r, rng_seed_r, true);

MT::FillArray(a.data(), a.size(), rng_min_a, rng_max_a, rng_seed_a, true);

}

//------------------------------------------------

// Ch09_02.cpp

//------------------------------------------------

#include <iostream>

#include <iomanip>

#include <vector>

#include <stdexcept>

#include "Ch09_02.h"

static void ConvertRectToPolar(void);

static void ConvertPolarToRect(void);

int main()

{

try

{

ConvertRectToPolar();

ConvertPolarToRect();

}

catch (std::exception& ex)

{

std::cout << "Ch09_02 exception: " << ex.what() << ' ';

}

return 0;

}

static void ConvertRectToPolar(void)

{

const size_t n = 19;

std::vector<float> x(n), y(n);

std::vector<float> r1(n), a1(n);

std::vector<float> r2(n), a2(n);

FillVectorsRect(x, y);

ConvertRectToPolarF32_Cpp(r1, a1, x, y);

ConvertRectToPolarF32_Iavx(r2, a2, x, y);

size_t w = 10;

std::cout << std::fixed << std::setprecision(4);

std::cout << " ----- Results for ConvertRectToPolar ----- ";

for (size_t i = 0; i < n; i++)

{

std::cout << std::setw(4) << i << ": ";

std::cout << std::setw(w) << x[i] << ", ";

std::cout << std::setw(w) << y[i] << " | ";

std::cout << std::setw(w) << r1[i] << ", ";

std::cout << std::setw(w) << a1[i] << " | ";

std::cout << std::setw(w) << r2[i] << ", ";

std::cout << std::setw(w) << a2[i] << ' ';

}

if (!CompareResults(r1, r2) || !CompareResults(a1, a2))

throw std::runtime_error("CompareResults() failed");

}

static void ConvertPolarToRect(void)

{

const size_t n = 19;

std::vector<float> r(n), a(n);

std::vector<float> x1(n), y1(n);

std::vector<float> x2(n), y2(n);

FillVectorsPolar(r, a);

ConvertPolarToRectF32_Cpp(x1, y1, r, a);

ConvertPolarToRectF32_Iavx(x2, y2, r, a);

size_t w = 10;

std::cout << std::fixed << std::setprecision(4);

std::cout << " ----- Results for ConvertPolarToRect ----- ";

for (size_t i = 0; i < n; i++)

{

std::cout << std::setw(4) << i << ": ";

std::cout << std::setw(w) << r[i] << ", ";

std::cout << std::setw(w) << a[i] << " | ";

std::cout << std::setw(w) << x1[i] << ", ";

std::cout << std::setw(w) << y1[i] << " | ";

std::cout << std::setw(w) << x2[i] << ", ";

std::cout << std::setw(w) << y2[i] << ' ';

}

if (!CompareResults(x1, x2) || !CompareResults(y1, y2))

throw std::runtime_error("CompareResults() failed");

}

//------------------------------------------------

// SimdMath.h

//------------------------------------------------

#if _MSC_VER >= 1921 // VS 2019 or later

#include <cmath>

#include <immintrin.h>

#elif defined(__GNUG__)

#include <cmath>

#include <immintrin.h>

#else

#error Unknown target in SimdMath.h

#endif

inline __m256 atan2_f32x8(__m256 y, __m256 x)

{

#if _MSC_VER >= 1921

return _mm256_atan2_ps(y, x);

#endif

#if defined(__GNUG__)

__m256 atan2_vals;

for (size_t i = 0; i < 8; i++)

atan2_vals[i] = atan2(y[i], x[i]);

return atan2_vals;

#endif

}

inline __m256 cos_f32x8(__m256 x)

{

#if _MSC_VER >= 1921

return _mm256_cos_ps(x);

#endif

#if defined(__GNUG__)

__m256 cos_vals;

for (size_t i = 0; i < 8; i++)

cos_vals[i] = cos(x[i]);

return cos_vals;

#endif

}

inline __m256d pow_f64x4(__m256d x, __m256d y)

{

#if _MSC_VER >= 1921

return _mm256_pow_pd(x, y);

#endif

#if defined(__GNUG__)

__m256d pow_vals;

for (size_t i = 0; i < 4; i++)

pow_vals[i] = pow(x[i], y[i]);

return pow_vals;

#endif

}

inline __m256 sin_f32x8(__m256 x)

{

#if _MSC_VER >= 1921

return _mm256_sin_ps(x);

#endif

#if defined(__GNUG__)

__m256 sin_vals;

for (size_t i = 0; i < 8; i++)

sin_vals[i] = sin(x[i]);

return sin_vals;

#endif

}

//------------------------------------------------

// Ch09_02_fcpp.cpp

//------------------------------------------------

#include <iostream>

#include <stdexcept>

#include <immintrin.h>

#define _USE_MATH_DEFINES

#include <math.h>

#include "Ch09_02.h"

#include "SimdMath.h"

const float c_DegToRad = (float)(M_PI / 180.0);

const float c_RadToDeg = (float)(180.0 / M_PI);

void ConvertRectToPolarF32_Cpp(std::vector<float>& r, std::vector<float>& a,

const std::vector<float>& x, const std::vector<float>& y)

{

if (!CheckArgs(r, a, x, y))

throw std::runtime_error("ConvertRectToPolarF32_Cpp() - CheckArgs failed");

size_t n = r.size();

for (size_t i = 0; i < n; i++)

{

r[i] = sqrt(x[i] * x[i] + y[i] * y[i]);

a[i] = atan2(y[i], x[i]) * c_RadToDeg;

}

void ConvertPolarToRectF32_Cpp(std::vector<float>& x, std::vector<float>& y,

const std::vector<float>& r, const std::vector<float>& a)

{

if (!CheckArgs(x, y, r, a))

throw std::runtime_error("ConvertPolarToRectF32_Cpp() - CheckArgs failed");

size_t n = x.size();

for (size_t i = 0; i < n; i++)

{

x[i] = r[i] * cos(a[i] * c_DegToRad);

y[i] = r[i] * sin(a[i] * c_DegToRad);

}

void ConvertRectToPolarF32_Iavx(std::vector<float>& r, std::vector<float>& a,

const std::vector<float>& x, const std::vector<float>& y)

{

if (!CheckArgs(r, a, x, y))

throw std::runtime_error("ConvertRectToPolarF32_Iavx() - CheckArgs failed");

size_t n = r.size();

__m256 rad_to_deg = _mm256_set1_ps(c_RadToDeg);

size_t i = 0;

const size_t num_simd_elements = 8;

for (; n - i >= num_simd_elements; i += num_simd_elements)

{

__m256 x_vals = _mm256_loadu_ps(&x[i]);

__m256 y_vals = _mm256_loadu_ps(&y[i]);

__m256 x_vals2 = _mm256_mul_ps(x_vals, x_vals);

__m256 y_vals2 = _mm256_mul_ps(y_vals, y_vals);

__m256 temp = _mm256_add_ps(x_vals2, y_vals2);

__m256 r_vals = _mm256_sqrt_ps(temp);

_mm256_storeu_ps(&r[i], r_vals);

__m256 a_vals_rad = atan2_f32x8(y_vals, x_vals);

__m256 a_vals_deg = _mm256_mul_ps(a_vals_rad, rad_to_deg);

_mm256_storeu_ps(&a[i], a_vals_deg);

}

for (; i < n; i++)

{

r[i] = sqrt(x[i] * x[i] + y[i] * y[i]);

a[i] = atan2(y[i], x[i]) * c_RadToDeg;

}

void ConvertPolarToRectF32_Iavx(std::vector<float>& x, std::vector<float>& y,

const std::vector<float>& r, const std::vector<float>& a)

{

if (!CheckArgs(x, y, r, a))

throw std::runtime_error("ConvertPolarToRectF32_Iavx() - CheckArgs failed");

size_t n = x.size();

__m256 deg_to_rad = _mm256_set1_ps(c_DegToRad);

size_t i = 0;

const size_t num_simd_elements = 8;

for (; n - i >= num_simd_elements; i += num_simd_elements)

{

__m256 r_vals = _mm256_loadu_ps(&r[i]);

__m256 a_vals_deg = _mm256_loadu_ps(&a[i]);

__m256 a_vals_rad = _mm256_mul_ps(a_vals_deg, deg_to_rad);

__m256 x_vals_temp = cos_f32x8(a_vals_rad);

__m256 x_vals = _mm256_mul_ps(r_vals, x_vals_temp);

_mm256_storeu_ps(&x[i], x_vals);

__m256 y_vals_temp = sin_f32x8(a_vals_rad);

__m256 y_vals = _mm256_mul_ps(r_vals, y_vals_temp);

_mm256_storeu_ps(&y[i], y_vals);

}

for (; i < n; i++)

{

x[i] = r[i] * cos(a[i] * c_DegToRad);

y[i] = r[i] * sin(a[i] * c_DegToRad);

}

Listing 9-2

Example Ch09_02

Listing 9-2 starts with the file Ch09_02.h. Note that the function declarations in this file use arguments of type std::vector<float> for the various coordinate arrays. The next file in Listing 9-2, Ch09_02_misc.cpp, contains assorted functions that perform argument checking and vector initialization. Also shown in Listing 9-2 is the file Ch09_02.cpp. This file contains a function named ConvertRectToPolar(), which performs test case initialization. Function ConvertRectToPolar() also exercises the SIMD rectangular to polar coordinate conversion function and streams results to std::cout. The polar to rectangular counterpart of function ConvertRectToPolar() is named ConvertPolarToRect() and is also located in file Ch09_02.cpp.

The next file in Listing 9-2, SimdMath.h, defines several inline functions that perform common math operations using SIMD arguments of type __m256 or __m256d. Note that this file includes preprocessor definitions that enable different code blocks for Visual C++ and GNU C++. The Visual C++ sections emit SVML library function calls since these are directly supported in Visual Studio 2019 and later. The GNU C++ sections substitute simple for-loops for the SVML functions since SVML is not directly supported. If you are interested in using SVML with GNU C++ and Linux, you should consult the Intel C++ compiler and GNU C++ compiler references listed in Appendix B.

The final file in Listing 9-2, Ch09_02_fcpp.cpp, begins with the definitions of functions ConvertRectToPolarF32_Cpp() and ConvertPolarToRectF32_Cpp(). These functions perform rectangular to polar and polar to rectangular coordinate conversions using standard C++ statements and math library functions. The next function, ConvertRectToPolarF32_Iavx(), performs rectangular to polar coordinate conversions using AVX and C++ SIMD intrinsic functions. Following argument validation, ConvertRectToPolarF32_Iavx() employs _mm256_set1_ps() to create a packed version of the constant 180.0 / M_PI for radian to degree conversions. The first for-loop in ConvertRectToPolarF32_Iavx() uses C++ SIMD intrinsic functions that you have already seen. Note the use of atan2_f32x8(), which is defined in SimdMath.h. This function calculates eight polar coordinate angle components. The second for-loop in ConvertRectToPolarF32_Iavx() process any residual coordinates using standard C++ math library functions.

Also included in file Ch09_02_fcpp.cpp is the polar to rectangular coordinate conversion function ConvertPolarToRectF32_Iavx(). The layout of this function is akin to ConvertRectToPolarF32_Iavx(). Note that the first for-loop in ConvertPolarToRectF32_Iavx() uses the SIMD math functions cos_f32x8() and sin_f32x8(), also defined in SimdMath.h. It is important to keep in mind that minor value discrepancies may exist between a standard C++ math library function and its SVML counterpart. Here are the results for source code example Ch09_02:

----- Results for ConvertRectToPolar -----

0: -1.0000, 13.0000 | 13.0384, 94.3987 | 13.0384, 94.3987

1: 5.0000, -6.0000 | 7.8102, -50.1944 | 7.8102, -50.1944

2: 21.0000, -6.0000 | 21.8403, -15.9454 | 21.8403, -15.9454

3: -16.0000, -4.0000 | 16.4924, -165.9638 | 16.4924, -165.9638

4: 11.0000, 20.0000 | 22.8254, 61.1892 | 22.8254, 61.1892

5: 22.0000, -14.0000 | 26.0768, -32.4712 | 26.0768, -32.4712

6: 24.0000, -2.0000 | 24.0832, -4.7636 | 24.0832, -4.7636

7: -9.0000, -5.0000 | 10.2956, -150.9454 | 10.2956, -150.9454

8: -23.0000, 3.0000 | 23.1948, 172.5686 | 23.1948, 172.5686

9: 23.0000, 17.0000 | 28.6007, 36.4692 | 28.6007, 36.4692

10: -2.0000, -4.0000 | 4.4721, -116.5650 | 4.4721, -116.5650

11: 23.0000, 21.0000 | 31.1448, 42.3974 | 31.1448, 42.3974

12: 25.0000, -17.0000 | 30.2324, -34.2157 | 30.2324, -34.2157

13: -4.0000, -12.0000 | 12.6491, -108.4350 | 12.6491, -108.4350

14: 21.0000, -2.0000 | 21.0950, -5.4403 | 21.0950, -5.4403

15: 17.0000, -7.0000 | 18.3848, -22.3801 | 18.3848, -22.3801

16: -3.0000, -19.0000 | 19.2354, -98.9726 | 19.2354, -98.9726

17: -25.0000, 21.0000 | 32.6497, 139.9697 | 32.6497, 139.9697

18: 5.0000, 16.0000 | 16.7631, 72.6460 | 16.7631, 72.6460

----- Results for ConvertPolarToRect -----

0: 43.0000, -64.0000 | 18.8500, -38.6481 | 18.8500, -38.6481

1: 22.0000, 194.0000 | -21.3465, -5.3223 | -21.3465, -5.3223

2: 11.0000, -81.0000 | 1.7208, -10.8646 | 1.7208, -10.8646

3: 47.0000, 149.0000 | -40.2869, 24.2068 | -40.2869, 24.2068

4: 34.0000, -217.0000 | -27.1536, 20.4617 | -27.1536, 20.4617

5: 8.0000, 194.0000 | -7.7624, -1.9354 | -7.7624, -1.9354

6: 12.0000, 158.0000 | -11.1262, 4.4953 | -11.1262, 4.4953

7: 11.0000, 90.0000 | -0.0000, 11.0000 | -0.0000, 11.0000

8: 46.0000, 111.0000 | -16.4849, 42.9447 | -16.4849, 42.9447

9: 34.0000, 92.0000 | -1.1866, 33.9793 | -1.1866, 33.9793

10: 14.0000, 84.0000 | 1.4634, 13.9233 | 1.4634, 13.9233

11: 17.0000, -37.0000 | 13.5768, -10.2309 | 13.5768, -10.2309

12: 26.0000, -61.0000 | 12.6050, -22.7401 | 12.6050, -22.7401

13: 14.0000, -76.0000 | 3.3869, -13.5841 | 3.3869, -13.5841

14: 27.0000, 197.0000 | -25.8202, -7.8940 | -25.8202, -7.8940

15: 3.0000, -36.0000 | 2.4271, -1.7634 | 2.4271, -1.7634

16: 5.0000, 196.0000 | -4.8063, -1.3782 | -4.8063, -1.3782

17: 40.0000, -149.0000 | -34.2867, -20.6015 | -34.2867, -20.6015

18: 27.0000, 354.0000 | 26.8521, -2.8223 | 26.8521, -2.8223

Body Surface Area

Healthcare professionals often use body surface area (BSA) to establish chemotherapy dosages for cancer patients. Table 9-1 lists three well-known equations that calculate BSA. In this table, each equation uses the symbol H for patient height in centimeters, W for patient weight in kilograms, and BSA for patient body surface area in square meters.

Table 9-1

Body Surface Area Equations

Method	Equation
DuBois and DuBois	BSA = 0.007184 × H^0.725 × W^0.425
Gehan and George	BSA = 0.0235 × H^0.42246 × W^0.51456
Mosteller

Listing 9-3 shows the source code for example Ch09_03. This example implements the three BSA equations shown in Table 9-1 using SIMD arithmetic and arrays of double-precision floating-point heights and weights.

//------------------------------------------------

// Ch09_03.h

//------------------------------------------------

#pragma once

#include <vector>

// Ch09_03_fcpp.cpp

extern void CalcBSA_F64_Cpp(std::vector<double>& bsa, const std::vector<double>& ht,

const std::vector<double>& wt);

extern void CalcBSA_F64_Iavx(std::vector<double>& bsa, const std::vector<double>& ht,

const std::vector<double>& wt);

// Ch09_03_misc.cpp

extern bool CheckArgs(const std::vector<double>& bsa,

const std::vector<double>& ht, const std::vector<double>& wt);

extern bool CompareResults(const std::vector<double>& bsa1,

const std::vector<double>& bsa2);

extern void FillHeightWeightVectors(std::vector<double>& ht,

std::vector<double>& wt);

// Ch09_03_bm.cpp

void CalcBSA_bm(void);

//------------------------------------------------

// Ch09_03_misc.cpp

//------------------------------------------------

#include <vector>

#include <algorithm>

#include <stdexcept>

#include <cmath>

#include "Ch09_03.h"

#include "MT.h"

bool CheckArgs(const std::vector<double>& bsa, const std::vector<double>& ht,

const std::vector<double>& wt)

{

if (ht.size() != wt.size())

return false;

if (bsa.size() != ht.size() * 3)

return false;

return true;

}

bool CompareResults(const std::vector<double>& bsa1,

const std::vector<double>& bsa2)

{

double epsilon = 1.0e-9;

if (bsa1.size() != bsa2.size())

return false;

size_t n = bsa1.size();

for (size_t i = 0; i < n; i++)

{

if (fabs(bsa1[i] - bsa2[i]) > epsilon)

return false;

}

return true;

}

void FillHeightWeightVectors(std::vector<double>& ht, std::vector<double>& wt)

{

const int rng_min_ht = 140; // cm

const int rng_max_ht = 204; // cm

const int rng_min_wt = 40; // kg

const int rng_max_wt = 140; // kg

const unsigned int rng_seed_ht = 803;

const unsigned int rng_seed_wt = 807;

MT::FillArray(ht.data(), ht.size(), rng_min_ht, rng_max_ht, rng_seed_ht);

MT::FillArray(wt.data(), wt.size(), rng_min_wt, rng_max_wt, rng_seed_wt);

}

//------------------------------------------------

// Ch09_03.cpp

//------------------------------------------------

#include <iostream>

#include <iomanip>

#include <vector>

#include <string>

#include "Ch09_03.h"

static void CalcBSA(void);

int main()

{

try

{

CalcBSA();

CalcBSA_bm();

}

catch (std::exception& ex)

{

std::cout << "Ch09_03 exception: " << ex.what() << ' ';

}

return 0;

}

static void CalcBSA(void)

{

const size_t n = 19;

std::vector<double> heights(n);

std::vector<double> weights(n);

std::vector<double> bsa1(n * 3);

std::vector<double> bsa2(n * 3);

FillHeightWeightVectors(heights, weights);

CalcBSA_F64_Cpp(bsa1, heights, weights);

CalcBSA_F64_Iavx(bsa2, heights, weights);

size_t w = 8;

std::cout << std::fixed;

std::cout << "----- Results for CalcBSA ----- ";

std::cout << " ht(cm) wt(kg)";

std::cout << " CppAlg0 CppAlg1 CppAlg2";

std::cout << " AvxAlg0 AvxAlg1 AvxAlg2";

std::cout << ' ' << std::string(86, '-') << ' ';

for (size_t i = 0; i < n; i++)

{

std::cout << std::setw(4) << i << ": ";

std::cout << std::setprecision(2);

std::cout << std::setw(w) << heights[i] << " ";

std::cout << std::setw(w) << weights[i] << " | ";

std::cout << std::setprecision(4);

std::cout << std::setw(w) << bsa1[(n * 0) + i] << " ";

std::cout << std::setw(w) << bsa1[(n * 1) + i] << " ";

std::cout << std::setw(w) << bsa1[(n * 2) + i] << " | ";

std::cout << std::setw(w) << bsa2[(n * 0) + i] << " ";

std::cout << std::setw(w) << bsa2[(n * 1) + i] << " ";

std::cout << std::setw(w) << bsa2[(n * 2) + i] << ' ';

}

if (!CompareResults(bsa1, bsa2))

throw std::runtime_error("CompareResults() failed");

}

//------------------------------------------------

// Ch09_03_fcpp.cpp

//------------------------------------------------

#include <iostream>

#include <immintrin.h>

#include <cmath>

#include "Ch09_03.h"

#include "SimdMath.h"

void CalcBSA_F64_Cpp(std::vector<double>& bsa, const std::vector<double>& ht,

const std::vector<double>& wt)

{

if (!CheckArgs(bsa, ht, wt))

throw std::runtime_error("CalcBSA_F64_Cpp() - CheckArgs failed");

size_t n = ht.size();

for (size_t i = 0; i < n; i++)

{

bsa[(n * 0) + i] = 0.007184 * pow(ht[i], 0.725) * pow(wt[i], 0.425);

bsa[(n * 1) + i] = 0.0235 * pow(ht[i], 0.42246) * pow(wt[i], 0.51456);

bsa[(n * 2) + i] = sqrt(ht[i] * wt[i] / 3600.0);

}

void CalcBSA_F64_Iavx(std::vector<double>& bsa, const std::vector<double>& ht,

const std::vector<double>& wt)

{

if (!CheckArgs(bsa, ht, wt))

throw std::runtime_error("CalcBSA_F64_Iavx() - CheckArgs failed");

__m256d f64_0p007184 = _mm256_set1_pd(0.007184);

__m256d f64_0p725 = _mm256_set1_pd(0.725);

__m256d f64_0p425 = _mm256_set1_pd(0.425);

__m256d f64_0p0235 = _mm256_set1_pd(0.0235);

__m256d f64_0p42246 = _mm256_set1_pd(0.42246);

__m256d f64_0p51456 = _mm256_set1_pd(0.51456);

__m256d f64_3600p0 = _mm256_set1_pd(3600.0);

size_t i = 0;

size_t n = ht.size();

const size_t num_simd_elements = 4;

for (; n - i >= num_simd_elements; i += num_simd_elements)

{

__m256d ht_vals = _mm256_loadu_pd(&ht[i]);

__m256d wt_vals = _mm256_loadu_pd(&wt[i]);

__m256d temp1 = pow_f64x4(ht_vals, f64_0p725);

__m256d temp2 = pow_f64x4(wt_vals, f64_0p425);

__m256d temp3 = _mm256_mul_pd(temp1, temp2);

__m256d bsa_vals = _mm256_mul_pd(f64_0p007184, temp3);

_mm256_storeu_pd(&bsa[(n * 0) + i], bsa_vals);

temp1 = pow_f64x4(ht_vals, f64_0p42246);

temp2 = pow_f64x4(wt_vals, f64_0p51456);

temp3 = _mm256_mul_pd(temp1, temp2);

bsa_vals = _mm256_mul_pd(f64_0p0235, temp3);

_mm256_storeu_pd(&bsa[(n * 1) + i], bsa_vals);

temp1 = _mm256_mul_pd(ht_vals, wt_vals);

temp2 = _mm256_div_pd(temp1, f64_3600p0);

bsa_vals = _mm256_sqrt_pd(temp2);

_mm256_storeu_pd(&bsa[(n * 2) + i], bsa_vals);

}

for (; i < n; i++)

{

bsa[(n * 0) + i] = 0.007184 * pow(ht[i], 0.725) * pow(wt[i], 0.425);

bsa[(n * 1) + i] = 0.0235 * pow(ht[i], 0.42246) * pow(wt[i], 0.51456);

bsa[(n * 2) + i] = sqrt(ht[i] * wt[i] / 3600.0);

}

Listing 9-3

Example Ch09_03

The first file in Listing 9-1, Ch09_03.h, includes the requisite function declarations for this example. Note that the BSA calculating functions require arrays of type std::vector<double> for the heights, weights, and BSAs. File Ch09_03_misc.cpp contains functions that validate arguments and initialize test data vectors. In function CheckArgs(), note that the size of array bsa must be three times the size of array ht since the results for all three BSA equations are saved in bsa. The function CalcBSA(), located in file Ch09_03.cpp, allocates the test data vectors, invokes the BSA calculating functions, and displays results.

The first function in file Ch09_03_fcpp.cpp, CalcBSA_F64_Cpp(), calculates BSA values using standard C++ statements and is included for comparison purposes. The SIMD counterpart of function CalcBSA_F64_Cpp() is named CalcBSA_F64_Iavx(). This function begins its execution with a series of _mm256_set1_pd() calls that initialize packed versions of the constants used in the BSA equations. In the first for-loop, each iteration begins with two calls to _mm256_loadu_pd() that load four pairs of heights and weights from arrays ht and wt, respectively. The next three code blocks calculate the equations shown in Table 9-1 using C++ SIMD intrinsic functions that you have already seen. Note that function pow_f64x4() is defined in the header file SimdMath.h. Following calculation of each BSA equation, CalcBSA_F64_Iavx() uses the C++ SIMD intrinsic function _mm256_storeu_pd() to save the computed BSAs in array bsa. Here is the output for source code example Ch09_03:

----- Results for CalcBSA -----

ht(cm) wt(kg) CppAlg0 CppAlg1 CppAlg2 AvxAlg0 AvxAlg1 AvxAlg2

--------------------------------------------------------------------------------------

0: 151.00 116.00 | 2.0584 2.2588 2.2058 | 2.0584 2.2588 2.2058

1: 192.00 136.00 | 2.6213 2.7133 2.6932 | 2.6213 2.7133 2.6932

2: 175.00 84.00 | 1.9970 2.0362 2.0207 | 1.9970 2.0362 2.0207

3: 187.00 52.00 | 1.7090 1.6361 1.6435 | 1.7090 1.6361 1.6435

4: 165.00 51.00 | 1.5480 1.5364 1.5289 | 1.5480 1.5364 1.5289

5: 184.00 44.00 | 1.5734 1.4911 1.4996 | 1.5734 1.4911 1.4996

6: 145.00 59.00 | 1.4996 1.5681 1.5416 | 1.4996 1.5681 1.5416

7: 154.00 56.00 | 1.5321 1.5659 1.5478 | 1.5321 1.5659 1.5478

8: 154.00 59.00 | 1.5665 1.6085 1.5887 | 1.5665 1.6085 1.5887

9: 167.00 127.00 | 2.3012 2.4695 2.4272 | 2.3012 2.4695 2.4272

10: 142.00 119.00 | 1.9901 2.2301 2.1665 | 1.9901 2.2301 2.1665

11: 183.00 65.00 | 1.8498 1.8185 1.8177 | 1.8498 1.8185 1.8177

12: 186.00 132.00 | 2.5293 2.6364 2.6115 | 2.5293 2.6364 2.6115

13: 154.00 117.00 | 2.0955 2.2878 2.2372 | 2.0955 2.2878 2.2372

14: 185.00 85.00 | 2.0896 2.0973 2.0900 | 2.0896 2.0973 2.0900

15: 191.00 103.00 | 2.3204 2.3466 2.3377 | 2.3204 2.3466 2.3377

16: 148.00 97.00 | 1.8801 2.0428 1.9969 | 1.8801 2.0428 1.9969

17: 192.00 62.00 | 1.8773 1.8112 1.8184 | 1.8773 1.8112 1.8184

18: 198.00 93.00 | 2.2806 2.2606 2.2616 | 2.2806 2.2606 2.2616

Running benchmark function CalcBSA_bm - please wait

Benchmark times save to file Ch09_03_CalcBSA_bm_OXYGEN4.csv

Table 9-2 shows some benchmark timing measurements for source code example Ch09_03.

Table 9-2

BSA Execution Times (Microseconds), 200,000 Heights and Weights

CPU	CalcBSA_F64_Cpp()	CalcBSA_F64_Iavx()
Intel Core i7-8700K	19628	3173
Intel Core i5-11600K	13347	2409

Summary

In this chapter, you learned how to exercise the cpuid instruction to ascertain availability of x86 processor extensions such as AVX, AVX2, and AVX-512. The AMD and Intel programming reference manuals listed in Appendix B contain additional information regarding proper use of the cpuid instruction. You are strongly encouraged to consult these guides before using the cpuid instruction (or an equivalent C++ intrinsic function) in your own programs.

The SVML includes hundreds of functions that implement SIMD versions of common C++ math library routines. It also includes dozens of other useful SIMD functions. You can obtain more information about SVML at https://software.intel.com/sites/landingpage/IntrinsicsGuide/#!=undefined&techs=SVML.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 9. Supplemental C++ SIMD Programming

Create new playlist

Sign In

Sign Up

9. Supplemental C++ SIMD Programming

Using CPUID

Short Vector Math Library

Rectangular to Polar Coordinates

Body Surface Area

Summary

Table of Contents for
9. Supplemental C++ SIMD Programming