Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Daniel KusswurmModern X86 Assembly Language Programminghttps://doi.org/10.1007/978-1-4842-4063-2_16

16. Advanced Programming

Daniel Kusswurm¹

(1)

Geneva, IL, USA

The final chapter of this book reviews several source code examples that demonstrate advanced x86 assembly language programming techniques. The first example explains how to use the cpuid instruction to detect specific x86 instruction set extensions. This is followed by two examples that illustrate how to accelerate SIMD processing functions using non-temporal memory stores and data prefetch instructions. The concluding example elucidates the use of an assembly language calculating function in a multithreaded application.

CPUID Instruction

It’s been mentioned several times in this book that an application program should never assume that a specific instruction set extension such as AVX, AVX2, or AVX-512 is available simply by knowing the processor’s microarchitecture, model number, or brand name. An application program should always test for the presence of an instruction set extension using the cupid (CPU Identification) instruction. Application programs can use this instruction to verify that a processor supports one of the previously-mentioned x86-AVX instruction set extensions. The cpuid instruction can also be used to obtain additional processor feature information that’s useful or needed in both application programs and operating system software.

Listing 16-1 shows the source code for example Ch16_01. This example demonstrates how to use the cupid instruction to determine processor support for various instruction set extensions. The source code for example Ch16_01 focuses on using cpuid to detect architectural features and instruction set extensions that are allied with this book’s content. If you're interested in learning how to use the cpuid instruction to identify other processor features, you should consult the AMD and Intel reference manuals that are listed in Appendix A.

//------------------------------------------------

// CpuidInfo.h

//------------------------------------------------

#pragma once

#include <cstdint>

#include <vector>

#include <string>

struct CpuidRegs

{

uint32_t EAX;

uint32_t EBX;

uint32_t ECX;

uint32_t EDX;

};

class CpuidInfo

{

public:

class CacheInfo

{

public:

enum class Type

{

Unknown, Data, Instruction, Unified

};

private:

uint32_t m_Level = 0;

Type m_Type = Type::Unknown;

uint32_t m_Size = 0;

public:

uint32_t GetLevel(void) const { return m_Level; }

uint32_t GetSize(void) const { return m_Size; }

Type GetType(void) const { return m_Type; }

// These are defined in CacheInfo.cpp

CacheInfo(uint32_t level, uint32_t type, uint32_t size);

std::string GetTypeString(void) const;

};

private:

uint32_t m_MaxEax; // Max EAX for basic CPUID

uint32_t m_MaxEaxExt; // Max EAX for extended CPUID

uint64_t m_FeatureFlags; // Processor feature flags

std::vector<CpuidInfo::CacheInfo> m_CacheInfo; // Processor cache information

char m_VendorId[13]; // Processor vendor ID string

char m_ProcessorBrand[49]; // Processor brand string

bool m_OsXsave; // XSAVE is enabled for app use

bool m_OsAvxState; // AVX state is enabled by OS

bool m_OsAvx512State; // AVX-512 state is enabled by OS

void Init(void);

void InitProcessorBrand(void);

void LoadInfo0(void);

void LoadInfo1(void);

void LoadInfo2(void);

void LoadInfo3(void);

void LoadInfo4(void);

void LoadInfo5(void);

public:

enum class FF : uint64_t

{

FXSR = (uint64_t)1 << 0,

MMX = (uint64_t)1 << 1,

MOVBE = (uint64_t)1 << 2,

SSE = (uint64_t)1 << 3,

SSE2 = (uint64_t)1 << 4,

SSE3 = (uint64_t)1 << 5,

SSSE3 = (uint64_t)1 << 6,

SSE4_1 = (uint64_t)1 << 7,

SSE4_2 = (uint64_t)1 << 8,

PCLMULQDQ = (uint64_t)1 << 9,

POPCNT = (uint64_t)1 << 10,

PREFETCHW = (uint64_t)1 << 11,

PREFETCHWT1 = (uint64_t)1 << 12,

RDRAND = (uint64_t)1 << 13,

RDSEED = (uint64_t)1 << 14,

ERMSB = (uint64_t)1 << 15,

AVX = (uint64_t)1 << 16,

AVX2 = (uint64_t)1 << 17,

F16C = (uint64_t)1 << 18,

FMA = (uint64_t)1 << 19,

BMI1 = (uint64_t)1 << 20,

BMI2 = (uint64_t)1 << 21,

LZCNT = (uint64_t)1 << 22,

ADX = (uint64_t)1 << 23,

AVX512F = (uint64_t)1 << 24,

AVX512ER = (uint64_t)1 << 25,

AVX512PF = (uint64_t)1 << 26,

AVX512DQ = (uint64_t)1 << 27,

AVX512CD = (uint64_t)1 << 28,

AVX512BW = (uint64_t)1 << 29,

AVX512VL = (uint64_t)1 << 30,

AVX512_IFMA = (uint64_t)1 << 31,

AVX512_VBMI = (uint64_t)1 << 32,

AVX512_4FMAPS = (uint64_t)1 << 33,

AVX512_4VNNIW = (uint64_t)1 << 34,

AVX512_VPOPCNTDQ = (uint64_t)1 << 35,

AVX512_VNNI = (uint64_t)1 << 36,

AVX512_VBMI2 = (uint64_t)1 << 37,

AVX512_BITALG = (uint64_t)1 << 38,

CLWB = (uint64_t)1 << 39,

};

CpuidInfo(void) { Init(); };

∼CpuidInfo() {};

const std::vector<CpuidInfo::CacheInfo>& GetCacheInfo(void) const

{

return m_CacheInfo;

}

bool GetFF(FF flag) const

{

return ((m_FeatureFlags & (uint64_t)flag) != 0) ? true : false;

}

std::string GetProcessorBrand(void) const { return std::string(m_ProcessorBrand); }

std::string GetVendorId(void) const { return std::string(m_VendorId); }

void LoadInfo(void);

};

// Cpuinfo_.asm

extern "C" void Xgetbv_(uint32_t r_ecx, uint32_t* r_eax, uint32_t* r_edx);

extern "C" uint32_t Cpuid_(uint32_t r_eax, uint32_t r_ecx, CpuidRegs* r_out);

;-------------------------------------------------

; CpuidInfo_.asm

;-------------------------------------------------

; The following structures must agree with the CpuidRegs structure

; that's defined in CpuidInfo.h

CpuidRegs struct

RegEAX dword ?

RegEBX dword ?

RegECX dword ?

RegEDX dword ?

CpuidRegs ends

; extern "C" uint32_t Cpuid_(uint32_t r_eax, uint32_t r_ecx, CpuidRegs* r_out);

;

; Returns: eax == 0 Unsupported CPUID leaf

; eax != 0 Supported CPUID leaf

;

; Note: the return code is valid only if r_eax <= MaxEAX.

.code

Cpuid_ proc frame

push rbx

.pushreg rbx

.endprolog

; Load eax and ecx

mov eax,ecx

mov ecx,edx

; Get cpuid info & save results

cpuid

mov dword ptr [r8+CpuidRegs.RegEAX],eax

mov dword ptr [r8+CpuidRegs.RegEBX],ebx

mov dword ptr [r8+CpuidRegs.RegECX],ecx

mov dword ptr [r8+CpuidRegs.RegEDX],edx

; Test for unsupported CPUID leaf

or eax,ebx

or ecx,edx

or eax,ecx ;eax = return code

pop rbx

ret

Cpuid_ endp

; extern "C" void Xgetbv_(uint32_t r_ecx, uint32_t* r_eax, uint32_t* r_edx);

Xgetbv_ proc

mov r9,rdx ;r9 = r_eax ptr

xgetbv

mov dword ptr [r9],eax ;save low word result

mov dword ptr [r8],edx ;save high word result

ret

Xgetbv_ endp

end

//------------------------------------------------

// Ch16_01.cpp

//------------------------------------------------

#include "stdafx.h"

#include <iostream>

#include <iomanip>

#include <string>

#include "CpuidInfo.h"

using namespace std;

static void DisplayCacheInfo(const CpuidInfo& ci);

static void DisplayFeatureFlags(const CpuidInfo& ci);

int main()

{

CpuidInfo ci;

ci.LoadInfo();

cout << ci.GetVendorId() << ' ';

cout << ci.GetProcessorBrand() << ' ';

DisplayCacheInfo(ci);

DisplayFeatureFlags(ci);

return 0;

}

static void DisplayCacheInfo(const CpuidInfo& ci)

{

const vector<CpuidInfo::CacheInfo>& cache_info = ci.GetCacheInfo();

for (const CpuidInfo::CacheInfo& x : cache_info)

{

uint32_t cache_size = x.GetSize();

string cache_size_str;

if (cache_size < 1024 * 1024)

{

cache_size /= 1024;

cache_size_str = "KB";

}

else

{

cache_size /= 1024 * 1024;

cache_size_str = "MB";

}

cout << "Cache L" << x.GetLevel() << ": ";

cout << cache_size << cache_size_str << ' ';

cout << x.GetTypeString() << ' ';

}

static void DisplayFeatureFlags(const CpuidInfo& ci)

{

const char nl = ' ';

cout << "----- CPUID Feature Flags -----" << nl;

cout << "ADX: " << ci.GetFF(CpuidInfo::FF::ADX) << nl;

cout << "AVX: " << ci.GetFF(CpuidInfo::FF::AVX) << nl;

cout << "AVX2: " << ci.GetFF(CpuidInfo::FF::AVX2) << nl;

cout << "AVX512F: " << ci.GetFF(CpuidInfo::FF::AVX512F) << nl;

cout << "AVX512BW: " << ci.GetFF(CpuidInfo::FF::AVX512BW) << nl;

cout << "AVX512CD: " << ci.GetFF(CpuidInfo::FF::AVX512CD) << nl;

cout << "AVX512DQ: " << ci.GetFF(CpuidInfo::FF::AVX512DQ) << nl;

cout << "AVX512ER: " << ci.GetFF(CpuidInfo::FF::AVX512ER) << nl;

cout << "AVX512PF: " << ci.GetFF(CpuidInfo::FF::AVX512PF) << nl;

cout << "AVX512VL: " << ci.GetFF(CpuidInfo::FF::AVX512VL) << nl;

cout << "AVX512_IFMA: " << ci.GetFF(CpuidInfo::FF::AVX512_IFMA) << nl;

cout << "AVX512_VBMI: " << ci.GetFF(CpuidInfo::FF::AVX512_VBMI) << nl;

cout << "BMI1: " << ci.GetFF(CpuidInfo::FF::BMI1) << nl;

cout << "BMI2: " << ci.GetFF(CpuidInfo::FF::BMI2) << nl;

cout << "F16C: " << ci.GetFF(CpuidInfo::FF::F16C) << nl;

cout << "FMA: " << ci.GetFF(CpuidInfo::FF::FMA) << nl;

cout << "LZCNT: " << ci.GetFF(CpuidInfo::FF::LZCNT) << nl;

cout << "POPCNT: " << ci.GetFF(CpuidInfo::FF::POPCNT) << nl;

}

//------------------------------------------------

// CpuidInfo.cpp

//------------------------------------------------

#include "stdafx.h"

#include <string>

#include <cstring>

#include <vector>

#include "CpuidInfo.h"

using namespace std;

void CpuidInfo::LoadInfo(void)

{

// Note: LoadInfo0 must be called first

LoadInfo0();

LoadInfo1();

LoadInfo2();

LoadInfo3();

LoadInfo4();

LoadInfo5();

}

void CpuidInfo::LoadInfo0(void)

{

CpuidRegs r1;

// Perform required initializations

Init();

// Get MaxEax and VendorID

Cpuid_(0, 0, &r1);

m_MaxEax = r1.EAX;

*(uint32_t *)(m_VendorId + 0) = r1.EBX;

*(uint32_t *)(m_VendorId + 4) = r1.EDX;

*(uint32_t *)(m_VendorId + 8) = r1.ECX;

m_VendorId[sizeof(m_VendorId) - 1] = '';

// Get MaxEaxExt

Cpuid_(0x80000000, 0, &r1);

m_MaxEaxExt = r1.EAX;

// Initialize processor brand string

InitProcessorBrand();

}

void CpuidInfo::LoadInfo1(void)

{

CpuidRegs r;

if (m_MaxEax < 1)

return;

Cpuid_(1, 0, &r);

// Decode r.ECX flags

// CPUID.(EAX=01H, ECX=00H):ECX.SSE3[bit 0]

if (r.ECX & (0x1 << 0))

m_FeatureFlags |= (uint64_t)FF::SSE3;

// CPUID.(EAX=01H, ECX=00H):ECX.PCLMULQDQ[bit 1]

if (r.ECX & (0x1 << 1))

m_FeatureFlags |= (uint64_t)FF::PCLMULQDQ;

// CPUID.(EAX=01H, ECX=00H):ECX.SSSE3[bit 9]

if (r.ECX & (0x1 << 9))

m_FeatureFlags |= (uint64_t)FF::SSSE3;

// CPUID.(EAX=01H, ECX=00H):ECX.SSE4.1[bit 19]

if (r.ECX & (0x1 << 19))

m_FeatureFlags |= (uint64_t)FF::SSE4_1;

// CPUID.(EAX=01H, ECX=00H):ECX.SSE4.2[bit 20]

if (r.ECX & (0x1 << 20))

m_FeatureFlags |= (uint64_t)FF::SSE4_2;

// CPUID.(EAX=01H, ECX=00H):ECX.MOVBE[bit 22]

if (r.ECX & (0x1 << 22))

m_FeatureFlags |= (uint64_t)FF::MOVBE;

// CPUID.(EAX=01H, ECX=00H):ECX.POPCNT[bit 23]

if (r.ECX & (0x1 << 23))

m_FeatureFlags |= (uint64_t)FF::POPCNT;

// CPUID.(EAX=01H, ECX=00H):ECX.RDRAND[bit 30]

if (r.ECX & (0x1 << 30))

m_FeatureFlags |= (uint64_t)FF::RDRAND;

// Decode r.RDX flags

// CPUID.(EAX=01H, ECX=00H):EDX.MMX[bit 23]

if (r.EDX & (0x1 << 23))

m_FeatureFlags |= (uint64_t)FF::MMX;

// CPUID.(EAX=01H, ECX=00H):EDX.FXSR[bit 24]

if (r.EDX & (0x1 << 24))

m_FeatureFlags |= (uint64_t)FF::FXSR;

// CPUID.(EAX=01H, ECX=00H):EDX.SSE[bit 25]

if (r.EDX & (0x1 << 25))

m_FeatureFlags |= (uint64_t)FF::SSE;

// CPUID.(EAX=01H, ECX=00H):EDX.SSE2[bit 26]

if (r.EDX & (0x1 << 26))

m_FeatureFlags |= (uint64_t)FF::SSE2;

}

void CpuidInfo::LoadInfo2(void)

{

CpuidRegs r;

if (m_MaxEax < 7)

return;

Cpuid_(7, 0, &r);

// CPUID.(EAX=07H, ECX=00H):ECX.PREFETCHWT1[bit 0]

if (r.ECX & (0x1 << 0))

m_FeatureFlags |= (uint64_t)FF::PREFETCHWT1;

// CPUID.(EAX=07H, ECX=00H):EBX.BMI1[bit 3]

if (r.EBX & (0x1 << 3))

m_FeatureFlags |= (uint64_t)FF::BMI1;

// CPUID.(EAX=07H, ECX=00H):EBX.BMI2[bit 8]

if (r.EBX & (0x1 << 8))

m_FeatureFlags |= (uint64_t)FF::BMI2;

// CPUID.(EAX=07H, ECX=00H):EBX.ERMSB[bit 9]

// ERMSB = Enhanced REP MOVSB/STOSB

if (r.EBX & (0x1 << 9))

m_FeatureFlags |= (uint64_t)FF::ERMSB;

// CPUID.(EAX=07H, ECX=00H):EBX.RDSEED[bit 18]

if (r.EBX & (0x1 << 18))

m_FeatureFlags |= (uint64_t)FF::RDSEED;

// CPUID.(EAX=07H, ECX=00H):EBX.ADX[bit 19]

if (r.EBX & (0x1 << 19))

m_FeatureFlags |= (uint64_t)FF::ADX;

// CPUID.(EAX=07H, ECX=00H):EBX.CLWB[bit 24]

if (r.EBX & (0x1 << 24))

m_FeatureFlags |= (uint64_t)FF::CLWB;

}

void CpuidInfo::LoadInfo3(void)

{

CpuidRegs r;

if (m_MaxEaxExt < 0x80000001)

return;

Cpuid_(0x80000001, 0, &r);

// CPUID.(EAX=80000001H, ECX=00H):ECX.LZCNT[bit 5]

if (r.ECX & (0x1 << 5))

m_FeatureFlags |= (uint64_t)FF::LZCNT;

// CPUID.(EAX=80000001H, ECX=00H):ECX.PREFETCHW[bit 8]

if (r.ECX & (0x1 << 8))

m_FeatureFlags |= (uint64_t)FF::PREFETCHW;

}

void CpuidInfo::LoadInfo4(void)

{

CpuidRegs r_eax01h;

CpuidRegs r_eax07h;

if (m_MaxEax < 7)

return;

Cpuid_(1, 0, &r_eax01h);

Cpuid_(7, 0, &r_eax07h);

// Test CPUID.(EAX=01H, ECX=00H):ECX.OSXSAVE[bit 27] to verify use of XGETBV

m_OsXsave = (r_eax01h.ECX & (0x1 << 27)) ? true : false;

if (m_OsXsave)

{

// Use XGETBV to obtain following information

// AVX state is enabled by OS if (XCR0[2:1] == '11b') is true

// AVX512 state is enabled by OS if (XCR0[7:5] == '111b') is true

uint32_t xgetbv_eax, xgetbv_edx;

Xgetbv_(0, &xgetbv_eax, &xgetbv_edx);

m_OsAvxState = (((xgetbv_eax >> 1) & 0x03) == 0x03) ? true : false;

if (m_OsAvxState)

{

// CPUID.(EAX=01H, ECX=00H):ECX.AVX[bit 28]

if (r_eax01h.ECX & (0x1 << 28))

{

m_FeatureFlags |= (uint64_t)FF::AVX;

// CPUID.(EAX=01H, ECX=00H):ECX.FMA[bit 12]

if (r_eax01h.ECX & (0x1 << 12))

m_FeatureFlags |= (uint64_t)FF::FMA;

// CPUID.(EAX=01H, ECX=00H):ECX.F16C[bit 29]

if (r_eax01h.ECX & (0x1 << 29))

m_FeatureFlags |= (uint64_t)FF::F16C;

// CPUID.(EAX=07H, ECX=00H):EBX.AVX2[bit 5]

if (r_eax07h.EBX & (0x1 << 5))

m_FeatureFlags |= (uint64_t)FF::AVX2;

m_OsAvx512State = (((xgetbv_eax >> 5) & 0x07) == 0x07) ? true : false;

if (m_OsAvx512State)

{

// CPUID.(EAX=07H, ECX=00H):EBX.AVX512F[bit 16]

if (r_eax07h.EBX & (0x1 << 16))

{

m_FeatureFlags |= (uint64_t)FF::AVX512F;

// Decode EBX flags

// CPUID.(EAX=07H, ECX=00H):EBX.AVX512DQ[bit 17]

if (r_eax07h.EBX & (0x1 << 17))

m_FeatureFlags |= (uint64_t)FF::AVX512DQ;

// CPUID.(EAX=07H, ECX=00H):EBX.AVX512_IFMA[bit 21]

if (r_eax07h.EBX & (0x1 << 21))

m_FeatureFlags |= (uint64_t)FF::AVX512_IFMA;

// CPUID.(EAX=07H, ECX=00H):EBX.AVX512PF[bit 26]

if (r_eax07h.EBX & (0x1 << 26))

m_FeatureFlags |= (uint64_t)FF::AVX512PF;

// CPUID.(EAX=07H, ECX=00H):EBX.AVX512ER[bit 27]

if (r_eax07h.EBX & (0x1 << 27))

m_FeatureFlags |= (uint64_t)FF::AVX512ER;

// CPUID.(EAX=07H, ECX=00H):EBX.AVX512CD[bit 28]

if (r_eax07h.EBX & (0x1 << 28))

m_FeatureFlags |= (uint64_t)FF::AVX512CD;

// CPUID.(EAX=07H, ECX=00H):EBX.AVX512BW[bit 30]

if (r_eax07h.EBX & (0x1 << 30))

m_FeatureFlags |= (uint64_t)FF::AVX512BW;

// CPUID.(EAX=07H, ECX=00H):EBX.AVX512VL[bit 31]

if (r_eax07h.EBX & (0x1 << 31))

m_FeatureFlags |= (uint64_t)FF::AVX512VL;

// Decode ECX flags

// CPUID.(EAX=07H, ECX=00H):ECX.AVX512_VBMI[bit 1]

if (r_eax07h.ECX & (0x1 << 1))

m_FeatureFlags |= (uint64_t)FF::AVX512_VBMI;

// CPUID.(EAX=07H, ECX=00H):ECX.AVX512_VBMI2[bit 6]

if (r_eax07h.ECX & (0x1 << 6))

m_FeatureFlags |= (uint64_t)FF::AVX512_VBMI2;

// CPUID.(EAX=07H, ECX=00H):ECX.AVX512_VNNI[bit 11]

if (r_eax07h.ECX & (0x1 << 11))

m_FeatureFlags |= (uint64_t)FF::AVX512_VNNI;

// CPUID.(EAX=07H, ECX=00H):ECX.AVX512_BITALG[bit 12]

if (r_eax07h.ECX & (0x1 << 12))

m_FeatureFlags |= (uint64_t)FF::AVX512_BITALG;

// CPUID.(EAX=07H, ECX=00H):ECX.AVX512_VPOPCNTDQ[bit 14]

if (r_eax07h.ECX & (0x1 << 14))

m_FeatureFlags |= (uint64_t)FF::AVX512_VPOPCNTDQ;

// Decode EDX flags

// CPUID.(EAX=07H, ECX=00H):EDX.AVX512_4FMAPS[bit 2]

if (r_eax07h.EDX & (0x1 << 2))

m_FeatureFlags |= (uint64_t)FF::AVX512_4FMAPS;

// CPUID.(EAX=07H, ECX=00H):EDX.AVX512_4VNNIW[bit 3]

if (r_eax07h.EDX & (0x1 << 3))

m_FeatureFlags |= (uint64_t)FF::AVX512_4VNNIW;

}

void CpuidInfo::LoadInfo5(void)

{

if (m_MaxEax < 4)

return;

bool done = false;

uint32_t index = 0;

while (!done)

{

CpuidRegs r;

Cpuid_(4, index, &r);

uint32_t cache_type = r.EAX & 0x1f;

uint32_t cache_level = ((r.EAX >> 5) & 0x3);

if (cache_type == 0)

done = true;

else

{

uint32_t ways = ((r.EBX >> 22) & 0x3ff) + 1;

uint32_t partitions = ((r.EBX >> 12) & 0x3ff) + 1;

uint32_t line_size = (r.EBX & 0xfff) + 1;

uint32_t sets = r.ECX + 1;

uint32_t cache_size = ways * partitions * line_size * sets;

CacheInfo ci(cache_level, cache_type, cache_size);

m_CacheInfo.push_back(ci);

index++;

}

void CpuidInfo::Init(void)

{

m_MaxEax = 0;

m_MaxEaxExt = 0;

m_FeatureFlags = 0;

m_OsXsave = false;

m_OsAvxState = false;

m_OsAvx512State = false;

m_VendorId[0] = '';

m_ProcessorBrand[0] = '';

m_CacheInfo.clear();

}

void CpuidInfo::InitProcessorBrand(void)

{

if (m_MaxEaxExt >= 0x80000004)

{

CpuidRegs r2, r3, r4;

char* p = m_ProcessorBrand;

Cpuid_(0x80000002, 0, &r2);

Cpuid_(0x80000003, 0, &r3);

Cpuid_(0x80000004, 0, &r4);

*(uint32_t *)(p + 0) = r2.EAX;

*(uint32_t *)(p + 4) = r2.EBX;

*(uint32_t *)(p + 8) = r2.ECX;

*(uint32_t *)(p + 12) = r2.EDX;

*(uint32_t *)(p + 16) = r3.EAX;

*(uint32_t *)(p + 20) = r3.EBX;

*(uint32_t *)(p + 24) = r3.ECX;

*(uint32_t *)(p + 28) = r3.EDX;

*(uint32_t *)(p + 32) = r4.EAX;

*(uint32_t *)(p + 36) = r4.EBX;

*(uint32_t *)(p + 40) = r4.ECX;

*(uint32_t *)(p + 44) = r4.EDX;

m_ProcessorBrand[sizeof(m_ProcessorBrand) - 1] = '';

}

else

strcpy_s(m_ProcessorBrand, "Unknown");

}

Listing 16-1.

Example Ch16_01

Before examining the source code, it will be helpful to have a basic understanding of the cupid instruction and how it works. Prior to using cpuid, a function must load register EAX with a “leaf” value that specifies what information the cpuid instruction should return. A second or “sub-leaf” value may also be required in register ECX. The cpuid instruction returns its results in registers EAX, EBX, ECX, and EDX. The calling function must then decode the values in these registers to ascertain processor support for specific features. As you will soon see, it is often necessary for a program to employ the cupid instruction multiple times. Most application programs typically use cupid during initialization and save the results for later use. The reason for this is that cupid is a serializing instruction, which means that it forces the processor to finish executing all previously fetched instructions and perform any pending memory writes before fetching the next instruction. In other words, the cupid instruction takes a long time to complete its execution.

Listing 16-1 begins with the header file CpuidInfo.h. Near the top of this file is a structure named CpuidRegs, which is used to save the results returned by cupid. Following CpuidRegs is a C++ class named CpuidInfo. This class contains the code and data that’s associated with cpuid instruction use. The public portion of CpuidInfo includes a subclass named CacheInfo. This class is employed to report information about a processor’s memory caches. Class CpuidInfo also includes an enum named FF. An application program can use this enum as an argument value with the member function CpuidInfo::GetFF to determine if the host processor supports a specific instruction set. You’ll see how this works later in this section. Toward the bottom of header file CpuidInfo.h are two declaration statements for the assembly language functions Cpuid_ and Xgetbv_. These functions execute the cpuid and xgetbv (Get Value of Extended Control Register) instructions, respectively.

Following CpuidInfo.h in Listing 16-1 is the source code file CpuidInfo_.asm. This file contains the assembly language functions Cpuid_ and Xgetbv_, which are simple wrapper functions for the x86 instructions cupid and xgetbv. The function Cpuid_ begins its execution by saving register RBX on the stack. It then loads argument values r_eax and r_ecx into registers EAX and ECX. The actual cupid instruction follows the loading of registers EAX and ECX. Following the execution of cpuid, the results in registers EAX, EBX, ECX, and EDX are saved to the specified CpuidRegs structure. The assembly language function Xgetbv_ executes the xgetbv instruction. This instruction loads the contents of the extended processor control register that’s specified by ECX into register pair EDX:EAX. The xgetbv instruction allows an application program to determine if the host operating system supports AVX, AVX2, or AVX-512, as explained later in this section.

The next file in Listing 16-1 is Ch16_01.cpp. The function main contains code illustrates how to use the C++ class CpuidInfo. The statement ci.LoadInfo() invokes the member function CpuidInfo::LoadInfo, which generates multiple executions of cpuid to obtain information about the processor. Note that CpuidInfo::LoadInfo is only called once. The function DisplayCacheInfo streams information about the processor’s memory caches to cout. This function invokes CpuidInfo::GetCacheInfo to report cache information that was obtained during execution of CpuidInfo::LoadInfo. The function DisplayFeatureFlags shows information about some of the instruction set extensions that the processor supports. Each cout statement in this function uses CpuidInfo::GetFF with a different CpuidInfo::FF value. The member function CpuidInfo::GetFF returns a single bool value that indicates whether the processor supports instruction set extension that’s specified by its argument value. Like the cache data, the processor instruction set extension information was obtained and saved during the call to CpuidInfo::LoadInfo. Note that CpuidInfo is structured to allow an application program to make multiple CpuidInfo::GetFF calls without triggering additional executions of cupid.

Following the file Ch16_01.cpp in Listing 16-1 is the source code for CpuidInfo.cpp, which contains the non-trivial member functions for class CpuidInfo. The member function CpuidInfo::LoadInfo that was discussed earlier invokes six private member functions that perform a multitude of cupid queries. The first of these functions, CpuidInfo::LoadInfo0, begins its execution by calling CpuidInfo::Init to carry out the requisite initializations. It then invokes the assembly language function Cpuid_ to obtain the maximum cpuid leaf value that’s supported by the processor and the processor vendor ID string. Another call to Cpuid_ is then used to obtain the maximum leaf value for extended cupid information. This is followed by a call to CpuidInfo::InitProcessorBrand, which uses several Cpuid_ calls to query and save the processor brand string. The source code for this function is located toward the end of the file CpuidInfo.cpp.

The member functions CpuidInfo::LoadInfo1, CpuidInfo::LoadInfo2, and CpuidInfo::LoadInfo3 also exploit Cpuid_ to ascertain processor support for a variety of instruction set extensions. The code that’s contained in these member functions is mostly brute-force decoding of the various Cpuid_ results. The AMD and Intel programming reference manuals contain additional information about the cpuid feature flag bits that are used to indicate processor support for a specific instruction set extension. The private member function CpuidInfo::LoadInfo4 contains the code that checks for AVX, AVX2, and AVX-512. This member function warrants closer examination.

An application program can use the computational resources of x86-AVX only if it’s supported by both the processor and its host operating system. The Xgetbv_ function can be employed to determine host operating system support. Before using Xgetbv_, the cpuid flag OSXSAVE must be tested to ensure that it’s safe for an application program to use the xgetbv instruction. If OSXSAVE is set to true, the function CpuidInfo::LoadInfo4 invokes Xgetbv_ to obtain information regarding OS support for x86-AVX state information (i.e., whether the OS properly preserves the XMM, YMM, and ZMM registers during a task switch). When using the Xgetbv_ function, the processor will generate an exception if the extended control register number is invalid or if the processor’s OSXSAVE flag is set to false. This explains why the software flag m_OsXsave is checked in prior to calling Xgetbv_. If the host operating system supports x86-AVX state information, the function CpuidInfo::LoadInfo4 proceeds to decode the cpuid feature flags related to AVX and AVX2. Note that the feature flags FMA and F16C are also tested here. The remaining code in CpuidInfo::LoadInfo4 decodes the cupid feature flags that signify support for the various AVX-512 instruction set extensions.

The final private member function that’s called by CpuidInfo::LoadInfo is named CpuidInfo::LoadInfo5. This member function uses cpuid and the class CpuidInfo::CacheInfo to save type and size information about the processor’s memory caches. The ancillary code for class CpuidInfo::CacheInfo is not shown in Listing 16-1 but included with the chapter download package. Here is the output for source code example Ch16_01:

GenuineIntel

Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz

Cache L1: 32KB Data

Cache L1: 32KB Instruction

Cache L2: 1MB Unified

Cache L3: 13MB Unified

----- CPUID Feature Flags -----

ADX: 1

AVX: 1

AVX2: 1

AVX512F: 1

AVX512BW: 1

AVX512CD: 1

AVX512DQ: 1

AVX512ER: 0

AVX512PF: 0

AVX512VL: 1

AVX512_IFMA: 0

AVX512_VBMI: 0

BMI1: 1

BMI2: 1

F16C: 1

FMA: 1

LZCNT: 1

POPCNT: 1

Table 16-1 shows a summary of cpuid information for several Intel processors. Before moving on to the next source code example, it should be noted that when using the cpuid instruction to determine processor support for the various AVX-512 instruction set extensions, it is often necessary for an application program to test multiple feature flags . For example, an application program must verify that the AVX512F, AVX512DQ, and AVX512VL feature flags are all set before using any AVX512DQ instructions with 256-bit or 128-bit wide operands. The Intel 64 and IA-32 Architectures Software Developer’s Manual (Volume 1) contains additional information regarding cpuid instruction use and feature flag testing.

Table 16-1.

Summary of Information from cpuid Instruction for Select Intel Processors

CPUID Feature	i3-2310m	i7-4790s	i9-7900x	i7-8700k
L1 Data (KB, per core)	32	32	32	32
L1 Instruction (KB, per core)	32	32	32	32
L2 Unified (KB, per core)	256	256	1024	256
L3 Unified (MB)	3	8	13	12
ADX	0	0	1	1
AVX	1	1	1	1
AVX2	0	1	1	1
AXV512F	0	0	1	0
AVX512BW	0	0	1	0
AVX512CD	0	0	1	0
AVX512DQ	0	0	1	0
AVX512ER	0	0	0	0
AVX512PF	0	0	0	0
AVX512VL	0	0	1	0
AVX512_IFMA	0	0	0	0
AVX512_VBMI	0	0	0	0
BMI1	0	1	1	1
BMI2	0	1	1	1
F16C	0	1	1	1
FMA	0	1	1	1
LZCNT	0	1	1	1
POPCNT	1	1	1	1

Non-Temporal Memory Stores

From the perspective of a memory cache, data can be classified as either temporal or non-temporal. Temporal data is any value that is accessed more than once within a short period of time. Examples of temporal data include the elements of an array or data structure that are referenced multiple times during execution of a program loop. It also includes the instruction bytes of a program. Non-temporal data is any value that is accessed once and not immediately reused. The destination arrays of many SIMD processing algorithms often contain non-temporal data. The differentiation between temporal and non-temporal data is important since processor performance often degrades if its memory caches contain excessive amounts of non-temporal data. This condition is commonly called cache pollution . Ideally, a processor’s memory caches contain only temporal data since it makes little sense to cache items that are only used once.

Listing 16-2 shows the source code for example Ch16_02. This example illustrates the use of the non-temporal store instruction vmovntps. It also compares the performance of this instruction to the standard vmovaps instruction.

//------------------------------------------------

// Ch16_02.cpp

//------------------------------------------------

#include "stdafx.h"

#include <iostream>

#include <iomanip>

#include <string>

#include <random>

#include "Ch16_02.h"

#include "AlignedMem.h"

using namespace std;

void Init(float* x, size_t n, unsigned int seed)

{

uniform_int_distribution<> ui_dist {1, 1000};

default_random_engine rng {seed};

for (size_t i = 0; i < n; i++)

x[i] = (float)ui_dist(rng);

}

bool CalcResultCpp(float* c, const float* a, const float* b, size_t n)

{

size_t align = 32;

if ((n == 0) || ((n & 0x0f) != 0))

return false;

if (!AlignedMem::IsAligned(a, align))

return false;

if (!AlignedMem::IsAligned(b, align))

return false;

if (!AlignedMem::IsAligned(b, align))

return false;

for (size_t i = 0; i < n; i++)

c[i] = sqrt(a[i] * a[i] + b[i] * b[i]);

return true;

}

void CompareResults(const float* c1, const float* c2a, const float*c2b, size_t n)

{

bool compare_ok = true;

const float epsilon = 1.0e-9f;

cout << fixed << setprecision(4);

for (size_t i = 0; i < n && compare_ok; i++)

{

bool b1 = fabs(c1[i] - c2a[i]) > epsilon;

bool b2 = fabs(c1[i] - c2b[i]) > epsilon;

cout << setw(2) << i << " - ";

cout << setw(10) << c1[i] << ' ';

cout << setw(10) << c2a[i] << ' ';

cout << setw(10) << c2b[i] << ' ';

if (b1 || b2)

compare_ok = false;

}

if (compare_ok)

cout << "Array compare OK ";

else

cout << "Array compare FAILED ";

}

void NonTemporalStore(void)

{

const size_t n = 16;

const size_t align = 32;

AlignedArray<float> a_aa(n, align);

AlignedArray<float> b_aa(n, align);

AlignedArray<float> c1_aa(n, align);

AlignedArray<float> c2a_aa(n, align);

AlignedArray<float> c2b_aa(n, align);

float* a = a_aa.Data();

float* b = b_aa.Data();

float* c1 = c1_aa.Data();

float* c2a = c2a_aa.Data();

float* c2b = c2b_aa.Data();

Init(a, n, 67);

Init(b, n, 79);

bool rc1 = CalcResultCpp(c1, a, b, n);

bool rc2 = CalcResultA_(c2a, a, b, n);

bool rc3 = CalcResultB_(c2b, a, b, n);

if (!rc1 || !rc2 || !rc3)

{

cout << "Invalid return code ";

cout << "rc1 = " << boolalpha << rc1 << ' ';

cout << "rc2 = " << boolalpha << rc2 << ' ';

cout << "rc3 = " << boolalpha << rc3 << ' ';

return;

}

cout << "Results for NonTemporalStore ";

CompareResults(c1, c2a, c2b, n);

}

int main()

{

NonTemporalStore();

NonTemporalStore_BM();

return 0;

}

;-------------------------------------------------

; Ch16_02.asm

;-------------------------------------------------

; _CalcResult Macro

;

; The following macro contains a simple calculating loop that is used

; to compare performance of the vmovaps and vmovntps instructions.

_CalcResult macro MovInstr

; Load and validate arguments

xor eax,eax ;set error code

test r9,r9

jz Done ;jump if n <= 0

test r9,0fh

jnz Done ;jump if (n % 16) != 0

test rcx,1fh

jnz Done ;jump if c is not aligned

test rdx,1fh

jnz Done ;jump if a is not aligned

test r8,1fh

jnz Done ;jump if b is not aligned

; Calculate c[i] = sqrt(a[i] * a[i] + b[i] * b[i])

align 16

@@: vmovaps ymm0,ymmword ptr [rdx+rax] ;ymm0 = a[i+7]:a[i]

vmovaps ymm1,ymmword ptr [r8+rax] ;ymm1 = b[i+7]:b[i]

vmulps ymm2,ymm0,ymm0 ;ymm2 = a[i] * a[i]

vmulps ymm3,ymm1,ymm1 ;ymm3 = b[i] * b[i]

vaddps ymm4,ymm2,ymm3 ;ymm4 = sum

vsqrtps ymm5,ymm4 ;ymm5 = final result

MovInstr ymmword ptr [rcx+rax],ymm5 ;save final values to c

vmovaps ymm0,ymmword ptr [rdx+rax+32] ;ymm0 = a[i+15]:a[i+8]

vmovaps ymm1,ymmword ptr [r8+rax+32] ;ymm1 = b[i+15]:b[i+8]

vmulps ymm2,ymm0,ymm0 ;ymm2 = a[i] * a[i]

vmulps ymm3,ymm1,ymm1 ;ymm3 = b[i] * b[i]

vaddps ymm4,ymm2,ymm3 ;ymm4 = sum

vsqrtps ymm5,ymm4 ;ymm5 = final result

MovInstr ymmword ptr [rcx+rax+32],ymm5 ;save final values to c

add rax,64 ;update offset

sub r9,16 ;update counter

jnz @B

mov eax,1 ;set success return code

Done: vzeroupper

ret

endm

; extern bool CalcResultA_(float* c, const float* a, const float* b, size_t n)

.code

CalcResultA_ proc

_CalcResult vmovaps

CalcResultA_ endp

; extern bool CalcResultB_(float* c, const float* a, const float* b, int n)

CalcResultB_ proc

_CalcResult vmovntps

CalcResultB_ endp

end

Listing 16-2.

Example Ch16_02

Near the top of Listing 16-2 is the C++ function CalcResultCpp. This function performs a simple arithmetic calculation using the elements of two single-precision floating-point source arrays. It then saves the result to a destination array. The next C++ function in Listing 16-2 is named CompareResults. This function verifies equivalence between the C++ and assembly language output arrays. The function NonTemporalStore allocates and initializes the test arrays. It then invokes the C++ and assembly language calculating functions. The output arrays of the three calculating functions are then compared for any discrepancies.

The assembly language code in Listing 16-2 begin with the definition of a macro named _CalcResult. This macro generates AVX instructions that perform the exact same calculation as the C++ function CalcResultCpp. The macro _CalcResult is used in the assembly language functions CalcResultA_ and CalcResultB_. Note that CalcResultA_ supplies the instruction vmovaps for the macro parameter MovInstr while CalcResultB_ provides vmovntaps. This means that the code executed by functions CalcResultA_ and CalcResultB_ is identical, except for the move instruction that saves results to the destination array. Here is the output for source code example Ch16_02;

Results for NonTemporalStore

0 - 240.8319 240.8319 240.8319

1 - 747.1814 747.1814 747.1814

2 - 285.1561 285.1561 285.1561

3 - 862.3062 862.3062 862.3062

4 - 604.8810 604.8810 604.8810

5 - 1102.4504 1102.4504 1102.4504

6 - 347.1441 347.1441 347.1441

7 - 471.8315 471.8315 471.8315

8 - 890.6739 890.6739 890.6739

9 - 729.0878 729.0878 729.0878

10 - 458.3536 458.3536 458.3536

11 - 639.8031 639.8031 639.8031

12 - 1053.1063 1053.1063 1053.1063

13 - 1016.0079 1016.0079 1016.0079

14 - 610.4507 610.4507 610.4507

15 - 1161.7935 1161.7935 1161.7935

Array compare OK

Running benchmark function NonTemporalStore_BM - please wait

Benchmark times save to file Ch16_02_NonTemporalStore_BM_CHROMIUM.csv

Table 16-2 shows benchmark timing measurements for source code example Ch16_02 using several different Intel processors. In this example, using a vmovntps instruction instead of a vmovaps instruction yielded notable performance improvements on all three computers. It should be noted that the x86’s non-temporal move instructions only provide a hint to the processor regarding memory use. They do not guarantee improved performance in all cases. Any performance gains are determined by the specific memory access pattern and the processor’s underlying microarchitecture.

Table 16-2.

Mean Execution Times (Microseconds) for Functions CalcResultCpp, CalcResultA_, and CalcResultB_ (n = 2,000,000)

CPU	CalcResultCpp	CalcResultA_(uses vmovaps)	CalcResultB_(uses vmovntps)
i7-4790s	1553	1554	1242
i9-7900x	1173	1139	934
i7-8700k	847	801	590

Data Prefetch

An application program can also use the prefetch (Prefetch Data Into Caches) instruction to improve the performance of certain algorithms. This instruction facilitates pre-loading of expected-use data into the processor’s cache hierarchy. There are two basic forms of the prefetch instruction. The first form, prefetcht[0|1|2], pre-loads temporal data into a specific cache level. The second form, prefetchnta, pre-loads non-temporal data while minimizing cache pollution. Both forms of the prefetch instruction provide hints to the processor about the data that a program expects to use; a processor may choose to perform the prefetch operation or ignore the hint.

The prefetch instructions are suitable for use with a variety of data structures, including large arrays and linked lists. A linked list is sequentially-ordered collection of nodes. Each node includes a data section and one or more pointers (or links) to its adjacent nodes. Figure 16-1 illustrates a simple linked list. Linked lists are useful since their size can grow or shrink (i.e., nodes can be added or deleted) depending on data storage requirements. One drawback of a linked list is that the nodes are usually not stored in a contiguously-allocated block of memory. This tends to increase access times when traversing the nodes a linked list.

../images/326959_2_En_16_Chapter/326959_2_En_16_Fig1_HTML.jpg — Figure 16-1.
Simple linked list

Source code example Ch16_03 illustrates how to perform linked list traversals both with and without the prefetchnta instruction. Listings 16-3 shows the C++ and assembly language source code for this example.

//------------------------------------------------

// Ch16_03.h

//------------------------------------------------

#pragma once

#include <cstdint>

// This structure must match the corresponding structure definition in Ch16_03.asmh

struct LlNode

{

double ValA[4];

double ValB[4];

double ValC[4];

double ValD[4];

uint8_t FreeSpace[376];

LlNode* Link;

};

// Ch16_03_Misc.cpp

extern bool LlCompare(int num_nodes, LlNode* l1, LlNode* l2, LlNode* l3, int* node_fail);

extern LlNode* LlCreate(int num_nodes);

extern void LlDelete(LlNode* p);

extern bool LlPrint(LlNode* p, const char* fn, const char* msg, bool append);

extern void LlTraverse(LlNode* p);

// Ch16_03_.asm

extern "C" void LlTraverseA_(LlNode* p);

extern "C" void LlTraverseB_(LlNode* p);

// Ch16_03_BM.cpp

extern void LinkedListPrefetch_BM(void);

//------------------------------------------------

// Ch16_03.cpp

//------------------------------------------------

#include "stdafx.h"

#include <iostream>

#include <cmath>

#include <random>

#include "Ch16_03.h"

#include "AlignedMem.h"

using namespace std;

void LinkedListPrefetch(void)

{

const int num_nodes = 8;

LlNode* list1 = LlCreate(num_nodes);

LlNode* list2a = LlCreate(num_nodes);

LlNode* list2b = LlCreate(num_nodes);

LlTraverse(list1);

LlTraverseA_(list2a);

LlTraverseB_(list2b);

int node_fail;

const char* fn = "Ch16_03_LinkedListPrefetchResults.txt";

cout << "Results for LinkedListPrefetch ";

if (LlCompare(num_nodes, list1, list2a, list2b, &node_fail))

cout << "Linked list compare OK ";

else

cout << "Linked list compare FAILED - node_fail = " << node_fail << ' ';

LlPrint(list1, fn, "----- list1 -----", 0);

LlPrint(list2a, fn, "----- list2a -----", 1);

LlPrint(list2b, fn, "----- list2b -----", 1);

cout << "Linked list results saved to file " << fn << ' ';

LlDelete(list1);

LlDelete(list2a);

LlDelete(list2b);

}

int main()

{

LinkedListPrefetch();

LinkedListPrefetch_BM();

return 0;

}

//------------------------------------------------

// Ch16_03_Misc.cpp

//------------------------------------------------

#include "stdafx.h"

#include <iostream>

#include <cmath>

#include <random>

#include "Ch16_03.h"

#include "AlignedMem.h"

using namespace std;

bool LlCompare(int num_nodes, LlNode* l1, LlNode* l2, LlNode* l3, int* node_fail)

{

const double epsilon = 1.0e-9;

for (int i = 0; i < num_nodes; i++)

{

*node_fail = i;

if ((l1 == nullptr) || (l2 == nullptr) || (l3 == nullptr))

return false;

for (int j = 0; j < 4; j++)

{

bool b12_c = fabs(l1->ValC[j] - l2->ValC[j]) > epsilon;

bool b13_c = fabs(l1->ValC[j] - l3->ValC[j]) > epsilon;

if (b12_c || b13_c)

return false;

bool b12_d = fabs(l1->ValD[j] - l2->ValD[j]) > epsilon;

bool b13_d = fabs(l1->ValD[j] - l3->ValD[j]) > epsilon;

if (b12_d || b13_d)

return false;

}

l1 = l1->Link;

l2 = l2->Link;

l3 = l3->Link;

}

*node_fail = -2;

if ((l1 != nullptr) || (l2 != nullptr) || (l3 != nullptr))

return false;

*node_fail = -1;

return true;

}

LlNode* LlCreate(int num_nodes)

{

const size_t align = 64;

const unsigned int seed = 83;

LlNode* first = nullptr;

LlNode* last = nullptr;

uniform_int_distribution<> ui_dist {1, 500};

default_random_engine rng {seed};

for (int i = 0; i < num_nodes; i++)

{

LlNode* p = (LlNode*)AlignedMem::Allocate(sizeof(LlNode), align);

p->Link = nullptr;

if (i == 0)

first = last = p;

else

{

last->Link = p;

last = p;

}

for (int j = 0; j < 4; j++)

{

p->ValA[j] = (double)ui_dist(rng);

p->ValB[j] = (double)ui_dist(rng);

p->ValC[j] = 0;

p->ValD[j] = 0;

}

return first;

}

void LlDelete(LlNode* p)

{

while (p != nullptr)

{

LlNode* q = p->Link;

AlignedMem::Release(p);

p = q;

}

bool LlPrint(LlNode* p, const char* fn, const char* msg, bool append)

{

FILE* fp;

const char* mode = (append) ? "at" : "wt";

if (fopen_s(&fp, fn, mode) != 0)

return false;

int i = 0;

const char* fs = "%14.4lf %14.4lf %14.4lf %14.4lf ";

if (msg != nullptr)

fprintf(fp, " %s ", msg);

while (p != nullptr)

{

fprintf(fp, " LlNode %d [0x%p] ", i, p);

fprintf(fp, " ValA: ");

fprintf(fp, fs, p->ValA[0], p->ValA[1], p->ValA[2], p->ValA[3]);

fprintf(fp, " ValB: ");

fprintf(fp, fs, p->ValB[0], p->ValB[1], p->ValB[2], p->ValB[3]);

fprintf(fp, " ValC: ");

fprintf(fp, fs, p->ValC[0], p->ValC[1], p->ValC[2], p->ValC[3]);

fprintf(fp, " ValD: ");

fprintf(fp, fs, p->ValD[0], p->ValD[1], p->ValD[2], p->ValD[3]);

i++;

p = p->Link;

}

fclose(fp);

return true;

}

void LlTraverse(LlNode* p)

{

while (p != nullptr)

{

for (int i = 0; i < 4; i++)

{

p->ValC[i] = sqrt(p->ValA[i] * p->ValA[i] + p->ValB[i] * p->ValB[i]);

p->ValD[i] = sqrt(p->ValA[i] / p->ValB[i] + p->ValB[i] / p->ValA[i]);

}

p = p->Link;

}

;-------------------------------------------------

; Ch16_03_.asmh

;-------------------------------------------------

; This structure must match the corresponding structure definition in Ch16_03.h

LlNode struct

ValA real8 4 dup(?)

ValB real8 4 dup(?)

ValC real8 4 dup(?)

ValD real8 4 dup(?)

FreeSpace byte 376 dup(?)

Link qword ?

LlNode ends

;-------------------------------------------------

; Ch16_03_.asm

;-------------------------------------------------

include <Ch16_03_.asmh>

; Macro _LlTraverse

;

; The following macro generates linked list traversal code using the

; prefetchnta instruction if UsePrefetch is equal to 'Y'.

_LlTraverse macro UsePrefetch

mov rax,rcx ;rax = ptr to 1st node

test rax,rax

jz Done ;jump if empty list

align 16

@@:: mov rcx,[rax+LlNode.Link] ;rcx = next node

vmovapd ymm0,ymmword ptr [rax+LlNode.ValA] ;ymm0 = ValA

vmovapd ymm1,ymmword ptr [rax+LLNode.ValB] ;ymm1 = ValB

IFIDNI <UsePrefetch>,<Y>

mov rdx,rcx

test rdx,rdx ;is there another node?

cmovz rdx,rax ;avoid prefetch of nullptr

prefetchnta [rdx] ;prefetch start of next node

ENDIF

; Calculate ValC[i] = sqrt(ValA[i] * ValA[i] + ValB[i] * ValB[i])

vmulpd ymm2,ymm0,ymm0 ;ymm2 = ValA * ValA

vmulpd ymm3,ymm1,ymm1 ;ymm3 = ValB * ValB

vaddpd ymm4,ymm2,ymm3 ;ymm4 = sums

vsqrtpd ymm5,ymm4 ;ymm5 = square roots

vmovntpd ymmword ptr [rax+LlNode.ValC],ymm5 ;save result

; Calculate ValD[i] = sqrt(ValA[i] / ValB[i] + ValB[i] / ValA[i]);

vdivpd ymm2,ymm0,ymm1 ;ymm2 = ValA / ValB

vdivpd ymm3,ymm1,ymm0 ;ymm3 = ValB / ValA

vaddpd ymm4,ymm2,ymm3 ;ymm4 = sums

vsqrtpd ymm5,ymm4 ;ymm5 = square roots

vmovntpd ymmword ptr [rax+LlNode.ValD],ymm5 ;save result

mov rax,rcx ;rax = ptr to next node

test rax,rax

jnz @B

Done: vzeroupper

ret

endm

; extern "C" void LlTraverseA_(LlNode* first);

.code

LlTraverseA_ proc

_LlTraverse n

LlTraverseA_ endp

; extern "C" void LlTraverseB_(LlNode* first);

LlTraverseB_ proc

_LlTraverse y

LlTraverseB_ endp

end

Listing 16-3.

Example Ch16_03

Listing 16-3 begins with the header file Ch16_03.h. The declaration of structure LlNode is located near the top of this file. The C++ code uses this structure to construct linked lists of test data. Structure members ValA through ValD hold the data values that are manipulated by the linked list traversal functions. The member FreeSpace is included to increase the size of LlNode for demonstration purposes since prefetching works best with larger data structures. A real-world implementation of LlNode could use this space for additional data items. The final member of LlNode is a pointer named Link, which points to the next LlNode structure. The assembly language counterpart of LlNode is declared in the file Ch16_03_.asmh.

The base function for source code example Ch16_03 is named LinkedListPrefetch and can be found in source code file Ch16_03.cpp. This function builds several test linked lists, invokes the C++ and assembly language traversal functions, and validates the results. The source code file Ch16_03_misc.cpp contains a set of miscellaneous functions that implement basic linked list processing operations. The function LlCompare compares the data nodes of its argument linked lists for equivalence. The functions LlCreate and LlDelete perform linked list allocation and deletion. LlPrint dumps the data contents of a linked list to a file. Finally, LlTraverse traverses a linked list and performs a simulated calculation using LlNode data elements ValA, ValB, ValC, and ValD.

Toward the top of the file Ch16_03_.asm is a macro named _LlTraverse. This macro generates code that performs the same linked list traversal and simulated calculation as the C++ function LlTraverse. The macro _LlTraverse requires one parameter that enables or disables data prefetching. If prefetching is enabled, the macro generates a short block code that includes the instruction prefetchnta [rdx]. This instruction directs the processor to prefetch the non-temporal data pointed to by register RDX. In this example, RDX points to the next LlNode in the linked list. The actual number of bytes fetched by this instruction varies depending on the underlying microarchitecture; Intel processors fetch a minimum of 32 bytes. It is extremely important to note that the prefetchnta [rdx] is positioned before the floating-point calculating instructions. Doing this gives the processor an opportunity to fetch the data for the next node while at the same time carrying out the arithmetic calculations for the current node. Also note that prior to execution of the prefetchnta [rdx] instruction, RDX is tested to avoid using prefetchnta with a nullptr (or zero) memory address since this degrades processor performance. This is important since nullptr is used as the end-of-list terminator value. The assembly language functions LlTraverseA_ and LlTraverseB_ use macro _LlTraverse to perform linked list traversals both without and with prefetching, respectively. Here is the output for source code example Ch16_03.

Results for LinkedListPrefetch

Linked list compare OK

Linked list results saved to file Ch16_03_LinkedListPrefetchResults.txt

Running benchmark function LinkedListPrefetch_BM - please wait

Benchmark times save to file Ch16_03_LinkedListPrefetch_BM_CHROMIUM.csv

Table 16-3 shows benchmark timing measurements for several Intel processors. It is important to keep in mind that any performance benefits provided by the prefetch instructions are highly dependent on current processor load, data access patterns, and the underlying microarchitecture. According to the Intel 64 and IA-32 Architectures Optimization Reference Manual, the data prefetch instructions are “implementation specific.” This means that to maximize prefetch performance, an algorithm must be “tuned to each implementation” or microarchitecture. You are encouraged to consult the aforementioned reference manual for additional information regarding use of the x86’s data prefetch instructions.

Table 16-3.

Mean Execution Times (Microseconds) for Linked List Traversal Functions (num_nodes = 50,000) .

CPU	LlTraverse(C++)	LlTraverseA_(without prefetchnta)	LlTraverseB_(with prefetchnta)
i7-4790s	5685	3093	2680
i9-7900x	5885	3064	2842
i7-8700k	5031	2384	2319

Multiple Threads

All the source code examples presented in this book thus far have shared one common characteristic: they all contain single-threaded code. The mere fact that you are reading this book probably means that you already know that most modern software applications utilize a least a few threads to better exploit the multiple cores of modern processors. For example, many high-performance computing applications frequently perform arithmetic calculations using large data arrays that contain millions of floating-point elements. One strategy that’s often employed to accelerate the performance of these types of calculations is to distribute the array elements across multiple threads and have each thread carry out a subset of the required calculations. The next source code example demonstrates how to perform an arithmetic calculation using large floating-point arrays and multiple threads. Listing 16-4 shows the source code for example Ch16_04.

Caution

A processor can become extremely hot while executing multithreaded code that makes extensive use of x86-AVX instructions. Before running the code of example Ch16_04, you should verify that the processor in your computer has an adequate cooling system.

//------------------------------------------------

// Ch16_04.h

//------------------------------------------------

#pragma once

#include <vector>

struct CalcInfo

{

double* m_X1;

double* m_X2;

double* m_Y1;

double* m_Y2;

double* m_Z1;

double* m_Z2;

double* m_Result;

size_t m_Index0;

size_t m_Index1;

int m_Status;

};

struct CoutInfo

{

bool m_ThreadMsgEnable;

size_t m_Iteration;

size_t m_NumElements;

size_t m_ThreadId;

size_t m_NumThreads;

};

// Ch16_04_Misc.cpp

extern size_t CompareResults(const double* a, const double* b, size_t n);

extern void DisplayThreadMsg(const CalcInfo* ci, const CoutInfo* cout_info, const char* msg);

extern void Init(double* a1, double* a2, size_t n, unsigned int seed);

std::vector<size_t> GetNumElementsVec(size_t* num_elements_max);

std::vector<size_t> GetNumThreadsVec(void);

// Ch16_04_WinApi.cpp

extern bool GetAvailableMemory(size_t* mem_size);

// Ch16_04_.asm

extern "C" void CalcResult_(CalcInfo* ci);

// Miscellaneous constants

const size_t c_ElementSize = sizeof(double);

const size_t c_NumArrays = 8; // Total number of allocated arrays

const size_t c_Align = 32; // Alignment boundary (update Ch16_04_.asm if changed)

const size_t c_BlockSize = 8; // Elements per iteration (update Ch16_04_.asm if changed)

//------------------------------------------------

// Ch16_04_Misc.cpp

//------------------------------------------------

#include "stdafx.h"

#include <iostream>

#include <random>

#include <memory.h>

#include <cmath>

#include <mutex>

#include <vector>

#include <algorithm>

#include "Ch16_04.h"

using namespace std;

void Init(double* a1, double* a2, size_t n, unsigned int seed)

{

uniform_int_distribution<> ui_dist {1, 2000};

default_random_engine rng {seed};

for (size_t i = 0; i < n; i++)

{

a1[i] = (double)ui_dist(rng);

a2[i] = (double)ui_dist(rng);

}

size_t CompareResults(const double* a, const double* b, size_t n)

{

if (memcmp(a, b, n * sizeof(double)) == 0)

return n;

const double epsilon = 1.0e-15;

for (size_t i = 0; i < n; i++)

{

if (fabs(a[i] - b[i]) > epsilon)

return i;

}

return n;

}

void DisplayThreadMsg(const CalcInfo* ci, const CoutInfo* cout_info, const char* msg)

{

static mutex mutex_cout;

static const char nl = ' ';

mutex_cout.lock();

cout << nl << msg << nl;

cout << " m_Iteration: " << cout_info->m_Iteration << nl;

cout << " m_NumElements: " << cout_info->m_NumElements << nl;

cout << " m_ThreadId: " << cout_info->m_ThreadId << nl;

cout << " m_NumThreads: " << cout_info->m_NumThreads << nl;

cout << " m_Index0: " << ci->m_Index0 << nl;

cout << " m_Index1: " << ci->m_Index1 << nl;

mutex_cout.unlock();

}

vector<size_t> GetNumElementsVec(size_t* num_elements_max)

{

// vector<size_t> ne_vec {64, 192, 384, 512}; // Requires 32GB + extra

vector<size_t> ne_vec {64, 128, 192, 256}; // Requires 16GB + extra

// vector<size_t> ne_vec {64, 96, 128, 160}; // Requires 10GB + extra

size_t mem_size_extra_gb = 2; // Used to avoid allocating all available mem

size_t ne_max = *std::max_element(ne_vec.begin(), ne_vec.end());

if ((ne_max % c_BlockSize) != 0)

throw runtime_error("ne_max must be an integer multiple of c_BlockSize");

size_t mem_size;

if (!GetAvailableMemory(&mem_size))

throw runtime_error ("GetAvailableMemory failed");

size_t mem_size_gb = mem_size / (1024 * 1024 * 1024);

size_t mem_size_min = ne_max * 1024 * 1024 * c_ElementSize * c_NumArrays;

size_t mem_size_min_gb = mem_size_min / (1024 * 1024 * 1024);

if (mem_size_gb < mem_size_min_gb + mem_size_extra_gb)

throw runtime_error ("Not enough available memory");

*num_elements_max = ne_max * 1024 * 1024;

return ne_vec;

}

vector<size_t> GetNumThreadsVec(void)

{

vector<size_t> num_threads_vec {1, 2, 4, 6, 8};

return num_threads_vec;

}

//------------------------------------------------

// Ch16_04.cpp

//------------------------------------------------

#include "stdafx.h"

#include <iostream>

#include <iomanip>

#include <sstream>

#include <stdexcept>

#include <thread>

#include <vector>

#include "Ch16_04.h"

#include "AlignedMem.h"

#include "BmThreadTimer.h"

using namespace std;

// Control flag for streaming thread status information to cout.

const bool c_ThreadMsgEnable = false;

void CalcResultCpp(CalcInfo* ci)

{

size_t al = c_Align;

size_t i0 = ci->m_Index0;

size_t i1 = ci->m_Index1;

size_t num_elements = i1 - i0 + 1;

ci->m_Status = 0;

if (num_elements == 0 || (num_elements % c_BlockSize) != 0)

return;

for (size_t i = i0; i <= i1; i++)

{

double xx = ci->m_X1[i] - ci->m_X2[i];

double yy = ci->m_Y1[i] - ci->m_Y2[i];

double zz = ci->m_Z1[i] - ci->m_Z2[i];

ci->m_Result[i] = sqrt(1.0 / sqrt(xx * xx + yy * yy + zz * zz));

}

ci->m_Status = 1;

}

static void CalcResultThread(CalcInfo* ci, CoutInfo* cout_info)

{

if (cout_info->m_ThreadMsgEnable)

DisplayThreadMsg(ci, cout_info, "ENTER CalcResultThread()");

CalcResult_(ci);

if (cout_info->m_ThreadMsgEnable)

DisplayThreadMsg(ci, cout_info, "EXIT CalcResultThread()");

}

void RunMultipleThreads(bool thread_msg_enable)

{

// Code section #1

size_t align = c_Align;

size_t num_elements_max;

vector<size_t> num_elements_vec = GetNumElementsVec(&num_elements_max);

vector<size_t> num_threads_vec = GetNumThreadsVec();

AlignedArray<double> x1_aa(num_elements_max, align);

AlignedArray<double> x2_aa(num_elements_max, align);

AlignedArray<double> y1_aa(num_elements_max, align);

AlignedArray<double> y2_aa(num_elements_max, align);

AlignedArray<double> z1_aa(num_elements_max, align);

AlignedArray<double> z2_aa(num_elements_max, align);

AlignedArray<double> result1_aa(num_elements_max, align);

AlignedArray<double> result2_aa(num_elements_max, align);

double* x1 = x1_aa.Data();

double* x2 = x2_aa.Data();

double* y1 = y1_aa.Data();

double* y2 = y2_aa.Data();

double* z1 = z1_aa.Data();

double* z2 = z2_aa.Data();

double* result1 = result1_aa.Data();

double* result2 = result2_aa.Data();

cout << "Begin initialization of test arrays ";

cout << " Initializing test arrays x1, x2 ";

Init(x1, x2, num_elements_max, 307);

cout << " Initializing test arrays y1, y2 ";

Init(y1, y2, num_elements_max, 401);

cout << " Initializing test arrays z1, z2 ";

Init(z1, z2, num_elements_max, 503);

cout << "Finished initialization of test arrays ";

CalcInfo ci1;

ci1.m_X1 = x1; ci1.m_X2 = x2;

ci1.m_Y1 = y1; ci1.m_Y2 = y2;

ci1.m_Z1 = z1; ci1.m_Z2 = z2;

ci1.m_Result = result1;

ci1.m_Index0 = 0;

ci1.m_Index1 = num_elements_max - 1;

ci1.m_Status = -1;

// CalcResultCpp used for verification purposes

cout << "Begin execution of CalcResultCpp ";

CalcResultCpp(&ci1);

cout << "Finished execution of CalcResultCpp ";

size_t iteration = 0;

const size_t block_size = c_BlockSize;

BmThreadTimer bmtt(num_elements_vec.size(), num_threads_vec.size());

// Code section #2

cout << "Begin execution of calculating threads ";

for (size_t i = 0; i < num_elements_vec.size(); i++)

{

size_t num_elements = num_elements_vec[i] * 1024 * 1024;

size_t num_blocks = num_elements / block_size;

size_t num_blocks_rem = num_elements % block_size;

if (num_blocks_rem != 0)

throw runtime_error("num_elements must be an integer multiple of block_size");

for (size_t j = 0; j < num_threads_vec.size(); j++)

{

size_t num_threads = num_threads_vec[j];

bmtt.Start(i, j);

size_t num_blocks_per_thread = num_blocks / num_threads;

size_t num_blocks_per_thread_rem = num_blocks % num_threads;

vector<CalcInfo> ci2(num_threads);

vector<CoutInfo> cout_info(num_threads);

vector<thread*> threads(num_threads);

// Thread start code

for (size_t k = 0; k < num_threads; k++)

{

ci2[k].m_X1 = x1; ci2[k].m_X2 = x2;

ci2[k].m_Y1 = y1; ci2[k].m_Y2 = y2;

ci2[k].m_Z1 = z1; ci2[k].m_Z2 = z2;

ci2[k].m_Result = result2;

ci2[k].m_Index0 = k * num_blocks_per_thread * block_size;

ci2[k].m_Index1 = (k + 1) * num_blocks_per_thread * block_size - 1;

ci2[k].m_Status = -1;

if ((k + 1) == num_threads)

ci2[k].m_Index1 += num_blocks_per_thread_rem * block_size;

cout_info[k].m_ThreadMsgEnable = thread_msg_enable;

cout_info[k].m_Iteration = iteration;

cout_info[k].m_NumElements = num_elements;

cout_info[k].m_NumThreads = num_threads;

cout_info[k].m_ThreadId = k;

threads[k] = new thread(CalcResultThread, &ci2[k], &cout_info[k]);

}

// Wait for all threads to complete

for (size_t k = 0; k < num_threads; k++)

threads[k]->join();

bmtt.Stop(i, j);

size_t cmp_index = CompareResults(result1, result2, num_elements);

if (cmp_index != num_elements)

{

ostringstream oss;

oss << " compare error detected at index " << cmp_index;

throw runtime_error(oss.str());

}

for (size_t k = 0; k < num_threads; k++)

{

if (ci2[k].m_Status != 1)

{

ostringstream oss;

oss << " invalid status code " << ci2[k].m_Status;

throw runtime_error(oss.str());

}

delete threads[k];

}

iteration++;

}

cout << "Finished execution of calculating threads ";

string fn = bmtt.BuildCsvFilenameString("Ch16_04_MultipleThreads_BM");

bmtt.SaveElapsedTimes(fn, BmThreadTimer::EtUnit::MilliSec, 0);

cout << "Benchmark times save to file " << fn << ' ';

}

int main()

{

try

{

RunMultipleThreads(c_ThreadMsgEnable);

}

catch (runtime_error& rte)

{

cout << "'runtime_error' exception has occurred - " << rte.what() << ' ';

}

catch (...)

{

cout << "Unexpected exception has occurred ";

}

return 0;

}

;-------------------------------------------------

; Ch16_04_.asm

;-------------------------------------------------

include <MacrosX86-64-AVX.asmh>

CalcInfo struct

X1 qword ?

X2 qword ?

Y1 qword ?

Y2 qword ?

Z1 qword ?

Z2 qword ?

Result qword ?

Index0 qword ?

Index1 qword ?

Status dword ?

CalcInfo ends

.const

r8_1p0 real8 1.0

; extern "C" void CalcResult_(CalcInfo* ci)

.code

CalcResult_ proc frame

_CreateFrame CR,0,16,r12,r13,r14,r15

_SaveXmmRegs xmm6

_EndProlog

mov dword ptr [rcx+CalcInfo.Status],0

; Make sure num_elements is valid

mov rax,[rcx+CalcInfo.Index0] ;rax = start index

mov rdx,[rcx+CalcInfo.Index1] ;rdx = stop index

sub rdx,rax

add rdx,1 ;rdx = num_elements

test rdx,rdx

jz Done ;jump if num_elements == 0

test rdx,7

jnz Done ;jump if num_elements % 8 != 0

; Make sure all arrays are properly aligned

mov r8d,1fh

mov r9,[rcx+CalcInfo.Result]

test r9,r8

jnz Done

mov r10,[rcx+CalcInfo.X1]

test r10,r8

jnz Done

mov r11,[rcx+CalcInfo.X2]

test r11,r8

jnz Done

mov r12,[rcx+CalcInfo.Y1]

test r12,r8

jnz Done

mov r13,[rcx+CalcInfo.Y2]

test r13,r8

jnz Done

mov r14,[rcx+CalcInfo.Z1]

test r14,r8

jnz Done

mov r15,[rcx+CalcInfo.Z2]

test r15,r8

jnz Done

vbroadcastsd ymm6,real8 ptr [r8_1p0] ;ymm6 = packed 1.0 (DPFP)

; Perform simulated calculation

align 16

LP1: vmovapd ymm0,ymmword ptr [r10+rax*8]

vmovapd ymm1,ymmword ptr [r12+rax*8]

vmovapd ymm2,ymmword ptr [r14+rax*8]

vsubpd ymm0,ymm0,ymmword ptr [r11+rax*8]

vsubpd ymm1,ymm1,ymmword ptr [r13+rax*8]

vsubpd ymm2,ymm2,ymmword ptr [r15+rax*8]

vmulpd ymm3,ymm0,ymm0

vmulpd ymm4,ymm1,ymm1

vmulpd ymm5,ymm2,ymm2

vaddpd ymm0,ymm3,ymm4

vaddpd ymm1,ymm0,ymm5

vsqrtpd ymm2,ymm1

vdivpd ymm3,ymm6,ymm2

vsqrtpd ymm4,ymm3

vmovntpd ymmword ptr [r9+rax*8],ymm4

add rax,4

vmovapd ymm0,ymmword ptr [r10+rax*8]

vmovapd ymm1,ymmword ptr [r12+rax*8]

vmovapd ymm2,ymmword ptr [r14+rax*8]

vsubpd ymm0,ymm0,ymmword ptr [r11+rax*8]

vsubpd ymm1,ymm1,ymmword ptr [r13+rax*8]

vsubpd ymm2,ymm2,ymmword ptr [r15+rax*8]

vmulpd ymm3,ymm0,ymm0

vmulpd ymm4,ymm1,ymm1

vmulpd ymm5,ymm2,ymm2

vaddpd ymm0,ymm3,ymm4

vaddpd ymm1,ymm0,ymm5

vsqrtpd ymm2,ymm1

vdivpd ymm3,ymm6,ymm2

vsqrtpd ymm4,ymm3

vmovntpd ymmword ptr [r9+rax*8],ymm4

add rax,4

sub rdx,8

jnz LP1

mov dword ptr [rcx+CalcInfo.Status],1

Done: vzeroupper

_RestoreXmmRegs xmm6

_DeleteFrame r12,r13,r14,r15

ret

CalcResult_ endp

end

Listing 16-4.

Example Ch16_04

Source code example Ch16_04 performs a simulated calculation using large arrays of double-precision floating-point values across multiple threads. It uses the C++ STL class thread to run multiple instances of an AVX2 assembly language calculating function. Each thread performs its calculations using only a portion of the array data. The C++ driver routine exercises the calculating algorithm using various combinations of array sizes and simultaneously executing threads. It also implements benchmark timing measurements to quantify the performance benefits of the multithreaded technique.

Listing 16-4 begins the header file Ch16_04.h. In this file, the structure CalcInfo contains the data that each thread needs to carry out its calculations. The structure members m_X1, m_X2, m_Y1, m_Y2, m_Z2, and m_Z2, point to the source arrays, while m_Result points to the destination array. Members m_Index0 and m_Index1 are array indices that define a range of unique elements for each calculating thread. Header file Ch16_04.h also includes a structure named CoutInfo, which contains status information that’s optionally displayed during program execution.

The next file in Listing 16-4 is Ch16_04_Misc.cpp. This file contains the source code for the program’s ancillary functions. The functions Init and CompareResults perform array initialization and verification, respectively. The next function is DisplayThreadMsg. This function displays status information for each executing thread. Note that the cout statements in DisplayThreadMsg are synchronized with a C++ STL mutex . A mutex is a synchronization object that facilitates controlled access to a single resource by multiple threads. When mutex_cout is locked, only one thread can stream its status results to cout. The other executing threads are blocked from streaming their results to cout until mutex_cout becomes unlocked. Without this mutex, status information text from the executing threads would be intermingled on the display (if you’re interested in seeing what happens, try commenting out the mutex_cout.lock and mutext_cout.unlock statements).

The function GetNumElementsVec returns a vector that contains the sizes of the test arrays. Note that the amount of memory required by the largest test array should be less than the amount of available memory plus a small fudge factor. The fudge factor prevents the program from allocating all available memory. Also note that GetNumElementsVec throws an exception if insufficient memory is available since running the program in this manner is very slow due to the amount of page swapping that occurs. The function GetNumThreadsVec returns a vector of test thread counts. You can change the values in num_threads_vec to experiment with different thread count values.

The source code file Ch16_04.cpp contains the driver routines for source code example Ch16_04. The function CalcResultCpp is a C++ implementation of the simulated calculating algorithm and is used for result verification purposes. The next function, CalcResultThread, is the main thread function. This function invokes the assembly language calculating function CalcResult_. It also displays thread status messages if they’re enabled.

Following CalcResultThread is the function RunMultipleThreads. This function exercises the calculating algorithm using the specified combinations of array sizes and number of simultaneously executing threads. The first code section of RunMultipleThreads performs array allocation and element initialization. It also calls the function CalcResultCpp to calculate results values for algorithm verification purposes. Note that prior to calling this function, an instance of CalcInfo is initialized with the data that’s necessary to carry out the required calculations.

The second code section of RunMultipleThreads, which starts immediate after the comment line Code section #2, runs the calculating algorithm by distributing the test array elements across multiple threads. The test array elements are spilt into groups based on the number of threads that will execute. For example, if the number of test array elements equals 64 million, launching four threads will result in each thread processing 16 million elements. The inner most for loop that follows the comment line Thread start code begins each iteration by initializing an instance of CalcInfo for the next thread. The statement threads[k] = new thread(CalcResultThread, &ci2[k], &cout_info[k]) constructs a new thread object and starts execution of the thread function CalcResultThread using argument values &ci2[k] and &cout_info[k]. This inner for loop repeats until the required number of executing thread have been launched. While the threads are executing, the function RunMultipleThreads executes a small for loop that invokes the function thread::join. This effectively forces RunMultipleThreads to wait until all executing threads have finished. The remaining code in RunMultipleThreads performs data verification and object cleanup.

Before reviewing the assembly language code, the C++ code in RunMultipleThreads merits a few additional comments. The first thing to note is that the benchmarking code measures both the time it takes to carry out the required calculations and the overhead that’s associated with thread management. If the algorithm that’s employed by RunMultipleThreads were to be used in a real-world application, any benchmark timing measurements would be meaningless without factoring in this overhead. It should also be noted that RunMultipleThreads implements an extremely rudimentary form of multithreading that omits many important real-world operations to simplify the code for this example. If you’re interested in learning more about the C++ STL thread class and the other STL classes that facilitate multithreaded processing, you are strongly encouraged to consult the references listed in Appendix A.

The final file in Listing 16-4 is Ch16_04_.asm. Near the top of this file is the assembly language counterpart of the data structure CalcInfo. The assembly language function CalcResult_ is next. Following its prolog, CalcResult_ uses the instructions mov rax,[rcx+CalcInfo.Index0] and mov rdx,[rcx+CalcInfo.Index1] to load registers RAX and RDX with the indices of the first and last array elements. It then calculates and validates num_elements. Next, the test arrays are validated for proper alignment. The processing loop in CalcResult_ uses AVX packed double-precision floating-point arithmetic to carry out the same simulated calculation as CalcResultCpp. The output for source code example Ch16_04 follows this paragraph. Note that this output was produced with c_ThreadMsgEnable set to false. Setting this flag to true ultimately directs the function CalcResultThread to display status messages for each executing thread. The flag c_ThreadMsgEnable is defined near the top of Ch16_04.cpp.

Begin initialization of test arrays

Initializing test arrays x1, x2

Initializing test arrays y1, y2

Initializing test arrays z1, z2

Finished initialization of test arrays

Begin execution of CalcResultCpp

Finished execution of CalcResultCpp

Begin execution of calculating threads

Finished execution of calculating threads

Benchmark times save to file Ch16_04_MultipleThreads_BM_CHROMIUM.csv

Tables 16-4, 16-5, and 16-6 contain the benchmark timing measurements for source code example Ch16_04. The measurements shown in these tables are the mean execution times from 10 separate runs and were made with 32 GB of SDRAM installed in each test computer. All three test computers show a significant improvement in performance when multiple threads are used to carry out the simulated calculation. For the i7-4900s and i7-8700k test computers, optimal performance is attained using four threads. The i9-7900x test computer shows meaningful performance gains when using six or eight threads. It is interesting to compare the timing measurements for the i9-7900x and i7-8700k systems. When using one or two threads, the i7-8700k out performs the i9-7900x while the opposite is true when four or more threads are employed. These measurements make sense when considering the hardware differences between the test processors, which are shown in Table 16-7. Even though the i7-8700k employs higher-clock frequencies, the extra memory channels of the i9-7900x enable it to better utilize its CPU cores and complete the required calculations more quickly.

Table 16-4.

Benchmark Timing Measurements (Milliseconds) for RunMultipleThreads Using an Intel i7-4790s Processor

	Number of Threads
Number of Elements (Millions)	1	2	4	6	8
64	686	226	178	180	172
128	1146	491	345	355	347
192	1592	702	513	530	516
256	2054	942	679	714	688

Table 16-5.

Benchmark Timing Measurements (Milliseconds) for RunMultipleThreads Using an Intel i9-7900x Processor

	Number of Threads
Number of Elements (Millions)	1	2	4	6	8
64	492	137	84	69	61
128	765	300	163	131	121
192	1110	454	233	193	178
256	1330	582	313	260	238

Table 16-6.

Benchmark Timing Measurements (Milliseconds) for RunMultipleThreads Using an Intel i7-8700k Processor

	Number of Threads
Number of Elements (Millions)	1	2	4	6	8
64	332	125	120	123	123
128	522	265	240	245	246
192	839	387	363	366	369
256	919	499	478	484	492

Table 16-7.

Summary of Hardware Features for Test Processors Used in Example Ch16_04

Hardware Feature	i7-4790s	i9-7900x	i7-8700k
Number of cores	4	10	6
Number of threads	8	20	12
Base frequency (GHz)	3.2	3.3	3.7
Maximum frequency (GHz)	4.0	4.5	4.7
Memory type	DDR3-1600	DDR4-2666	DDR4-2666
Number of memory channels	2	4	2

Summary

Here are the key learning points for Chapter 16:

An application program should always use the cpuid instruction to verify processor support for specific instruction set extensions. This is extremely important for software compatibility with future processors from both AMD and Intel.
An assembly language function can use the non-temporal store instructions vmovntp[d|s] instead of the vmovap[d|s] instructions to improve the performance of algorithms that carry out calculations using large arrays of non-temporal floating-point data.
An assembly language function can use the prefetch[0|1|2] instructions to pre-load temporal data into the processor’s cache hierarchy. A function can also use the prefetchnta instruction to pre-load non-temporal data and minimize cache pollution. The performance benefits of the prefetch instructions vary depending on data access patterns and the processor’s underlying microarchitecture.
A multithreaded algorithm that’s implemented in a high-level language such as C++ can exploit AVX, AVX2, or AVX-512 assembly language calculating functions to accelerate an algorithm’s overall performance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 16. Advanced Programming

Create new playlist

Sign In

Sign Up

16. Advanced Programming

CPUID Instruction

Non-Temporal Memory Stores

Data Prefetch

Multiple Threads

Caution

Summary

Table of Contents for
16. Advanced Programming