© Daniel Kusswurm 2018
Daniel KusswurmModern X86 Assembly Language Programminghttps://doi.org/10.1007/978-1-4842-4063-2_16

16. Advanced Programming

Daniel Kusswurm1 
(1)
Geneva, IL, USA
 

The final chapter of this book reviews several source code examples that demonstrate advanced x86 assembly language programming techniques. The first example explains how to use the cpuid instruction to detect specific x86 instruction set extensions. This is followed by two examples that illustrate how to accelerate SIMD processing functions using non-temporal memory stores and data prefetch instructions. The concluding example elucidates the use of an assembly language calculating function in a multithreaded application.

CPUID Instruction

It’s been mentioned several times in this book that an application program should never assume that a specific instruction set extension such as AVX, AVX2, or AVX-512 is available simply by knowing the processor’s microarchitecture, model number, or brand name. An application program should always test for the presence of an instruction set extension using the cupid (CPU Identification) instruction. Application programs can use this instruction to verify that a processor supports one of the previously-mentioned x86-AVX instruction set extensions. The cpuid instruction can also be used to obtain additional processor feature information that’s useful or needed in both application programs and operating system software.

Listing 16-1 shows the source code for example Ch16_01. This example demonstrates how to use the cupid instruction to determine processor support for various instruction set extensions. The source code for example Ch16_01 focuses on using cpuid to detect architectural features and instruction set extensions that are allied with this book’s content. If you're interested in learning how to use the cpuid instruction to identify other processor features, you should consult the AMD and Intel reference manuals that are listed in Appendix A.
//------------------------------------------------
//        CpuidInfo.h
//------------------------------------------------
#pragma once
#include <cstdint>
#include <vector>
#include <string>
struct CpuidRegs
{
  uint32_t EAX;
  uint32_t EBX;
  uint32_t ECX;
  uint32_t EDX;
};
class CpuidInfo
{
public:
  class CacheInfo
  {
  public:
    enum class Type
    {
      Unknown, Data, Instruction, Unified
    };
  private:
    uint32_t m_Level = 0;
    Type m_Type = Type::Unknown;
    uint32_t m_Size = 0;
  public:
    uint32_t GetLevel(void) const       { return m_Level; }
    uint32_t GetSize(void) const       { return m_Size; }
    Type GetType(void) const         { return m_Type; }
    // These are defined in CacheInfo.cpp
    CacheInfo(uint32_t level, uint32_t type, uint32_t size);
    std::string GetTypeString(void) const;
  };
private:
  uint32_t m_MaxEax;               // Max EAX for basic CPUID
  uint32_t m_MaxEaxExt;              // Max EAX for extended CPUID
  uint64_t m_FeatureFlags;            // Processor feature flags
  std::vector<CpuidInfo::CacheInfo> m_CacheInfo; // Processor cache information
  char m_VendorId[13];              // Processor vendor ID string
  char m_ProcessorBrand[49];           // Processor brand string
  bool m_OsXsave;                 // XSAVE is enabled for app use
  bool m_OsAvxState;               // AVX state is enabled by OS
  bool m_OsAvx512State;              // AVX-512 state is enabled by OS
  void Init(void);
  void InitProcessorBrand(void);
  void LoadInfo0(void);
  void LoadInfo1(void);
  void LoadInfo2(void);
  void LoadInfo3(void);
  void LoadInfo4(void);
  void LoadInfo5(void);
public:
  enum class FF : uint64_t
  {
    FXSR        = (uint64_t)1 << 0,
    MMX         = (uint64_t)1 << 1,
    MOVBE        = (uint64_t)1 << 2,
    SSE         = (uint64_t)1 << 3,
    SSE2        = (uint64_t)1 << 4,
    SSE3        = (uint64_t)1 << 5,
    SSSE3        = (uint64_t)1 << 6,
    SSE4_1       = (uint64_t)1 << 7,
    SSE4_2       = (uint64_t)1 << 8,
    PCLMULQDQ      = (uint64_t)1 << 9,
    POPCNT       = (uint64_t)1 << 10,
    PREFETCHW      = (uint64_t)1 << 11,
    PREFETCHWT1     = (uint64_t)1 << 12,
    RDRAND       = (uint64_t)1 << 13,
    RDSEED       = (uint64_t)1 << 14,
    ERMSB        = (uint64_t)1 << 15,
    AVX         = (uint64_t)1 << 16,
    AVX2        = (uint64_t)1 << 17,
    F16C        = (uint64_t)1 << 18,
    FMA         = (uint64_t)1 << 19,
    BMI1        = (uint64_t)1 << 20,
    BMI2        = (uint64_t)1 << 21,
    LZCNT        = (uint64_t)1 << 22,
    ADX         = (uint64_t)1 << 23,
    AVX512F       = (uint64_t)1 << 24,
    AVX512ER      = (uint64_t)1 << 25,
    AVX512PF      = (uint64_t)1 << 26,
    AVX512DQ      = (uint64_t)1 << 27,
    AVX512CD      = (uint64_t)1 << 28,
    AVX512BW      = (uint64_t)1 << 29,
    AVX512VL      = (uint64_t)1 << 30,
    AVX512_IFMA     = (uint64_t)1 << 31,
    AVX512_VBMI     = (uint64_t)1 << 32,
    AVX512_4FMAPS    = (uint64_t)1 << 33,
    AVX512_4VNNIW    = (uint64_t)1 << 34,
    AVX512_VPOPCNTDQ  = (uint64_t)1 << 35,
    AVX512_VNNI     = (uint64_t)1 << 36,
    AVX512_VBMI2    = (uint64_t)1 << 37,
    AVX512_BITALG    = (uint64_t)1 << 38,
    CLWB        = (uint64_t)1 << 39,
  };
  CpuidInfo(void) { Init(); };
  ∼CpuidInfo() {};
  const std::vector<CpuidInfo::CacheInfo>& GetCacheInfo(void) const
  {
    return m_CacheInfo;
  }
  bool GetFF(FF flag) const
  {
    return ((m_FeatureFlags & (uint64_t)flag) != 0) ? true : false;
  }
  std::string GetProcessorBrand(void) const  { return std::string(m_ProcessorBrand); }
  std::string GetVendorId(void) const     { return std::string(m_VendorId); }
  void LoadInfo(void);
};
// Cpuinfo_.asm
extern "C" void Xgetbv_(uint32_t r_ecx, uint32_t* r_eax, uint32_t* r_edx);
extern "C" uint32_t Cpuid_(uint32_t r_eax, uint32_t r_ecx, CpuidRegs* r_out);
;-------------------------------------------------
;        CpuidInfo_.asm
;-------------------------------------------------
; The following structures must agree with the CpuidRegs structure
; that's defined in CpuidInfo.h
CpuidRegs  struct
RegEAX   dword ?
RegEBX   dword ?
RegECX   dword ?
RegEDX   dword ?
CpuidRegs  ends
; extern "C" uint32_t Cpuid_(uint32_t r_eax, uint32_t r_ecx, CpuidRegs* r_out);
;
; Returns:   eax == 0   Unsupported CPUID leaf
;        eax != 0   Supported CPUID leaf
;
;        Note: the return code is valid only if r_eax <= MaxEAX.
    .code
Cpuid_ proc frame
    push rbx
    .pushreg rbx
    .endprolog
; Load eax and ecx
    mov eax,ecx
    mov ecx,edx
; Get cpuid info & save results
    cpuid
    mov dword ptr [r8+CpuidRegs.RegEAX],eax
    mov dword ptr [r8+CpuidRegs.RegEBX],ebx
    mov dword ptr [r8+CpuidRegs.RegECX],ecx
    mov dword ptr [r8+CpuidRegs.RegEDX],edx
; Test for unsupported CPUID leaf
    or eax,ebx
    or ecx,edx
    or eax,ecx             ;eax = return code
    pop rbx
    ret
Cpuid_ endp
; extern "C" void Xgetbv_(uint32_t r_ecx, uint32_t* r_eax, uint32_t* r_edx);
Xgetbv_ proc
    mov r9,rdx             ;r9 = r_eax ptr
    xgetbv
    mov dword ptr [r9],eax       ;save low word result
    mov dword ptr [r8],edx       ;save high word result
    ret
Xgetbv_ endp
    end
//------------------------------------------------
//        Ch16_01.cpp
//------------------------------------------------
#include "stdafx.h"
#include <iostream>
#include <iomanip>
#include <string>
#include "CpuidInfo.h"
using namespace std;
static void DisplayCacheInfo(const CpuidInfo& ci);
static void DisplayFeatureFlags(const CpuidInfo& ci);
int main()
{
  CpuidInfo ci;
  ci.LoadInfo();
  cout << ci.GetVendorId() << ' ';
  cout << ci.GetProcessorBrand() << ' ';
  DisplayCacheInfo(ci);
  DisplayFeatureFlags(ci);
  return 0;
}
static void DisplayCacheInfo(const CpuidInfo& ci)
{
  const vector<CpuidInfo::CacheInfo>& cache_info = ci.GetCacheInfo();
  for (const CpuidInfo::CacheInfo& x : cache_info)
  {
    uint32_t cache_size = x.GetSize();
    string cache_size_str;
    if (cache_size < 1024 * 1024)
    {
      cache_size /= 1024;
      cache_size_str = "KB";
    }
    else
    {
      cache_size /= 1024 * 1024;
      cache_size_str = "MB";
    }
    cout << "Cache L" << x.GetLevel() << ": ";
    cout << cache_size << cache_size_str << ' ';
    cout << x.GetTypeString() << ' ';
  }
}
static void DisplayFeatureFlags(const CpuidInfo& ci)
{
  const char nl = ' ';
  cout << "----- CPUID Feature Flags -----" << nl;
  cout << "ADX:     " << ci.GetFF(CpuidInfo::FF::ADX) << nl;
  cout << "AVX:     " << ci.GetFF(CpuidInfo::FF::AVX) << nl;
  cout << "AVX2:    " << ci.GetFF(CpuidInfo::FF::AVX2) << nl;
  cout << "AVX512F:   " << ci.GetFF(CpuidInfo::FF::AVX512F) << nl;
  cout << "AVX512BW:  " << ci.GetFF(CpuidInfo::FF::AVX512BW) << nl;
  cout << "AVX512CD:  " << ci.GetFF(CpuidInfo::FF::AVX512CD) << nl;
  cout << "AVX512DQ:  " << ci.GetFF(CpuidInfo::FF::AVX512DQ) << nl;
  cout << "AVX512ER:  " << ci.GetFF(CpuidInfo::FF::AVX512ER) << nl;
  cout << "AVX512PF:  " << ci.GetFF(CpuidInfo::FF::AVX512PF) << nl;
  cout << "AVX512VL:  " << ci.GetFF(CpuidInfo::FF::AVX512VL) << nl;
  cout << "AVX512_IFMA: " << ci.GetFF(CpuidInfo::FF::AVX512_IFMA) << nl;
  cout << "AVX512_VBMI: " << ci.GetFF(CpuidInfo::FF::AVX512_VBMI) << nl;
  cout << "BMI1:    " << ci.GetFF(CpuidInfo::FF::BMI1) << nl;
  cout << "BMI2:    " << ci.GetFF(CpuidInfo::FF::BMI2) << nl;
  cout << "F16C:    " << ci.GetFF(CpuidInfo::FF::F16C) << nl;
  cout << "FMA:     " << ci.GetFF(CpuidInfo::FF::FMA) << nl;
  cout << "LZCNT:    " << ci.GetFF(CpuidInfo::FF::LZCNT) << nl;
  cout << "POPCNT:   " << ci.GetFF(CpuidInfo::FF::POPCNT) << nl;
}
//------------------------------------------------
//        CpuidInfo.cpp
//------------------------------------------------
#include "stdafx.h"
#include <string>
#include <cstring>
#include <vector>
#include "CpuidInfo.h"
using namespace std;
void CpuidInfo::LoadInfo(void)
{
  // Note: LoadInfo0 must be called first
  LoadInfo0();
  LoadInfo1();
  LoadInfo2();
  LoadInfo3();
  LoadInfo4();
  LoadInfo5();
}
void CpuidInfo::LoadInfo0(void)
{
  CpuidRegs r1;
  // Perform required initializations
  Init();
  // Get MaxEax and VendorID
  Cpuid_(0, 0, &r1);
  m_MaxEax = r1.EAX;
  *(uint32_t *)(m_VendorId + 0) = r1.EBX;
  *(uint32_t *)(m_VendorId + 4) = r1.EDX;
  *(uint32_t *)(m_VendorId + 8) = r1.ECX;
  m_VendorId[sizeof(m_VendorId) - 1] = '';
  // Get MaxEaxExt
  Cpuid_(0x80000000, 0, &r1);
  m_MaxEaxExt = r1.EAX;
  // Initialize processor brand string
  InitProcessorBrand();
}
void CpuidInfo::LoadInfo1(void)
{
  CpuidRegs r;
  if (m_MaxEax < 1)
    return;
  Cpuid_(1, 0, &r);
  //
  // Decode r.ECX flags
  //
  // CPUID.(EAX=01H, ECX=00H):ECX.SSE3[bit 0]
  if (r.ECX & (0x1 << 0))
    m_FeatureFlags |= (uint64_t)FF::SSE3;
  // CPUID.(EAX=01H, ECX=00H):ECX.PCLMULQDQ[bit 1]
  if (r.ECX & (0x1 << 1))
    m_FeatureFlags |= (uint64_t)FF::PCLMULQDQ;
  // CPUID.(EAX=01H, ECX=00H):ECX.SSSE3[bit 9]
  if (r.ECX & (0x1 << 9))
    m_FeatureFlags |= (uint64_t)FF::SSSE3;
  // CPUID.(EAX=01H, ECX=00H):ECX.SSE4.1[bit 19]
  if (r.ECX & (0x1 << 19))
    m_FeatureFlags |= (uint64_t)FF::SSE4_1;
  // CPUID.(EAX=01H, ECX=00H):ECX.SSE4.2[bit 20]
  if (r.ECX & (0x1 << 20))
    m_FeatureFlags |= (uint64_t)FF::SSE4_2;
  // CPUID.(EAX=01H, ECX=00H):ECX.MOVBE[bit 22]
  if (r.ECX & (0x1 << 22))
    m_FeatureFlags |= (uint64_t)FF::MOVBE;
  // CPUID.(EAX=01H, ECX=00H):ECX.POPCNT[bit 23]
  if (r.ECX & (0x1 << 23))
    m_FeatureFlags |= (uint64_t)FF::POPCNT;
  // CPUID.(EAX=01H, ECX=00H):ECX.RDRAND[bit 30]
  if (r.ECX & (0x1 << 30))
    m_FeatureFlags |= (uint64_t)FF::RDRAND;
  //
  // Decode r.RDX flags
  //
  // CPUID.(EAX=01H, ECX=00H):EDX.MMX[bit 23]
  if (r.EDX & (0x1 << 23))
    m_FeatureFlags |= (uint64_t)FF::MMX;
  // CPUID.(EAX=01H, ECX=00H):EDX.FXSR[bit 24]
  if (r.EDX & (0x1 << 24))
    m_FeatureFlags |= (uint64_t)FF::FXSR;
  // CPUID.(EAX=01H, ECX=00H):EDX.SSE[bit 25]
  if (r.EDX & (0x1 << 25))
    m_FeatureFlags |= (uint64_t)FF::SSE;
  // CPUID.(EAX=01H, ECX=00H):EDX.SSE2[bit 26]
  if (r.EDX & (0x1 << 26))
    m_FeatureFlags |= (uint64_t)FF::SSE2;
}
void CpuidInfo::LoadInfo2(void)
{
  CpuidRegs  r;
  if (m_MaxEax < 7)
    return;
  Cpuid_(7, 0, &r);
  // CPUID.(EAX=07H, ECX=00H):ECX.PREFETCHWT1[bit 0]
  if (r.ECX & (0x1 << 0))
    m_FeatureFlags |= (uint64_t)FF::PREFETCHWT1;
  // CPUID.(EAX=07H, ECX=00H):EBX.BMI1[bit 3]
  if (r.EBX & (0x1 << 3))
    m_FeatureFlags |= (uint64_t)FF::BMI1;
  // CPUID.(EAX=07H, ECX=00H):EBX.BMI2[bit 8]
  if (r.EBX & (0x1 << 8))
    m_FeatureFlags |= (uint64_t)FF::BMI2;
  // CPUID.(EAX=07H, ECX=00H):EBX.ERMSB[bit 9]
  // ERMSB = Enhanced REP MOVSB/STOSB
  if (r.EBX & (0x1 << 9))
    m_FeatureFlags |= (uint64_t)FF::ERMSB;
  // CPUID.(EAX=07H, ECX=00H):EBX.RDSEED[bit 18]
  if (r.EBX & (0x1 << 18))
    m_FeatureFlags |= (uint64_t)FF::RDSEED;
  // CPUID.(EAX=07H, ECX=00H):EBX.ADX[bit 19]
  if (r.EBX & (0x1 << 19))
    m_FeatureFlags |= (uint64_t)FF::ADX;
  // CPUID.(EAX=07H, ECX=00H):EBX.CLWB[bit 24]
  if (r.EBX & (0x1 << 24))
    m_FeatureFlags |= (uint64_t)FF::CLWB;
}
void CpuidInfo::LoadInfo3(void)
{
  CpuidRegs r;
  if (m_MaxEaxExt < 0x80000001)
    return;
  Cpuid_(0x80000001, 0, &r);
  // CPUID.(EAX=80000001H, ECX=00H):ECX.LZCNT[bit 5]
  if (r.ECX & (0x1 << 5))
    m_FeatureFlags |= (uint64_t)FF::LZCNT;
  // CPUID.(EAX=80000001H, ECX=00H):ECX.PREFETCHW[bit 8]
  if (r.ECX & (0x1 << 8))
    m_FeatureFlags |= (uint64_t)FF::PREFETCHW;
}
void CpuidInfo::LoadInfo4(void)
{
  CpuidRegs r_eax01h;
  CpuidRegs r_eax07h;
  if (m_MaxEax < 7)
    return;
  Cpuid_(1, 0, &r_eax01h);
  Cpuid_(7, 0, &r_eax07h);
  // Test CPUID.(EAX=01H, ECX=00H):ECX.OSXSAVE[bit 27] to verify use of XGETBV
  m_OsXsave = (r_eax01h.ECX & (0x1 << 27)) ? true : false;
  if (m_OsXsave)
  {
    // Use XGETBV to obtain following information
    // AVX state is enabled by OS if (XCR0[2:1] == '11b') is true
    // AVX512 state is enabled by OS if (XCR0[7:5] == '111b') is true
    uint32_t xgetbv_eax, xgetbv_edx;
    Xgetbv_(0, &xgetbv_eax, &xgetbv_edx);
    m_OsAvxState = (((xgetbv_eax >> 1) & 0x03) == 0x03) ? true : false;
    if (m_OsAvxState)
    {
      // CPUID.(EAX=01H, ECX=00H):ECX.AVX[bit 28]
      if (r_eax01h.ECX & (0x1 << 28))
      {
        m_FeatureFlags |= (uint64_t)FF::AVX;
        // CPUID.(EAX=01H, ECX=00H):ECX.FMA[bit 12]
        if (r_eax01h.ECX & (0x1 << 12))
          m_FeatureFlags |= (uint64_t)FF::FMA;
        // CPUID.(EAX=01H, ECX=00H):ECX.F16C[bit 29]
        if (r_eax01h.ECX & (0x1 << 29))
          m_FeatureFlags |= (uint64_t)FF::F16C;
        // CPUID.(EAX=07H, ECX=00H):EBX.AVX2[bit 5]
        if (r_eax07h.EBX & (0x1 << 5))
          m_FeatureFlags |= (uint64_t)FF::AVX2;
        m_OsAvx512State = (((xgetbv_eax >> 5) & 0x07) == 0x07) ? true : false;
        if (m_OsAvx512State)
        {
          // CPUID.(EAX=07H, ECX=00H):EBX.AVX512F[bit 16]
          if (r_eax07h.EBX & (0x1 << 16))
          {
            m_FeatureFlags |= (uint64_t)FF::AVX512F;
            //
            // Decode EBX flags
            //
            // CPUID.(EAX=07H, ECX=00H):EBX.AVX512DQ[bit 17]
            if (r_eax07h.EBX & (0x1 << 17))
              m_FeatureFlags |= (uint64_t)FF::AVX512DQ;
            // CPUID.(EAX=07H, ECX=00H):EBX.AVX512_IFMA[bit 21]
            if (r_eax07h.EBX & (0x1 << 21))
              m_FeatureFlags |= (uint64_t)FF::AVX512_IFMA;
            // CPUID.(EAX=07H, ECX=00H):EBX.AVX512PF[bit 26]
            if (r_eax07h.EBX & (0x1 << 26))
              m_FeatureFlags |= (uint64_t)FF::AVX512PF;
            // CPUID.(EAX=07H, ECX=00H):EBX.AVX512ER[bit 27]
            if (r_eax07h.EBX & (0x1 << 27))
              m_FeatureFlags |= (uint64_t)FF::AVX512ER;
            // CPUID.(EAX=07H, ECX=00H):EBX.AVX512CD[bit 28]
            if (r_eax07h.EBX & (0x1 << 28))
              m_FeatureFlags |= (uint64_t)FF::AVX512CD;
            // CPUID.(EAX=07H, ECX=00H):EBX.AVX512BW[bit 30]
            if (r_eax07h.EBX & (0x1 << 30))
              m_FeatureFlags |= (uint64_t)FF::AVX512BW;
            // CPUID.(EAX=07H, ECX=00H):EBX.AVX512VL[bit 31]
            if (r_eax07h.EBX & (0x1 << 31))
              m_FeatureFlags |= (uint64_t)FF::AVX512VL;
            //
            // Decode ECX flags
            //
            // CPUID.(EAX=07H, ECX=00H):ECX.AVX512_VBMI[bit 1]
            if (r_eax07h.ECX & (0x1 << 1))
              m_FeatureFlags |= (uint64_t)FF::AVX512_VBMI;
            // CPUID.(EAX=07H, ECX=00H):ECX.AVX512_VBMI2[bit 6]
            if (r_eax07h.ECX & (0x1 << 6))
              m_FeatureFlags |= (uint64_t)FF::AVX512_VBMI2;
            // CPUID.(EAX=07H, ECX=00H):ECX.AVX512_VNNI[bit 11]
            if (r_eax07h.ECX & (0x1 << 11))
              m_FeatureFlags |= (uint64_t)FF::AVX512_VNNI;
            // CPUID.(EAX=07H, ECX=00H):ECX.AVX512_BITALG[bit 12]
            if (r_eax07h.ECX & (0x1 << 12))
              m_FeatureFlags |= (uint64_t)FF::AVX512_BITALG;
            // CPUID.(EAX=07H, ECX=00H):ECX.AVX512_VPOPCNTDQ[bit 14]
            if (r_eax07h.ECX & (0x1 << 14))
              m_FeatureFlags |= (uint64_t)FF::AVX512_VPOPCNTDQ;
            //
            // Decode EDX flags
            //
            // CPUID.(EAX=07H, ECX=00H):EDX.AVX512_4FMAPS[bit 2]
            if (r_eax07h.EDX & (0x1 << 2))
              m_FeatureFlags |= (uint64_t)FF::AVX512_4FMAPS;
            // CPUID.(EAX=07H, ECX=00H):EDX.AVX512_4VNNIW[bit 3]
            if (r_eax07h.EDX & (0x1 << 3))
              m_FeatureFlags |= (uint64_t)FF::AVX512_4VNNIW;
          }
        }
      }
    }
  }
}
void CpuidInfo::LoadInfo5(void)
{
  if (m_MaxEax < 4)
    return;
  bool done = false;
  uint32_t index = 0;
  while (!done)
  {
    CpuidRegs r;
    Cpuid_(4, index, &r);
    uint32_t cache_type = r.EAX & 0x1f;
    uint32_t cache_level = ((r.EAX >> 5) & 0x3);
    if (cache_type == 0)
      done = true;
    else
    {
      uint32_t ways = ((r.EBX >> 22) & 0x3ff) + 1;
      uint32_t partitions = ((r.EBX >> 12) & 0x3ff) + 1;
      uint32_t line_size = (r.EBX & 0xfff) + 1;
      uint32_t sets = r.ECX + 1;
      uint32_t cache_size = ways * partitions * line_size * sets;
      CacheInfo ci(cache_level, cache_type, cache_size);
      m_CacheInfo.push_back(ci);
      index++;
    }
  }
}
void CpuidInfo::Init(void)
{
  m_MaxEax = 0;
  m_MaxEaxExt = 0;
  m_FeatureFlags = 0;
  m_OsXsave = false;
  m_OsAvxState = false;
  m_OsAvx512State = false;
  m_VendorId[0] = '';
  m_ProcessorBrand[0] = '';
  m_CacheInfo.clear();
}
void CpuidInfo::InitProcessorBrand(void)
{
  if (m_MaxEaxExt >= 0x80000004)
  {
    CpuidRegs r2, r3, r4;
    char* p = m_ProcessorBrand;
    Cpuid_(0x80000002, 0, &r2);
    Cpuid_(0x80000003, 0, &r3);
    Cpuid_(0x80000004, 0, &r4);
    *(uint32_t *)(p + 0) = r2.EAX;
    *(uint32_t *)(p + 4) = r2.EBX;
    *(uint32_t *)(p + 8) = r2.ECX;
    *(uint32_t *)(p + 12) = r2.EDX;
    *(uint32_t *)(p + 16) = r3.EAX;
    *(uint32_t *)(p + 20) = r3.EBX;
    *(uint32_t *)(p + 24) = r3.ECX;
    *(uint32_t *)(p + 28) = r3.EDX;
    *(uint32_t *)(p + 32) = r4.EAX;
    *(uint32_t *)(p + 36) = r4.EBX;
    *(uint32_t *)(p + 40) = r4.ECX;
    *(uint32_t *)(p + 44) = r4.EDX;
    m_ProcessorBrand[sizeof(m_ProcessorBrand) - 1] = '';
  }
  else
    strcpy_s(m_ProcessorBrand, "Unknown");
}
Listing 16-1.

Example Ch16_01

Before examining the source code, it will be helpful to have a basic understanding of the cupid instruction and how it works. Prior to using cpuid, a function must load register EAX with a “leaf” value that specifies what information the cpuid instruction should return. A second or “sub-leaf” value may also be required in register ECX. The cpuid instruction returns its results in registers EAX, EBX, ECX, and EDX. The calling function must then decode the values in these registers to ascertain processor support for specific features. As you will soon see, it is often necessary for a program to employ the cupid instruction multiple times. Most application programs typically use cupid during initialization and save the results for later use. The reason for this is that cupid is a serializing instruction, which means that it forces the processor to finish executing all previously fetched instructions and perform any pending memory writes before fetching the next instruction. In other words, the cupid instruction takes a long time to complete its execution.

Listing 16-1 begins with the header file CpuidInfo.h. Near the top of this file is a structure named CpuidRegs, which is used to save the results returned by cupid. Following CpuidRegs is a C++ class named CpuidInfo. This class contains the code and data that’s associated with cpuid instruction use. The public portion of CpuidInfo includes a subclass named CacheInfo. This class is employed to report information about a processor’s memory caches. Class CpuidInfo also includes an enum named FF. An application program can use this enum as an argument value with the member function CpuidInfo::GetFF to determine if the host processor supports a specific instruction set. You’ll see how this works later in this section. Toward the bottom of header file CpuidInfo.h are two declaration statements for the assembly language functions Cpuid_ and Xgetbv_. These functions execute the cpuid and xgetbv (Get Value of Extended Control Register) instructions, respectively.

Following CpuidInfo.h in Listing 16-1 is the source code file CpuidInfo_.asm. This file contains the assembly language functions Cpuid_ and Xgetbv_, which are simple wrapper functions for the x86 instructions cupid and xgetbv. The function Cpuid_ begins its execution by saving register RBX on the stack. It then loads argument values r_eax and r_ecx into registers EAX and ECX. The actual cupid instruction follows the loading of registers EAX and ECX. Following the execution of cpuid, the results in registers EAX, EBX, ECX, and EDX are saved to the specified CpuidRegs structure. The assembly language function Xgetbv_ executes the xgetbv instruction. This instruction loads the contents of the extended processor control register that’s specified by ECX into register pair EDX:EAX. The xgetbv instruction allows an application program to determine if the host operating system supports AVX, AVX2, or AVX-512, as explained later in this section.

The next file in Listing 16-1 is Ch16_01.cpp. The function main contains code illustrates how to use the C++ class CpuidInfo. The statement ci.LoadInfo() invokes the member function CpuidInfo::LoadInfo, which generates multiple executions of cpuid to obtain information about the processor. Note that CpuidInfo::LoadInfo is only called once. The function DisplayCacheInfo streams information about the processor’s memory caches to cout. This function invokes CpuidInfo::GetCacheInfo to report cache information that was obtained during execution of CpuidInfo::LoadInfo. The function DisplayFeatureFlags shows information about some of the instruction set extensions that the processor supports. Each cout statement in this function uses CpuidInfo::GetFF with a different CpuidInfo::FF value. The member function CpuidInfo::GetFF returns a single bool value that indicates whether the processor supports instruction set extension that’s specified by its argument value. Like the cache data, the processor instruction set extension information was obtained and saved during the call to CpuidInfo::LoadInfo. Note that CpuidInfo is structured to allow an application program to make multiple CpuidInfo::GetFF calls without triggering additional executions of cupid.

Following the file Ch16_01.cpp in Listing 16-1 is the source code for CpuidInfo.cpp, which contains the non-trivial member functions for class CpuidInfo. The member function CpuidInfo::LoadInfo that was discussed earlier invokes six private member functions that perform a multitude of cupid queries. The first of these functions, CpuidInfo::LoadInfo0, begins its execution by calling CpuidInfo::Init to carry out the requisite initializations. It then invokes the assembly language function Cpuid_ to obtain the maximum cpuid leaf value that’s supported by the processor and the processor vendor ID string. Another call to Cpuid_ is then used to obtain the maximum leaf value for extended cupid information. This is followed by a call to CpuidInfo::InitProcessorBrand, which uses several Cpuid_ calls to query and save the processor brand string. The source code for this function is located toward the end of the file CpuidInfo.cpp.

The member functions CpuidInfo::LoadInfo1, CpuidInfo::LoadInfo2, and CpuidInfo::LoadInfo3 also exploit Cpuid_ to ascertain processor support for a variety of instruction set extensions. The code that’s contained in these member functions is mostly brute-force decoding of the various Cpuid_ results. The AMD and Intel programming reference manuals contain additional information about the cpuid feature flag bits that are used to indicate processor support for a specific instruction set extension. The private member function CpuidInfo::LoadInfo4 contains the code that checks for AVX, AVX2, and AVX-512. This member function warrants closer examination.

An application program can use the computational resources of x86-AVX only if it’s supported by both the processor and its host operating system. The Xgetbv_ function can be employed to determine host operating system support. Before using Xgetbv_, the cpuid flag OSXSAVE must be tested to ensure that it’s safe for an application program to use the xgetbv instruction. If OSXSAVE is set to true, the function CpuidInfo::LoadInfo4 invokes Xgetbv_ to obtain information regarding OS support for x86-AVX state information (i.e., whether the OS properly preserves the XMM, YMM, and ZMM registers during a task switch). When using the Xgetbv_ function, the processor will generate an exception if the extended control register number is invalid or if the processor’s OSXSAVE flag is set to false. This explains why the software flag m_OsXsave is checked in prior to calling Xgetbv_. If the host operating system supports x86-AVX state information, the function CpuidInfo::LoadInfo4 proceeds to decode the cpuid feature flags related to AVX and AVX2. Note that the feature flags FMA and F16C are also tested here. The remaining code in CpuidInfo::LoadInfo4 decodes the cupid feature flags that signify support for the various AVX-512 instruction set extensions.

The final private member function that’s called by CpuidInfo::LoadInfo is named CpuidInfo::LoadInfo5. This member function uses cpuid and the class CpuidInfo::CacheInfo to save type and size information about the processor’s memory caches. The ancillary code for class CpuidInfo::CacheInfo is not shown in Listing 16-1 but included with the chapter download package. Here is the output for source code example Ch16_01:
GenuineIntel
Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
Cache L1: 32KB Data
Cache L1: 32KB Instruction
Cache L2: 1MB Unified
Cache L3: 13MB Unified
----- CPUID Feature Flags -----
ADX:     1
AVX:     1
AVX2:    1
AVX512F:   1
AVX512BW:  1
AVX512CD:  1
AVX512DQ:  1
AVX512ER:  0
AVX512PF:  0
AVX512VL:  1
AVX512_IFMA: 0
AVX512_VBMI: 0
BMI1:    1
BMI2:    1
F16C:    1
FMA:     1
LZCNT:    1
POPCNT:   1
Table 16-1 shows a summary of cpuid information for several Intel processors. Before moving on to the next source code example, it should be noted that when using the cpuid instruction to determine processor support for the various AVX-512 instruction set extensions, it is often necessary for an application program to test multiple feature flags . For example, an application program must verify that the AVX512F, AVX512DQ, and AVX512VL feature flags are all set before using any AVX512DQ instructions with 256-bit or 128-bit wide operands. The Intel 64 and IA-32 Architectures Software Developer’s Manual (Volume 1) contains additional information regarding cpuid instruction use and feature flag testing.
Table 16-1.

Summary of Information from cpuid Instruction for Select Intel Processors

CPUID Feature

i3-2310m

i7-4790s

i9-7900x

i7-8700k

L1 Data (KB, per core)

32

32

32

32

L1 Instruction (KB, per core)

32

32

32

32

L2 Unified (KB, per core)

256

256

1024

256

L3 Unified (MB)

3

8

13

12

ADX

0

0

1

1

AVX

1

1

1

1

AVX2

0

1

1

1

AXV512F

0

0

1

0

AVX512BW

0

0

1

0

AVX512CD

0

0

1

0

AVX512DQ

0

0

1

0

AVX512ER

0

0

0

0

AVX512PF

0

0

0

0

AVX512VL

0

0

1

0

AVX512_IFMA

0

0

0

0

AVX512_VBMI

0

0

0

0

BMI1

0

1

1

1

BMI2

0

1

1

1

F16C

0

1

1

1

FMA

0

1

1

1

LZCNT

0

1

1

1

POPCNT

1

1

1

1

Non-Temporal Memory Stores

From the perspective of a memory cache, data can be classified as either temporal or non-temporal. Temporal data is any value that is accessed more than once within a short period of time. Examples of temporal data include the elements of an array or data structure that are referenced multiple times during execution of a program loop. It also includes the instruction bytes of a program. Non-temporal data is any value that is accessed once and not immediately reused. The destination arrays of many SIMD processing algorithms often contain non-temporal data. The differentiation between temporal and non-temporal data is important since processor performance often degrades if its memory caches contain excessive amounts of non-temporal data. This condition is commonly called cache pollution . Ideally, a processor’s memory caches contain only temporal data since it makes little sense to cache items that are only used once.

Listing 16-2 shows the source code for example Ch16_02. This example illustrates the use of the non-temporal store instruction vmovntps. It also compares the performance of this instruction to the standard vmovaps instruction.
//------------------------------------------------
//        Ch16_02.cpp
//------------------------------------------------
#include "stdafx.h"
#include <iostream>
#include <iomanip>
#include <string>
#include <random>
#include "Ch16_02.h"
#include "AlignedMem.h"
using namespace std;
void Init(float* x, size_t n, unsigned int seed)
{
  uniform_int_distribution<> ui_dist {1, 1000};
  default_random_engine rng {seed};
  for (size_t i = 0; i < n; i++)
    x[i] = (float)ui_dist(rng);
}
bool CalcResultCpp(float* c, const float* a, const float* b, size_t n)
{
  size_t align = 32;
  if ((n == 0) || ((n & 0x0f) != 0))
    return false;
  if (!AlignedMem::IsAligned(a, align))
    return false;
  if (!AlignedMem::IsAligned(b, align))
    return false;
  if (!AlignedMem::IsAligned(b, align))
    return false;
  for (size_t i = 0; i < n; i++)
    c[i] = sqrt(a[i] * a[i] + b[i] * b[i]);
  return true;
}
void CompareResults(const float* c1, const float* c2a, const float*c2b, size_t n)
{
  bool compare_ok = true;
  const float epsilon = 1.0e-9f;
  cout << fixed << setprecision(4);
  for (size_t i = 0; i < n && compare_ok; i++)
  {
    bool b1 = fabs(c1[i] - c2a[i]) > epsilon;
    bool b2 = fabs(c1[i] - c2b[i]) > epsilon;
    cout << setw(2) << i << " - ";
    cout << setw(10) << c1[i] << ' ';
    cout << setw(10) << c2a[i] << ' ';
    cout << setw(10) << c2b[i] << ' ';
    if (b1 || b2)
      compare_ok = false;
  }
  if (compare_ok)
    cout << "Array compare OK ";
  else
    cout << "Array compare FAILED ";
}
void NonTemporalStore(void)
{
  const size_t n = 16;
  const size_t align = 32;
  AlignedArray<float> a_aa(n, align);
  AlignedArray<float> b_aa(n, align);
  AlignedArray<float> c1_aa(n, align);
  AlignedArray<float> c2a_aa(n, align);
  AlignedArray<float> c2b_aa(n, align);
  float* a = a_aa.Data();
  float* b = b_aa.Data();
  float* c1 = c1_aa.Data();
  float* c2a = c2a_aa.Data();
  float* c2b = c2b_aa.Data();
  Init(a, n, 67);
  Init(b, n, 79);
  bool rc1 = CalcResultCpp(c1, a, b, n);
  bool rc2 = CalcResultA_(c2a, a, b, n);
  bool rc3 = CalcResultB_(c2b, a, b, n);
  if (!rc1 || !rc2 || !rc3)
  {
    cout << "Invalid return code ";
    cout << "rc1 = " << boolalpha << rc1 << ' ';
    cout << "rc2 = " << boolalpha << rc2 << ' ';
    cout << "rc3 = " << boolalpha << rc3 << ' ';
    return;
  }
  cout << "Results for NonTemporalStore ";
  CompareResults(c1, c2a, c2b, n);
}
int main()
{
  NonTemporalStore();
  NonTemporalStore_BM();
  return 0;
}
;-------------------------------------------------
;        Ch16_02.asm
;-------------------------------------------------
; _CalcResult Macro
;
; The following macro contains a simple calculating loop that is used
; to compare performance of the vmovaps and vmovntps instructions.
_CalcResult macro MovInstr
; Load and validate arguments
    xor eax,eax             ;set error code
    test r9,r9
    jz Done               ;jump if n <= 0
    test r9,0fh
    jnz Done              ;jump if (n % 16) != 0
    test rcx,1fh
    jnz Done              ;jump if c is not aligned
    test rdx,1fh
    jnz Done              ;jump if a is not aligned
    test r8,1fh
    jnz Done              ;jump if b is not aligned
; Calculate c[i] = sqrt(a[i] * a[i] + b[i] * b[i])
    align 16
@@:   vmovaps ymm0,ymmword ptr [rdx+rax]   ;ymm0 = a[i+7]:a[i]
    vmovaps ymm1,ymmword ptr [r8+rax]    ;ymm1 = b[i+7]:b[i]
    vmulps ymm2,ymm0,ymm0          ;ymm2 = a[i] * a[i]
    vmulps ymm3,ymm1,ymm1          ;ymm3 = b[i] * b[i]
    vaddps ymm4,ymm2,ymm3          ;ymm4 = sum
    vsqrtps ymm5,ymm4            ;ymm5 = final result
    MovInstr ymmword ptr [rcx+rax],ymm5   ;save final values to c
    vmovaps ymm0,ymmword ptr [rdx+rax+32]  ;ymm0 = a[i+15]:a[i+8]
    vmovaps ymm1,ymmword ptr [r8+rax+32]  ;ymm1 = b[i+15]:b[i+8]
    vmulps ymm2,ymm0,ymm0          ;ymm2 = a[i] * a[i]
    vmulps ymm3,ymm1,ymm1          ;ymm3 = b[i] * b[i]
    vaddps ymm4,ymm2,ymm3          ;ymm4 = sum
    vsqrtps ymm5,ymm4            ;ymm5 = final result
    MovInstr ymmword ptr [rcx+rax+32],ymm5 ;save final values to c
    add rax,64               ;update offset
    sub r9,16                ;update counter
    jnz @B
    mov eax,1               ;set success return code
Done:  vzeroupper
    ret
    endm
; extern bool CalcResultA_(float* c, const float* a, const float* b, size_t n)
    .code
CalcResultA_ proc
    _CalcResult vmovaps
CalcResultA_ endp
; extern bool CalcResultB_(float* c, const float* a, const float* b, int n)
CalcResultB_ proc
    _CalcResult vmovntps
CalcResultB_ endp
    end
Listing 16-2.

Example Ch16_02

Near the top of Listing 16-2 is the C++ function CalcResultCpp. This function performs a simple arithmetic calculation using the elements of two single-precision floating-point source arrays. It then saves the result to a destination array. The next C++ function in Listing 16-2 is named CompareResults. This function verifies equivalence between the C++ and assembly language output arrays. The function NonTemporalStore allocates and initializes the test arrays. It then invokes the C++ and assembly language calculating functions. The output arrays of the three calculating functions are then compared for any discrepancies.

The assembly language code in Listing 16-2 begin with the definition of a macro named _CalcResult. This macro generates AVX instructions that perform the exact same calculation as the C++ function CalcResultCpp. The macro _CalcResult is used in the assembly language functions CalcResultA_ and CalcResultB_. Note that CalcResultA_ supplies the instruction vmovaps for the macro parameter MovInstr while CalcResultB_ provides vmovntaps. This means that the code executed by functions CalcResultA_ and CalcResultB_ is identical, except for the move instruction that saves results to the destination array. Here is the output for source code example Ch16_02;
Results for NonTemporalStore
 0 -  240.8319  240.8319  240.8319
 1 -  747.1814  747.1814  747.1814
 2 -  285.1561  285.1561  285.1561
 3 -  862.3062  862.3062  862.3062
 4 -  604.8810  604.8810  604.8810
 5 - 1102.4504 1102.4504 1102.4504
 6 -  347.1441  347.1441  347.1441
 7 -  471.8315  471.8315  471.8315
 8 -  890.6739  890.6739  890.6739
 9 -  729.0878  729.0878  729.0878
10 -  458.3536  458.3536  458.3536
11 -  639.8031  639.8031  639.8031
12 - 1053.1063 1053.1063 1053.1063
13 - 1016.0079 1016.0079 1016.0079
14 -  610.4507  610.4507  610.4507
15 - 1161.7935 1161.7935 1161.7935
Array compare OK
Running benchmark function NonTemporalStore_BM - please wait
Benchmark times save to file Ch16_02_NonTemporalStore_BM_CHROMIUM.csv
Table 16-2 shows benchmark timing measurements for source code example Ch16_02 using several different Intel processors. In this example, using a vmovntps instruction instead of a vmovaps instruction yielded notable performance improvements on all three computers. It should be noted that the x86’s non-temporal move instructions only provide a hint to the processor regarding memory use. They do not guarantee improved performance in all cases. Any performance gains are determined by the specific memory access pattern and the processor’s underlying microarchitecture.
Table 16-2.

Mean Execution Times (Microseconds) for Functions CalcResultCpp, CalcResultA_, and CalcResultB_ (n = 2,000,000)

CPU

CalcResultCpp

CalcResultA_(uses vmovaps)

CalcResultB_(uses vmovntps)

i7-4790s

1553

1554

1242

i9-7900x

1173

1139

934

i7-8700k

847

801

590

Data Prefetch

An application program can also use the prefetch (Prefetch Data Into Caches) instruction to improve the performance of certain algorithms. This instruction facilitates pre-loading of expected-use data into the processor’s cache hierarchy. There are two basic forms of the prefetch instruction. The first form, prefetcht[0|1|2], pre-loads temporal data into a specific cache level. The second form, prefetchnta, pre-loads non-temporal data while minimizing cache pollution. Both forms of the prefetch instruction provide hints to the processor about the data that a program expects to use; a processor may choose to perform the prefetch operation or ignore the hint.

The prefetch instructions are suitable for use with a variety of data structures, including large arrays and linked lists. A linked list is sequentially-ordered collection of nodes. Each node includes a data section and one or more pointers (or links) to its adjacent nodes. Figure 16-1 illustrates a simple linked list. Linked lists are useful since their size can grow or shrink (i.e., nodes can be added or deleted) depending on data storage requirements. One drawback of a linked list is that the nodes are usually not stored in a contiguously-allocated block of memory. This tends to increase access times when traversing the nodes a linked list.
../images/326959_2_En_16_Chapter/326959_2_En_16_Fig1_HTML.jpg
Figure 16-1.

Simple linked list

Source code example Ch16_03 illustrates how to perform linked list traversals both with and without the prefetchnta instruction. Listings 16-3 shows the C++ and assembly language source code for this example.
//------------------------------------------------
//        Ch16_03.h
//------------------------------------------------
#pragma once
#include <cstdint>
// This structure must match the corresponding structure definition in Ch16_03.asmh
struct LlNode
{
  double ValA[4];
  double ValB[4];
  double ValC[4];
  double ValD[4];
  uint8_t FreeSpace[376];
  LlNode* Link;
};
// Ch16_03_Misc.cpp
extern bool LlCompare(int num_nodes, LlNode* l1, LlNode* l2, LlNode* l3, int* node_fail);
extern LlNode* LlCreate(int num_nodes);
extern void LlDelete(LlNode* p);
extern bool LlPrint(LlNode* p, const char* fn, const char* msg, bool append);
extern void LlTraverse(LlNode* p);
// Ch16_03_.asm
extern "C" void LlTraverseA_(LlNode* p);
extern "C" void LlTraverseB_(LlNode* p);
// Ch16_03_BM.cpp
extern void LinkedListPrefetch_BM(void);
//------------------------------------------------
//        Ch16_03.cpp
//------------------------------------------------
#include "stdafx.h"
#include <iostream>
#include <cmath>
#include <random>
#include "Ch16_03.h"
#include "AlignedMem.h"
using namespace std;
void LinkedListPrefetch(void)
{
  const int num_nodes = 8;
  LlNode* list1 = LlCreate(num_nodes);
  LlNode* list2a = LlCreate(num_nodes);
  LlNode* list2b = LlCreate(num_nodes);
  LlTraverse(list1);
  LlTraverseA_(list2a);
  LlTraverseB_(list2b);
  int node_fail;
  const char* fn = "Ch16_03_LinkedListPrefetchResults.txt";
  cout << "Results for LinkedListPrefetch ";
  if (LlCompare(num_nodes, list1, list2a, list2b, &node_fail))
    cout << "Linked list compare OK ";
  else
    cout << "Linked list compare FAILED - node_fail = " << node_fail << ' ';
  LlPrint(list1, fn, "----- list1 -----", 0);
  LlPrint(list2a, fn, "----- list2a -----", 1);
  LlPrint(list2b, fn, "----- list2b -----", 1);
  cout << "Linked list results saved to file " << fn << ' ';
  LlDelete(list1);
  LlDelete(list2a);
  LlDelete(list2b);
}
int main()
{
  LinkedListPrefetch();
  LinkedListPrefetch_BM();
  return 0;
}
//------------------------------------------------
//        Ch16_03_Misc.cpp
//------------------------------------------------
#include "stdafx.h"
#include <iostream>
#include <cmath>
#include <random>
#include "Ch16_03.h"
#include "AlignedMem.h"
using namespace std;
bool LlCompare(int num_nodes, LlNode* l1, LlNode* l2, LlNode* l3, int* node_fail)
{
  const double epsilon = 1.0e-9;
  for (int i = 0; i < num_nodes; i++)
  {
    *node_fail = i;
    if ((l1 == nullptr) || (l2 == nullptr) || (l3 == nullptr))
      return false;
    for (int j = 0; j < 4; j++)
    {
      bool b12_c = fabs(l1->ValC[j] - l2->ValC[j]) > epsilon;
      bool b13_c = fabs(l1->ValC[j] - l3->ValC[j]) > epsilon;
      if (b12_c || b13_c)
        return false;
      bool b12_d = fabs(l1->ValD[j] - l2->ValD[j]) > epsilon;
      bool b13_d = fabs(l1->ValD[j] - l3->ValD[j]) > epsilon;
      if (b12_d || b13_d)
        return false;
    }
    l1 = l1->Link;
    l2 = l2->Link;
    l3 = l3->Link;
  }
  *node_fail = -2;
  if ((l1 != nullptr) || (l2 != nullptr) || (l3 != nullptr))
    return false;
  *node_fail = -1;
  return true;
}
LlNode* LlCreate(int num_nodes)
{
  const size_t align = 64;
  const unsigned int seed = 83;
  LlNode* first = nullptr;
  LlNode* last = nullptr;
  uniform_int_distribution<> ui_dist {1, 500};
  default_random_engine rng {seed};
  for (int i = 0; i < num_nodes; i++)
  {
    LlNode* p = (LlNode*)AlignedMem::Allocate(sizeof(LlNode), align);
    p->Link = nullptr;
    if (i == 0)
      first = last = p;
    else
    {
      last->Link = p;
      last = p;
    }
    for (int j = 0; j < 4; j++)
    {
      p->ValA[j] = (double)ui_dist(rng);
      p->ValB[j] = (double)ui_dist(rng);
      p->ValC[j] = 0;
      p->ValD[j] = 0;
    }
  }
  return first;
}
void LlDelete(LlNode* p)
{
  while (p != nullptr)
  {
    LlNode* q = p->Link;
    AlignedMem::Release(p);
    p = q;
  }
}
bool LlPrint(LlNode* p, const char* fn, const char* msg, bool append)
{
  FILE* fp;
  const char* mode = (append) ? "at" : "wt";
  if (fopen_s(&fp, fn, mode) != 0)
    return false;
  int i = 0;
  const char* fs = "%14.4lf %14.4lf %14.4lf %14.4lf ";
  if (msg != nullptr)
    fprintf(fp, " %s ", msg);
  while (p != nullptr)
  {
    fprintf(fp, " LlNode %d [0x%p] ", i, p);
    fprintf(fp, " ValA: ");
    fprintf(fp, fs, p->ValA[0], p->ValA[1], p->ValA[2], p->ValA[3]);
    fprintf(fp, " ValB: ");
    fprintf(fp, fs, p->ValB[0], p->ValB[1], p->ValB[2], p->ValB[3]);
    fprintf(fp, " ValC: ");
    fprintf(fp, fs, p->ValC[0], p->ValC[1], p->ValC[2], p->ValC[3]);
    fprintf(fp, " ValD: ");
    fprintf(fp, fs, p->ValD[0], p->ValD[1], p->ValD[2], p->ValD[3]);
    i++;
    p = p->Link;
  }
  fclose(fp);
  return true;
}
void LlTraverse(LlNode* p)
{
  while (p != nullptr)
  {
    for (int i = 0; i < 4; i++)
    {
      p->ValC[i] = sqrt(p->ValA[i] * p->ValA[i] + p->ValB[i] * p->ValB[i]);
      p->ValD[i] = sqrt(p->ValA[i] / p->ValB[i] + p->ValB[i] / p->ValA[i]);
    }
    p = p->Link;
  }
}
;-------------------------------------------------
;        Ch16_03_.asmh
;-------------------------------------------------
; This structure must match the corresponding structure definition in Ch16_03.h
LlNode   struct
ValA    real8 4 dup(?)
ValB    real8 4 dup(?)
ValC    real8 4 dup(?)
ValD    real8 4 dup(?)
FreeSpace  byte 376 dup(?)
Link    qword ?
LlNode   ends
;-------------------------------------------------
;        Ch16_03_.asm
;-------------------------------------------------
    include <Ch16_03_.asmh>
; Macro _LlTraverse
;
; The following macro generates linked list traversal code using the
; prefetchnta instruction if UsePrefetch is equal to 'Y'.
_LlTraverse macro UsePrefetch
    mov rax,rcx                 ;rax = ptr to 1st node
    test rax,rax
    jz Done                   ;jump if empty list
    align 16
@@::  mov rcx,[rax+LlNode.Link]          ;rcx = next node
    vmovapd ymm0,ymmword ptr [rax+LlNode.ValA] ;ymm0 = ValA
    vmovapd ymm1,ymmword ptr [rax+LLNode.ValB] ;ymm1 = ValB
IFIDNI <UsePrefetch>,<Y>
    mov rdx,rcx
    test rdx,rdx            ;is there another node?
    cmovz rdx,rax            ;avoid prefetch of nullptr
    prefetchnta [rdx]          ;prefetch start of next node
ENDIF
; Calculate ValC[i] = sqrt(ValA[i] * ValA[i] + ValB[i] * ValB[i])
    vmulpd ymm2,ymm0,ymm0        ;ymm2 = ValA * ValA
    vmulpd ymm3,ymm1,ymm1        ;ymm3 = ValB * ValB
    vaddpd ymm4,ymm2,ymm3        ;ymm4 = sums
    vsqrtpd ymm5,ymm4          ;ymm5 = square roots
    vmovntpd ymmword ptr [rax+LlNode.ValC],ymm5 ;save result
; Calculate ValD[i] = sqrt(ValA[i] / ValB[i] + ValB[i] / ValA[i]);
    vdivpd ymm2,ymm0,ymm1        ;ymm2 = ValA / ValB
    vdivpd ymm3,ymm1,ymm0        ;ymm3 = ValB / ValA
    vaddpd ymm4,ymm2,ymm3        ;ymm4 = sums
    vsqrtpd ymm5,ymm4          ;ymm5 = square roots
    vmovntpd ymmword ptr [rax+LlNode.ValD],ymm5 ;save result
    mov rax,rcx             ;rax = ptr to next node
    test rax,rax
    jnz @B
Done:  vzeroupper
    ret
    endm
; extern "C" void LlTraverseA_(LlNode* first);
    .code
LlTraverseA_ proc
    _LlTraverse n
LlTraverseA_ endp
; extern "C" void LlTraverseB_(LlNode* first);
LlTraverseB_ proc
    _LlTraverse y
LlTraverseB_ endp
    end
Listing 16-3.

Example Ch16_03

Listing 16-3 begins with the header file Ch16_03.h. The declaration of structure LlNode is located near the top of this file. The C++ code uses this structure to construct linked lists of test data. Structure members ValA through ValD hold the data values that are manipulated by the linked list traversal functions. The member FreeSpace is included to increase the size of LlNode for demonstration purposes since prefetching works best with larger data structures. A real-world implementation of LlNode could use this space for additional data items. The final member of LlNode is a pointer named Link, which points to the next LlNode structure. The assembly language counterpart of LlNode is declared in the file Ch16_03_.asmh.

The base function for source code example Ch16_03 is named LinkedListPrefetch and can be found in source code file Ch16_03.cpp. This function builds several test linked lists, invokes the C++ and assembly language traversal functions, and validates the results. The source code file Ch16_03_misc.cpp contains a set of miscellaneous functions that implement basic linked list processing operations. The function LlCompare compares the data nodes of its argument linked lists for equivalence. The functions LlCreate and LlDelete perform linked list allocation and deletion. LlPrint dumps the data contents of a linked list to a file. Finally, LlTraverse traverses a linked list and performs a simulated calculation using LlNode data elements ValA, ValB, ValC, and ValD.

Toward the top of the file Ch16_03_.asm is a macro named _LlTraverse. This macro generates code that performs the same linked list traversal and simulated calculation as the C++ function LlTraverse. The macro _LlTraverse requires one parameter that enables or disables data prefetching. If prefetching is enabled, the macro generates a short block code that includes the instruction prefetchnta [rdx]. This instruction directs the processor to prefetch the non-temporal data pointed to by register RDX. In this example, RDX points to the next LlNode in the linked list. The actual number of bytes fetched by this instruction varies depending on the underlying microarchitecture; Intel processors fetch a minimum of 32 bytes. It is extremely important to note that the prefetchnta [rdx] is positioned before the floating-point calculating instructions. Doing this gives the processor an opportunity to fetch the data for the next node while at the same time carrying out the arithmetic calculations for the current node. Also note that prior to execution of the prefetchnta [rdx] instruction, RDX is tested to avoid using prefetchnta with a nullptr (or zero) memory address since this degrades processor performance. This is important since nullptr is used as the end-of-list terminator value. The assembly language functions LlTraverseA_ and LlTraverseB_ use macro _LlTraverse to perform linked list traversals both without and with prefetching, respectively. Here is the output for source code example Ch16_03.
Results for LinkedListPrefetch
Linked list compare OK
Linked list results saved to file Ch16_03_LinkedListPrefetchResults.txt
Running benchmark function LinkedListPrefetch_BM - please wait
Benchmark times save to file Ch16_03_LinkedListPrefetch_BM_CHROMIUM.csv
Table 16-3 shows benchmark timing measurements for several Intel processors. It is important to keep in mind that any performance benefits provided by the prefetch instructions are highly dependent on current processor load, data access patterns, and the underlying microarchitecture. According to the Intel 64 and IA-32 Architectures Optimization Reference Manual, the data prefetch instructions are “implementation specific.” This means that to maximize prefetch performance, an algorithm must be “tuned to each implementation” or microarchitecture. You are encouraged to consult the aforementioned reference manual for additional information regarding use of the x86’s data prefetch instructions.
Table 16-3.

Mean Execution Times (Microseconds) for Linked List Traversal Functions (num_nodes = 50,000) .

CPU

LlTraverse(C++)

LlTraverseA_(without prefetchnta)

LlTraverseB_(with prefetchnta)

i7-4790s

5685

3093

2680

i9-7900x

5885

3064

2842

i7-8700k

5031

2384

2319

Multiple Threads

All the source code examples presented in this book thus far have shared one common characteristic: they all contain single-threaded code. The mere fact that you are reading this book probably means that you already know that most modern software applications utilize a least a few threads to better exploit the multiple cores of modern processors. For example, many high-performance computing applications frequently perform arithmetic calculations using large data arrays that contain millions of floating-point elements. One strategy that’s often employed to accelerate the performance of these types of calculations is to distribute the array elements across multiple threads and have each thread carry out a subset of the required calculations. The next source code example demonstrates how to perform an arithmetic calculation using large floating-point arrays and multiple threads. Listing 16-4 shows the source code for example Ch16_04.

Caution

A processor can become extremely hot while executing multithreaded code that makes extensive use of x86-AVX instructions. Before running the code of example Ch16_04, you should verify that the processor in your computer has an adequate cooling system.

//------------------------------------------------
//        Ch16_04.h
//------------------------------------------------
#pragma once
#include <vector>
struct CalcInfo
{
  double* m_X1;
  double* m_X2;
  double* m_Y1;
  double* m_Y2;
  double* m_Z1;
  double* m_Z2;
  double* m_Result;
  size_t m_Index0;
  size_t m_Index1;
  int m_Status;
};
struct CoutInfo
{
  bool m_ThreadMsgEnable;
  size_t m_Iteration;
  size_t m_NumElements;
  size_t m_ThreadId;
  size_t m_NumThreads;
};
// Ch16_04_Misc.cpp
extern size_t CompareResults(const double* a, const double* b, size_t n);
extern void DisplayThreadMsg(const CalcInfo* ci, const CoutInfo* cout_info, const char* msg);
extern void Init(double* a1, double* a2, size_t n, unsigned int seed);
std::vector<size_t> GetNumElementsVec(size_t* num_elements_max);
std::vector<size_t> GetNumThreadsVec(void);
// Ch16_04_WinApi.cpp
extern bool GetAvailableMemory(size_t* mem_size);
// Ch16_04_.asm
extern "C" void CalcResult_(CalcInfo* ci);
// Miscellaneous constants
const size_t c_ElementSize = sizeof(double);
const size_t c_NumArrays = 8;  // Total number of allocated arrays
const size_t c_Align = 32;   // Alignment boundary (update Ch16_04_.asm if changed)
const size_t c_BlockSize = 8;  // Elements per iteration (update Ch16_04_.asm if changed)
//------------------------------------------------
//        Ch16_04_Misc.cpp
//------------------------------------------------
#include "stdafx.h"
#include <iostream>
#include <random>
#include <memory.h>
#include <cmath>
#include <mutex>
#include <vector>
#include <algorithm>
#include "Ch16_04.h"
using namespace std;
void Init(double* a1, double* a2, size_t n, unsigned int seed)
{
  uniform_int_distribution<> ui_dist {1, 2000};
  default_random_engine rng {seed};
  for (size_t i = 0; i < n; i++)
  {
    a1[i] = (double)ui_dist(rng);
    a2[i] = (double)ui_dist(rng);
  }
}
size_t CompareResults(const double* a, const double* b, size_t n)
{
  if (memcmp(a, b, n * sizeof(double)) == 0)
    return n;
  const double epsilon = 1.0e-15;
  for (size_t i = 0; i < n; i++)
  {
    if (fabs(a[i] - b[i]) > epsilon)
      return i;
  }
  return n;
}
void DisplayThreadMsg(const CalcInfo* ci, const CoutInfo* cout_info, const char* msg)
{
  static mutex mutex_cout;
  static const char nl = ' ';
  mutex_cout.lock();
  cout << nl << msg << nl;
  cout << " m_Iteration:  " << cout_info->m_Iteration << nl;
  cout << " m_NumElements: " << cout_info->m_NumElements << nl;
  cout << " m_ThreadId:  " << cout_info->m_ThreadId << nl;
  cout << " m_NumThreads: " << cout_info->m_NumThreads << nl;
  cout << " m_Index0:   " << ci->m_Index0 << nl;
  cout << " m_Index1:   " << ci->m_Index1 << nl;
  mutex_cout.unlock();
}
vector<size_t> GetNumElementsVec(size_t* num_elements_max)
{
//  vector<size_t> ne_vec {64, 192, 384, 512};  // Requires 32GB + extra
   vector<size_t> ne_vec {64, 128, 192, 256};  // Requires 16GB + extra
//  vector<size_t> ne_vec {64, 96, 128, 160};   // Requires 10GB + extra
  size_t mem_size_extra_gb = 2;    // Used to avoid allocating all available mem
  size_t ne_max = *std::max_element(ne_vec.begin(), ne_vec.end());
  if ((ne_max % c_BlockSize) != 0)
    throw runtime_error("ne_max must be an integer multiple of c_BlockSize");
  size_t mem_size;
  if (!GetAvailableMemory(&mem_size))
    throw runtime_error ("GetAvailableMemory failed");
  size_t mem_size_gb = mem_size / (1024 * 1024 * 1024);
  size_t mem_size_min = ne_max * 1024 * 1024 * c_ElementSize * c_NumArrays;
  size_t mem_size_min_gb = mem_size_min / (1024 * 1024 * 1024);
  if (mem_size_gb < mem_size_min_gb + mem_size_extra_gb)
    throw runtime_error ("Not enough available memory");
  *num_elements_max = ne_max * 1024 * 1024;
  return ne_vec;
}
vector<size_t> GetNumThreadsVec(void)
{
  vector<size_t> num_threads_vec {1, 2, 4, 6, 8};
  return num_threads_vec;
}
//------------------------------------------------
//        Ch16_04.cpp
//------------------------------------------------
#include "stdafx.h"
#include <iostream>
#include <iomanip>
#include <sstream>
#include <stdexcept>
#include <thread>
#include <vector>
#include "Ch16_04.h"
#include "AlignedMem.h"
#include "BmThreadTimer.h"
using namespace std;
// Control flag for streaming thread status information to cout.
const bool c_ThreadMsgEnable = false;
void CalcResultCpp(CalcInfo* ci)
{
  size_t al = c_Align;
  size_t i0 = ci->m_Index0;
  size_t i1 = ci->m_Index1;
  size_t num_elements = i1 - i0 + 1;
  ci->m_Status = 0;
  if (num_elements == 0 || (num_elements % c_BlockSize) != 0)
    return;
  for (size_t i = i0; i <= i1; i++)
  {
    double xx = ci->m_X1[i] - ci->m_X2[i];
    double yy = ci->m_Y1[i] - ci->m_Y2[i];
    double zz = ci->m_Z1[i] - ci->m_Z2[i];
    ci->m_Result[i] = sqrt(1.0 / sqrt(xx * xx + yy * yy + zz * zz));
  }
  ci->m_Status = 1;
}
static void CalcResultThread(CalcInfo* ci, CoutInfo* cout_info)
{
  if (cout_info->m_ThreadMsgEnable)
    DisplayThreadMsg(ci, cout_info, "ENTER CalcResultThread()");
  CalcResult_(ci);
  if (cout_info->m_ThreadMsgEnable)
    DisplayThreadMsg(ci, cout_info, "EXIT CalcResultThread()");
}
void RunMultipleThreads(bool thread_msg_enable)
{
  // Code section #1
  size_t align = c_Align;
  size_t num_elements_max;
  vector<size_t> num_elements_vec = GetNumElementsVec(&num_elements_max);
  vector<size_t> num_threads_vec = GetNumThreadsVec();
  AlignedArray<double> x1_aa(num_elements_max, align);
  AlignedArray<double> x2_aa(num_elements_max, align);
  AlignedArray<double> y1_aa(num_elements_max, align);
  AlignedArray<double> y2_aa(num_elements_max, align);
  AlignedArray<double> z1_aa(num_elements_max, align);
  AlignedArray<double> z2_aa(num_elements_max, align);
  AlignedArray<double> result1_aa(num_elements_max, align);
  AlignedArray<double> result2_aa(num_elements_max, align);
  double* x1 = x1_aa.Data();
  double* x2 = x2_aa.Data();
  double* y1 = y1_aa.Data();
  double* y2 = y2_aa.Data();
  double* z1 = z1_aa.Data();
  double* z2 = z2_aa.Data();
  double* result1 = result1_aa.Data();
  double* result2 = result2_aa.Data();
  cout << "Begin initialization of test arrays ";
  cout << " Initializing test arrays x1, x2 ";
  Init(x1, x2, num_elements_max, 307);
  cout << " Initializing test arrays y1, y2 ";
  Init(y1, y2, num_elements_max, 401);
  cout << " Initializing test arrays z1, z2 ";
  Init(z1, z2, num_elements_max, 503);
  cout << "Finished initialization of test arrays ";
  CalcInfo ci1;
  ci1.m_X1 = x1; ci1.m_X2 = x2;
  ci1.m_Y1 = y1; ci1.m_Y2 = y2;
  ci1.m_Z1 = z1; ci1.m_Z2 = z2;
  ci1.m_Result = result1;
  ci1.m_Index0 = 0;
  ci1.m_Index1 = num_elements_max - 1;
  ci1.m_Status = -1;
  // CalcResultCpp used for verification purposes
  cout << "Begin execution of CalcResultCpp ";
  CalcResultCpp(&ci1);
  cout << "Finished execution of CalcResultCpp ";
  size_t iteration = 0;
  const size_t block_size = c_BlockSize;
  BmThreadTimer bmtt(num_elements_vec.size(), num_threads_vec.size());
  // Code section #2
  cout << "Begin execution of calculating threads ";
  for (size_t i = 0; i < num_elements_vec.size(); i++)
  {
    size_t num_elements = num_elements_vec[i] * 1024 * 1024;
    size_t num_blocks = num_elements / block_size;
    size_t num_blocks_rem = num_elements % block_size;
    if (num_blocks_rem != 0)
      throw runtime_error("num_elements must be an integer multiple of block_size");
    for (size_t j = 0; j < num_threads_vec.size(); j++)
    {
      size_t num_threads = num_threads_vec[j];
      bmtt.Start(i, j);
      size_t num_blocks_per_thread = num_blocks / num_threads;
      size_t num_blocks_per_thread_rem = num_blocks % num_threads;
      vector<CalcInfo> ci2(num_threads);
      vector<CoutInfo> cout_info(num_threads);
      vector<thread*> threads(num_threads);
      // Thread start code
      for (size_t k = 0; k < num_threads; k++)
      {
        ci2[k].m_X1 = x1;  ci2[k].m_X2 = x2;
        ci2[k].m_Y1 = y1;  ci2[k].m_Y2 = y2;
        ci2[k].m_Z1 = z1;  ci2[k].m_Z2 = z2;
        ci2[k].m_Result = result2;
        ci2[k].m_Index0 = k * num_blocks_per_thread * block_size;
        ci2[k].m_Index1 = (k + 1) * num_blocks_per_thread * block_size - 1;
        ci2[k].m_Status = -1;
        if ((k + 1) == num_threads)
          ci2[k].m_Index1 += num_blocks_per_thread_rem * block_size;
        cout_info[k].m_ThreadMsgEnable = thread_msg_enable;
        cout_info[k].m_Iteration = iteration;
        cout_info[k].m_NumElements = num_elements;
        cout_info[k].m_NumThreads = num_threads;
        cout_info[k].m_ThreadId = k;
        threads[k] = new thread(CalcResultThread, &ci2[k], &cout_info[k]);
      }
      // Wait for all threads to complete
      for (size_t k = 0; k < num_threads; k++)
        threads[k]->join();
      bmtt.Stop(i, j);
      size_t cmp_index = CompareResults(result1, result2, num_elements);
      if (cmp_index != num_elements)
      {
        ostringstream oss;
        oss << " compare error detected at index " << cmp_index;
        throw runtime_error(oss.str());
      }
      for (size_t k = 0; k < num_threads; k++)
      {
        if (ci2[k].m_Status != 1)
        {
          ostringstream oss;
          oss << " invalid status code " << ci2[k].m_Status;
          throw runtime_error(oss.str());
        }
        delete threads[k];
      }
    }
    iteration++;
  }
  cout << "Finished execution of calculating threads ";
  string fn = bmtt.BuildCsvFilenameString("Ch16_04_MultipleThreads_BM");
  bmtt.SaveElapsedTimes(fn, BmThreadTimer::EtUnit::MilliSec, 0);
  cout << "Benchmark times save to file " << fn << ' ';
}
int main()
{
  try
  {
    RunMultipleThreads(c_ThreadMsgEnable);
  }
  catch (runtime_error& rte)
  {
    cout << "'runtime_error' exception has occurred - " << rte.what() << ' ';
  }
  catch (...)
  {
    cout << "Unexpected exception has occurred ";
  }
  return 0;
}
;-------------------------------------------------
;        Ch16_04_.asm
;-------------------------------------------------
    include <MacrosX86-64-AVX.asmh>
CalcInfo struct
  X1 qword ?
  X2 qword ?
  Y1 qword ?
  Y2 qword ?
  Z1 qword ?
  Z2 qword ?
  Result qword ?
  Index0 qword ?
  Index1 qword ?
  Status dword ?
CalcInfo ends
    .const
r8_1p0 real8 1.0
; extern "C" void CalcResult_(CalcInfo* ci)
    .code
CalcResult_ proc frame
    _CreateFrame CR,0,16,r12,r13,r14,r15
    _SaveXmmRegs xmm6
    _EndProlog
    mov dword ptr [rcx+CalcInfo.Status],0
; Make sure num_elements is valid
    mov rax,[rcx+CalcInfo.Index0]    ;rax = start index
    mov rdx,[rcx+CalcInfo.Index1]    ;rdx = stop index
    sub rdx,rax
    add rdx,1              ;rdx = num_elements
    test rdx,rdx
    jz Done               ;jump if num_elements == 0
    test rdx,7
    jnz Done              ;jump if num_elements % 8 != 0
; Make sure all arrays are properly aligned
    mov r8d,1fh
    mov r9,[rcx+CalcInfo.Result]
    test r9,r8
    jnz Done
    mov r10,[rcx+CalcInfo.X1]
    test r10,r8
    jnz Done
    mov r11,[rcx+CalcInfo.X2]
    test r11,r8
    jnz Done
    mov r12,[rcx+CalcInfo.Y1]
    test r12,r8
    jnz Done
    mov r13,[rcx+CalcInfo.Y2]
    test r13,r8
    jnz Done
    mov r14,[rcx+CalcInfo.Z1]
    test r14,r8
    jnz Done
    mov r15,[rcx+CalcInfo.Z2]
    test r15,r8
    jnz Done
    vbroadcastsd ymm6,real8 ptr [r8_1p0]    ;ymm6 = packed 1.0 (DPFP)
; Perform simulated calculation
    align 16
LP1:  vmovapd ymm0,ymmword ptr [r10+rax*8]
    vmovapd ymm1,ymmword ptr [r12+rax*8]
    vmovapd ymm2,ymmword ptr [r14+rax*8]
    vsubpd ymm0,ymm0,ymmword ptr [r11+rax*8]
    vsubpd ymm1,ymm1,ymmword ptr [r13+rax*8]
    vsubpd ymm2,ymm2,ymmword ptr [r15+rax*8]
    vmulpd ymm3,ymm0,ymm0
    vmulpd ymm4,ymm1,ymm1
    vmulpd ymm5,ymm2,ymm2
    vaddpd ymm0,ymm3,ymm4
    vaddpd ymm1,ymm0,ymm5
    vsqrtpd ymm2,ymm1
    vdivpd ymm3,ymm6,ymm2
    vsqrtpd ymm4,ymm3
    vmovntpd ymmword ptr [r9+rax*8],ymm4
    add rax,4
    vmovapd ymm0,ymmword ptr [r10+rax*8]
    vmovapd ymm1,ymmword ptr [r12+rax*8]
    vmovapd ymm2,ymmword ptr [r14+rax*8]
    vsubpd ymm0,ymm0,ymmword ptr [r11+rax*8]
    vsubpd ymm1,ymm1,ymmword ptr [r13+rax*8]
    vsubpd ymm2,ymm2,ymmword ptr [r15+rax*8]
    vmulpd ymm3,ymm0,ymm0
    vmulpd ymm4,ymm1,ymm1
    vmulpd ymm5,ymm2,ymm2
    vaddpd ymm0,ymm3,ymm4
    vaddpd ymm1,ymm0,ymm5
    vsqrtpd ymm2,ymm1
    vdivpd ymm3,ymm6,ymm2
    vsqrtpd ymm4,ymm3
    vmovntpd ymmword ptr [r9+rax*8],ymm4
    add rax,4
    sub rdx,8
    jnz LP1
    mov dword ptr [rcx+CalcInfo.Status],1
Done:  vzeroupper
    _RestoreXmmRegs xmm6
    _DeleteFrame r12,r13,r14,r15
    ret
CalcResult_ endp
    end
Listing 16-4.

Example Ch16_04

Source code example Ch16_04 performs a simulated calculation using large arrays of double-precision floating-point values across multiple threads. It uses the C++ STL class thread to run multiple instances of an AVX2 assembly language calculating function. Each thread performs its calculations using only a portion of the array data. The C++ driver routine exercises the calculating algorithm using various combinations of array sizes and simultaneously executing threads. It also implements benchmark timing measurements to quantify the performance benefits of the multithreaded technique.

Listing 16-4 begins the header file Ch16_04.h. In this file, the structure CalcInfo contains the data that each thread needs to carry out its calculations. The structure members m_X1, m_X2, m_Y1, m_Y2, m_Z2, and m_Z2, point to the source arrays, while m_Result points to the destination array. Members m_Index0 and m_Index1 are array indices that define a range of unique elements for each calculating thread. Header file Ch16_04.h also includes a structure named CoutInfo, which contains status information that’s optionally displayed during program execution.

The next file in Listing 16-4 is Ch16_04_Misc.cpp. This file contains the source code for the program’s ancillary functions. The functions Init and CompareResults perform array initialization and verification, respectively. The next function is DisplayThreadMsg. This function displays status information for each executing thread. Note that the cout statements in DisplayThreadMsg are synchronized with a C++ STL mutex . A mutex is a synchronization object that facilitates controlled access to a single resource by multiple threads. When mutex_cout is locked, only one thread can stream its status results to cout. The other executing threads are blocked from streaming their results to cout until mutex_cout becomes unlocked. Without this mutex, status information text from the executing threads would be intermingled on the display (if you’re interested in seeing what happens, try commenting out the mutex_cout.lock and mutext_cout.unlock statements).

The function GetNumElementsVec returns a vector that contains the sizes of the test arrays. Note that the amount of memory required by the largest test array should be less than the amount of available memory plus a small fudge factor. The fudge factor prevents the program from allocating all available memory. Also note that GetNumElementsVec throws an exception if insufficient memory is available since running the program in this manner is very slow due to the amount of page swapping that occurs. The function GetNumThreadsVec returns a vector of test thread counts. You can change the values in num_threads_vec to experiment with different thread count values.

The source code file Ch16_04.cpp contains the driver routines for source code example Ch16_04. The function CalcResultCpp is a C++ implementation of the simulated calculating algorithm and is used for result verification purposes. The next function, CalcResultThread, is the main thread function. This function invokes the assembly language calculating function CalcResult_. It also displays thread status messages if they’re enabled.

Following CalcResultThread is the function RunMultipleThreads. This function exercises the calculating algorithm using the specified combinations of array sizes and number of simultaneously executing threads. The first code section of RunMultipleThreads performs array allocation and element initialization. It also calls the function CalcResultCpp to calculate results values for algorithm verification purposes. Note that prior to calling this function, an instance of CalcInfo is initialized with the data that’s necessary to carry out the required calculations.

The second code section of RunMultipleThreads, which starts immediate after the comment line Code section #2, runs the calculating algorithm by distributing the test array elements across multiple threads. The test array elements are spilt into groups based on the number of threads that will execute. For example, if the number of test array elements equals 64 million, launching four threads will result in each thread processing 16 million elements. The inner most for loop that follows the comment line Thread start code begins each iteration by initializing an instance of CalcInfo for the next thread. The statement threads[k] = new thread(CalcResultThread, &ci2[k], &cout_info[k]) constructs a new thread object and starts execution of the thread function CalcResultThread using argument values &ci2[k] and &cout_info[k]. This inner for loop repeats until the required number of executing thread have been launched. While the threads are executing, the function RunMultipleThreads executes a small for loop that invokes the function thread::join. This effectively forces RunMultipleThreads to wait until all executing threads have finished. The remaining code in RunMultipleThreads performs data verification and object cleanup.

Before reviewing the assembly language code, the C++ code in RunMultipleThreads merits a few additional comments. The first thing to note is that the benchmarking code measures both the time it takes to carry out the required calculations and the overhead that’s associated with thread management. If the algorithm that’s employed by RunMultipleThreads were to be used in a real-world application, any benchmark timing measurements would be meaningless without factoring in this overhead. It should also be noted that RunMultipleThreads implements an extremely rudimentary form of multithreading that omits many important real-world operations to simplify the code for this example. If you’re interested in learning more about the C++ STL thread class and the other STL classes that facilitate multithreaded processing, you are strongly encouraged to consult the references listed in Appendix A.

The final file in Listing 16-4 is Ch16_04_.asm. Near the top of this file is the assembly language counterpart of the data structure CalcInfo. The assembly language function CalcResult_ is next. Following its prolog, CalcResult_ uses the instructions mov rax,[rcx+CalcInfo.Index0] and mov rdx,[rcx+CalcInfo.Index1] to load registers RAX and RDX with the indices of the first and last array elements. It then calculates and validates num_elements. Next, the test arrays are validated for proper alignment. The processing loop in CalcResult_ uses AVX packed double-precision floating-point arithmetic to carry out the same simulated calculation as CalcResultCpp. The output for source code example Ch16_04 follows this paragraph. Note that this output was produced with c_ThreadMsgEnable set to false. Setting this flag to true ultimately directs the function CalcResultThread to display status messages for each executing thread. The flag c_ThreadMsgEnable is defined near the top of Ch16_04.cpp.
Begin initialization of test arrays
 Initializing test arrays x1, x2
 Initializing test arrays y1, y2
 Initializing test arrays z1, z2
Finished initialization of test arrays
Begin execution of CalcResultCpp
Finished execution of CalcResultCpp
Begin execution of calculating threads
Finished execution of calculating threads
Benchmark times save to file Ch16_04_MultipleThreads_BM_CHROMIUM.csv
Tables 16-4, 16-5, and 16-6 contain the benchmark timing measurements for source code example Ch16_04. The measurements shown in these tables are the mean execution times from 10 separate runs and were made with 32 GB of SDRAM installed in each test computer. All three test computers show a significant improvement in performance when multiple threads are used to carry out the simulated calculation. For the i7-4900s and i7-8700k test computers, optimal performance is attained using four threads. The i9-7900x test computer shows meaningful performance gains when using six or eight threads. It is interesting to compare the timing measurements for the i9-7900x and i7-8700k systems. When using one or two threads, the i7-8700k out performs the i9-7900x while the opposite is true when four or more threads are employed. These measurements make sense when considering the hardware differences between the test processors, which are shown in Table 16-7. Even though the i7-8700k employs higher-clock frequencies, the extra memory channels of the i9-7900x enable it to better utilize its CPU cores and complete the required calculations more quickly.
Table 16-4.

Benchmark Timing Measurements (Milliseconds) for RunMultipleThreads Using an Intel i7-4790s Processor

 

Number of Threads

Number of Elements

(Millions)

1

2

4

6

8

64

686

226

178

180

172

128

1146

491

345

355

347

192

1592

702

513

530

516

256

2054

942

679

714

688

Table 16-5.

Benchmark Timing Measurements (Milliseconds) for RunMultipleThreads Using an Intel i9-7900x Processor

 

Number of Threads

Number of Elements

(Millions)

1

2

4

6

8

64

492

137

84

69

61

128

765

300

163

131

121

192

1110

454

233

193

178

256

1330

582

313

260

238

Table 16-6.

Benchmark Timing Measurements (Milliseconds) for RunMultipleThreads Using an Intel i7-8700k Processor

 

Number of Threads

Number of Elements

(Millions)

1

2

4

6

8

64

332

125

120

123

123

128

522

265

240

245

246

192

839

387

363

366

369

256

919

499

478

484

492

Table 16-7.

Summary of Hardware Features for Test Processors Used in Example Ch16_04

Hardware Feature

i7-4790s

i9-7900x

i7-8700k

Number of cores

4

10

6

Number of threads

8

20

12

Base frequency (GHz)

3.2

3.3

3.7

Maximum frequency (GHz)

4.0

4.5

4.7

Memory type

DDR3-1600

DDR4-2666

DDR4-2666

Number of memory channels

2

4

2

Summary

Here are the key learning points for Chapter 16:
  • An application program should always use the cpuid instruction to verify processor support for specific instruction set extensions. This is extremely important for software compatibility with future processors from both AMD and Intel.

  • An assembly language function can use the non-temporal store instructions vmovntp[d|s] instead of the vmovap[d|s] instructions to improve the performance of algorithms that carry out calculations using large arrays of non-temporal floating-point data.

  • An assembly language function can use the prefetch[0|1|2] instructions to pre-load temporal data into the processor’s cache hierarchy. A function can also use the prefetchnta instruction to pre-load non-temporal data and minimize cache pollution. The performance benefits of the prefetch instructions vary depending on data access patterns and the processor’s underlying microarchitecture.

  • A multithreaded algorithm that’s implemented in a high-level language such as C++ can exploit AVX, AVX2, or AVX-512 assembly language calculating functions to accelerate an algorithm’s overall performance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.211.87