© Daniel Kusswurm 2018
Daniel KusswurmModern X86 Assembly Language Programminghttps://doi.org/10.1007/978-1-4842-4063-2_10

10. AVX2 Programming – Packed Integers

Daniel Kusswurm1 
(1)
Geneva, IL, USA
 

In Chapter 7, you learned how to use the AVX instruction set to perform packed integer operations using 128-bit wide operands and the XMM register set. In this chapter, you learn how to carry out similar operations using AVX2 instructions with 256-bit wide operands and the YMM register set. Chapter 10’s source code examples are divided into two major sections. The first section contains elementary examples that illustrate basic operations using AVX2 instructions and 256-bit wide packed integer operands. The second section includes examples that are a continuation of the image processing techniques first presented in Chapter 7.

All of the source code examples in this chapter require a processor and operating system that supports AVX2. You can use one of the free utilities listed in Appendix A to verify the processing capabilities of your system.

Packed Integer Fundamentals

In this section, you learn how to perform fundamental packed integer operations using AVX2 instructions. The first source code example expounds basic arithmetic using 256-bit wide operands and the YMM register set. The second source code example demonstrates AVX2 instructions that carry out integer pack and unpack operations. This example also explains how to return a structure by value from an assembly language function. The final source code example illuminates AVX2 instructions that execute packed integer size promotions using zero or sign extended values.

Basic Arithmetic

Listing 10-1 shows the source code for example Ch10_01. This example illustrates how to perform basic arithmetic operations using packed word and doubleword operands.
//------------------------------------------------
//        Ch10_01.cpp
//------------------------------------------------
#include "stdafx.h"
#include <iostream>
#include <iomanip>
#include "Ymmval.h"
using namespace std;
extern "C" void Avx2PackedMathI16_(const YmmVal& a, const YmmVal& b, YmmVal c[6]);
extern "C" void Avx2PackedMathI32_(const YmmVal& a, const YmmVal& b, YmmVal c[5]);
void Avx2PackedMathI16(void)
{
  alignas(32) YmmVal a;
  alignas(32) YmmVal b;
  alignas(32) YmmVal c[6];
  a.m_I16[0] = 10;    b.m_I16[0] = 1000;
  a.m_I16[1] = 20;    b.m_I16[1] = 2000;
  a.m_I16[2] = 3000;   b.m_I16[2] = 30;
  a.m_I16[3] = 4000;   b.m_I16[3] = 40;
  a.m_I16[4] = 30000;  b.m_I16[4] = 3000;    // add overflow
  a.m_I16[5] = 6000;   b.m_I16[5] = 32000;   // add overflow
  a.m_I16[6] = 2000;   b.m_I16[6] = -31000;   // sub overflow
  a.m_I16[7] = 4000;   b.m_I16[7] = -30000;   // sub overflow
  a.m_I16[8] = 4000;   b.m_I16[8] = -2500;
  a.m_I16[9] = 3600;   b.m_I16[9] = -1200;
  a.m_I16[10] = 6000;  b.m_I16[10] = 9000;
  a.m_I16[11] = -20000; b.m_I16[11] = -20000;
  a.m_I16[12] = -25000; b.m_I16[12] = -27000;  // add overflow
  a.m_I16[13] = 8000;  b.m_I16[13] = 28700;   // add overflow
  a.m_I16[14] = 3;    b.m_I16[14] = -32766;  // sub overflow
  a.m_I16[15] = -15000; b.m_I16[15] = 24000;   // sub overflow
  Avx2PackedMathI16_(a, b, c);
  cout <<" Results for Avx2PackedMathI16_ ";
  cout << " i    a    b  vpaddw vpaddsw  vpsubw vpsubsw vpminsw vpmaxsw ";
  cout << "-------------------------------------------------------------------------- ";
  for (int i = 0; i < 16; i++)
  {
    cout << setw(2) << i << ' ';
    cout << setw(8) << a.m_I16[i] << ' ';
    cout << setw(8) << b.m_I16[i] << ' ';
    cout << setw(8) << c[0].m_I16[i] << ' ';
    cout << setw(8) << c[1].m_I16[i] << ' ';
    cout << setw(8) << c[2].m_I16[i] << ' ';
    cout << setw(8) << c[3].m_I16[i] << ' ';
    cout << setw(8) << c[4].m_I16[i] << ' ';
    cout << setw(8) << c[5].m_I16[i] << ' ';
  }
}
void Avx2PackedMathI32(void)
{
  alignas(32) YmmVal a;
  alignas(32) YmmVal b;
  alignas(32) YmmVal c[6];
  a.m_I32[0] = 64;    b.m_I32[0] = 4;
  a.m_I32[1] = 1024;   b.m_I32[1] = 5;
  a.m_I32[2] = -2048;   b.m_I32[2] = 2;
  a.m_I32[3] = 8192;   b.m_I32[3] = 5;
  a.m_I32[4] = -256;   b.m_I32[4] = 8;
  a.m_I32[5] = 4096;   b.m_I32[5] = 7;
  a.m_I32[6] = 16;    b.m_I32[6] = 3;
  a.m_I32[7] = 512;    b.m_I32[7] = 6;
  Avx2PackedMathI32_(a, b, c);
  cout << " Results for Avx2PackedMathI32 ";
  cout << " i   a   b  vpaddd  vpsubd vpmulld vpsllvd vpsravd  vpabsd ";
  cout << "---------------------------------------------------------------------- ";
  for (int i = 0; i < 8; i++)
  {
    cout << setw(2) << i << ' ';
    cout << setw(6) << a.m_I32[i] << ' ';
    cout << setw(6) << b.m_I32[i] << ' ';
    cout << setw(8) << c[0].m_I32[i] << ' ';
    cout << setw(8) << c[1].m_I32[i] << ' ';
    cout << setw(8) << c[2].m_I32[i] << ' ';
    cout << setw(8) << c[3].m_I32[i] << ' ';
    cout << setw(8) << c[4].m_I32[i] << ' ';
    cout << setw(8) << c[5].m_I32[i] << ' ';
  }
}
int main()
{
  Avx2PackedMathI16();
  Avx2PackedMathI32();
  return 0;
}
;-------------------------------------------------
;        Ch10_01.asm
;-------------------------------------------------
; extern "C" void Avx2PackedMathI16_(const YmmVal& a, const YmmVal& b, YmmVal c[6])
    .code
Avx2PackedMathI16_ proc
; Load values a and b, which must be properly aligned
    vmovdqa ymm0,ymmword ptr [rcx]   ;ymm0 = a
    vmovdqa ymm1,ymmword ptr [rdx]   ;ymm1 = b
; Perform packed arithmetic operations
    vpaddw ymm2,ymm0,ymm1        ;add
    vmovdqa ymmword ptr [r8],ymm2    ;save vpaddw result
    vpaddsw ymm2,ymm0,ymm1       ;add with signed saturation
    vmovdqa ymmword ptr [r8+32],ymm2  ;save vpaddsw result
    vpsubw ymm2,ymm0,ymm1        ;sub
    vmovdqa ymmword ptr [r8+64],ymm2  ;save vpsubw result
    vpsubsw ymm2,ymm0,ymm1       ;sub with signed saturation
    vmovdqa ymmword ptr [r8+96],ymm2  ;save vpsubsw result
    vpminsw ymm2,ymm0,ymm1       ;signed minimums
    vmovdqa ymmword ptr [r8+128],ymm2  ;save vpminsw result
    vpmaxsw ymm2,ymm0,ymm1       ;signed maximums
    vmovdqa ymmword ptr [r8+160],ymm2  ;save vpmaxsw result
    vzeroupper
    ret
Avx2PackedMathI16_ endp
; extern "C" void Avx2PackedMathI32_(const YmmVal& a, const YmmVal& b, YmmVal c[6])
Avx2PackedMathI32_ proc
; Load values a and b, which must be properly aligned
    vmovdqa ymm0,ymmword ptr [rcx]   ;ymm0 = a
    vmovdqa ymm1,ymmword ptr [rdx]   ;ymm1 = b
; Perform packed arithmetic operations
    vpaddd ymm2,ymm0,ymm1        ;add
    vmovdqa ymmword ptr [r8],ymm2    ;save vpaddd result
    vpsubd ymm2,ymm0,ymm1        ;sub
    vmovdqa ymmword ptr [r8+32],ymm2  ;save vpsubd result
    vpmulld ymm2,ymm0,ymm1       ;signed mul (low 32 bits)
    vmovdqa ymmword ptr [r8+64],ymm2  ;save vpmulld result
    vpsllvd ymm2,ymm0,ymm1       ;shift left logical
    vmovdqa ymmword ptr [r8+96],ymm2  ;save vpsllvd result
    vpsravd ymm2,ymm0,ymm1       ;shift right arithmetic
    vmovdqa ymmword ptr [r8+128],ymm2  ;save vpsravd result
    vpabsd ymm2,ymm0          ;absolute value
    vmovdqa ymmword ptr [r8+160],ymm2  ;save vpabsd result
    vzeroupper
    ret
Avx2PackedMathI32_ endp
    end
Listing 10-1.

Example Ch10_01

The C++ function Avx2PackedMathI16 contains code that demonstrates packed signed word arithmetic. This function begins with the definitions of YmmVal variables a, b, and c. Note that the C++ specifier alignas(32) is used with each YmmVal definition to ensure alignment on a 32-byte boundary. The signed word elements of both a and b are then initialized with test values. Following variable initialization, Avx2PackedMathI16 calls the assembly language function Avx2PackedMathI16_, which performs several packed arithmetic operations. The results are then streamed to cout. The C++ function Avx2PackedMathI32 is next. The structure of this function is similar to Avx2PackedMathI16, with the main difference being that it exercises packed doubleword operands.

The assembly language function Avx2PackedMathI16_ begins with a vmovdqa ymm0,ymmword ptr [rcx] instruction that loads YmmVal a into register YMM0. The ensuing vmovdqa ymm1,ymmword ptr [rdx] instruction loads YmmVal b into register YMM1. This is followed by a vpaddw ymm2,ymm0,ymm1 that performs packed word addition of a and b. The vmovdqa ymmword ptr [r8],ymm2 instruction then saves packed word sums to c[0]. The remaining assembly language code in Avx2PackedMathI16_ exercises the instructions vpaddsw, vpsubw, vpsubsw, vpminsw, and vpmaxsw to carry out additional arithmetic operations. Similar to the source code examples that you saw in Chapter 9, Avx2PackedMathI16_ uses a vzeroupper instruction before its ret instruction. This avoids potential performance penalties that can occur when the processor transitions from executing x86-AVX instructions to x86-SSE instructions as explained in Chapter 8. The assembly language function Avx2PackedMathI32_ employs a similar structure to exercise commonly-used packed doubleword instructions including vpaddd, vpsubd, vpmulld, vpsllvd, vpsravd, and vpabsd. Here are the results for source code example Ch10_01:
Results for Avx2PackedMathI16_
 i    a    b  vpaddw vpaddsw  vpsubw vpsubsw vpminsw vpmaxsw
--------------------------------------------------------------------------
 0    10   1000   1010   1010   -990   -990    10   1000
 1    20   2000   2020   2020  -1980  -1980    20   2000
 2   3000    30   3030   3030   2970   2970    30   3000
 3   4000    40   4040   4040   3960   3960    40   4000
 4  30000   3000  -32536  32767  27000  27000   3000  30000
 5   6000  32000  -27536  32767  -26000  -26000   6000  32000
 6   2000  -31000  -29000  -29000  -32536  32767  -31000   2000
 7   4000  -30000  -26000  -26000  -31536  32767  -30000   4000
 8   4000  -2500   1500   1500   6500   6500  -2500   4000
 9   3600  -1200   2400   2400   4800   4800  -1200   3600
10   6000   9000  15000  15000  -3000  -3000   6000   9000
11  -20000  -20000  25536  -32768    0    0  -20000  -20000
12  -25000  -27000  13536  -32768   2000   2000  -27000  -25000
13   8000  28700  -28836  32767  -20700  -20700   8000  28700
14    3  -32766  -32763  -32763  -32767  32767  -32766    3
15  -15000  24000   9000   9000  26536  -32768  -15000  24000
Results for Avx2PackedMathI32
 i   a   b  vpaddd  vpsubd vpmulld vpsllvd vpsravd  vpabsd
----------------------------------------------------------------------
 0   64   4    68    60   256   1024    4    64
 1  1024   5   1029   1019   5120  32768    32   1024
 2 -2048   2  -2046  -2050  -4096  -8192   -512   2048
 3  8192   5   8197   8187  40960  262144   256   8192
 4  -256   8   -248   -264  -2048  -65536    -1   256
 5  4096   7   4103   4089  28672  524288    32   4096
 6   16   3    19    13    48   128    2    16
 7  512   6   518   506   3072  32768    8   512

On systems that support AVX2, most of the instructions exercised in this example can be used with a variety of 256-bit wide packed integer operands. For example, the vpadd[b|q] and vpsub[b|q] instructions carry out addition and subtraction using 256-bit wide packed byte or quadword operands. The vpaddsb and vpsubsb instructions perform signed saturated addition and subtraction using packed byte operands. The instructions vpmins[b|d] and vpmaxs[b|d] calculate packed signed minimums and maximums, respectively. The variable bit shift instructions vpsllv[d|q], vpsravd, and vpsrlv[d|q] are new AVX2 instructions. These instructions are not available on systems that only support AVX.

Pack and Unpack

Then next source code example illustrates how to perform integer pack and unpack operations. These operations are often employed to size-reduce or size-promote packed integer operands. This example also explains how to return a structure by value from an assembly language function. Listing 10-2 shows the source code for example Ch10_02
//------------------------------------------------
//        Ch10_02.cpp
//------------------------------------------------
#include "stdafx.h"
#include <iostream>
#include <iomanip>
#include "YmmVal.h"
using namespace std;
struct alignas(32) YmmVal2
{
  YmmVal m_YmmVal0;
  YmmVal m_YmmVal1;
};
extern "C" YmmVal2 Avx2UnpackU32_U64_(const YmmVal& a, const YmmVal& b);
extern "C" void Avx2PackI32_I16_(const YmmVal& a, const YmmVal& b, YmmVal* c);
void Avx2UnpackU32_U64(void)
{
  alignas(32) YmmVal a;
  alignas(32) YmmVal b;
  a.m_U32[0] = 0x00000000; b.m_U32[0] = 0x88888888;
  a.m_U32[1] = 0x11111111; b.m_U32[1] = 0x99999999;
  a.m_U32[2] = 0x22222222; b.m_U32[2] = 0xaaaaaaaa;
  a.m_U32[3] = 0x33333333; b.m_U32[3] = 0xbbbbbbbb;
  a.m_U32[4] = 0x44444444; b.m_U32[4] = 0xcccccccc;
  a.m_U32[5] = 0x55555555; b.m_U32[5] = 0xdddddddd;
  a.m_U32[6] = 0x66666666; b.m_U32[6] = 0xeeeeeeee;
  a.m_U32[7] = 0x77777777; b.m_U32[7] = 0xffffffff;
  YmmVal2 c = Avx2UnpackU32_U64_(a, b);
  cout << " Results for Avx2UnpackU32_U64 ";
  cout << "a lo      " << a.ToStringX32(0) << ' ';
  cout << "b lo      " << b.ToStringX32(0) << ' ';
  cout << ' ';
  cout << "a hi      " << a.ToStringX32(1) << ' ';
  cout << "b hi      " << b.ToStringX32(1) << ' ';
  cout << " vpunpckldq result ";
  cout << "c.m_YmmVal0 lo " << c.m_YmmVal0.ToStringX64(0) << ' ';
  cout << "c.m_YmmVal0 hi " << c.m_YmmVal0.ToStringX64(1) << ' ';
  cout << " vpunpckhdq result ";
  cout << "c.m_YmmVal1 lo " << c.m_YmmVal1.ToStringX64(0) << ' ';
  cout << "c.m_YmmVal1 hi " << c.m_YmmVal1.ToStringX64(1) << ' ';
}
void Avx2PackI32_I16(void)
{
  alignas(32) YmmVal a;
  alignas(32) YmmVal b;
  alignas(32) YmmVal c;
  a.m_I32[0] = 10;     b.m_I32[0] = 32768;
  a.m_I32[1] = -200000;   b.m_I32[1] = 6500;
  a.m_I32[2] = 300000;   b.m_I32[2] = 42000;
  a.m_I32[3] = -4000;    b.m_I32[3] = -68000;
  a.m_I32[4] = 9000;    b.m_I32[4] = 25000;
  a.m_I32[5] = 80000;    b.m_I32[5] = 500000;
  a.m_I32[6] = 200;     b.m_I32[6] = -7000;
  a.m_I32[7] = -32769;   b.m_I32[7] = 12500;
  Avx2PackI32_I16_(a, b, &c);
  cout << " Results for Avx2PackI32_I16 ";
  cout << "a lo " << a.ToStringI32(0) << ' ';
  cout << "a hi " << a.ToStringI32(1) << ' ';
  cout << ' ';
  cout << "b lo " << b.ToStringI32(0) << ' ';
  cout << "b hi " << b.ToStringI32(1) << ' ';
  cout << ' ';
  cout << "c lo " << c.ToStringI16(0) << ' ';
  cout << "c hi " << c.ToStringI16(1) << ' ';
  cout << ' ';
}
int main()
{
  Avx2UnpackU32_U64();
  Avx2PackI32_I16();
  return 0;
}
;-------------------------------------------------
;        Ch10_02.asm
;-------------------------------------------------
; extern "C" YmmVal2 Avx2UnpackU32_U64_(const YmmVal& a, const YmmVal& b);
    .code
Avx2UnpackU32_U64_ proc
; Load argument values
    vmovdqa ymm0,ymmword ptr [rdx]   ;ymm0 = a
    vmovdqa ymm1,ymmword ptr [r8]    ;ymm1 = b
; Perform dword to qword unpacks
    vpunpckldq ymm2,ymm0,ymm1      ;unpack low doublewords
    vpunpckhdq ymm3,ymm0,ymm1      ;unpack high doublewords
; Save result to YmmVal2 buffer
    vmovdqa ymmword ptr [rcx],ymm2   ;save low result
    vmovdqa ymmword ptr [rcx+32],ymm3  ;save high result
    mov rax,rcx             ;rax = ptr to YmmVal2
    vzeroupper
    ret
Avx2UnpackU32_U64_ endp
; extern "C" void Avx2PackI32_I16_(const YmmVal& a, const YmmVal& b, YmmVal* c);
Avx2PackI32_I16_ proc
; Load argument values
    vmovdqa ymm0,ymmword ptr [rcx]   ;ymm0 = a
    vmovdqa ymm1,ymmword ptr [rdx]   ;ymm1 = b
; Perform pack dword to word with signed saturation
    vpackssdw ymm2,ymm0,ymm1      ;ymm2 = packed words
    vmovdqa ymmword ptr [r8],ymm2    ;save result
    vzeroupper
    ret
Avx2PackI32_I16_ endp
Foo1_ proc
    ret
Foo1_ endp
    end
Listing 10-2.

Example Ch10_02

The C++ code in Listing 10-2 begins the declaration of a structure named YmmVal2. This structure contains two YmmVal members: m_YmmVal0 and m_YmmVal1. Note that the alignas(32) specifier is used immediately after the keyword struct. Using this specifier ensures that all instances of YmmVal2 are aligned on a 32-byte boundary including temporary instances created by the compiler. More on this in a moment. The assembly language function Avx2UnpackU32_U64_, whose declaration follows, returns an instance of YmmVal2 by value.

The C++ function AvxUnpackU32_U64 begins by initializing the unsigned doubleword elements of YmmVal variables a and b. Following variable initialization is the statement YmmVal2 c = Avx2UnpackU32_U64_(a, b), which calls the assembly language function Avx2UnpackU32_U64_ to unpack the elements of a and b from doublewords to quadwords. Unlike previous examples, Avx2UnpackU32_U64_ returns its YmmVal2 result by value. Before proceeding, it is important to note that in most cases, returning a user-defined structure like YmmVal2 by value is less efficient than passing a pointer argument to a variable of type YmmVal2. The function Avx2UnpackU32_U64_ uses return-by-value principally for demonstration purposes and to elucidate the Visual C++ calling convention protocols that an assembly language function must observe when returning a structure by value is warranted. The remaining statements in AvxUnpackU32_U64 stream the results from Avx2UnpackU32_U64_ to cout.

Following AvxUnpackU32_U64 is the C++ function Avx2PackI32_I16. This function initializes the signed doubleword elements of YmmVal variables a and b. These values will be size reduced to packed words. Subsequent to YmmVal variable initialization, Avx2PackI32_I16 calls the assembly language function Avx2PackI32_I16_ to carry out the aforementioned size reduction. The results are then streamed to cout.

The calling convention that Visual C++ uses for functions that return a structure by value varies somewhat from the normal calling convention. Upon entry to the assembly language function Avx2UnpackU32_U64_, register RCX points to a temporary buffer where Avx2UnpackU32_U64_ must store its YmmVal2 return result. It is important to note that this buffer is not necessarily the same memory location as the destination YmmVal2 variable in the C++ statement that called Avx2UnpackU32_U64_. In order to implement expression evaluation and operator overloading, a C++ compiler often generates code that allocates temporary variables (or rvalues) to hold intermediate results. An rvalue that needs to be saved is ultimately copied to a named variable (or lvalue) using either a default or overloaded assignment operator. This copy operation is the reason why returning a structure by value is usually slower than passing a pointer argument. The alignas(32) specifier that’s used in the declaration of struct YmmVal2 directs the Visual C++ compiler to align all variables of type YmmVal2 including rvalues on a 32-byte boundary.

If the subject matter of the preceding paragraph seems a little abstract, don’t worry. Temporary storage space allocation for return-by-value structures is handled automatically by the C++ compiler. It’s more important to understand the following Visual C++ calling convention requirements that must be observed by any function that returns a large structure (any structure whose size is greater than eight bytes) by value:
  • The caller of a function that returns a large structure by value must allocate storage space for the returned structure. A pointer to this storage space must be passed to the called function in register RCX.

  • The normal calling convention argument registers are “right-shifted” by one. This means that the first three arguments are passed using registers RDX/XMM1, R8/XMM2, and R9/XMM3. Any remaining arguments are passed on the stack.

  • Prior to returning, the called function must load register RAX with a pointer to the returned structure.

If the size of a return-by-value structure is less than or equal to eight bytes, it must be returned in register RAX. The normal calling convention argument registers are used in these situations.

Returning to the code, the first instruction of function Avx2UnpackU32_U64_ uses a vmovdqa ymm0,ymmword ptr [rdx] instruction to load YmmVal a (the first function argument) into register YMM0. The ensuing vmovdqa ymm1,ymmword ptr [r8] instruction loads YmmVal b (the second function argument) into register YMM1. The next two instructions, vpunpckldq ymm2,ymm0,ymm1 and vpunpckhdq ymm3,ymm0,ymm1, unpack the doublewords into quadwords, as shown in Figure 10-1. The results are then saved to the YmmVal2 buffer pointed to by RCX using two vmovdqa instructions. Note that two vmovdqu instructions would be required here if the structure YmmVal2 was declared without the alignas(32) specifier . As previously mentioned, the Visual C++ calling convention requires any function that returns a structure by value to load a copy of the structure buffer pointer into register RAX prior to returning. The mov rax,rcx instruction fulfills this requirement (recall that RCX contains a pointer to the structure buffer).
../images/326959_2_En_10_Chapter/326959_2_En_10_Fig1_HTML.jpg
Figure 10-1.

Execution of the vpunpckldq and vpunpckhdq instructions

The assembly language function Avx2PackI32_I16_ demonstrates use of the vpackssdw (Packed with Signed Saturation) instruction. In this function, the vpackssdw ymm2,ymm0,ymm1 instruction converts the 16 doubleword integers in registers YMM0 and YMM1 to word integers using signed saturation. It then saves the 16 word integers in register YMM2. Figure 10-2 illustrates the execution of this instruction. X86-AVX also include a vpacksswb instruction that performs signed word to byte size reductions. The vpackus[dw|wb] instructions can be used for packed unsigned integer reductions.
../images/326959_2_En_10_Chapter/326959_2_En_10_Fig2_HTML.jpg
Figure 10-2.

Execution of the vpackssdw instruction

Note that in Figures 10-1 and 10-2, the vpunpckldq, vpunpckhdq, and vpackssdw instructions carry out their operations using two 128-bit wide independent lanes, as explained in Chapter 4. Here are the results for source code example Ch10_02:
Results for Avx2UnpackU32_U64
a lo          00000000    11111111  |    22222222    33333333
b lo          88888888    99999999  |    AAAAAAAA    BBBBBBBB
a hi          44444444    55555555  |    66666666    77777777
b hi          CCCCCCCC    DDDDDDDD  |    EEEEEEEE    FFFFFFFF
vpunpckldq result
c.m_YmmVal0 lo         8888888800000000  |        9999999911111111
c.m_YmmVal0 hi         CCCCCCCC44444444  |        DDDDDDDD55555555
vpunpckhdq result
c.m_YmmVal1 lo         AAAAAAAA22222222  |        BBBBBBBB33333333
c.m_YmmVal1 hi         EEEEEEEE66666666  |        FFFFFFFF77777777
Results for Avx2PackI32_I16
a lo        10     -200000  |     300000      -4000
a hi       9000      80000  |       200     -32769
b lo      32768      6500  |      42000     -68000
b hi      25000     500000  |      -7000      12500
c lo    10 -32768  32767  -4000  |  32767  6500  32767 -32768
c hi   9000  32767   200 -32768  |  25000  32767  -7000  12500

Size Promotions

In Chapter 7, you learned how to use the used the vpunpckl[bw|dw] and vpunpckh[bw|wd] instructions to size-promote packed integers (see source code examples Ch07_05, Ch07_06, and Ch07_08). The next source code example, Ch10_03, demonstrates how to employ the vpmovzx[bw|bd] and vpmovsx[wd|wq] instructions to size-promote packed integers using either zero or sign extension . Listing 10-3 shows the source code for example Ch10_03.
//------------------------------------------------
//        Ch10_03.cpp
//------------------------------------------------
#include "stdafx.h"
#include <cstdint>
#include <iostream>
#include <string>
#include "YmmVal.h"
using namespace std;
extern "C" void Avx2ZeroExtU8_U16_(YmmVal*a, YmmVal b[2]);
extern "C" void Avx2ZeroExtU8_U32_(YmmVal*a, YmmVal b[4]);
extern "C" void Avx2SignExtI16_I32_(YmmVal*a, YmmVal b[2]);
extern "C" void Avx2SignExtI16_I64_(YmmVal*a, YmmVal b[4]);
const string c_Line(80, '-');
void Avx2ZeroExtU8_U16(void)
{
  alignas(32) YmmVal a;
  alignas(32) YmmVal b[2];
  for (int i = 0; i < 32; i++)
    a.m_U8[i] = (uint8_t)(i * 8);
  Avx2ZeroExtU8_U16_(&a, b);
  cout << " Results for Avx2ZeroExtU8_U16_ ";
  cout << c_Line << ' ';
  cout << "a (0:15):  " << a.ToStringU8(0) << ' ';
  cout << "a (16:31): " << a.ToStringU8(1) << ' ';
  cout << ' ';
  cout << "b (0:7):  " << b[0].ToStringU16(0) << ' ';
  cout << "b (8:15):  " << b[0].ToStringU16(1) << ' ';
  cout << "b (16:23): " << b[1].ToStringU16(0) << ' ';
  cout << "b (24:31): " << b[1].ToStringU16(1) << ' ';
}
void Avx2ZeroExtU8_U32(void)
{
  alignas(32) YmmVal a;
  alignas(32) YmmVal b[4];
  for (int i = 0; i < 32; i++)
    a.m_U8[i] = (uint8_t)(255 - i * 8);
  Avx2ZeroExtU8_U32_(&a, b);
  cout << " Results for Avx2ZeroExtU8_U32_ ";
  cout << c_Line << ' ';
  cout << "a (0:15):  " << a.ToStringU8(0) << ' ';
  cout << "a (16:31): " << a.ToStringU8(1) << ' ';
  cout << ' ';
  cout << "b (0:3):  " << b[0].ToStringU32(0) << ' ';
  cout << "b (4:7):  " << b[0].ToStringU32(1) << ' ';
  cout << "b (8:11):  " << b[1].ToStringU32(0) << ' ';
  cout << "b (12:15): " << b[1].ToStringU32(1) << ' ';
  cout << "b (16:19): " << b[2].ToStringU32(0) << ' ';
  cout << "b (20:23): " << b[2].ToStringU32(1) << ' ';
  cout << "b (24:27): " << b[3].ToStringU32(0) << ' ';
  cout << "b (28:31): " << b[3].ToStringU32(1) << ' ';
}
void Avx2SignExtI16_I32()
{
  alignas(32) YmmVal a;
  alignas(32) YmmVal b[2];
  for (int i = 0; i < 16; i++)
    a.m_I16[i] = (int16_t)(-32768 + i * 4000);
  Avx2SignExtI16_I32_(&a, b);
  cout << " Results for Avx2SignExtI16_I32_ ";
  cout << c_Line << ' ';
  cout << "a (0:7):  " << a.ToStringI16(0) << ' ';
  cout << "a (8:15):  " << a.ToStringI16(1) << ' ';
  cout << ' ';
  cout << "b (0:3):  " << b[0].ToStringI32(0) << ' ';
  cout << "b (4:7):  " << b[0].ToStringI32(1) << ' ';
  cout << "b (8:11):  " << b[1].ToStringI32(0) << ' ';
  cout << "b (12:15): " << b[1].ToStringI32(1) << ' ';
}
void Avx2SignExtI16_I64()
{
  alignas(32) YmmVal a;
  alignas(32) YmmVal b[4];
  for (int i = 0; i < 16; i++)
    a.m_I16[i] = (int16_t)(32767 - i * 4000);
  Avx2SignExtI16_I64_(&a, b);
  cout << " Results for Avx2SignExtI16_I64_ ";
  cout << c_Line << ' ';
  cout << "a (0:7):  " << a.ToStringI16(0) << ' ';
  cout << "a (8:15):  " << a.ToStringI16(1) << ' ';
  cout << ' ';
  cout << "b (0:1):  " << b[0].ToStringI64(0) << ' ';
  cout << "b (2:3):  " << b[0].ToStringI64(1) << ' ';
  cout << "b (4:5):  " << b[1].ToStringI64(0) << ' ';
  cout << "b (6:7):  " << b[1].ToStringI64(1) << ' ';
  cout << "b (8:9):  " << b[2].ToStringI64(0) << ' ';
  cout << "b (10:11): " << b[2].ToStringI64(1) << ' ';
  cout << "b (12:13): " << b[3].ToStringI64(0) << ' ';
  cout << "b (14:15): " << b[3].ToStringI64(1) << ' ';
}
int main()
{
  Avx2ZeroExtU8_U16();
  Avx2ZeroExtU8_U32();
  Avx2SignExtI16_I32();
  Avx2SignExtI16_I64();
  return 0;
}
;-------------------------------------------------
;        Ch10_03.asm
;-------------------------------------------------
; extern "C" void Avx2ZeroExtU8_U16_(YmmVal*a, YmmVal b[2]);
    .code
Avx2ZeroExtU8_U16_ proc
    vpmovzxbw ymm0,xmmword ptr [rcx]    ;zero extend a[0] - a[15]
    vpmovzxbw ymm1,xmmword ptr [rcx+16]   ;zero extend a[16] - a[31]
    vmovdqa ymmword ptr [rdx],ymm0     ;save results
    vmovdqa ymmword ptr [rdx+32],ymm1
    vzeroupper
    ret
Avx2ZeroExtU8_U16_ endp
; extern "C" void Avx2ZeroExtU8_U32_(YmmVal*a, YmmVal b[4]);
Avx2ZeroExtU8_U32_ proc
    vpmovzxbd ymm0,qword ptr [rcx]     ;zero extend a[0] - a[7]
    vpmovzxbd ymm1,qword ptr [rcx+8]    ;zero extend a[8] - a[15]
    vpmovzxbd ymm2,qword ptr [rcx+16]    ;zero extend a[16] - a[23]
    vpmovzxbd ymm3,qword ptr [rcx+24]    ;zero extend a[24] - a[31]
    vmovdqa ymmword ptr [rdx],ymm0     ;save results
    vmovdqa ymmword ptr [rdx+32],ymm1
    vmovdqa ymmword ptr [rdx+64],ymm2
    vmovdqa ymmword ptr [rdx+96],ymm3
    vzeroupper
    ret
Avx2ZeroExtU8_U32_ endp
; extern "C" void Avx2SignExtI16_I32_(YmmVal*a, YmmVal b[2])
Avx2SignExtI16_I32_ proc
    vpmovsxwd ymm0,xmmword ptr [rcx]    ;sign extend a[0] - a[7]
    vpmovsxwd ymm1,xmmword ptr [rcx+16]   ;sign extend a[8] - a[15]
    vmovdqa ymmword ptr [rdx],ymm0     ;save results
    vmovdqa ymmword ptr [rdx+32],ymm1
    vzeroupper
    ret
Avx2SignExtI16_I32_ endp
; extern "C" void Avx2SignExtI16_I64_(YmmVal*a, YmmVal b[4])
Avx2SignExtI16_I64_ proc
    vpmovsxwq ymm0,qword ptr [rcx]     ;sign extend a[0] - a[3]
    vpmovsxwq ymm1,qword ptr [rcx+8]    ;sign extend a[4] - a[7]
    vpmovsxwq ymm2,qword ptr [rcx+16]    ;sign extend a[8] - a[11]
    vpmovsxwq ymm3,qword ptr [rcx+24]    ;sign extend a[12] - a[15]
    vmovdqa ymmword ptr [rdx],ymm0     ;save results
    vmovdqa ymmword ptr [rdx+32],ymm1
    vmovdqa ymmword ptr [rdx+64],ymm2
    vmovdqa ymmword ptr [rdx+96],ymm3
    vzeroupper
    ret
Avx2SignExtI16_I64_ endp
    end
Listing 10-3.

Example Ch10_03

The C++ code in Listing 10-3 contains four functions that initialize test cases for various packed size-promotion operations. The first function, Avx2ZeroExtU8_U16, begins by initializing the unsigned byte elements of YmmVal a. It then calls the assembly language function Avx2ZeroExtU8_U16_ to size-promote the packed unsigned bytes into packed unsigned words. The function Avx2ZeroExtU8_U32 performs a similar set of initializations to demonstrate packed unsigned byte to packed unsigned doubleword promotions. The functions Avx2SignExtI16_I32 and Avx2SignExtI16_I64 initialize test cases for packed signed word to packed signed doubleword and packed signed quadword size promotions.

The first instruction in the assembly language function Avx2ZeroExtU8_U16_, vpmovzxbw ymm0,xmmword ptr [rcx], loads and zero-extends the 16 low-order bytes of YmmVal a (pointed to by register RCX) and saves these values in register YMM0. The ensuing vpmovzxbw ymm1,xmmword ptr [rcx+16] instruction performs the same operation using the 16 high-order bytes of YmmVal a. The function Avx2ZeroExtU8_U16_ then uses two vmovdqa instructions to save the size-promoted results.

The assembly language function Avx2ZeroExtU8_U32_ performs packed byte to doubleword size promotions. The first instruction, vpmovzxbd ymm0,qword ptr [rcx], loads and zero-extends the eight low-order bytes of YmmVal a into doublewords and saves these values in register YMM0. The three ensuing vpmovzxbd instructions size-promote the remaining byte values in YmmVal a. The results are then saved using a series of vmovdqa instructions. When working with unsigned 8-bit values, it is sometimes (depending on the algorithm) more expedient to use the vpmovzxbd instruction to perform a packed byte to packed doubleword size promotion instead of a semantically equivalent series of vpunpckl[bw|dw] and vpunpckh[bw|dw] instructions. You see an example of this in Chapter 14.

The assembly language functions Avx2SignExtI16_I32_ and Avx2SignExtI16_I64_ demonstrate how to use the vpmovsxwd and vpmovsxwq instructions, respectively. These instructions size-promote and sign-extend packed word integers to doublewords and quadwords. X86-AVX also includes the packed move with sign extension instructions vpmovsx[bw|bd|bq] and vpmovsxdq. Here is the output for source code example Ch10_03:
Results for Avx2ZeroExtU8_U16_
--------------------------------------------------------------------------------
a (0:15):   0  8 16 24 32 40 48 56  | 64 72 80 88 96 104 112 120
a (16:31):  128 136 144 152 160 168 176 184  | 192 200 208 216 224 232 240 248
b (0:7):      0    8   16   24  |   32   40   48   56
b (8:15):     64   72   80   88  |   96   104   112   120
b (16:23):    128   136   144   152  |   160   168   176   184
b (24:31):    192   200   208   216  |   224   232   240   248
Results for Avx2ZeroExtU8_U32_
--------------------------------------------------------------------------------
a (0:15):  255 247 239 231 223 215 207 199  | 191 183 175 167 159 151 143 135
a (16:31):  127 119 111 103 95 87 79 71  | 63 55 47 39 31 23 15  7
b (0:3):         255       247  |       239       231
b (4:7):         223       215  |       207       199
b (8:11):        191       183  |       175       167
b (12:15):        159       151  |       143       135
b (16:19):        127       119  |       111       103
b (20:23):        95       87  |       79       71
b (24:27):        63       55  |       47       39
b (28:31):        31       23  |       15        7
Results for Avx2SignExtI16_I32_
--------------------------------------------------------------------------------
a (0:7):   -32768 -28768 -24768 -20768  | -16768 -12768  -8768  -4768
a (8:15):    -768  3232  7232  11232  |  15232  19232  23232  27232
b (0:3):       -32768     -28768  |     -24768     -20768
b (4:7):       -16768     -12768  |      -8768      -4768
b (8:11):        -768      3232  |      7232      11232
b (12:15):       15232      19232  |      23232      27232
Results for Avx2SignExtI16_I64_
--------------------------------------------------------------------------------
a (0:7):    32767  28767  24767  20767  |  16767  12767  8767  4767
a (8:15):    767  -3233  -7233 -11233  | -15233 -19233 -23233 -27233
b (0:1):                32767  |              28767
b (2:3):                24767  |              20767
b (4:5):                16767  |              12767
b (6:7):                8767  |              4767
b (8:9):                 767  |              -3233
b (10:11):               -7233  |             -11233
b (12:13):              -15233  |             -19233
b (14:15):              -23233  |             -27233

Packed Integer Image Processing

In Chapter 7, you learned how to use the AVX instruction set to perform some common image processing operations using 128-bit wide packed unsigned integer operands. The source code examples of this section demonstrate additional image processing methods using AXV2 instructions with 256-bit wide packed unsigned integer operands. The first source example illustrates how to clip the pixel intensity values of a grayscale image. This is followed by an example that determines the minimum and maximum pixel intensity values of an RGB image. The final source code example uses the AVX2 instruction set to perform RGB to grayscale image conversion.

Pixel Clipping

Pixel clipping is an image processing technique that bounds the intensity values of each pixel in an image between two threshold limits. This technique is often used to reduce the dynamic range of an image by eliminating its extremely dark and light pixels. Source code example Ch10_04 illustrates how to use the AVX2 instruction set to clip the pixels of an 8-bit grayscale image. Listing 10-4 shows the C++ and assembly language source code for example Ch10_04.
//------------------------------------------------
//        Ch10_04.h
//------------------------------------------------
#pragma once
#include <cstdint>
// The following structure must match the structure that's declared in the file .asm file
struct ClipData
{
  uint8_t* m_Src;         // source buffer pointer
  uint8_t* m_Des;         // destination buffer pointer
  uint64_t m_NumPixels;      // number of pixels
  uint64_t m_NumClippedPixels;  // number of clipped pixels
  uint8_t m_ThreshLo;       // low threshold
  uint8_t m_ThreshHi;       // high threshold
};
// Functions defined in Ch10_04.cpp
extern void Init(uint8_t* x, uint64_t n, unsigned int seed);
extern bool Avx2ClipPixelsCpp(ClipData* cd);
// Functions defined in Ch10_04_.asm
extern "C" bool Avx2ClipPixels_(ClipData* cd);
// Functions defined in Ch10_04_BM.cpp
extern void Avx2ClipPixels_BM(void);
//------------------------------------------------
//        Ch10_04.cpp
//------------------------------------------------
#include "stdafx.h"
#include <iostream>
#include <random>
#include <memory.h>
#include <limits>
#include "Ch10_04.h"
#include "AlignedMem.h"
using namespace std;
void Init(uint8_t* x, uint64_t n, unsigned int seed)
{
  uniform_int_distribution<> ui_dist {0, 255};
  default_random_engine rng {seed};
  for (size_t i = 0; i < n; i++)
    x[i] = (uint8_t)ui_dist(rng);
}
bool Avx2ClipPixelsCpp(ClipData* cd)
{
  uint8_t* src = cd->m_Src;
  uint8_t* des = cd->m_Des;
  uint64_t num_pixels = cd->m_NumPixels;
  if (num_pixels == 0 || (num_pixels % 32) != 0)
    return false;
  if (!AlignedMem::IsAligned(src, 32) || !AlignedMem::IsAligned(des, 32))
    return false;
  uint64_t num_clipped_pixels = 0;
  uint8_t thresh_lo = cd->m_ThreshLo;
  uint8_t thresh_hi = cd->m_ThreshHi;
  for (uint64_t i = 0; i < num_pixels; i++)
  {
    uint8_t pixel = src[i];
    if (pixel < thresh_lo)
    {
      des[i] = thresh_lo;
      num_clipped_pixels++;
    }
    else if (pixel > thresh_hi)
    {
      des[i] = thresh_hi;
      num_clipped_pixels++;
    }
    else
      des[i] = src[i];
  }
  cd->m_NumClippedPixels = num_clipped_pixels;
  return true;
}
void Avx2ClipPixels(void)
{
  const uint8_t thresh_lo = 10;
  const uint8_t thresh_hi = 245;
  const uint64_t num_pixels = 4 * 1024 * 1024;
  AlignedArray<uint8_t> src(num_pixels, 32);
  AlignedArray<uint8_t> des1(num_pixels, 32);
  AlignedArray<uint8_t> des2(num_pixels, 32);
  Init(src.Data(), num_pixels, 157);
  ClipData cd1;
  ClipData cd2;
  cd1.m_Src = src.Data();
  cd1.m_Des = des1.Data();
  cd1.m_NumPixels = num_pixels;
  cd1.m_NumClippedPixels = numeric_limits<uint64_t>::max();
  cd1.m_ThreshLo = thresh_lo;
  cd1.m_ThreshHi = thresh_hi;
  cd2.m_Src = src.Data();
  cd2.m_Des = des2.Data();
  cd2.m_NumPixels = num_pixels;
  cd2.m_NumClippedPixels = numeric_limits<uint64_t>::max();
  cd2.m_ThreshLo = thresh_lo;
  cd2.m_ThreshHi = thresh_hi;
  Avx2ClipPixelsCpp(&cd1);
  Avx2ClipPixels_(&cd2);
  cout << " Results for Avx2ClipPixels ";
  cout << " cd1.m_NumClippedPixels1: " << cd1.m_NumClippedPixels << ' ';
  cout << " cd2.m_NumClippedPixels2: " << cd2.m_NumClippedPixels << ' ';
  if (cd1.m_NumClippedPixels != cd2.m_NumClippedPixels)
    cout << " NumClippedPixels compare error ";
  if (memcmp(des1.Data(), des2.Data(), num_pixels) == 0)
    cout << " Pixel buffer memory compare passed ";
  else
    cout << " Pixel buffer memory compare passed ";
}
int main(void)
{
  Avx2ClipPixels();
  Avx2ClipPixels_BM();
  return 0;
}
;-------------------------------------------------
;        Ch10_04.asm
;-------------------------------------------------
; The following structure must match the structure that's declared in the file .h file
ClipData      struct
Src         qword ?       ;source buffer pointer
Des         qword ?       ;destination buffer pointer
NumPixels      qword ?       ;number of pixels
NumClippedPixels  qword ?       ;number of clipped pixels
ThreshLo      byte ?       ;low threshold
ThreshHi      byte ?       ;high threshold
ClipData      ends
; extern "C" bool Avx2ClipPixels_(ClipData* cd)
      .code
Avx2ClipPixels_ proc
; Load and validate arguments
    xor eax,eax             ;set error return code
    xor r8d,r8d             ;r8 = number of clipped pixels
    mov rdx,[rcx+ClipData.NumPixels]  ;rdx = num_pixels
    or rdx,rdx
    jz Done               ;jump of num_pixels is zero
    test rdx,1fh
    jnz Done              ;jump if num_pixels % 32 != 0
    mov r10,[rcx+ClipData.Src]     ;r10 = Src
    test r10,1fh
    jnz Done              ;jump if Src is misaligned
    mov r11,[rcx+ClipData.Des]     ;r11 = Des
    test r11,1fh
    jnz Done              ;jump if Des is misaligned
; Create packed thresh_lo and thresh_hi data values
    vpbroadcastb ymm4,[rcx+ClipData.ThreshLo]  ;ymm4 = packed thresh_lo
    vpbroadcastb ymm5,[rcx+ClipData.ThreshHi]  ;ymm5 = packed thresh_hi
; Clip pixels to threshold values
@@:   vmovdqa ymm0,ymmword ptr [r10]   ;ymm0 = 32 pixels
    vpmaxub ymm1,ymm0,ymm4       ;clip to thresh_lo
    vpminub ymm2,ymm1,ymm5       ;clip to thresh_hi
    vmovdqa ymmword ptr [r11],ymm2   ;save clipped pixels
; Count number of clipped pixels
    vpcmpeqb ymm3,ymm2,ymm0       ;compare clipped pixels to original
    vpmovmskb eax,ymm3         ;eax = mask of non-clipped pixels
    not eax               ;eax = mask of clipped pixels
    popcnt eax,eax           ;eax = number of clipped pixels
    add r8,rax             ;update clipped pixel count
; Update pointers and loop counter
    add r10,32             ;update Src ptr
    add r11,32             ;update Des ptr
    sub rdx,32             ;update loop counter
    jnz @B               ;repeat if not done
    mov eax,1              ;set success return code
; Save num_clipped_pixels
Done:  mov [rcx+ClipData.NumClippedPixels],r8 ;save num_clipped_pixels
    vzeroupper
    ret
Avx2ClipPixels_ endp
    end
Listing 10-4.

Example Ch10_04

The C++ code begins with declaration of a structure named ClipData. This structure and its assembly language equivalent are used to maintain the pixel-clipping algorithm’s data. Following the function declarations in the header file Ch10_04.h is the definition of a C++ function named Init. This function initializes the elements of a uint8_t array using random values, which simulates the pixel values of a grayscale image. The function Avx2ClipPixelCpp is a C++ implementation of the pixel clipping algorithm. This function starts by validating num_pixels for correct size and divisibility by 32. Restricting the algorithm to images that contain an even multiple of 32 pixels is not as inflexible as it might appear. Most digital camera images are sized using multiples of 64 pixels due to the processing requirements of the JPEG compression algorithms. Following validation of num_pixels, the source and destination pixel buffers are checked for proper alignment .

The procedure used in Avx2ClipPixelCpp to perform pixel clipping is straightforward. A simple for loop examines each pixel element in the source image buffer. If a source image pixel buffer intensity value found to be below thresh_lo or above thresh_hi, the corresponding threshold limit is saved in the destination buffer. Source image pixels whose intensity values lie between the two threshold limits are copied to the destination pixel buffer unaltered. The processing loop in Avx2ClipPixelCpp also counts the number of clipped pixels for comparison purposes with the assembly language version of the algorithm.

Function Avx2ClipPixels exploits the C++ template class AlignedArray to allocate and manage the required image pixel buffers (see Chapter 7 for a description of this class). Following source image pixel buffer initialization, Avx2ClipPixels primes two instances of ClipData (cd1 and cd2) for use by the pixel clipping functions Avx2ClipPixelsCpp and Avx2ClipPixels_. It then invokes these functions and compares the results for any discrepancies.

Toward the top of the assembly language code is the declaration for data structure ClipPixel, which is semantically equivalent to its C++ counterpart. The function Avx2ClipPixels_ begins its execution by validating num_pixels for size and divisibility by 32. It then checks the source and destination pixels buffers for proper alignment. Following argument validation, Avx2ClipPixels_ employs two vpbroadcastb instructions to create packed versions of the threshold limit values thresh_lo and thresh_hi in registers YMM4 and YMM5, respectively. During each processing loop iteration, the vmovdqa ymm0,ymmword ptr [r10] instruction loads 32 pixel values from the source image pixel buffer into register YMM0. The ensuing vpmaxub ymm1,ymm0,ymm4 instruction clips the pixel values in YMM0 to thresh_lo. This is followed by a vpminub ymm2,ymm1,ymm5 instruction that clips the pixel values to thresh_hi. The vmovdqa ymmword ptr [r11],ymm2 instruction then saves the clipped pixel intensity values to the destination image pixel buffer.

Avx2ClipPixels_ counts the number of clipped pixels using a straightforward sequence of instructions. The vpcmpeqb ymm3,ymm2,ymm0 instruction compares the original pixel values in YMM0 to the clipped pixel values in YMM2 for equality. Each byte element in YMM3 is set to 0xff if the original and clipped pixel intensity values are equal; otherwise, the YMM3 byte element is set to 0x00. The vpmovmskb eax,ymm3 instruction that follows creates a mask of the most significant bit of each byte element in YMM3 and saves this mask to register EAX. More specifically, this instruction computes eax[i] = ymm3[i*8+7] for i = 0, 1, 2, … 31, which means that each 1 bit in register EAX signifies a non-clipped pixel. The ensuing not eax instruction converts the bit pattern in EAX to a mask of clipped pixels, and the popcnt eax,eax instruction counts the number of 1 bits in EAX. This count value, which corresponds to the number of clipped pixels in YMM2, is then added to the total number of clipped pixels in register R8. The processing loop repeats until all pixels have been processed. Here are the results for source code example Ch10_04:
Results for Avx2ClipPixels
 cd1.m_NumClippedPixels1: 328090
 cd2.m_NumClippedPixels2: 328090
 Pixel buffer memory compare passed
Running benchmark function Avx2ClipPixels_BM - please wait
Benchmark times save to file Ch10_04_Avx2ClipPixels_BM_CHROMIUM.csv
Table 10-1 shows the benchmark timing measurements for the pixel clipping functions Avx2ClipPixelsCpp and Avx2ClipPixels_.
Table 10-1.

Mean Execution Times (Microseconds) for Pixel Clipping Functions (Image Buffer Size = 8 MB)

CPU

Avx2ClipPixelsCpp

Avx2ClipPixels_

i7-4790S

13005

1078

i9-7900X

11617

719

i7-8700K

11252

644

RGB Pixel Min-Max Values

Listing 10-5 shows the C++ and assembly language source code for example Ch10_05, which illustrates how to calculate the minimum and maximum pixel intensity values in an RGB image. This example also explains how to exploit some of MASM’s advanced macro processing capabilities.
//------------------------------------------------
//        Ch10_05.cpp
//------------------------------------------------
#include "stdafx.h"
#include <cstdint>
#include <iostream>
#include <iomanip>
#include <random>
#include "AlignedMem.h"
using namespace std;
extern "C" bool Avx2CalcRgbMinMax_(uint8_t* rgb[3], size_t num_pixels, uint8_t min_vals[3], uint8_t max_vals[3]);
void Init(uint8_t* rgb[3], size_t n, unsigned int seed)
{
  uniform_int_distribution<> ui_dist {5, 250};
  default_random_engine rng {seed};
  for (size_t i = 0; i < n; i++)
  {
    rgb[0][i] = (uint8_t)ui_dist(rng);
    rgb[1][i] = (uint8_t)ui_dist(rng);
    rgb[2][i] = (uint8_t)ui_dist(rng);
  }
  // Set known min & max values for validation purposes
  rgb[0][n / 4] = 4;  rgb[1][n / 2] = 1;    rgb[2][3 * n / 4] = 3;
  rgb[0][n / 3] = 254; rgb[1][2 * n / 5] = 251; rgb[2][n - 1] = 252;
}
bool Avx2CalcRgbMinMaxCpp(uint8_t* rgb[3], size_t num_pixels, uint8_t min_vals[3], uint8_t max_vals[3])
{
  // Make sure num_pixels is valid
  if ((num_pixels == 0) || (num_pixels % 32 != 0))
    return false;
  if (!AlignedMem::IsAligned(rgb[0], 32))
    return false;
  if (!AlignedMem::IsAligned(rgb[1], 32))
    return false;
  if (!AlignedMem::IsAligned(rgb[2], 32))
    return false;
  // Find the min and max of each color plane
  min_vals[0] = min_vals[1] = min_vals[2] = 255;
  max_vals[0] = max_vals[1] = max_vals[2] = 0;
  for (size_t i = 0; i < 3; i++)
  {
    for (size_t j = 0; j < num_pixels; j++)
    {
      if (rgb[i][j] < min_vals[i])
        min_vals[i] = rgb[i][j];
      else if (rgb[i][j] > max_vals[i])
        max_vals[i] = rgb[i][j];
    }
  }
  return true;
}
int main(void)
{
  const size_t n = 1024;
  uint8_t* rgb[3];
  uint8_t min_vals1[3], max_vals1[3];
  uint8_t min_vals2[3], max_vals2[3];
  AlignedArray<uint8_t> r(n, 32);
  AlignedArray<uint8_t> g(n, 32);
  AlignedArray<uint8_t> b(n, 32);
  rgb[0] = r.Data();
  rgb[1] = g.Data();
  rgb[2] = b.Data();
  Init(rgb, n, 219);
  Avx2CalcRgbMinMaxCpp(rgb, n, min_vals1, max_vals1);
  Avx2CalcRgbMinMax_(rgb, n, min_vals2, max_vals2);
  cout << "Results for Avx2CalcRgbMinMax ";
  cout << "        R  G  B ";
  cout << "------------------------- ";
  cout << "min_vals1: ";
  cout << setw(4) << (int)min_vals1[0] << ' ';
  cout << setw(4) << (int)min_vals1[1] << ' ';
  cout << setw(4) << (int)min_vals1[2] << ' ';
  cout << "min_vals2: ";
  cout << setw(4) << (int)min_vals2[0] << ' ';
  cout << setw(4) << (int)min_vals2[1] << ' ';
  cout << setw(4) << (int)min_vals2[2] << " ";
  cout << "max_vals1: ";
  cout << setw(4) << (int)max_vals1[0] << ' ';
  cout << setw(4) << (int)max_vals1[1] << ' ';
  cout << setw(4) << (int)max_vals1[2] << ' ';
  cout << "max_vals2: ";
  cout << setw(4) << (int)max_vals2[0] << ' ';
  cout << setw(4) << (int)max_vals2[1] << ' ';
  cout << setw(4) << (int)max_vals2[2] << " ";
  return 0;
}
;-------------------------------------------------
;        Ch10_05.asm
;-------------------------------------------------
    include <MacrosX86-64-AVX.asmh>
; 256-bit wide constants
ConstVals    segment readonly align(32) 'const'
InitialPminVal db 32 dup(0ffh)
InitialPmaxVal db 32 dup(00h)
ConstVals    ends
; Macro _YmmVpextrMinub
;
; This macro generates code that extracts the smallest unsigned byte from register YmmSrc.
_YmmVpextrMinub macro GprDes,YmmSrc,YmmTmp
; Make sure YmmSrc and YmmTmp are different
.erridni <YmmSrc>, <YmmTmp>, <Invalid registers>
; Construct text strings for the corresponding XMM registers
    YmmSrcSuffix SUBSTR <YmmSrc>,2
    XmmSrc CATSTR <X>,YmmSrcSuffix
    YmmTmpSuffix SUBSTR <YmmTmp>,2
    XmmTmp CATSTR <X>,YmmTmpSuffix
; Reduce the 32 byte values in YmmSrc to the smallest value
    vextracti128 XmmTmp,YmmSrc,1
    vpminub XmmSrc,XmmSrc,XmmTmp    ;XmmSrc = final 16 min values
    vpsrldq XmmTmp,XmmSrc,8
    vpminub XmmSrc,XmmSrc,XmmTmp    ;XmmSrc = final 8 min values
    vpsrldq XmmTmp,XmmSrc,4
    vpminub XmmSrc,XmmSrc,XmmTmp    ;XmmSrc = final 4 min values
    vpsrldq XmmTmp,XmmSrc,2
    vpminub XmmSrc,XmmSrc,XmmTmp    ;XmmSrc = final 2 min values
    vpsrldq XmmTmp,XmmSrc,1
    vpminub XmmSrc,XmmSrc,XmmTmp    ;XmmSrc = final 1 min value
    vpextrb GprDes,XmmSrc,0       ;mov final min value to Gpr
    endm
; Macro _YmmVpextrMaxub
;
; This macro generates code that extracts the largest unsigned byte from register YmmSrc.
_YmmVpextrMaxub macro GprDes,YmmSrc,YmmTmp
; Make sure YmmSrc and YmmTmp are different
.erridni <YmmSrc>, <YmmTmp>, <Invalid registers>
; Construct text strings for the corresponding XMM registers
    YmmSrcSuffix SUBSTR <YmmSrc>,2
    XmmSrc CATSTR <X>,YmmSrcSuffix
    YmmTmpSuffix SUBSTR <YmmTmp>,2
    XmmTmp CATSTR <X>,YmmTmpSuffix
; Reduce the 32 byte values in YmmSrc to the largest value
    vextracti128 XmmTmp,YmmSrc,1
    vpmaxub XmmSrc,XmmSrc,XmmTmp    ;XmmSrc = final 16 max values
    vpsrldq XmmTmp,XmmSrc,8
    vpmaxub XmmSrc,XmmSrc,XmmTmp    ;XmmSrc = final 8 max values
    vpsrldq XmmTmp,XmmSrc,4
    vpmaxub XmmSrc,XmmSrc,XmmTmp    ;XmmSrc = final 4 max values
    vpsrldq XmmTmp,XmmSrc,2
    vpmaxub XmmSrc,XmmSrc,XmmTmp    ;XmmSrc = final 2 max values
    vpsrldq XmmTmp,XmmSrc,1
    vpmaxub XmmSrc,XmmSrc,XmmTmp    ;XmmSrc = final 1 max value
    vpextrb GprDes,XmmSrc,0       ;mov final max value to Gpr
    endm
; extern "C" bool Avx2CalcRgbMinMax_(uint8_t* rgb[3], size_t num_pixels, uint8_t min_vals[3], uint8_t max_vals[3])
    .code
Avx2CalcRgbMinMax_ proc frame
    _CreateFrame CalcMinMax_,0,48,r12
    _SaveXmmRegs xmm6,xmm7,xmm8
    _EndProlog
; Make sure num_pixels and the color plane arrays are valid
    xor eax,eax             ;set error code
    test rdx,rdx
    jz Done               ;jump if num_pixels == 0
    test rdx,01fh
    jnz Done              ;jump if num_pixels % 32 != 0
    mov r10,[rcx]            ;r10 = color plane R
    test r10,1fh
    jnz Done              ;jump if color plane R is not aligned
    mov r11,[rcx+8]           ;r11 = color plane G
    test r11,1fh
    jnz Done              ;jump if color plane G is not aligned
    mov r12,[rcx+16]          ;r12 = color plane B
    test r12,1fh
    jnz Done              ;jump if color plane B is not aligned
; Initialize the processing loop registers
    vmovdqa ymm3,ymmword ptr [InitialPminVal]  ;ymm3 = R minimums
    vmovdqa ymm4,ymm3              ;ymm4 = G minimums
    vmovdqa ymm5,ymm3              ;ymm5 = B minimums
    vmovdqa ymm6,ymmword ptr [InitialPmaxVal]  ;ymm6 = R maximums
    vmovdqa ymm7,ymm6              ;ymm7 = G maximums
    vmovdqa ymm8,ymm6              ;ymm8 = B maximums
    xor rcx,rcx             ;rcx = common array offset
; Scan RGB color plane arrays for packed minimums and maximums
    align 16
@@:   vmovdqa ymm0,ymmword ptr [r10+rcx] ;ymm0 = R pixels
    vmovdqa ymm1,ymmword ptr [r11+rcx] ;ymm1 = G pixels
    vmovdqa ymm2,ymmword ptr [r12+rcx] ;ymm2 = B pixels
    vpminub ymm3,ymm3,ymm0       ;update R minimums
    vpminub ymm4,ymm4,ymm1       ;update G minimums
    vpminub ymm5,ymm5,ymm2       ;update B minimums
    vpmaxub ymm6,ymm6,ymm0       ;update R maximums
    vpmaxub ymm7,ymm7,ymm1       ;update G maximums
    vpmaxub ymm8,ymm8,ymm2       ;update B maximums
    add rcx,32
    sub rdx,32
    jnz @B
; Calculate the final RGB minimum values
    _YmmVpextrMinub rax,ymm3,ymm0
    mov byte ptr [r8],al        ;save min R
    _YmmVpextrMinub rax,ymm4,ymm0
    mov byte ptr [r8+1],al       ;save min G
    _YmmVpextrMinub rax,ymm5,ymm0
    mov byte ptr [r8+2],al       ;save min B
; Calculate the final RGB maximum values
    _YmmVpextrMaxub rax,ymm6,ymm1
    mov byte ptr [r9],al        ;save max R
    _YmmVpextrMaxub rax,ymm7,ymm1
    mov byte ptr [r9+1],al       ;save max G
    _YmmVpextrMaxub rax,ymm8,ymm1
    mov byte ptr [r9+2],al       ;save max B
    mov eax,1              ;set success return code
Done:  vzeroupper
    _RestoreXmmRegs xmm6,xmm7,xmm8
    _DeleteFrame r12
    ret
Avx2CalcRgbMinMax_ endp
    end
Listing 10-5.

Example Ch10_05

The function Avx2CalcRgbMinMaxCpp that’s shown in Listing 10-5 is a C++ implementation of the RGB min-max algorithm. This function employs a set of nested for loops to determine the minimum and maximum pixel intensity values for each color plane. These values are maintained in the arrays min_vals and max_vals. The function main uses the C++ template class AlignedArray to allocate three arrays that simulate the color plane buffers of an RGB image. These buffers are loaded with random values by the function Init. Note that function Init assigns known values to several elements in each color plane buffer. These known values are used to verify correct execution of both the C++ and assembly language min-max functions.

Toward the top of the assembly language code is a custom constant segment named ConstVals that defines packed versions of the initial pixel minimum and maximum values. A custom segment is used here to ensure alignment of the 256-bit wide packed values on a 32-byte boundary, as explained in Chapter 9. The macro definitions _YmmVpextrMinub and _YmmVpextrMaxub are next. These macros contain instructions that extract the smallest and largest byte values from a YMM register. The inner workings of these macros will be explained shortly.

The function Avx2CalcRgbMinMax_ uses registers YMM3-YMM5 and YMM6-YMM8 to maintain the RGB minimum and maximum values, respectively. During each iteration of the main processing loop, a series of vpminub and vpmaxub instructions update the current RGB minimums and maximums. Upon completion of the main processing loop, the aforementioned YMM registers contain 32 minimum and maximum pixel intensity values for each color component. The _YmmVpextrMinub and _YmmVpextrMaxub macros are then used to extract the final RGB minimum and maximum pixel values. These values are then saved to the result arrays min_vals and max_vals, respectively.

The macros definitions _YmmVpextrMinub and _YmmVpextrMaxub are identical, except for the instructions vpminub and vpmaxub. In the text that follows, all explanatory comments made about _YmmVpextrMinub also apply to _YmmVpextrMaxub. The _YmmVpextrMinub macro requires three parameters: a destination general-purpose register (GprDes), a source YMM register (YmmSrc), and a temporary YMM register (YmmTmp). Note that macro parameters YmmSrc and YmmTmp must be different registers. If they’re the same, the .erridni directive (Error if Text Items are Identical, Case Insensitive) generates an error message during assembly. MASM also supports several other conditional error directives besides .erridni, and these are described in the Visual Studio documentation.

In order to generate the correct assembly language code, the macro _YmmVpextrMinub requires an XMM register text string (XmmSrc) that corresponds to the low-order portion of the specified YmmSrc register. For example, if YmmSrc equals YMM0, then XmmSrc must equal XMM0. The MASM directives substr (Return Substring of Text Item) and catstr (Concatenate Text Items) are used to initialize XmmSrc. The statement YmmSrcSuffix SUBSTR <YmmSrc>,2 assigns a text string value to YmmSrcSuffix that excludes the leading character of macro parameter YmmSrc. For example, if YmmSrc equals YMM0, then YmmSrcSuffix equals MM0. The next statement, XmmSrc CATSTR <X>,YmmSrcSuffix, adds a leading X to the value of YmmSrcSuffix and assigns it to XmmSrc. Continuing with the earlier example, this means that the text string XMM0 is assigned to XmmSrc. The SUBSTR and CATSTR directives are then used to assign a text string value to XmmTmp.

Following initialization of the required macro text strings are the instructions that extract the smallest byte value from the specified YMM register. The vextracti128 XmmTmp,YmmSrc,1 instruction copies the high-order 16 bytes of register YmmSrc to XmmTmp. (The vextracti128 instruction also supports using an immediate operand of 0 to copy the low-order 16 bytes.) A vpminub XmmSrc,XmmSrc,XmmTmp instruction loads the final 16 minimum values into XmmSrc. The vpsrldq XmmTmp,XmmSrc,8 instruction shifts a copy of the value that’s in XmmSrc to the right by eight bytes and saves the result to XmmTmp. This facilitates the use of another vpminub instruction that reduces the number of minimum byte values from 16 to 8. Repeated sets of the vpsrldq and vpminub instructions are then employed until the final minimum value resides in the low-order byte of XmmSrc. A vpextrb GprDes,XmmSrc,0 instruction copies the final minimum value to the specified general-purpose register. Here are the results for source code example Ch10_05:
Results for Avx2CalcRgbMinMax
        R  G  B
-------------------------
min_vals1:  4  1  3
min_vals2:  4  1  3
max_vals1: 254 251 252
max_vals2: 254 251 252

RGB to Grayscale Conversion

The final source code example of this chapter, Ch10_06, explains how to perform an RGB to grayscale image conversion. This example intermixes the packed integer capabilities of AVX2 that you have learned in this chapter with the packed floating-point techniques presented in Chapter 9. Listing 10-6 shows the source code for example Ch10_06
//------------------------------------------------
//        ImageMatrix.h
//------------------------------------------------
struct RGB32
{
  uint8_t m_R;
  uint8_t m_G;
  uint8_t m_B;
  uint8_t m_A;
};
//------------------------------------------------
//        Ch10_06.cpp
//------------------------------------------------
#include "stdafx.h"
#include <iostream>
#include <stdexcept>
#include "Ch10_06.h"
#include "AlignedMem.h"
#include "ImageMatrix.h"
using namespace std;
// Image size limits
extern "C" const int c_NumPixelsMin = 32;
extern "C" const int c_NumPixelsMax = 256 * 1024 * 1024;
// RGB to grayscale conversion coefficients
const float c_Coef[4] {0.2126f, 0.7152f, 0.0722f, 0.0f};
bool CompareGsImages(const uint8_t* pb_gs1,const uint8_t* pb_gs2, int num_pixels)
{
  for (int i = 0; i < num_pixels; i++)
  {
    if (abs((int)pb_gs1[i] - (int)pb_gs2[i]) > 1)
      return false;
  }
  return true;
}
bool Avx2ConvertRgbToGsCpp(uint8_t* pb_gs, const RGB32* pb_rgb, int num_pixels, const float coef[4])
{
  if (num_pixels < c_NumPixelsMin || num_pixels > c_NumPixelsMax)
    return false;
  if (num_pixels % 8 != 0)
    return false;
  if (!AlignedMem::IsAligned(pb_gs, 32))
    return false;
  if (!AlignedMem::IsAligned(pb_rgb, 32))
    return false;
  for (int i = 0; i < num_pixels; i++)
  {
    uint8_t r = pb_rgb[i].m_R;
    uint8_t g = pb_rgb[i].m_G;
    uint8_t b = pb_rgb[i].m_B;
    float gs_temp = r * coef[0] + g * coef[1] + b * coef[2] + 0.5f;
    if (gs_temp < 0.0f)
      gs_temp = 0.0f;
    else if (gs_temp > 255.0f)
      gs_temp = 255.0f;
    pb_gs[i] = (uint8_t)gs_temp;
  }
  return true;
}
void Avx2ConvertRgbToGs(void)
{
  const wchar_t* fn_rgb = L"..\Ch10_Data\TestImage3.bmp";
  const wchar_t* fn_gs1 = L"Ch10_06_Avx2ConvertRgbToGs_TestImage3_GS1.bmp";
  const wchar_t* fn_gs2 = L"Ch10_06_Avx2ConvertRgbToGs_TestImage3_GS2.bmp";
  ImageMatrix im_rgb(fn_rgb);
  int im_h = im_rgb.GetHeight();
  int im_w = im_rgb.GetWidth();
  int num_pixels = im_h * im_w;
  ImageMatrix im_gs1(im_h, im_w, PixelType::Gray8);
  ImageMatrix im_gs2(im_h, im_w, PixelType::Gray8);
  RGB32* pb_rgb = im_rgb.GetPixelBuffer<RGB32>();
  uint8_t* pb_gs1 = im_gs1.GetPixelBuffer<uint8_t>();
  uint8_t* pb_gs2 = im_gs2.GetPixelBuffer<uint8_t>();
  cout << "Results for Avx2ConvertRgbToGs ";
  wcout << "Converting RGB image " << fn_rgb << ' ';
  cout << " im_h = " << im_h << " pixels ";
  cout << " im_w = " << im_w << " pixels ";
  // Exercise conversion functions
  bool rc1 = Avx2ConvertRgbToGsCpp(pb_gs1, pb_rgb, num_pixels, c_Coef);
  bool rc2 = Avx2ConvertRgbToGs_(pb_gs2, pb_rgb, num_pixels, c_Coef);
  if (rc1 && rc2)
  {
    wcout << "Saving grayscale image #1 - " << fn_gs1 << ' ';
    im_gs1.SaveToBitmapFile(fn_gs1);
    wcout << "Saving grayscale image #2 - " << fn_gs2 << ' ';
    im_gs2.SaveToBitmapFile(fn_gs2);
    if (CompareGsImages(pb_gs1, pb_gs2, num_pixels))
      cout << "Grayscale image compare OK ";
    else
      cout << "Grayscale image compare failed ";
  }
  else
    cout << "Invalid return code ";
}
int main()
{
  try
  {
    Avx2ConvertRgbToGs();
    Avx2ConvertRgbToGs_BM();
  }
  catch (runtime_error& rte)
  {
    cout << "'runtime_error' exception has occurred - " << rte.what() << ' ';
  }
  catch (...)
  {
    cout << "Unexpected exception has occurred ";
  }
  return 0;
}
;-------------------------------------------------
;        Ch10_06.asm
;-------------------------------------------------
    include <MacrosX86-64-AVX.asmh>
        .const
GsMask     dword 0ffffffffh, 0, 0, 0, 0ffffffffh, 0, 0, 0
r4_0p5     real4 0.5
r4_255p0    real4 255.0
        extern c_NumPixelsMin:dword
        extern c_NumPixelsMax:dword
; extern "C" bool Avx2ConvertRgbToGs_(uint8_t* pb_gs, const RGB32* pb_rgb, int num_pixels, const float coef[4])
;
; Note: Memory pointed to by pb_rgb is ordered as follows:
;    R(0,0), G(0,0), B(0,0), A(0,0), R(0,1), G(0,1), B(0,1), A(0,1), ...
    .code
Avx2ConvertRgbToGs_ proc frame
    _CreateFrame RGBGS_,0,112
    _SaveXmmRegs xmm6,xmm7,xmm11,xmm12,xmm13,xmm14,xmm15
    _EndProlog
; Validate argument values
    xor eax,eax             ;set error return code
    cmp r8d,[c_NumPixelsMin]
    jl Done               ;jump if num_pixels < min value
    cmp r8d,[c_NumPixelsMax]
    jg Done               ;jump if num_pixels > max value
    test r8d,7
    jnz Done              ;jump if (num_pixels % 8) != 0
    test rcx,1fh
    jnz Done              ;jump if pb_gs is not aligned
    test rdx,1fh
    jnz Done              ;jump if pb_rgb is not aligned
; Perform required initializations
    vbroadcastss ymm11,real4 ptr [r4_255p0]   ;ymm11 = packed 255.0
    vbroadcastss ymm12,real4 ptr [r4_0p5]    ;ymm12 = packed 0.5
    vpxor ymm13,ymm13,ymm13           ;ymm13 = packed zero
    vmovups xmm0,xmmword ptr [r9]
    vperm2f128 ymm14,ymm0,ymm0,00000000b    ;ymm14 = packed coef
    vmovups ymm15,ymmword ptr [GsMask]     ;ymm15 = GsMask (SPFP)
; Load next 8 RGB32 pixel values (P0 - P7)
    align 16
@@:   vmovdqa ymm0,ymmword ptr [rdx]   ;ymm0 = 8 rgb32 pixels (P7 - P0)
; Size-promote RGB32 color components from bytes to dwords
    vpunpcklbw ymm1,ymm0,ymm13
    vpunpckhbw ymm2,ymm0,ymm13
    vpunpcklwd ymm3,ymm1,ymm13     ;ymm3 = P1, P0 (dword)
    vpunpckhwd ymm4,ymm1,ymm13     ;ymm4 = P3, P2 (dword)
    vpunpcklwd ymm5,ymm2,ymm13     ;ymm5 = P5, P4 (dword)
    vpunpckhwd ymm6,ymm2,ymm13     ;ymm6 = P7, P6 (dword)
; Convert color component values to single-precision floating-point
    vcvtdq2ps ymm0,ymm3         ;ymm0 = P1, P0 (SPFP)
    vcvtdq2ps ymm1,ymm4         ;ymm1 = P3, P2 (SPFP)
    vcvtdq2ps ymm2,ymm5         ;ymm2 = P5, P4 (SPFP)
    vcvtdq2ps ymm3,ymm6         ;ymm3 = P7, P6 (SPFP)
; Multiply color component values by color conversion coefficients
    vmulps ymm0,ymm0,ymm14
    vmulps ymm1,ymm1,ymm14
    vmulps ymm2,ymm2,ymm14
    vmulps ymm3,ymm3,ymm14
; Sum weighted color components for final grayscale values
    vhaddps ymm4,ymm0,ymm0
    vhaddps ymm4,ymm4,ymm4       ;ymm4[159:128] = P1, ymm4[31:0] = P0
    vhaddps ymm5,ymm1,ymm1
    vhaddps ymm5,ymm5,ymm5       ;ymm5[159:128] = P3, ymm4[31:0] = P2
    vhaddps ymm6,ymm2,ymm2
    vhaddps ymm6,ymm6,ymm6       ;ymm6[159:128] = P5, ymm4[31:0] = P4
    vhaddps ymm7,ymm3,ymm3
    vhaddps ymm7,ymm7,ymm7       ;ymm7[159:128] = P7, ymm4[31:0] = P6
; Merge SPFP grayscale values into a single YMM register
    vandps ymm4,ymm4,ymm15       ;mask out unneeded SPFP values
    vandps ymm5,ymm5,ymm15
    vandps ymm6,ymm6,ymm15
    vandps ymm7,ymm7,ymm15
    vpslldq ymm5,ymm5,4
    vpslldq ymm6,ymm6,8
    vpslldq ymm7,ymm7,12
    vorps ymm0,ymm4,ymm5        ;merge values
    vorps ymm1,ymm6,ymm7
    vorps ymm2,ymm0,ymm1        ;ymm2 = 8 GS pixel values (SPFP)
; Add 0.5 rounding factor and clip to 0.0 - 255.0
    vaddps ymm2,ymm2,ymm12       ;add 0.5f rounding factor
    vminps ymm3,ymm2,ymm11       ;clip pixels above 255.0
    vmaxps ymm4,ymm3,ymm13       ;clip pixels below 0.0
; Convert SPFP values to bytes and save
    vcvtps2dq ymm3,ymm2         ;convert GS SPFP to dwords
    vpackusdw ymm4,ymm3,ymm13      ;convert GS dwords to words
    vpackuswb ymm5,ymm4,ymm13      ;convert GS words to bytes
    vperm2i128 ymm6,ymm13,ymm5,3    ;xmm5 = GS P3:P0, xmm6 = GS P7:P4
    vmovd dword ptr [rcx],xmm5     ;save P3 - P0
    vmovd dword ptr [rcx+4],xmm6    ;save P7 - P4
    add rdx,32             ;update pb_rgb to next block
    add rcx,8              ;update pb_gs to next block
    sub r8d,8              ;num_pixels -= 8
    jnz @B               ;repeat until done
    mov eax,1              ;set success return code
Done: vzeroupper
    _RestoreXmmRegs xmm6,xmm7,xmm11,xmm12,xmm13,xmm14,xmm15
    _DeleteFrame
    ret
Avx2ConvertRgbToGs_ endp
    end
Listing 10-6.

Example Ch10_06

A variety of algorithms exist to convert an RGB image into a grayscale image. One frequently-used technique calculates grayscale pixel values using a weighted of sum the RGB color components. In this source code example, RGB pixels are converted to grayscale pixels using the following equation:
$$ GSleft(x,y
ight)=Rleft(x,y
ight){W}_r+Gleft(x,y
ight){W}_g+Bleft(x,y
ight){W}_b $$

Each RGB color component weight (or coefficient) is a floating-point number between 0.0 and 1.0, and the sum of the three component coefficients normally equals 1.0. The exact values used for the color component coefficients are usually based on published standards that reflect a multitude of visual factors including properties of the target color space, display device characteristics, and perceived image quality. If you’re interested in learning more about RGB to grayscale image conversion, Appendix A contains some references that you can consult.

Source code Ch10_06 opens with the structure declaration RGB32. This structure is declared in the header file ImageMatrix.h and specifies the color component ordering scheme of each RGB pixel. The function Avx2ConvertRgbToGsCpp contains a C++ implementation of the RGB to grayscale conversion algorithm. This function uses an ordinary for loop that sweeps through the RGB32 image buffer pb_rgb and computes grayscale pixel values using the aforementioned conversion equation. Note that RGB32 element m_A is not used in any of the calculations in this example. Each calculated grayscale pixel value is adjusted by a rounding factor and clipped to [0.0, 255.0] before it is saved to the grayscale image buffer pointed to by pb_gs.

The assembly language code begins with a .const section that defines the necessary constants. Following its prolog, the function Avx2ConvertRgbToGs_ performs the customary image size and buffer alignment checks. It then loads the algorithm’s required packed constants into registers YMM11–YMM15. Note that register YMM14 contains a packed version of the color conversion coefficients , as illustrated in Figure 10-3. The assembly language processing loop begins with a vmovdqa ymm0,ymmword ptr [rdx] instruction that loads eight RGB32 pixel values into register YMM0. The color components of these pixels are then size-promoted to doublewords using a series of vpunpck[l|h]bw and vpunpck[l|h]wd instructions. The ensuing vcvtdq2ps instructions convert the pixel color components from doublewords to single-precision floating-point values. Following execution of the four vcvtdq2ps instructions, registers YMM0–YMM3 each contain two RGB32 pixels and each color component is a single-precision floating-point value. Figure 10-3 also shows the RGB32 size promotions and conversions discussed in this paragraph.
../images/326959_2_En_10_Chapter/326959_2_En_10_Fig3_HTML.jpg
Figure 10-3.

RGB32 pixel color component size promotions and conversions

The four vmulps instructions multiply the eight RGB32 pixels by the color conversion coefficients . The ensuing vhaddps instructions sum the weighted color components of each pixel to generate the required grayscale values. Following execution of these instructions, registers YMM4–YMM7 each contain two single-precision floating-point grayscale pixel values, one in element position [31:0] and the another in [159:128], as shown in Figure 10-4. The eight grayscale values in YMM4–YMM7 are then merged into YMM2 using a series of vandps, vpslldq, and vorps instructions. Figure 10-4 also shows the final merged result. The vaddps, vminps, and vmaxps instructions that follow add in the rounding factor (0.5) and clip the grayscale pixels to [0.0, 255.0]. These values are then converted to unsigned bytes using the instructions vcvtps2dq, vpackusdw, and vpackuswb. The two vmovd instructions save the four unsigned byte pixel values in both XMM5[31:0] and XMM6[31:0] to the grayscale image buffer.
../images/326959_2_En_10_Chapter/326959_2_En_10_Fig4_HTML.jpg
Figure 10-4.

Grayscale single-precision floating-point pixel values before and after merging

Here are the results of source code example Ch10_06:
Results for Avx2ConvertRgbToGs
Converting RGB image ..Ch10_DataTestImage3.bmp
 im_h = 960 pixels
 im_w = 640 pixels
Saving grayscale image #1 - Ch10_06_Avx2ConvertRgbToGs_TestImage3_GS1.bmp
Saving grayscale image #2 - Ch10_06_Avx2ConvertRgbToGs_TestImage3_GS2.bmp
Grayscale image compare OK
Running benchmark function Avx2ConvertRgbToGs_BM - please wait
Benchmark times save to file Ch10_06_Avx2ConvertRgbToGs_BM_CHROMIUM.csv
Table 10-2 shows the benchmark timing measurements for the RGB to grayscale image conversion functions Avx2ConvertRgbToGsCpp and Avx2ConvertRgbToGs_. The performance gains of this source code example are modest compared to some of the other examples in this book. The reason for this is that the RGB32 color components in the source image buffer are interleaved with each other, which necessitates the use of slower horizontal arithmetic. Rearranging the RGB32 data so that the pixels of each color component reside in separate image buffers often results in significantly faster performance. You see an example of this in Chapter 14.
Table 10-2.

Mean Execution Times (Microseconds) for RGB to Grayscale Image Conversion Using TestImage3.bmp

CPU

Avx2ConvertRgbToGsCpp

Avx2ConvertRgbToGs_

i7-4790S

1504

843

i9-7900X

1075

593

i7-8700K

1031

565

Summary

Here are the key learning points of Chapter 10:
  • AVX2 extends the packed integer capabilities of AVX. Most x86-AVX packed integer instructions can be used with either 128-bit or 256-bit wide operands. These operands should always be properly aligned whenever possible.

  • Similar to x86-AVX floating-point, assembly language functions that perform packed integer calculations using a YMM register should use a vzeroupper instruction prior any epilog code or the ret instruction. This avoids potential performance delays that can occur when the processor transitions from executing x86-AVX instructions to x86-SSE instructions.

  • The Visual C++ calling convention differs for assembly language functions that return a structure by value. A function that returns a structure by value must copy a large structure (one greater than eight bytes) to the buffer pointed to by the RCX register. The normal calling convention registers are also “right-shifted” as explained in this chapter.

  • Assembly language functions can use the vpunpckl[bw|wd|dq] and vpunpckh[bw|wd|dq] instructions to unpack 128-bit or 256-bit wide integer operands.

  • Assembly language functions can use the vpackss[dw|wb] and vpackus[dw|wb] instructions to pack 128-bit or 256-bit wide integer operands using signed or unsigned saturation.

  • Assembly language functions can use the vmovzx[bw|bd|bq|wd|wq|dq] and vmovsx[bw|bd|bq|wd|wq|dq] instructions to perform zero or sign extended packed integer size promotions.

  • MASM supports directives that can perform rudimentary string processing operations, which can be employed to construct text strings for macro instruction mnemonics, operands, and labels. MASM also supports conditional error directives that can be used to signal error conditions during source code assembly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.31.180