Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Daniel KusswurmModern X86 Assembly Language Programminghttps://doi.org/10.1007/978-1-4842-4063-2_10

10. AVX2 Programming – Packed Integers

Daniel Kusswurm¹

(1)

Geneva, IL, USA

In Chapter 7, you learned how to use the AVX instruction set to perform packed integer operations using 128-bit wide operands and the XMM register set. In this chapter, you learn how to carry out similar operations using AVX2 instructions with 256-bit wide operands and the YMM register set. Chapter 10’s source code examples are divided into two major sections. The first section contains elementary examples that illustrate basic operations using AVX2 instructions and 256-bit wide packed integer operands. The second section includes examples that are a continuation of the image processing techniques first presented in Chapter 7.

All of the source code examples in this chapter require a processor and operating system that supports AVX2. You can use one of the free utilities listed in Appendix A to verify the processing capabilities of your system.

Packed Integer Fundamentals

In this section, you learn how to perform fundamental packed integer operations using AVX2 instructions. The first source code example expounds basic arithmetic using 256-bit wide operands and the YMM register set. The second source code example demonstrates AVX2 instructions that carry out integer pack and unpack operations. This example also explains how to return a structure by value from an assembly language function. The final source code example illuminates AVX2 instructions that execute packed integer size promotions using zero or sign extended values.

Basic Arithmetic

Listing 10-1 shows the source code for example Ch10_01. This example illustrates how to perform basic arithmetic operations using packed word and doubleword operands.

//------------------------------------------------

// Ch10_01.cpp

//------------------------------------------------

#include "stdafx.h"

#include <iostream>

#include <iomanip>

#include "Ymmval.h"

using namespace std;

extern "C" void Avx2PackedMathI16_(const YmmVal& a, const YmmVal& b, YmmVal c[6]);

extern "C" void Avx2PackedMathI32_(const YmmVal& a, const YmmVal& b, YmmVal c[5]);

void Avx2PackedMathI16(void)

{

alignas(32) YmmVal a;

alignas(32) YmmVal b;

alignas(32) YmmVal c[6];

a.m_I16[0] = 10; b.m_I16[0] = 1000;

a.m_I16[1] = 20; b.m_I16[1] = 2000;

a.m_I16[2] = 3000; b.m_I16[2] = 30;

a.m_I16[3] = 4000; b.m_I16[3] = 40;

a.m_I16[4] = 30000; b.m_I16[4] = 3000; // add overflow

a.m_I16[5] = 6000; b.m_I16[5] = 32000; // add overflow

a.m_I16[6] = 2000; b.m_I16[6] = -31000; // sub overflow

a.m_I16[7] = 4000; b.m_I16[7] = -30000; // sub overflow

a.m_I16[8] = 4000; b.m_I16[8] = -2500;

a.m_I16[9] = 3600; b.m_I16[9] = -1200;

a.m_I16[10] = 6000; b.m_I16[10] = 9000;

a.m_I16[11] = -20000; b.m_I16[11] = -20000;

a.m_I16[12] = -25000; b.m_I16[12] = -27000; // add overflow

a.m_I16[13] = 8000; b.m_I16[13] = 28700; // add overflow

a.m_I16[14] = 3; b.m_I16[14] = -32766; // sub overflow

a.m_I16[15] = -15000; b.m_I16[15] = 24000; // sub overflow

Avx2PackedMathI16_(a, b, c);

cout <<" Results for Avx2PackedMathI16_ ";

cout << " i a b vpaddw vpaddsw vpsubw vpsubsw vpminsw vpmaxsw ";

cout << "-------------------------------------------------------------------------- ";

for (int i = 0; i < 16; i++)

{

cout << setw(2) << i << ' ';

cout << setw(8) << a.m_I16[i] << ' ';

cout << setw(8) << b.m_I16[i] << ' ';

cout << setw(8) << c[0].m_I16[i] << ' ';

cout << setw(8) << c[1].m_I16[i] << ' ';

cout << setw(8) << c[2].m_I16[i] << ' ';

cout << setw(8) << c[3].m_I16[i] << ' ';

cout << setw(8) << c[4].m_I16[i] << ' ';

cout << setw(8) << c[5].m_I16[i] << ' ';

}

void Avx2PackedMathI32(void)

{

alignas(32) YmmVal a;

alignas(32) YmmVal b;

alignas(32) YmmVal c[6];

a.m_I32[0] = 64; b.m_I32[0] = 4;

a.m_I32[1] = 1024; b.m_I32[1] = 5;

a.m_I32[2] = -2048; b.m_I32[2] = 2;

a.m_I32[3] = 8192; b.m_I32[3] = 5;

a.m_I32[4] = -256; b.m_I32[4] = 8;

a.m_I32[5] = 4096; b.m_I32[5] = 7;

a.m_I32[6] = 16; b.m_I32[6] = 3;

a.m_I32[7] = 512; b.m_I32[7] = 6;

Avx2PackedMathI32_(a, b, c);

cout << " Results for Avx2PackedMathI32 ";

cout << " i a b vpaddd vpsubd vpmulld vpsllvd vpsravd vpabsd ";

cout << "---------------------------------------------------------------------- ";

for (int i = 0; i < 8; i++)

{

cout << setw(2) << i << ' ';

cout << setw(6) << a.m_I32[i] << ' ';

cout << setw(6) << b.m_I32[i] << ' ';

cout << setw(8) << c[0].m_I32[i] << ' ';

cout << setw(8) << c[1].m_I32[i] << ' ';

cout << setw(8) << c[2].m_I32[i] << ' ';

cout << setw(8) << c[3].m_I32[i] << ' ';

cout << setw(8) << c[4].m_I32[i] << ' ';

cout << setw(8) << c[5].m_I32[i] << ' ';

}

int main()

{

Avx2PackedMathI16();

Avx2PackedMathI32();

return 0;

}

;-------------------------------------------------

; Ch10_01.asm

;-------------------------------------------------

; extern "C" void Avx2PackedMathI16_(const YmmVal& a, const YmmVal& b, YmmVal c[6])

.code

Avx2PackedMathI16_ proc

; Load values a and b, which must be properly aligned

vmovdqa ymm0,ymmword ptr [rcx] ;ymm0 = a

vmovdqa ymm1,ymmword ptr [rdx] ;ymm1 = b

; Perform packed arithmetic operations

vpaddw ymm2,ymm0,ymm1 ;add

vmovdqa ymmword ptr [r8],ymm2 ;save vpaddw result

vpaddsw ymm2,ymm0,ymm1 ;add with signed saturation

vmovdqa ymmword ptr [r8+32],ymm2 ;save vpaddsw result

vpsubw ymm2,ymm0,ymm1 ;sub

vmovdqa ymmword ptr [r8+64],ymm2 ;save vpsubw result

vpsubsw ymm2,ymm0,ymm1 ;sub with signed saturation

vmovdqa ymmword ptr [r8+96],ymm2 ;save vpsubsw result

vpminsw ymm2,ymm0,ymm1 ;signed minimums

vmovdqa ymmword ptr [r8+128],ymm2 ;save vpminsw result

vpmaxsw ymm2,ymm0,ymm1 ;signed maximums

vmovdqa ymmword ptr [r8+160],ymm2 ;save vpmaxsw result

vzeroupper

ret

Avx2PackedMathI16_ endp

; extern "C" void Avx2PackedMathI32_(const YmmVal& a, const YmmVal& b, YmmVal c[6])

Avx2PackedMathI32_ proc

; Load values a and b, which must be properly aligned

vmovdqa ymm0,ymmword ptr [rcx] ;ymm0 = a

vmovdqa ymm1,ymmword ptr [rdx] ;ymm1 = b

; Perform packed arithmetic operations

vpaddd ymm2,ymm0,ymm1 ;add

vmovdqa ymmword ptr [r8],ymm2 ;save vpaddd result

vpsubd ymm2,ymm0,ymm1 ;sub

vmovdqa ymmword ptr [r8+32],ymm2 ;save vpsubd result

vpmulld ymm2,ymm0,ymm1 ;signed mul (low 32 bits)

vmovdqa ymmword ptr [r8+64],ymm2 ;save vpmulld result

vpsllvd ymm2,ymm0,ymm1 ;shift left logical

vmovdqa ymmword ptr [r8+96],ymm2 ;save vpsllvd result

vpsravd ymm2,ymm0,ymm1 ;shift right arithmetic

vmovdqa ymmword ptr [r8+128],ymm2 ;save vpsravd result

vpabsd ymm2,ymm0 ;absolute value

vmovdqa ymmword ptr [r8+160],ymm2 ;save vpabsd result

vzeroupper

ret

Avx2PackedMathI32_ endp

end

Listing 10-1.

Example Ch10_01

The C++ function Avx2PackedMathI16 contains code that demonstrates packed signed word arithmetic. This function begins with the definitions of YmmVal variables a, b, and c. Note that the C++ specifier alignas(32) is used with each YmmVal definition to ensure alignment on a 32-byte boundary. The signed word elements of both a and b are then initialized with test values. Following variable initialization, Avx2PackedMathI16 calls the assembly language function Avx2PackedMathI16_, which performs several packed arithmetic operations. The results are then streamed to cout. The C++ function Avx2PackedMathI32 is next. The structure of this function is similar to Avx2PackedMathI16, with the main difference being that it exercises packed doubleword operands.

The assembly language function Avx2PackedMathI16_ begins with a vmovdqa ymm0,ymmword ptr [rcx] instruction that loads YmmVal a into register YMM0. The ensuing vmovdqa ymm1,ymmword ptr [rdx] instruction loads YmmVal b into register YMM1. This is followed by a vpaddw ymm2,ymm0,ymm1 that performs packed word addition of a and b. The vmovdqa ymmword ptr [r8],ymm2 instruction then saves packed word sums to c[0]. The remaining assembly language code in Avx2PackedMathI16_ exercises the instructions vpaddsw, vpsubw, vpsubsw, vpminsw, and vpmaxsw to carry out additional arithmetic operations. Similar to the source code examples that you saw in Chapter 9, Avx2PackedMathI16_ uses a vzeroupper instruction before its ret instruction. This avoids potential performance penalties that can occur when the processor transitions from executing x86-AVX instructions to x86-SSE instructions as explained in Chapter 8. The assembly language function Avx2PackedMathI32_ employs a similar structure to exercise commonly-used packed doubleword instructions including vpaddd, vpsubd, vpmulld, vpsllvd, vpsravd, and vpabsd. Here are the results for source code example Ch10_01:

Results for Avx2PackedMathI16_

i a b vpaddw vpaddsw vpsubw vpsubsw vpminsw vpmaxsw

--------------------------------------------------------------------------

0 10 1000 1010 1010 -990 -990 10 1000

1 20 2000 2020 2020 -1980 -1980 20 2000

2 3000 30 3030 3030 2970 2970 30 3000

3 4000 40 4040 4040 3960 3960 40 4000

4 30000 3000 -32536 32767 27000 27000 3000 30000

5 6000 32000 -27536 32767 -26000 -26000 6000 32000

6 2000 -31000 -29000 -29000 -32536 32767 -31000 2000

7 4000 -30000 -26000 -26000 -31536 32767 -30000 4000

8 4000 -2500 1500 1500 6500 6500 -2500 4000

9 3600 -1200 2400 2400 4800 4800 -1200 3600

10 6000 9000 15000 15000 -3000 -3000 6000 9000

11 -20000 -20000 25536 -32768 0 0 -20000 -20000

12 -25000 -27000 13536 -32768 2000 2000 -27000 -25000

13 8000 28700 -28836 32767 -20700 -20700 8000 28700

14 3 -32766 -32763 -32763 -32767 32767 -32766 3

15 -15000 24000 9000 9000 26536 -32768 -15000 24000

Results for Avx2PackedMathI32

i a b vpaddd vpsubd vpmulld vpsllvd vpsravd vpabsd

----------------------------------------------------------------------

0 64 4 68 60 256 1024 4 64

1 1024 5 1029 1019 5120 32768 32 1024

2 -2048 2 -2046 -2050 -4096 -8192 -512 2048

3 8192 5 8197 8187 40960 262144 256 8192

4 -256 8 -248 -264 -2048 -65536 -1 256

5 4096 7 4103 4089 28672 524288 32 4096

6 16 3 19 13 48 128 2 16

7 512 6 518 506 3072 32768 8 512

On systems that support AVX2, most of the instructions exercised in this example can be used with a variety of 256-bit wide packed integer operands. For example, the vpadd[b|q] and vpsub[b|q] instructions carry out addition and subtraction using 256-bit wide packed byte or quadword operands. The vpaddsb and vpsubsb instructions perform signed saturated addition and subtraction using packed byte operands. The instructions vpmins[b|d] and vpmaxs[b|d] calculate packed signed minimums and maximums, respectively. The variable bit shift instructions vpsllv[d|q], vpsravd, and vpsrlv[d|q] are new AVX2 instructions. These instructions are not available on systems that only support AVX.

Pack and Unpack

Then next source code example illustrates how to perform integer pack and unpack operations. These operations are often employed to size-reduce or size-promote packed integer operands. This example also explains how to return a structure by value from an assembly language function. Listing 10-2 shows the source code for example Ch10_02

//------------------------------------------------

// Ch10_02.cpp

//------------------------------------------------

#include "stdafx.h"

#include <iostream>

#include <iomanip>

#include "YmmVal.h"

using namespace std;

struct alignas(32) YmmVal2

{

YmmVal m_YmmVal0;

YmmVal m_YmmVal1;

};

extern "C" YmmVal2 Avx2UnpackU32_U64_(const YmmVal& a, const YmmVal& b);

extern "C" void Avx2PackI32_I16_(const YmmVal& a, const YmmVal& b, YmmVal* c);

void Avx2UnpackU32_U64(void)

{

alignas(32) YmmVal a;

alignas(32) YmmVal b;

a.m_U32[0] = 0x00000000; b.m_U32[0] = 0x88888888;

a.m_U32[1] = 0x11111111; b.m_U32[1] = 0x99999999;

a.m_U32[2] = 0x22222222; b.m_U32[2] = 0xaaaaaaaa;

a.m_U32[3] = 0x33333333; b.m_U32[3] = 0xbbbbbbbb;

a.m_U32[4] = 0x44444444; b.m_U32[4] = 0xcccccccc;

a.m_U32[5] = 0x55555555; b.m_U32[5] = 0xdddddddd;

a.m_U32[6] = 0x66666666; b.m_U32[6] = 0xeeeeeeee;

a.m_U32[7] = 0x77777777; b.m_U32[7] = 0xffffffff;

YmmVal2 c = Avx2UnpackU32_U64_(a, b);

cout << " Results for Avx2UnpackU32_U64 ";

cout << "a lo " << a.ToStringX32(0) << ' ';

cout << "b lo " << b.ToStringX32(0) << ' ';

cout << ' ';

cout << "a hi " << a.ToStringX32(1) << ' ';

cout << "b hi " << b.ToStringX32(1) << ' ';

cout << " vpunpckldq result ";

cout << "c.m_YmmVal0 lo " << c.m_YmmVal0.ToStringX64(0) << ' ';

cout << "c.m_YmmVal0 hi " << c.m_YmmVal0.ToStringX64(1) << ' ';

cout << " vpunpckhdq result ";

cout << "c.m_YmmVal1 lo " << c.m_YmmVal1.ToStringX64(0) << ' ';

cout << "c.m_YmmVal1 hi " << c.m_YmmVal1.ToStringX64(1) << ' ';

}

void Avx2PackI32_I16(void)

{

alignas(32) YmmVal a;

alignas(32) YmmVal b;

alignas(32) YmmVal c;

a.m_I32[0] = 10; b.m_I32[0] = 32768;

a.m_I32[1] = -200000; b.m_I32[1] = 6500;

a.m_I32[2] = 300000; b.m_I32[2] = 42000;

a.m_I32[3] = -4000; b.m_I32[3] = -68000;

a.m_I32[4] = 9000; b.m_I32[4] = 25000;

a.m_I32[5] = 80000; b.m_I32[5] = 500000;

a.m_I32[6] = 200; b.m_I32[6] = -7000;

a.m_I32[7] = -32769; b.m_I32[7] = 12500;

Avx2PackI32_I16_(a, b, &c);

cout << " Results for Avx2PackI32_I16 ";

cout << "a lo " << a.ToStringI32(0) << ' ';

cout << "a hi " << a.ToStringI32(1) << ' ';

cout << ' ';

cout << "b lo " << b.ToStringI32(0) << ' ';

cout << "b hi " << b.ToStringI32(1) << ' ';

cout << ' ';

cout << "c lo " << c.ToStringI16(0) << ' ';

cout << "c hi " << c.ToStringI16(1) << ' ';

cout << ' ';

}

int main()

{

Avx2UnpackU32_U64();

Avx2PackI32_I16();

return 0;

}

;-------------------------------------------------

; Ch10_02.asm

;-------------------------------------------------

; extern "C" YmmVal2 Avx2UnpackU32_U64_(const YmmVal& a, const YmmVal& b);

.code

Avx2UnpackU32_U64_ proc

; Load argument values

vmovdqa ymm0,ymmword ptr [rdx] ;ymm0 = a

vmovdqa ymm1,ymmword ptr [r8] ;ymm1 = b

; Perform dword to qword unpacks

vpunpckldq ymm2,ymm0,ymm1 ;unpack low doublewords

vpunpckhdq ymm3,ymm0,ymm1 ;unpack high doublewords

; Save result to YmmVal2 buffer

vmovdqa ymmword ptr [rcx],ymm2 ;save low result

vmovdqa ymmword ptr [rcx+32],ymm3 ;save high result

mov rax,rcx ;rax = ptr to YmmVal2

vzeroupper

ret

Avx2UnpackU32_U64_ endp

; extern "C" void Avx2PackI32_I16_(const YmmVal& a, const YmmVal& b, YmmVal* c);

Avx2PackI32_I16_ proc

; Load argument values

vmovdqa ymm0,ymmword ptr [rcx] ;ymm0 = a

vmovdqa ymm1,ymmword ptr [rdx] ;ymm1 = b

; Perform pack dword to word with signed saturation

vpackssdw ymm2,ymm0,ymm1 ;ymm2 = packed words

vmovdqa ymmword ptr [r8],ymm2 ;save result

vzeroupper

ret

Avx2PackI32_I16_ endp

Foo1_ proc

ret

Foo1_ endp

end

Listing 10-2.

Example Ch10_02

The C++ code in Listing 10-2 begins the declaration of a structure named YmmVal2. This structure contains two YmmVal members: m_YmmVal0 and m_YmmVal1. Note that the alignas(32) specifier is used immediately after the keyword struct. Using this specifier ensures that all instances of YmmVal2 are aligned on a 32-byte boundary including temporary instances created by the compiler. More on this in a moment. The assembly language function Avx2UnpackU32_U64_, whose declaration follows, returns an instance of YmmVal2 by value.

The C++ function AvxUnpackU32_U64 begins by initializing the unsigned doubleword elements of YmmVal variables a and b. Following variable initialization is the statement YmmVal2 c = Avx2UnpackU32_U64_(a, b), which calls the assembly language function Avx2UnpackU32_U64_ to unpack the elements of a and b from doublewords to quadwords. Unlike previous examples, Avx2UnpackU32_U64_ returns its YmmVal2 result by value. Before proceeding, it is important to note that in most cases, returning a user-defined structure like YmmVal2 by value is less efficient than passing a pointer argument to a variable of type YmmVal2. The function Avx2UnpackU32_U64_ uses return-by-value principally for demonstration purposes and to elucidate the Visual C++ calling convention protocols that an assembly language function must observe when returning a structure by value is warranted. The remaining statements in AvxUnpackU32_U64 stream the results from Avx2UnpackU32_U64_ to cout.

Following AvxUnpackU32_U64 is the C++ function Avx2PackI32_I16. This function initializes the signed doubleword elements of YmmVal variables a and b. These values will be size reduced to packed words. Subsequent to YmmVal variable initialization, Avx2PackI32_I16 calls the assembly language function Avx2PackI32_I16_ to carry out the aforementioned size reduction. The results are then streamed to cout.

The calling convention that Visual C++ uses for functions that return a structure by value varies somewhat from the normal calling convention. Upon entry to the assembly language function Avx2UnpackU32_U64_, register RCX points to a temporary buffer where Avx2UnpackU32_U64_ must store its YmmVal2 return result. It is important to note that this buffer is not necessarily the same memory location as the destination YmmVal2 variable in the C++ statement that called Avx2UnpackU32_U64_. In order to implement expression evaluation and operator overloading, a C++ compiler often generates code that allocates temporary variables (or rvalues) to hold intermediate results. An rvalue that needs to be saved is ultimately copied to a named variable (or lvalue) using either a default or overloaded assignment operator. This copy operation is the reason why returning a structure by value is usually slower than passing a pointer argument. The alignas(32) specifier that’s used in the declaration of struct YmmVal2 directs the Visual C++ compiler to align all variables of type YmmVal2 including rvalues on a 32-byte boundary.

If the subject matter of the preceding paragraph seems a little abstract, don’t worry. Temporary storage space allocation for return-by-value structures is handled automatically by the C++ compiler. It’s more important to understand the following Visual C++ calling convention requirements that must be observed by any function that returns a large structure (any structure whose size is greater than eight bytes) by value:

The caller of a function that returns a large structure by value must allocate storage space for the returned structure. A pointer to this storage space must be passed to the called function in register RCX.
The normal calling convention argument registers are “right-shifted” by one. This means that the first three arguments are passed using registers RDX/XMM1, R8/XMM2, and R9/XMM3. Any remaining arguments are passed on the stack.
Prior to returning, the called function must load register RAX with a pointer to the returned structure.

If the size of a return-by-value structure is less than or equal to eight bytes, it must be returned in register RAX. The normal calling convention argument registers are used in these situations.

Returning to the code, the first instruction of function Avx2UnpackU32_U64_ uses a vmovdqa ymm0,ymmword ptr [rdx] instruction to load YmmVal a (the first function argument) into register YMM0. The ensuing vmovdqa ymm1,ymmword ptr [r8] instruction loads YmmVal b (the second function argument) into register YMM1. The next two instructions, vpunpckldq ymm2,ymm0,ymm1 and vpunpckhdq ymm3,ymm0,ymm1, unpack the doublewords into quadwords, as shown in Figure 10-1. The results are then saved to the YmmVal2 buffer pointed to by RCX using two vmovdqa instructions. Note that two vmovdqu instructions would be required here if the structure YmmVal2 was declared without the alignas(32) specifier . As previously mentioned, the Visual C++ calling convention requires any function that returns a structure by value to load a copy of the structure buffer pointer into register RAX prior to returning. The mov rax,rcx instruction fulfills this requirement (recall that RCX contains a pointer to the structure buffer).

../images/326959_2_En_10_Chapter/326959_2_En_10_Fig1_HTML.jpg — Figure 10-1.
Execution of the *vpunpckldq* and *vpunpckhdq* instructions

The assembly language function Avx2PackI32_I16_ demonstrates use of the vpackssdw (Packed with Signed Saturation) instruction. In this function, the vpackssdw ymm2,ymm0,ymm1 instruction converts the 16 doubleword integers in registers YMM0 and YMM1 to word integers using signed saturation. It then saves the 16 word integers in register YMM2. Figure 10-2 illustrates the execution of this instruction. X86-AVX also include a vpacksswb instruction that performs signed word to byte size reductions. The vpackus[dw|wb] instructions can be used for packed unsigned integer reductions.

../images/326959_2_En_10_Chapter/326959_2_En_10_Fig2_HTML.jpg — Figure 10-2.
Execution of the *vpackssdw* instruction

Note that in Figures 10-1 and 10-2, the vpunpckldq, vpunpckhdq, and vpackssdw instructions carry out their operations using two 128-bit wide independent lanes, as explained in Chapter 4. Here are the results for source code example Ch10_02:

Results for Avx2UnpackU32_U64

a lo 00000000 11111111 | 22222222 33333333

b lo 88888888 99999999 | AAAAAAAA BBBBBBBB

a hi 44444444 55555555 | 66666666 77777777

b hi CCCCCCCC DDDDDDDD | EEEEEEEE FFFFFFFF

vpunpckldq result

c.m_YmmVal0 lo 8888888800000000 | 9999999911111111

c.m_YmmVal0 hi CCCCCCCC44444444 | DDDDDDDD55555555

vpunpckhdq result

c.m_YmmVal1 lo AAAAAAAA22222222 | BBBBBBBB33333333

c.m_YmmVal1 hi EEEEEEEE66666666 | FFFFFFFF77777777

Results for Avx2PackI32_I16

a lo 10 -200000 | 300000 -4000

a hi 9000 80000 | 200 -32769

b lo 32768 6500 | 42000 -68000

b hi 25000 500000 | -7000 12500

c lo 10 -32768 32767 -4000 | 32767 6500 32767 -32768

c hi 9000 32767 200 -32768 | 25000 32767 -7000 12500

Size Promotions

In Chapter 7, you learned how to use the used the vpunpckl[bw|dw] and vpunpckh[bw|wd] instructions to size-promote packed integers (see source code examples Ch07_05, Ch07_06, and Ch07_08). The next source code example, Ch10_03, demonstrates how to employ the vpmovzx[bw|bd] and vpmovsx[wd|wq] instructions to size-promote packed integers using either zero or sign extension . Listing 10-3 shows the source code for example Ch10_03.

//------------------------------------------------

// Ch10_03.cpp

//------------------------------------------------

#include "stdafx.h"

#include <cstdint>

#include <iostream>

#include <string>

#include "YmmVal.h"

using namespace std;

extern "C" void Avx2ZeroExtU8_U16_(YmmVal*a, YmmVal b[2]);

extern "C" void Avx2ZeroExtU8_U32_(YmmVal*a, YmmVal b[4]);

extern "C" void Avx2SignExtI16_I32_(YmmVal*a, YmmVal b[2]);

extern "C" void Avx2SignExtI16_I64_(YmmVal*a, YmmVal b[4]);

const string c_Line(80, '-');

void Avx2ZeroExtU8_U16(void)

{

alignas(32) YmmVal a;

alignas(32) YmmVal b[2];

for (int i = 0; i < 32; i++)

a.m_U8[i] = (uint8_t)(i * 8);

Avx2ZeroExtU8_U16_(&a, b);

cout << " Results for Avx2ZeroExtU8_U16_ ";

cout << c_Line << ' ';

cout << "a (0:15): " << a.ToStringU8(0) << ' ';

cout << "a (16:31): " << a.ToStringU8(1) << ' ';

cout << ' ';

cout << "b (0:7): " << b[0].ToStringU16(0) << ' ';

cout << "b (8:15): " << b[0].ToStringU16(1) << ' ';

cout << "b (16:23): " << b[1].ToStringU16(0) << ' ';

cout << "b (24:31): " << b[1].ToStringU16(1) << ' ';

}

void Avx2ZeroExtU8_U32(void)

{

alignas(32) YmmVal a;

alignas(32) YmmVal b[4];

for (int i = 0; i < 32; i++)

a.m_U8[i] = (uint8_t)(255 - i * 8);

Avx2ZeroExtU8_U32_(&a, b);

cout << " Results for Avx2ZeroExtU8_U32_ ";

cout << c_Line << ' ';

cout << "a (0:15): " << a.ToStringU8(0) << ' ';

cout << "a (16:31): " << a.ToStringU8(1) << ' ';

cout << ' ';

cout << "b (0:3): " << b[0].ToStringU32(0) << ' ';

cout << "b (4:7): " << b[0].ToStringU32(1) << ' ';

cout << "b (8:11): " << b[1].ToStringU32(0) << ' ';

cout << "b (12:15): " << b[1].ToStringU32(1) << ' ';

cout << "b (16:19): " << b[2].ToStringU32(0) << ' ';

cout << "b (20:23): " << b[2].ToStringU32(1) << ' ';

cout << "b (24:27): " << b[3].ToStringU32(0) << ' ';

cout << "b (28:31): " << b[3].ToStringU32(1) << ' ';

}

void Avx2SignExtI16_I32()

{

alignas(32) YmmVal a;

alignas(32) YmmVal b[2];

for (int i = 0; i < 16; i++)

a.m_I16[i] = (int16_t)(-32768 + i * 4000);

Avx2SignExtI16_I32_(&a, b);

cout << " Results for Avx2SignExtI16_I32_ ";

cout << c_Line << ' ';

cout << "a (0:7): " << a.ToStringI16(0) << ' ';

cout << "a (8:15): " << a.ToStringI16(1) << ' ';

cout << ' ';

cout << "b (0:3): " << b[0].ToStringI32(0) << ' ';

cout << "b (4:7): " << b[0].ToStringI32(1) << ' ';

cout << "b (8:11): " << b[1].ToStringI32(0) << ' ';

cout << "b (12:15): " << b[1].ToStringI32(1) << ' ';

}

void Avx2SignExtI16_I64()

{

alignas(32) YmmVal a;

alignas(32) YmmVal b[4];

for (int i = 0; i < 16; i++)

a.m_I16[i] = (int16_t)(32767 - i * 4000);

Avx2SignExtI16_I64_(&a, b);

cout << " Results for Avx2SignExtI16_I64_ ";

cout << c_Line << ' ';

cout << "a (0:7): " << a.ToStringI16(0) << ' ';

cout << "a (8:15): " << a.ToStringI16(1) << ' ';

cout << ' ';

cout << "b (0:1): " << b[0].ToStringI64(0) << ' ';

cout << "b (2:3): " << b[0].ToStringI64(1) << ' ';

cout << "b (4:5): " << b[1].ToStringI64(0) << ' ';

cout << "b (6:7): " << b[1].ToStringI64(1) << ' ';

cout << "b (8:9): " << b[2].ToStringI64(0) << ' ';

cout << "b (10:11): " << b[2].ToStringI64(1) << ' ';

cout << "b (12:13): " << b[3].ToStringI64(0) << ' ';

cout << "b (14:15): " << b[3].ToStringI64(1) << ' ';

}

int main()

{

Avx2ZeroExtU8_U16();

Avx2ZeroExtU8_U32();

Avx2SignExtI16_I32();

Avx2SignExtI16_I64();

return 0;

}

;-------------------------------------------------

; Ch10_03.asm

;-------------------------------------------------

; extern "C" void Avx2ZeroExtU8_U16_(YmmVal*a, YmmVal b[2]);

.code

Avx2ZeroExtU8_U16_ proc

vpmovzxbw ymm0,xmmword ptr [rcx] ;zero extend a[0] - a[15]

vpmovzxbw ymm1,xmmword ptr [rcx+16] ;zero extend a[16] - a[31]

vmovdqa ymmword ptr [rdx],ymm0 ;save results

vmovdqa ymmword ptr [rdx+32],ymm1

vzeroupper

ret

Avx2ZeroExtU8_U16_ endp

; extern "C" void Avx2ZeroExtU8_U32_(YmmVal*a, YmmVal b[4]);

Avx2ZeroExtU8_U32_ proc

vpmovzxbd ymm0,qword ptr [rcx] ;zero extend a[0] - a[7]

vpmovzxbd ymm1,qword ptr [rcx+8] ;zero extend a[8] - a[15]

vpmovzxbd ymm2,qword ptr [rcx+16] ;zero extend a[16] - a[23]

vpmovzxbd ymm3,qword ptr [rcx+24] ;zero extend a[24] - a[31]

vmovdqa ymmword ptr [rdx],ymm0 ;save results

vmovdqa ymmword ptr [rdx+32],ymm1

vmovdqa ymmword ptr [rdx+64],ymm2

vmovdqa ymmword ptr [rdx+96],ymm3

vzeroupper

ret

Avx2ZeroExtU8_U32_ endp

; extern "C" void Avx2SignExtI16_I32_(YmmVal*a, YmmVal b[2])

Avx2SignExtI16_I32_ proc

vpmovsxwd ymm0,xmmword ptr [rcx] ;sign extend a[0] - a[7]

vpmovsxwd ymm1,xmmword ptr [rcx+16] ;sign extend a[8] - a[15]

vmovdqa ymmword ptr [rdx],ymm0 ;save results

vmovdqa ymmword ptr [rdx+32],ymm1

vzeroupper

ret

Avx2SignExtI16_I32_ endp

; extern "C" void Avx2SignExtI16_I64_(YmmVal*a, YmmVal b[4])

Avx2SignExtI16_I64_ proc

vpmovsxwq ymm0,qword ptr [rcx] ;sign extend a[0] - a[3]

vpmovsxwq ymm1,qword ptr [rcx+8] ;sign extend a[4] - a[7]

vpmovsxwq ymm2,qword ptr [rcx+16] ;sign extend a[8] - a[11]

vpmovsxwq ymm3,qword ptr [rcx+24] ;sign extend a[12] - a[15]

vmovdqa ymmword ptr [rdx],ymm0 ;save results

vmovdqa ymmword ptr [rdx+32],ymm1

vmovdqa ymmword ptr [rdx+64],ymm2

vmovdqa ymmword ptr [rdx+96],ymm3

vzeroupper

ret

Avx2SignExtI16_I64_ endp

end

Listing 10-3.

Example Ch10_03

The C++ code in Listing 10-3 contains four functions that initialize test cases for various packed size-promotion operations. The first function, Avx2ZeroExtU8_U16, begins by initializing the unsigned byte elements of YmmVal a. It then calls the assembly language function Avx2ZeroExtU8_U16_ to size-promote the packed unsigned bytes into packed unsigned words. The function Avx2ZeroExtU8_U32 performs a similar set of initializations to demonstrate packed unsigned byte to packed unsigned doubleword promotions. The functions Avx2SignExtI16_I32 and Avx2SignExtI16_I64 initialize test cases for packed signed word to packed signed doubleword and packed signed quadword size promotions.

The first instruction in the assembly language function Avx2ZeroExtU8_U16_, vpmovzxbw ymm0,xmmword ptr [rcx], loads and zero-extends the 16 low-order bytes of YmmVal a (pointed to by register RCX) and saves these values in register YMM0. The ensuing vpmovzxbw ymm1,xmmword ptr [rcx+16] instruction performs the same operation using the 16 high-order bytes of YmmVal a. The function Avx2ZeroExtU8_U16_ then uses two vmovdqa instructions to save the size-promoted results.

The assembly language function Avx2ZeroExtU8_U32_ performs packed byte to doubleword size promotions. The first instruction, vpmovzxbd ymm0,qword ptr [rcx], loads and zero-extends the eight low-order bytes of YmmVal a into doublewords and saves these values in register YMM0. The three ensuing vpmovzxbd instructions size-promote the remaining byte values in YmmVal a. The results are then saved using a series of vmovdqa instructions. When working with unsigned 8-bit values, it is sometimes (depending on the algorithm) more expedient to use the vpmovzxbd instruction to perform a packed byte to packed doubleword size promotion instead of a semantically equivalent series of vpunpckl[bw|dw] and vpunpckh[bw|dw] instructions. You see an example of this in Chapter 14.

The assembly language functions Avx2SignExtI16_I32_ and Avx2SignExtI16_I64_ demonstrate how to use the vpmovsxwd and vpmovsxwq instructions, respectively. These instructions size-promote and sign-extend packed word integers to doublewords and quadwords. X86-AVX also includes the packed move with sign extension instructions vpmovsx[bw|bd|bq] and vpmovsxdq. Here is the output for source code example Ch10_03:

Results for Avx2ZeroExtU8_U16_

--------------------------------------------------------------------------------

a (0:15): 0 8 16 24 32 40 48 56 | 64 72 80 88 96 104 112 120

a (16:31): 128 136 144 152 160 168 176 184 | 192 200 208 216 224 232 240 248

b (0:7): 0 8 16 24 | 32 40 48 56

b (8:15): 64 72 80 88 | 96 104 112 120

b (16:23): 128 136 144 152 | 160 168 176 184

b (24:31): 192 200 208 216 | 224 232 240 248

Results for Avx2ZeroExtU8_U32_

--------------------------------------------------------------------------------

a (0:15): 255 247 239 231 223 215 207 199 | 191 183 175 167 159 151 143 135

a (16:31): 127 119 111 103 95 87 79 71 | 63 55 47 39 31 23 15 7

b (0:3): 255 247 | 239 231

b (4:7): 223 215 | 207 199

b (8:11): 191 183 | 175 167

b (12:15): 159 151 | 143 135

b (16:19): 127 119 | 111 103

b (20:23): 95 87 | 79 71

b (24:27): 63 55 | 47 39

b (28:31): 31 23 | 15 7

Results for Avx2SignExtI16_I32_

--------------------------------------------------------------------------------

a (0:7): -32768 -28768 -24768 -20768 | -16768 -12768 -8768 -4768

a (8:15): -768 3232 7232 11232 | 15232 19232 23232 27232

b (0:3): -32768 -28768 | -24768 -20768

b (4:7): -16768 -12768 | -8768 -4768

b (8:11): -768 3232 | 7232 11232

b (12:15): 15232 19232 | 23232 27232

Results for Avx2SignExtI16_I64_

--------------------------------------------------------------------------------

a (0:7): 32767 28767 24767 20767 | 16767 12767 8767 4767

a (8:15): 767 -3233 -7233 -11233 | -15233 -19233 -23233 -27233

b (0:1): 32767 | 28767

b (2:3): 24767 | 20767

b (4:5): 16767 | 12767

b (6:7): 8767 | 4767

b (8:9): 767 | -3233

b (10:11): -7233 | -11233

b (12:13): -15233 | -19233

b (14:15): -23233 | -27233

Packed Integer Image Processing

In Chapter 7, you learned how to use the AVX instruction set to perform some common image processing operations using 128-bit wide packed unsigned integer operands. The source code examples of this section demonstrate additional image processing methods using AXV2 instructions with 256-bit wide packed unsigned integer operands. The first source example illustrates how to clip the pixel intensity values of a grayscale image. This is followed by an example that determines the minimum and maximum pixel intensity values of an RGB image. The final source code example uses the AVX2 instruction set to perform RGB to grayscale image conversion.

Pixel Clipping

Pixel clipping is an image processing technique that bounds the intensity values of each pixel in an image between two threshold limits. This technique is often used to reduce the dynamic range of an image by eliminating its extremely dark and light pixels. Source code example Ch10_04 illustrates how to use the AVX2 instruction set to clip the pixels of an 8-bit grayscale image. Listing 10-4 shows the C++ and assembly language source code for example Ch10_04.

//------------------------------------------------

// Ch10_04.h

//------------------------------------------------

#pragma once

#include <cstdint>

// The following structure must match the structure that's declared in the file .asm file

struct ClipData

{

uint8_t* m_Src; // source buffer pointer

uint8_t* m_Des; // destination buffer pointer

uint64_t m_NumPixels; // number of pixels

uint64_t m_NumClippedPixels; // number of clipped pixels

uint8_t m_ThreshLo; // low threshold

uint8_t m_ThreshHi; // high threshold

};

// Functions defined in Ch10_04.cpp

extern void Init(uint8_t* x, uint64_t n, unsigned int seed);

extern bool Avx2ClipPixelsCpp(ClipData* cd);

// Functions defined in Ch10_04_.asm

extern "C" bool Avx2ClipPixels_(ClipData* cd);

// Functions defined in Ch10_04_BM.cpp

extern void Avx2ClipPixels_BM(void);

//------------------------------------------------

// Ch10_04.cpp

//------------------------------------------------

#include "stdafx.h"

#include <iostream>

#include <random>

#include <memory.h>

#include <limits>

#include "Ch10_04.h"

#include "AlignedMem.h"

using namespace std;

void Init(uint8_t* x, uint64_t n, unsigned int seed)

{

uniform_int_distribution<> ui_dist {0, 255};

default_random_engine rng {seed};

for (size_t i = 0; i < n; i++)

x[i] = (uint8_t)ui_dist(rng);

}

bool Avx2ClipPixelsCpp(ClipData* cd)

{

uint8_t* src = cd->m_Src;

uint8_t* des = cd->m_Des;

uint64_t num_pixels = cd->m_NumPixels;

if (num_pixels == 0 || (num_pixels % 32) != 0)

return false;

if (!AlignedMem::IsAligned(src, 32) || !AlignedMem::IsAligned(des, 32))

return false;

uint64_t num_clipped_pixels = 0;

uint8_t thresh_lo = cd->m_ThreshLo;

uint8_t thresh_hi = cd->m_ThreshHi;

for (uint64_t i = 0; i < num_pixels; i++)

{

uint8_t pixel = src[i];

if (pixel < thresh_lo)

{

des[i] = thresh_lo;

num_clipped_pixels++;

}

else if (pixel > thresh_hi)

{

des[i] = thresh_hi;

num_clipped_pixels++;

}

else

des[i] = src[i];

}

cd->m_NumClippedPixels = num_clipped_pixels;

return true;

}

void Avx2ClipPixels(void)

{

const uint8_t thresh_lo = 10;

const uint8_t thresh_hi = 245;

const uint64_t num_pixels = 4 * 1024 * 1024;

AlignedArray<uint8_t> src(num_pixels, 32);

AlignedArray<uint8_t> des1(num_pixels, 32);

AlignedArray<uint8_t> des2(num_pixels, 32);

Init(src.Data(), num_pixels, 157);

ClipData cd1;

ClipData cd2;

cd1.m_Src = src.Data();

cd1.m_Des = des1.Data();

cd1.m_NumPixels = num_pixels;

cd1.m_NumClippedPixels = numeric_limits<uint64_t>::max();

cd1.m_ThreshLo = thresh_lo;

cd1.m_ThreshHi = thresh_hi;

cd2.m_Src = src.Data();

cd2.m_Des = des2.Data();

cd2.m_NumPixels = num_pixels;

cd2.m_NumClippedPixels = numeric_limits<uint64_t>::max();

cd2.m_ThreshLo = thresh_lo;

cd2.m_ThreshHi = thresh_hi;

Avx2ClipPixelsCpp(&cd1);

Avx2ClipPixels_(&cd2);

cout << " Results for Avx2ClipPixels ";

cout << " cd1.m_NumClippedPixels1: " << cd1.m_NumClippedPixels << ' ';

cout << " cd2.m_NumClippedPixels2: " << cd2.m_NumClippedPixels << ' ';

if (cd1.m_NumClippedPixels != cd2.m_NumClippedPixels)

cout << " NumClippedPixels compare error ";

if (memcmp(des1.Data(), des2.Data(), num_pixels) == 0)

cout << " Pixel buffer memory compare passed ";

else

cout << " Pixel buffer memory compare passed ";

}

int main(void)

{

Avx2ClipPixels();

Avx2ClipPixels_BM();

return 0;

}

;-------------------------------------------------

; Ch10_04.asm

;-------------------------------------------------

; The following structure must match the structure that's declared in the file .h file

ClipData struct

Src qword ? ;source buffer pointer

Des qword ? ;destination buffer pointer

NumPixels qword ? ;number of pixels

NumClippedPixels qword ? ;number of clipped pixels

ThreshLo byte ? ;low threshold

ThreshHi byte ? ;high threshold

ClipData ends

; extern "C" bool Avx2ClipPixels_(ClipData* cd)

.code

Avx2ClipPixels_ proc

; Load and validate arguments

xor eax,eax ;set error return code

xor r8d,r8d ;r8 = number of clipped pixels

mov rdx,[rcx+ClipData.NumPixels] ;rdx = num_pixels

or rdx,rdx

jz Done ;jump of num_pixels is zero

test rdx,1fh

jnz Done ;jump if num_pixels % 32 != 0

mov r10,[rcx+ClipData.Src] ;r10 = Src

test r10,1fh

jnz Done ;jump if Src is misaligned

mov r11,[rcx+ClipData.Des] ;r11 = Des

test r11,1fh

jnz Done ;jump if Des is misaligned

; Create packed thresh_lo and thresh_hi data values

vpbroadcastb ymm4,[rcx+ClipData.ThreshLo] ;ymm4 = packed thresh_lo

vpbroadcastb ymm5,[rcx+ClipData.ThreshHi] ;ymm5 = packed thresh_hi

; Clip pixels to threshold values

@@: vmovdqa ymm0,ymmword ptr [r10] ;ymm0 = 32 pixels

vpmaxub ymm1,ymm0,ymm4 ;clip to thresh_lo

vpminub ymm2,ymm1,ymm5 ;clip to thresh_hi

vmovdqa ymmword ptr [r11],ymm2 ;save clipped pixels

; Count number of clipped pixels

vpcmpeqb ymm3,ymm2,ymm0 ;compare clipped pixels to original

vpmovmskb eax,ymm3 ;eax = mask of non-clipped pixels

not eax ;eax = mask of clipped pixels

popcnt eax,eax ;eax = number of clipped pixels

add r8,rax ;update clipped pixel count

; Update pointers and loop counter

add r10,32 ;update Src ptr

add r11,32 ;update Des ptr

sub rdx,32 ;update loop counter

jnz @B ;repeat if not done

mov eax,1 ;set success return code

; Save num_clipped_pixels

Done: mov [rcx+ClipData.NumClippedPixels],r8 ;save num_clipped_pixels

vzeroupper

ret

Avx2ClipPixels_ endp

end

Listing 10-4.

Example Ch10_04

The C++ code begins with declaration of a structure named ClipData. This structure and its assembly language equivalent are used to maintain the pixel-clipping algorithm’s data. Following the function declarations in the header file Ch10_04.h is the definition of a C++ function named Init. This function initializes the elements of a uint8_t array using random values, which simulates the pixel values of a grayscale image. The function Avx2ClipPixelCpp is a C++ implementation of the pixel clipping algorithm. This function starts by validating num_pixels for correct size and divisibility by 32. Restricting the algorithm to images that contain an even multiple of 32 pixels is not as inflexible as it might appear. Most digital camera images are sized using multiples of 64 pixels due to the processing requirements of the JPEG compression algorithms. Following validation of num_pixels, the source and destination pixel buffers are checked for proper alignment .

The procedure used in Avx2ClipPixelCpp to perform pixel clipping is straightforward. A simple for loop examines each pixel element in the source image buffer. If a source image pixel buffer intensity value found to be below thresh_lo or above thresh_hi, the corresponding threshold limit is saved in the destination buffer. Source image pixels whose intensity values lie between the two threshold limits are copied to the destination pixel buffer unaltered. The processing loop in Avx2ClipPixelCpp also counts the number of clipped pixels for comparison purposes with the assembly language version of the algorithm.

Function Avx2ClipPixels exploits the C++ template class AlignedArray to allocate and manage the required image pixel buffers (see Chapter 7 for a description of this class). Following source image pixel buffer initialization, Avx2ClipPixels primes two instances of ClipData (cd1 and cd2) for use by the pixel clipping functions Avx2ClipPixelsCpp and Avx2ClipPixels_. It then invokes these functions and compares the results for any discrepancies.

Toward the top of the assembly language code is the declaration for data structure ClipPixel, which is semantically equivalent to its C++ counterpart. The function Avx2ClipPixels_ begins its execution by validating num_pixels for size and divisibility by 32. It then checks the source and destination pixels buffers for proper alignment. Following argument validation, Avx2ClipPixels_ employs two vpbroadcastb instructions to create packed versions of the threshold limit values thresh_lo and thresh_hi in registers YMM4 and YMM5, respectively. During each processing loop iteration, the vmovdqa ymm0,ymmword ptr [r10] instruction loads 32 pixel values from the source image pixel buffer into register YMM0. The ensuing vpmaxub ymm1,ymm0,ymm4 instruction clips the pixel values in YMM0 to thresh_lo. This is followed by a vpminub ymm2,ymm1,ymm5 instruction that clips the pixel values to thresh_hi. The vmovdqa ymmword ptr [r11],ymm2 instruction then saves the clipped pixel intensity values to the destination image pixel buffer.

Avx2ClipPixels_ counts the number of clipped pixels using a straightforward sequence of instructions. The vpcmpeqb ymm3,ymm2,ymm0 instruction compares the original pixel values in YMM0 to the clipped pixel values in YMM2 for equality. Each byte element in YMM3 is set to 0xff if the original and clipped pixel intensity values are equal; otherwise, the YMM3 byte element is set to 0x00. The vpmovmskb eax,ymm3 instruction that follows creates a mask of the most significant bit of each byte element in YMM3 and saves this mask to register EAX. More specifically, this instruction computes eax[i] = ymm3[i*8+7] for i = 0, 1, 2, … 31, which means that each 1 bit in register EAX signifies a non-clipped pixel. The ensuing not eax instruction converts the bit pattern in EAX to a mask of clipped pixels, and the popcnt eax,eax instruction counts the number of 1 bits in EAX. This count value, which corresponds to the number of clipped pixels in YMM2, is then added to the total number of clipped pixels in register R8. The processing loop repeats until all pixels have been processed. Here are the results for source code example Ch10_04:

Results for Avx2ClipPixels

cd1.m_NumClippedPixels1: 328090

cd2.m_NumClippedPixels2: 328090

Pixel buffer memory compare passed

Running benchmark function Avx2ClipPixels_BM - please wait

Benchmark times save to file Ch10_04_Avx2ClipPixels_BM_CHROMIUM.csv

Table 10-1 shows the benchmark timing measurements for the pixel clipping functions Avx2ClipPixelsCpp and Avx2ClipPixels_.

Table 10-1.

Mean Execution Times (Microseconds) for Pixel Clipping Functions (Image Buffer Size = 8 MB)

CPU	Avx2ClipPixelsCpp	Avx2ClipPixels_
i7-4790S	13005	1078
i9-7900X	11617	719
i7-8700K	11252	644

RGB Pixel Min-Max Values

Listing 10-5 shows the C++ and assembly language source code for example Ch10_05, which illustrates how to calculate the minimum and maximum pixel intensity values in an RGB image. This example also explains how to exploit some of MASM’s advanced macro processing capabilities.

//------------------------------------------------

// Ch10_05.cpp

//------------------------------------------------

#include "stdafx.h"

#include <cstdint>

#include <iostream>

#include <iomanip>

#include <random>

#include "AlignedMem.h"

using namespace std;

extern "C" bool Avx2CalcRgbMinMax_(uint8_t* rgb[3], size_t num_pixels, uint8_t min_vals[3], uint8_t max_vals[3]);

void Init(uint8_t* rgb[3], size_t n, unsigned int seed)

{

uniform_int_distribution<> ui_dist {5, 250};

default_random_engine rng {seed};

for (size_t i = 0; i < n; i++)

{

rgb[0][i] = (uint8_t)ui_dist(rng);

rgb[1][i] = (uint8_t)ui_dist(rng);

rgb[2][i] = (uint8_t)ui_dist(rng);

}

// Set known min & max values for validation purposes

rgb[0][n / 4] = 4; rgb[1][n / 2] = 1; rgb[2][3 * n / 4] = 3;

rgb[0][n / 3] = 254; rgb[1][2 * n / 5] = 251; rgb[2][n - 1] = 252;

}

bool Avx2CalcRgbMinMaxCpp(uint8_t* rgb[3], size_t num_pixels, uint8_t min_vals[3], uint8_t max_vals[3])

{

// Make sure num_pixels is valid

if ((num_pixels == 0) || (num_pixels % 32 != 0))

return false;

if (!AlignedMem::IsAligned(rgb[0], 32))

return false;

if (!AlignedMem::IsAligned(rgb[1], 32))

return false;

if (!AlignedMem::IsAligned(rgb[2], 32))

return false;

// Find the min and max of each color plane

min_vals[0] = min_vals[1] = min_vals[2] = 255;

max_vals[0] = max_vals[1] = max_vals[2] = 0;

for (size_t i = 0; i < 3; i++)

{

for (size_t j = 0; j < num_pixels; j++)

{

if (rgb[i][j] < min_vals[i])

min_vals[i] = rgb[i][j];

else if (rgb[i][j] > max_vals[i])

max_vals[i] = rgb[i][j];

}

return true;

}

int main(void)

{

const size_t n = 1024;

uint8_t* rgb[3];

uint8_t min_vals1[3], max_vals1[3];

uint8_t min_vals2[3], max_vals2[3];

AlignedArray<uint8_t> r(n, 32);

AlignedArray<uint8_t> g(n, 32);

AlignedArray<uint8_t> b(n, 32);

rgb[0] = r.Data();

rgb[1] = g.Data();

rgb[2] = b.Data();

Init(rgb, n, 219);

Avx2CalcRgbMinMaxCpp(rgb, n, min_vals1, max_vals1);

Avx2CalcRgbMinMax_(rgb, n, min_vals2, max_vals2);

cout << "Results for Avx2CalcRgbMinMax ";

cout << " R G B ";

cout << "------------------------- ";

cout << "min_vals1: ";

cout << setw(4) << (int)min_vals1[0] << ' ';

cout << setw(4) << (int)min_vals1[1] << ' ';

cout << setw(4) << (int)min_vals1[2] << ' ';

cout << "min_vals2: ";

cout << setw(4) << (int)min_vals2[0] << ' ';

cout << setw(4) << (int)min_vals2[1] << ' ';

cout << setw(4) << (int)min_vals2[2] << " ";

cout << "max_vals1: ";

cout << setw(4) << (int)max_vals1[0] << ' ';

cout << setw(4) << (int)max_vals1[1] << ' ';

cout << setw(4) << (int)max_vals1[2] << ' ';

cout << "max_vals2: ";

cout << setw(4) << (int)max_vals2[0] << ' ';

cout << setw(4) << (int)max_vals2[1] << ' ';

cout << setw(4) << (int)max_vals2[2] << " ";

return 0;

}

;-------------------------------------------------

; Ch10_05.asm

;-------------------------------------------------

include <MacrosX86-64-AVX.asmh>

; 256-bit wide constants

ConstVals segment readonly align(32) 'const'

InitialPminVal db 32 dup(0ffh)

InitialPmaxVal db 32 dup(00h)

ConstVals ends

; Macro _YmmVpextrMinub

;

; This macro generates code that extracts the smallest unsigned byte from register YmmSrc.

_YmmVpextrMinub macro GprDes,YmmSrc,YmmTmp

; Make sure YmmSrc and YmmTmp are different

.erridni <YmmSrc>, <YmmTmp>, <Invalid registers>

; Construct text strings for the corresponding XMM registers

YmmSrcSuffix SUBSTR <YmmSrc>,2

XmmSrc CATSTR <X>,YmmSrcSuffix

YmmTmpSuffix SUBSTR <YmmTmp>,2

XmmTmp CATSTR <X>,YmmTmpSuffix

; Reduce the 32 byte values in YmmSrc to the smallest value

vextracti128 XmmTmp,YmmSrc,1

vpminub XmmSrc,XmmSrc,XmmTmp ;XmmSrc = final 16 min values

vpsrldq XmmTmp,XmmSrc,8

vpminub XmmSrc,XmmSrc,XmmTmp ;XmmSrc = final 8 min values

vpsrldq XmmTmp,XmmSrc,4

vpminub XmmSrc,XmmSrc,XmmTmp ;XmmSrc = final 4 min values

vpsrldq XmmTmp,XmmSrc,2

vpminub XmmSrc,XmmSrc,XmmTmp ;XmmSrc = final 2 min values

vpsrldq XmmTmp,XmmSrc,1

vpminub XmmSrc,XmmSrc,XmmTmp ;XmmSrc = final 1 min value

vpextrb GprDes,XmmSrc,0 ;mov final min value to Gpr

endm

; Macro _YmmVpextrMaxub

;

; This macro generates code that extracts the largest unsigned byte from register YmmSrc.

_YmmVpextrMaxub macro GprDes,YmmSrc,YmmTmp

; Make sure YmmSrc and YmmTmp are different

.erridni <YmmSrc>, <YmmTmp>, <Invalid registers>

; Construct text strings for the corresponding XMM registers

YmmSrcSuffix SUBSTR <YmmSrc>,2

XmmSrc CATSTR <X>,YmmSrcSuffix

YmmTmpSuffix SUBSTR <YmmTmp>,2

XmmTmp CATSTR <X>,YmmTmpSuffix

; Reduce the 32 byte values in YmmSrc to the largest value

vextracti128 XmmTmp,YmmSrc,1

vpmaxub XmmSrc,XmmSrc,XmmTmp ;XmmSrc = final 16 max values

vpsrldq XmmTmp,XmmSrc,8

vpmaxub XmmSrc,XmmSrc,XmmTmp ;XmmSrc = final 8 max values

vpsrldq XmmTmp,XmmSrc,4

vpmaxub XmmSrc,XmmSrc,XmmTmp ;XmmSrc = final 4 max values

vpsrldq XmmTmp,XmmSrc,2

vpmaxub XmmSrc,XmmSrc,XmmTmp ;XmmSrc = final 2 max values

vpsrldq XmmTmp,XmmSrc,1

vpmaxub XmmSrc,XmmSrc,XmmTmp ;XmmSrc = final 1 max value

vpextrb GprDes,XmmSrc,0 ;mov final max value to Gpr

endm

; extern "C" bool Avx2CalcRgbMinMax_(uint8_t* rgb[3], size_t num_pixels, uint8_t min_vals[3], uint8_t max_vals[3])

.code

Avx2CalcRgbMinMax_ proc frame

_CreateFrame CalcMinMax_,0,48,r12

_SaveXmmRegs xmm6,xmm7,xmm8

_EndProlog

; Make sure num_pixels and the color plane arrays are valid

xor eax,eax ;set error code

test rdx,rdx

jz Done ;jump if num_pixels == 0

test rdx,01fh

jnz Done ;jump if num_pixels % 32 != 0

mov r10,[rcx] ;r10 = color plane R

test r10,1fh

jnz Done ;jump if color plane R is not aligned

mov r11,[rcx+8] ;r11 = color plane G

test r11,1fh

jnz Done ;jump if color plane G is not aligned

mov r12,[rcx+16] ;r12 = color plane B

test r12,1fh

jnz Done ;jump if color plane B is not aligned

; Initialize the processing loop registers

vmovdqa ymm3,ymmword ptr [InitialPminVal] ;ymm3 = R minimums

vmovdqa ymm4,ymm3 ;ymm4 = G minimums

vmovdqa ymm5,ymm3 ;ymm5 = B minimums

vmovdqa ymm6,ymmword ptr [InitialPmaxVal] ;ymm6 = R maximums

vmovdqa ymm7,ymm6 ;ymm7 = G maximums

vmovdqa ymm8,ymm6 ;ymm8 = B maximums

xor rcx,rcx ;rcx = common array offset

; Scan RGB color plane arrays for packed minimums and maximums

align 16

@@: vmovdqa ymm0,ymmword ptr [r10+rcx] ;ymm0 = R pixels

vmovdqa ymm1,ymmword ptr [r11+rcx] ;ymm1 = G pixels

vmovdqa ymm2,ymmword ptr [r12+rcx] ;ymm2 = B pixels

vpminub ymm3,ymm3,ymm0 ;update R minimums

vpminub ymm4,ymm4,ymm1 ;update G minimums

vpminub ymm5,ymm5,ymm2 ;update B minimums

vpmaxub ymm6,ymm6,ymm0 ;update R maximums

vpmaxub ymm7,ymm7,ymm1 ;update G maximums

vpmaxub ymm8,ymm8,ymm2 ;update B maximums

add rcx,32

sub rdx,32

jnz @B

; Calculate the final RGB minimum values

_YmmVpextrMinub rax,ymm3,ymm0

mov byte ptr [r8],al ;save min R

_YmmVpextrMinub rax,ymm4,ymm0

mov byte ptr [r8+1],al ;save min G

_YmmVpextrMinub rax,ymm5,ymm0

mov byte ptr [r8+2],al ;save min B

; Calculate the final RGB maximum values

_YmmVpextrMaxub rax,ymm6,ymm1

mov byte ptr [r9],al ;save max R

_YmmVpextrMaxub rax,ymm7,ymm1

mov byte ptr [r9+1],al ;save max G

_YmmVpextrMaxub rax,ymm8,ymm1

mov byte ptr [r9+2],al ;save max B

mov eax,1 ;set success return code

Done: vzeroupper

_RestoreXmmRegs xmm6,xmm7,xmm8

_DeleteFrame r12

ret

Avx2CalcRgbMinMax_ endp

end

Listing 10-5.

Example Ch10_05

The function Avx2CalcRgbMinMaxCpp that’s shown in Listing 10-5 is a C++ implementation of the RGB min-max algorithm. This function employs a set of nested for loops to determine the minimum and maximum pixel intensity values for each color plane. These values are maintained in the arrays min_vals and max_vals. The function main uses the C++ template class AlignedArray to allocate three arrays that simulate the color plane buffers of an RGB image. These buffers are loaded with random values by the function Init. Note that function Init assigns known values to several elements in each color plane buffer. These known values are used to verify correct execution of both the C++ and assembly language min-max functions.

Toward the top of the assembly language code is a custom constant segment named ConstVals that defines packed versions of the initial pixel minimum and maximum values. A custom segment is used here to ensure alignment of the 256-bit wide packed values on a 32-byte boundary, as explained in Chapter 9. The macro definitions _YmmVpextrMinub and _YmmVpextrMaxub are next. These macros contain instructions that extract the smallest and largest byte values from a YMM register. The inner workings of these macros will be explained shortly.

The function Avx2CalcRgbMinMax_ uses registers YMM3-YMM5 and YMM6-YMM8 to maintain the RGB minimum and maximum values, respectively. During each iteration of the main processing loop, a series of vpminub and vpmaxub instructions update the current RGB minimums and maximums. Upon completion of the main processing loop, the aforementioned YMM registers contain 32 minimum and maximum pixel intensity values for each color component. The _YmmVpextrMinub and _YmmVpextrMaxub macros are then used to extract the final RGB minimum and maximum pixel values. These values are then saved to the result arrays min_vals and max_vals, respectively.

The macros definitions _YmmVpextrMinub and _YmmVpextrMaxub are identical, except for the instructions vpminub and vpmaxub. In the text that follows, all explanatory comments made about _YmmVpextrMinub also apply to _YmmVpextrMaxub. The _YmmVpextrMinub macro requires three parameters: a destination general-purpose register (GprDes), a source YMM register (YmmSrc), and a temporary YMM register (YmmTmp). Note that macro parameters YmmSrc and YmmTmp must be different registers. If they’re the same, the .erridni directive (Error if Text Items are Identical, Case Insensitive) generates an error message during assembly. MASM also supports several other conditional error directives besides .erridni, and these are described in the Visual Studio documentation.

In order to generate the correct assembly language code, the macro _YmmVpextrMinub requires an XMM register text string (XmmSrc) that corresponds to the low-order portion of the specified YmmSrc register. For example, if YmmSrc equals YMM0, then XmmSrc must equal XMM0. The MASM directives substr (Return Substring of Text Item) and catstr (Concatenate Text Items) are used to initialize XmmSrc. The statement YmmSrcSuffix SUBSTR <YmmSrc>,2 assigns a text string value to YmmSrcSuffix that excludes the leading character of macro parameter YmmSrc. For example, if YmmSrc equals YMM0, then YmmSrcSuffix equals MM0. The next statement, XmmSrc CATSTR <X>,YmmSrcSuffix, adds a leading X to the value of YmmSrcSuffix and assigns it to XmmSrc. Continuing with the earlier example, this means that the text string XMM0 is assigned to XmmSrc. The SUBSTR and CATSTR directives are then used to assign a text string value to XmmTmp.

Following initialization of the required macro text strings are the instructions that extract the smallest byte value from the specified YMM register. The vextracti128 XmmTmp,YmmSrc,1 instruction copies the high-order 16 bytes of register YmmSrc to XmmTmp. (The vextracti128 instruction also supports using an immediate operand of 0 to copy the low-order 16 bytes.) A vpminub XmmSrc,XmmSrc,XmmTmp instruction loads the final 16 minimum values into XmmSrc. The vpsrldq XmmTmp,XmmSrc,8 instruction shifts a copy of the value that’s in XmmSrc to the right by eight bytes and saves the result to XmmTmp. This facilitates the use of another vpminub instruction that reduces the number of minimum byte values from 16 to 8. Repeated sets of the vpsrldq and vpminub instructions are then employed until the final minimum value resides in the low-order byte of XmmSrc. A vpextrb GprDes,XmmSrc,0 instruction copies the final minimum value to the specified general-purpose register. Here are the results for source code example Ch10_05:

Results for Avx2CalcRgbMinMax

R G B

-------------------------

min_vals1: 4 1 3

min_vals2: 4 1 3

max_vals1: 254 251 252

max_vals2: 254 251 252

RGB to Grayscale Conversion

The final source code example of this chapter, Ch10_06, explains how to perform an RGB to grayscale image conversion. This example intermixes the packed integer capabilities of AVX2 that you have learned in this chapter with the packed floating-point techniques presented in Chapter 9. Listing 10-6 shows the source code for example Ch10_06

//------------------------------------------------

// ImageMatrix.h

//------------------------------------------------

struct RGB32

{

uint8_t m_R;

uint8_t m_G;

uint8_t m_B;

uint8_t m_A;

};

//------------------------------------------------

// Ch10_06.cpp

//------------------------------------------------

#include "stdafx.h"

#include <iostream>

#include <stdexcept>

#include "Ch10_06.h"

#include "AlignedMem.h"

#include "ImageMatrix.h"

using namespace std;

// Image size limits

extern "C" const int c_NumPixelsMin = 32;

extern "C" const int c_NumPixelsMax = 256 * 1024 * 1024;

// RGB to grayscale conversion coefficients

const float c_Coef[4] {0.2126f, 0.7152f, 0.0722f, 0.0f};

bool CompareGsImages(const uint8_t* pb_gs1,const uint8_t* pb_gs2, int num_pixels)

{

for (int i = 0; i < num_pixels; i++)

{

if (abs((int)pb_gs1[i] - (int)pb_gs2[i]) > 1)

return false;

}

return true;

}

bool Avx2ConvertRgbToGsCpp(uint8_t* pb_gs, const RGB32* pb_rgb, int num_pixels, const float coef[4])

{

if (num_pixels < c_NumPixelsMin || num_pixels > c_NumPixelsMax)

return false;

if (num_pixels % 8 != 0)

return false;

if (!AlignedMem::IsAligned(pb_gs, 32))

return false;

if (!AlignedMem::IsAligned(pb_rgb, 32))

return false;

for (int i = 0; i < num_pixels; i++)

{

uint8_t r = pb_rgb[i].m_R;

uint8_t g = pb_rgb[i].m_G;

uint8_t b = pb_rgb[i].m_B;

float gs_temp = r * coef[0] + g * coef[1] + b * coef[2] + 0.5f;

if (gs_temp < 0.0f)

gs_temp = 0.0f;

else if (gs_temp > 255.0f)

gs_temp = 255.0f;

pb_gs[i] = (uint8_t)gs_temp;

}

return true;

}

void Avx2ConvertRgbToGs(void)

{

const wchar_t* fn_rgb = L"..\Ch10_Data\TestImage3.bmp";

const wchar_t* fn_gs1 = L"Ch10_06_Avx2ConvertRgbToGs_TestImage3_GS1.bmp";

const wchar_t* fn_gs2 = L"Ch10_06_Avx2ConvertRgbToGs_TestImage3_GS2.bmp";

ImageMatrix im_rgb(fn_rgb);

int im_h = im_rgb.GetHeight();

int im_w = im_rgb.GetWidth();

int num_pixels = im_h * im_w;

ImageMatrix im_gs1(im_h, im_w, PixelType::Gray8);

ImageMatrix im_gs2(im_h, im_w, PixelType::Gray8);

RGB32* pb_rgb = im_rgb.GetPixelBuffer<RGB32>();

uint8_t* pb_gs1 = im_gs1.GetPixelBuffer<uint8_t>();

uint8_t* pb_gs2 = im_gs2.GetPixelBuffer<uint8_t>();

cout << "Results for Avx2ConvertRgbToGs ";

wcout << "Converting RGB image " << fn_rgb << ' ';

cout << " im_h = " << im_h << " pixels ";

cout << " im_w = " << im_w << " pixels ";

// Exercise conversion functions

bool rc1 = Avx2ConvertRgbToGsCpp(pb_gs1, pb_rgb, num_pixels, c_Coef);

bool rc2 = Avx2ConvertRgbToGs_(pb_gs2, pb_rgb, num_pixels, c_Coef);

if (rc1 && rc2)

{

wcout << "Saving grayscale image #1 - " << fn_gs1 << ' ';

im_gs1.SaveToBitmapFile(fn_gs1);

wcout << "Saving grayscale image #2 - " << fn_gs2 << ' ';

im_gs2.SaveToBitmapFile(fn_gs2);

if (CompareGsImages(pb_gs1, pb_gs2, num_pixels))

cout << "Grayscale image compare OK ";

else

cout << "Grayscale image compare failed ";

}

else

cout << "Invalid return code ";

}

int main()

{

try

{

Avx2ConvertRgbToGs();

Avx2ConvertRgbToGs_BM();

}

catch (runtime_error& rte)

{

cout << "'runtime_error' exception has occurred - " << rte.what() << ' ';

}

catch (...)

{

cout << "Unexpected exception has occurred ";

}

return 0;

}

;-------------------------------------------------

; Ch10_06.asm

;-------------------------------------------------

include <MacrosX86-64-AVX.asmh>

.const

GsMask dword 0ffffffffh, 0, 0, 0, 0ffffffffh, 0, 0, 0

r4_0p5 real4 0.5

r4_255p0 real4 255.0

extern c_NumPixelsMin:dword

extern c_NumPixelsMax:dword

; extern "C" bool Avx2ConvertRgbToGs_(uint8_t* pb_gs, const RGB32* pb_rgb, int num_pixels, const float coef[4])

;

; Note: Memory pointed to by pb_rgb is ordered as follows:

; R(0,0), G(0,0), B(0,0), A(0,0), R(0,1), G(0,1), B(0,1), A(0,1), ...

.code

Avx2ConvertRgbToGs_ proc frame

_CreateFrame RGBGS_,0,112

_SaveXmmRegs xmm6,xmm7,xmm11,xmm12,xmm13,xmm14,xmm15

_EndProlog

; Validate argument values

xor eax,eax ;set error return code

cmp r8d,[c_NumPixelsMin]

jl Done ;jump if num_pixels < min value

cmp r8d,[c_NumPixelsMax]

jg Done ;jump if num_pixels > max value

test r8d,7

jnz Done ;jump if (num_pixels % 8) != 0

test rcx,1fh

jnz Done ;jump if pb_gs is not aligned

test rdx,1fh

jnz Done ;jump if pb_rgb is not aligned

; Perform required initializations

vbroadcastss ymm11,real4 ptr [r4_255p0] ;ymm11 = packed 255.0

vbroadcastss ymm12,real4 ptr [r4_0p5] ;ymm12 = packed 0.5

vpxor ymm13,ymm13,ymm13 ;ymm13 = packed zero

vmovups xmm0,xmmword ptr [r9]

vperm2f128 ymm14,ymm0,ymm0,00000000b ;ymm14 = packed coef

vmovups ymm15,ymmword ptr [GsMask] ;ymm15 = GsMask (SPFP)

; Load next 8 RGB32 pixel values (P0 - P7)

align 16

@@: vmovdqa ymm0,ymmword ptr [rdx] ;ymm0 = 8 rgb32 pixels (P7 - P0)

; Size-promote RGB32 color components from bytes to dwords

vpunpcklbw ymm1,ymm0,ymm13

vpunpckhbw ymm2,ymm0,ymm13

vpunpcklwd ymm3,ymm1,ymm13 ;ymm3 = P1, P0 (dword)

vpunpckhwd ymm4,ymm1,ymm13 ;ymm4 = P3, P2 (dword)

vpunpcklwd ymm5,ymm2,ymm13 ;ymm5 = P5, P4 (dword)

vpunpckhwd ymm6,ymm2,ymm13 ;ymm6 = P7, P6 (dword)

; Convert color component values to single-precision floating-point

vcvtdq2ps ymm0,ymm3 ;ymm0 = P1, P0 (SPFP)

vcvtdq2ps ymm1,ymm4 ;ymm1 = P3, P2 (SPFP)

vcvtdq2ps ymm2,ymm5 ;ymm2 = P5, P4 (SPFP)

vcvtdq2ps ymm3,ymm6 ;ymm3 = P7, P6 (SPFP)

; Multiply color component values by color conversion coefficients

vmulps ymm0,ymm0,ymm14

vmulps ymm1,ymm1,ymm14

vmulps ymm2,ymm2,ymm14

vmulps ymm3,ymm3,ymm14

; Sum weighted color components for final grayscale values

vhaddps ymm4,ymm0,ymm0

vhaddps ymm4,ymm4,ymm4 ;ymm4[159:128] = P1, ymm4[31:0] = P0

vhaddps ymm5,ymm1,ymm1

vhaddps ymm5,ymm5,ymm5 ;ymm5[159:128] = P3, ymm4[31:0] = P2

vhaddps ymm6,ymm2,ymm2

vhaddps ymm6,ymm6,ymm6 ;ymm6[159:128] = P5, ymm4[31:0] = P4

vhaddps ymm7,ymm3,ymm3

vhaddps ymm7,ymm7,ymm7 ;ymm7[159:128] = P7, ymm4[31:0] = P6

; Merge SPFP grayscale values into a single YMM register

vandps ymm4,ymm4,ymm15 ;mask out unneeded SPFP values

vandps ymm5,ymm5,ymm15

vandps ymm6,ymm6,ymm15

vandps ymm7,ymm7,ymm15

vpslldq ymm5,ymm5,4

vpslldq ymm6,ymm6,8

vpslldq ymm7,ymm7,12

vorps ymm0,ymm4,ymm5 ;merge values

vorps ymm1,ymm6,ymm7

vorps ymm2,ymm0,ymm1 ;ymm2 = 8 GS pixel values (SPFP)

; Add 0.5 rounding factor and clip to 0.0 - 255.0

vaddps ymm2,ymm2,ymm12 ;add 0.5f rounding factor

vminps ymm3,ymm2,ymm11 ;clip pixels above 255.0

vmaxps ymm4,ymm3,ymm13 ;clip pixels below 0.0

; Convert SPFP values to bytes and save

vcvtps2dq ymm3,ymm2 ;convert GS SPFP to dwords

vpackusdw ymm4,ymm3,ymm13 ;convert GS dwords to words

vpackuswb ymm5,ymm4,ymm13 ;convert GS words to bytes

vperm2i128 ymm6,ymm13,ymm5,3 ;xmm5 = GS P3:P0, xmm6 = GS P7:P4

vmovd dword ptr [rcx],xmm5 ;save P3 - P0

vmovd dword ptr [rcx+4],xmm6 ;save P7 - P4

add rdx,32 ;update pb_rgb to next block

add rcx,8 ;update pb_gs to next block

sub r8d,8 ;num_pixels -= 8

jnz @B ;repeat until done

mov eax,1 ;set success return code

Done: vzeroupper

_RestoreXmmRegs xmm6,xmm7,xmm11,xmm12,xmm13,xmm14,xmm15

_DeleteFrame

ret

Avx2ConvertRgbToGs_ endp

end

Listing 10-6.

Example Ch10_06

A variety of algorithms exist to convert an RGB image into a grayscale image. One frequently-used technique calculates grayscale pixel values using a weighted of sum the RGB color components. In this source code example, RGB pixels are converted to grayscale pixels using the following equation:

$GSleft(x,y ight)=Rleft(x,y ight){W}_r+Gleft(x,y ight){W}_g+Bleft(x,y ight){W}_b$

Each RGB color component weight (or coefficient) is a floating-point number between 0.0 and 1.0, and the sum of the three component coefficients normally equals 1.0. The exact values used for the color component coefficients are usually based on published standards that reflect a multitude of visual factors including properties of the target color space, display device characteristics, and perceived image quality. If you’re interested in learning more about RGB to grayscale image conversion, Appendix A contains some references that you can consult.

Source code Ch10_06 opens with the structure declaration RGB32. This structure is declared in the header file ImageMatrix.h and specifies the color component ordering scheme of each RGB pixel. The function Avx2ConvertRgbToGsCpp contains a C++ implementation of the RGB to grayscale conversion algorithm. This function uses an ordinary for loop that sweeps through the RGB32 image buffer pb_rgb and computes grayscale pixel values using the aforementioned conversion equation. Note that RGB32 element m_A is not used in any of the calculations in this example. Each calculated grayscale pixel value is adjusted by a rounding factor and clipped to [0.0, 255.0] before it is saved to the grayscale image buffer pointed to by pb_gs.

The assembly language code begins with a .const section that defines the necessary constants. Following its prolog, the function Avx2ConvertRgbToGs_ performs the customary image size and buffer alignment checks. It then loads the algorithm’s required packed constants into registers YMM11–YMM15. Note that register YMM14 contains a packed version of the color conversion coefficients , as illustrated in Figure 10-3. The assembly language processing loop begins with a vmovdqa ymm0,ymmword ptr [rdx] instruction that loads eight RGB32 pixel values into register YMM0. The color components of these pixels are then size-promoted to doublewords using a series of vpunpck[l|h]bw and vpunpck[l|h]wd instructions. The ensuing vcvtdq2ps instructions convert the pixel color components from doublewords to single-precision floating-point values. Following execution of the four vcvtdq2ps instructions, registers YMM0–YMM3 each contain two RGB32 pixels and each color component is a single-precision floating-point value. Figure 10-3 also shows the RGB32 size promotions and conversions discussed in this paragraph.

../images/326959_2_En_10_Chapter/326959_2_En_10_Fig3_HTML.jpg — Figure 10-3.
*RGB32* pixel color component size promotions and conversions

The four vmulps instructions multiply the eight RGB32 pixels by the color conversion coefficients . The ensuing vhaddps instructions sum the weighted color components of each pixel to generate the required grayscale values. Following execution of these instructions, registers YMM4–YMM7 each contain two single-precision floating-point grayscale pixel values, one in element position [31:0] and the another in [159:128], as shown in Figure 10-4. The eight grayscale values in YMM4–YMM7 are then merged into YMM2 using a series of vandps, vpslldq, and vorps instructions. Figure 10-4 also shows the final merged result. The vaddps, vminps, and vmaxps instructions that follow add in the rounding factor (0.5) and clip the grayscale pixels to [0.0, 255.0]. These values are then converted to unsigned bytes using the instructions vcvtps2dq, vpackusdw, and vpackuswb. The two vmovd instructions save the four unsigned byte pixel values in both XMM5[31:0] and XMM6[31:0] to the grayscale image buffer.

../images/326959_2_En_10_Chapter/326959_2_En_10_Fig4_HTML.jpg — Figure 10-4.
Grayscale single-precision floating-point pixel values before and after merging

Here are the results of source code example Ch10_06:

Results for Avx2ConvertRgbToGs

Converting RGB image ..Ch10_DataTestImage3.bmp

im_h = 960 pixels

im_w = 640 pixels

Saving grayscale image #1 - Ch10_06_Avx2ConvertRgbToGs_TestImage3_GS1.bmp

Saving grayscale image #2 - Ch10_06_Avx2ConvertRgbToGs_TestImage3_GS2.bmp

Grayscale image compare OK

Running benchmark function Avx2ConvertRgbToGs_BM - please wait

Benchmark times save to file Ch10_06_Avx2ConvertRgbToGs_BM_CHROMIUM.csv

Table 10-2 shows the benchmark timing measurements for the RGB to grayscale image conversion functions Avx2ConvertRgbToGsCpp and Avx2ConvertRgbToGs_. The performance gains of this source code example are modest compared to some of the other examples in this book. The reason for this is that the RGB32 color components in the source image buffer are interleaved with each other, which necessitates the use of slower horizontal arithmetic. Rearranging the RGB32 data so that the pixels of each color component reside in separate image buffers often results in significantly faster performance. You see an example of this in Chapter 14.

Table 10-2.

Mean Execution Times (Microseconds) for RGB to Grayscale Image Conversion Using TestImage3.bmp

CPU	Avx2ConvertRgbToGsCpp	Avx2ConvertRgbToGs_
i7-4790S	1504	843
i9-7900X	1075	593
i7-8700K	1031	565

Summary

Here are the key learning points of Chapter 10:

AVX2 extends the packed integer capabilities of AVX. Most x86-AVX packed integer instructions can be used with either 128-bit or 256-bit wide operands. These operands should always be properly aligned whenever possible.
Similar to x86-AVX floating-point, assembly language functions that perform packed integer calculations using a YMM register should use a vzeroupper instruction prior any epilog code or the ret instruction. This avoids potential performance delays that can occur when the processor transitions from executing x86-AVX instructions to x86-SSE instructions.
The Visual C++ calling convention differs for assembly language functions that return a structure by value. A function that returns a structure by value must copy a large structure (one greater than eight bytes) to the buffer pointed to by the RCX register. The normal calling convention registers are also “right-shifted” as explained in this chapter.
Assembly language functions can use the vpunpckl[bw|wd|dq] and vpunpckh[bw|wd|dq] instructions to unpack 128-bit or 256-bit wide integer operands.
Assembly language functions can use the vpackss[dw|wb] and vpackus[dw|wb] instructions to pack 128-bit or 256-bit wide integer operands using signed or unsigned saturation.
Assembly language functions can use the vmovzx[bw|bd|bq|wd|wq|dq] and vmovsx[bw|bd|bq|wd|wq|dq] instructions to perform zero or sign extended packed integer size promotions.
MASM supports directives that can perform rudimentary string processing operations, which can be employed to construct text strings for macro instruction mnemonics, operands, and labels. MASM also supports conditional error directives that can be used to signal error conditions during source code assembly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 10. AVX2 Programming – Packed Integers

Create new playlist

Sign In

Sign Up

10. AVX2 Programming – Packed Integers

Packed Integer Fundamentals

Basic Arithmetic

Pack and Unpack

Size Promotions

Packed Integer Image Processing

Pixel Clipping

RGB Pixel Min-Max Values

RGB to Grayscale Conversion

Summary

Table of Contents for
10. AVX2 Programming – Packed Integers