Chapter 15

Exploiting Parallelism

Abstract

This chapter covers the fundamentals of parallelism. While common techniques for parallelism, including multi-process and multi-threading, are briefly discussed, the majority of the chapter focuses on SIMD vectorization with SSE and Intel® Advanced Vector Extensions (Intel® AVX).

Keywords

SSE

SSE2

SSE3

SSSE3

SSE4.1

SSE4.2

AVX

AVX2

AVX512

SIMD

Vectorization

Alignment

Parallelism, the act of performing two or more operations simultaneously, is a topic that has generated lots of interest and headlines within the last decade. Part of this fascination stems from the shift of the consumer market away from uniprocessor (UP) systems to widely available symmetric multiprocessing systems (SMP). Another part of this interest stems from the simplification of GPGPU, that is writing general purpose code for the GPU, which is a dedicated parallel processor. This simplification has transformed highly parallel vector processing from a niche aspect of the high performance computing (HPC) segment into something very accessible to everyone. Regardless of the modern attention, research into parallelism, and its associated challenges, traces back to the 1950s, predating the x86 architecture by more than twenty years.

Unfortunately, one hazard of the modern hype is the misconception that parallel means fast, or that every task yields itself well to parallelization. On the contrary, many tasks can’t be parallelized in a manner that improves performance, and often parallel code will run slower than serial code. It is important to keep in mind that parallelism isn’t a silver bullet for performance. On the contrary, just like all of the other aspects introduced within this book, properly wielding parallelism requires measurement and careful analysis in order to determine whether it is an appropriate solution for a given problem.

The most commonly discussed levels of parallelism utilize multiple threads and multiple processes. The decision between these two techniques typically revolves around whether the address space should be shared or isolated between the two components. In Linux, both threads and processes are created with the clone(2) system call, and processes can also be created with the fork(2) system call.

The concept of pipes and filters, that is, filters being the applications and pipes being the glue that connects multiple filters in order to build complicated commands out of simplistic ones, is a fundamental aspect of the UNIX philosophy. Every filter, whether executed interactively in a terminal emulator or noninteractively in a shell script, is executed by first creating a new process. This is achieved via the fork(2) system call, which duplicates the calling process into a new process. To change this new duplicate process into a different command, the execve(2) system call is invoked by the new child process. This system call replaces the current process image with the image described by the executable specified.

Since the workflow of creating a new process and then immediately loading a new process image to begin executing a different command is incredibly common, the fork(2)/execve(2) code path within the Linux kernel is heavily optimized. That being said, process and thread creation is still very expensive. In this situation, because execve(2) immediately replaces the current process image with a new image, the copying of the parent process into the child process, performed by the fork(2) system call, is unnecessary. In order to avoid these unnecessary data copies, Linux processes are created with Copy on Write (COW). With COW, rather than performing a deep copy of the parent’s resources into the new process, the child’s process resources point to the parent’s, but are marked as read-only. If the child process calls execve(2), or doesn’t perform any modifications to the address space, no copies are performed. On the other hand, if the child process does perform a write operation, then the write operation on the read-only pages will fault, giving the kernel time to perform the actual copies required for the new process.

As an aside, COW illustrates an important point about performance measurements. Hopefully the reader can see that without understanding COW, attempting to benchmark the overhead of process creation would probably be misleading. The system call API provides a nice abstraction of hardware functionality for user space; however, in order to measure the abstraction, at least a basic understanding of the lower layer and implementation is required. A wise teacher once told the author that operating systems are nothing but “lies, tricks, and deceptions.”

There are already many excellent resources covering the use of parallel threads and processes. Some of the author’s favorites include:

1. Is Parallel Programming Hard, And, If so, What Can You Do About It by Paul McKenney

2. Modern Operating Systems by Andrew Tanenbaum

Additionally, Intel develops an open source library, Intel® Threading Building Blocks (Intel® TBB), that is designed to aid in writing high performance parallel code. Intel TBB is a cross-platform C++ library which adds parallel algorithms and containers, while also providing primitives for defining parallel tasks. These parallel tasks are then automatically scheduled by the Intel TBB library, in order to provide better performance. The Intel TBB library is dual-licensed under the GPLv2 with the (libstdc++) runtime exception and a commercial license, and is typically available in most Linux distribution’s package managers. More information on Intel TBB can be found at https://www.threadingbuildingblocks.org/.

15.1 SIMD

Single Instruction Multiple Data (SIMD) instructions were introduced into the x86 architecture with the MMX™ technology, and continue to be incredibly significant to performance on modern x86 with the SSE and Intel® Advanced Vector Set Extensions (Intel® AVX) instruction set extensions. As the name implies, a single SIMD instruction performs the same operation, not on one data element, but on many simultaneously. Their purpose is to provide the general purpose processor with some of the benefits and capabilities of a vector processor.

Whereas vector processors are typically more parallel, that is, they can operate on larger data sets, providing SIMD functionality directly in the processor architecture provides some unique advantages. For example, when writing GPGPU kernels, special care must be taken to reduce the number of data copies, since PCIe bandwidth is usually a significant bottleneck. As a result, GPGPU can be impractical for use cases that require data to bounce back and forth between the CPU and GPU. On the other hand, SSE and Intel AVX resources are located close to the processor’s other execution resources, and can fully benefit from the processor caches. As a result, parallelization with SIMD has much lower setup costs, which make it more practical for situations where other forms of parallelism, such as threading, are too expensive.

The SIMD functionality of the processor depends on which instruction set extensions are supported. While it is always a good practice to check for support for all features used, since these extensions compound upon each other, Intel® processors typically don’t support one extension without supporting all prior extensions. For instance, if a processor supports SSE3, it will typically also support SSE2, SSE, and so on.

Originally, the MMX extensions added instructions designed to perform integer operations; however, unlike the later extensions, the MMX registers aliased to the x87 stack. This simplified operating system support, since no new state needed to be saved on a context switch, but complicated the simultaneous use of MMX and x87 operations. In general, MMX isn’t frequently used anymore.

The Intel® Pentium® III introduced the Streaming SIMD Extensions (SSE). This extension added new larger registers, which aren’t part of the x87 stack, and instructions for operating on both integer and floating point values. Over time, the functionality of SSE has been steadily augmented through a number of extensions, which each add new instructions. These extensions include SSE2, SSE3, SSSE3, SSE4.1, and SSE4.2. The extra S in SSSE3 stands for Supplemental, that is, Supplemental Streaming SIMD Extensions 3.

The Intel® Advanced Vector Extensions (Intel® AVX) were introduced with the Second Generation Intel® Core™ processor family. Once again this increased the register width, with a new set of registers and instructions. The Intel AVX instructions only focused on floating point operations, while the second extension, Intel® AVX2, introduced with the Fourth Generation Intel® Core™ processor family, added integer operations. Intel AVX also introduced the VEX instruction encoding, which is capable of supporting non-destructive versions of instructions. With a destructive form of an instruction, one operand serves as both the source and destination for the operation, that is, one of the source operands is clobbered. On the other hand, with VEX encoded non-destructive instruction forms, neither source operand is modified and a third operand is added for the destination.

Intel has already released the documentation for the next Intel AVX extension, Intel® Advanced Vector Extensions 512 (Intel® AVX-512). This extension is not called AVX3 because that would imply an extension that simply adds new instructions to Intel AVX. Instead, Intel AVX-512 increases the register width from Intel AVX, and uses a different encoding, known as EVEX.

When attempting to determine whether a code segment is a prime candidate for vectorization, there are a few characteristics to look for. First, the code should be a self-contained loop. In other words, the loop shouldn’t call any external functions on each data element within the loop, unless that function can be recreated in SIMD instructions. Second, there should be no data dependence between each loop iteration. If this doesn’t hold true, it can be difficult to achieve a high level of parallelism. Third, the loop computation shouldn’t require any branching, although masks can often be used to avoid branching. Review Section 13.1 in Chapter 13 for more information on techniques to avoid branches.

15.1.1 SIMD Registers

In order for the instructions to operate on multiple data elements, each data element is packed into special extra-wide registers. There are sixteen SSE registers, XMM0 through XMM15, that are each 128 bits wide, although only eight of these registers are available in 32-bit mode with the other eight requiring 64-bit mode. There are sixteen Intel AVX registers, Y MM0 through Y MM15, that are each 256 bits wide. Finally, there are thirty-two Intel AVX-512 registers, ZMM0 through ZMM31, which are each 512 bits wide. So for instance, a 256-bit Intel AVX register can hold thirty-two packed bytes, sixteen packed words, eight packed dwords, eight packed IEEE 754 floats, four packed qwords, four packed IEEE 754 doubles, two packed 128-bit SSE values, or one 256-bit value.

The most straightforward and most efficient method to load these packed values into a SIMD register is by loading contiguous elements from an array. This is accomplished mainly by different forms of two instructions, MOVDQA and MOVDQU, that load the appropriate number of bytes beginning at the memory operand provided. The MOVDQA and MOVDQU instructions load 16-B and store them in a SSE register. The VMOVDQA and VMOVDQU instructions load 32-B and store them in an Intel AVX register. As time has passed, the expectations and abilities of SIMD have evolved, resulting in the presence of multiple other instructions that perform data loads, such as MOVAPS, MOVAPD and LDDQU. At the times of their various introductions, these instructions all had slightly different internal behaviors, allowing for performance tuning for different situations; however on modern x86, they are all essentially identical.

Data alignment

The data load instructions typically come in two distinct versions, one with the suffix containing the letter “A”, and one with the suffix containing the letter “U.” Sometimes this letter is the last letter in the suffix, such as in the case of MOVDQU and MOVDQA, and sometimes it is not, such as in the case of MOVAPS and MOVUPS. This letter designates instructions that only operate on aligned data, corresponding to the letter “A” for aligned, and instructions that operate on both unaligned and aligned data, corresponding to the letter “U.”

Each register size has a corresponding alignment requirement, the natural alignment. In other words, the 128-bit SSE registers have an alignment requirement of 16 bytes, that is, 128 bits, the 256-bit Intel AVX registers have an alignment requirement of 32 bytes, that is, 256 bits, and the 512-bit Intel AVX-512 registers have an alignment requirement of 64 bytes, that is, 512 bits. Attempting to perform an aligned load, or store, on a memory operand that doesn’t meet this alignment requirement will result in a fault occurring, causing the process to receive a SIGBUS, or sometimes a SIGSEGV, signal. The default signal handler in both cases is to generate a core dump and then terminate the process.

The reason for providing instructions that only operate on aligned memory operands is that aligned loads and stores are significantly faster than their unaligned counterparts.

Originally, utilizing an unaligned load or store instruction would result in the more expensive unaligned load or store occurring, regardless of whether the memory operand was aligned or not. Starting with the Second Generation Intel® Core™ processor family, there is extra logic in the unaligned load and store instructions that checks for an aligned memory operand and, if found, performs the aligned operation instead. This optimization in hardware improves performance in the situations where the alignment isn’t explicitly known and where software is written to always perform the unaligned load or store, to avoid the possibility of generating a fault. Unfortunately, this optimization has also resulted in some confusion about the performance impact of data alignment.

The reason for this confusion stems from the fact that using an unaligned load instruction no longer guarantees an actual unaligned data load, making it very easy to create misrepresentative microbenchmarks supposedly disproving the performance impact of alignment. On the contrary, data alignment does make a significant impact on performance. This is especially true for Intel® Atom™ processors.

This confusion is further compounded by the fact that, excluding the streaming instructions, loads and stores are satisfied by the cache, as opposed to directly accessing RAM. Therefore, an unaligned load that is contained within one cache line causes no extra penalty; however, by definition, there is no guarantee that an unaligned address doesn’t straddle two cache lines. In this situation, where two cache lines are needed to satisfy the load or store, the processor needs to internally perform two aligned loads and then splice the result together, resulting in the extra penalty. This too can lead to misleading microbenchmarks.

As a result, any time data is being loaded into or stored from a SIMD register, the resulting memory operand should be explicitly aligned, either through the C11 alignas() macro, the POSIX posix_memalign(3) function, the compiler variable attribute, or manually. When aligning both the source and destination are not possible, favor alignment of the destination. For more information about specifying data alignment, see Section 12.4.3 in Chapter 12.

On newer microarchitectures, once the data is aligned, the selection of an aligned or unaligned instruction doesn’t matter. As a result, the author tends to prefer using the unaligned instructions, since they provide identical performance as the aligned instructions, but also provide safety in the case of an unaligned operand, i.e., the program won’t crash. The reader may be wondering how an unaligned operand could occur if the data was explicitly aligned; however, this does occasionally happen, typically with compiler variable attributes. The author has experience with compilers completely ignoring a variable’s alignment attribute, resulting in a fault (Cryil, 2012).

Handling unknown alignment

In many situations, the programmer will be unable to guarantee the alignment of an address at compile time. Since the additional overhead of performing unaligned loads can be significant enough to negate any performance benefits from vectorization, it may be beneficial to use a hybrid approach consisting of both vectorized and nonvectorized implementations. In this approach, the nonvectorized implementation of the algorithm is used to “single-step” until an aligned address is found. For an alignment requirement of n, any address is either a multiple of n, and therefore meets the requirement, or is anywhere from 1 to n − 1 bytes away from the next aligned address. This technique relies on the nonvectorized code incrementing the address in a manner that alignment is possible. Once the aligned address is reached, the vectorized code can be utilized.

Assuming that the desired alignment is a power of two, which is a safe assumption for Intel processors, checking for alignment can be performed quickly with a logical and instruction. This takes advantage of the property of an integer’s binary representation, where a power of two is represented by a single set bit, and a power of two minus one is represented as every bit prior to the power of two’s bit being set. For example, 16 is represented as 100002 and 15 is represented as 011112. Using this knowledge, checking whether a number is a multiple of a power of two is as simple as checking whether any of the bits below that power of two’s corresponding bit are set.

Consider the following example, where this hybrid approach is used to combine a byte-by-byte nonvectorized and SSE vectorized implementation.

1 #include<inttypes.h>

2

3 void foo(char *x, size_t x_sz)

4 {

5 /* Single−Step until aligned address */

6  while ((uintptr_t)x & (16 − 1) && x_sz)   {

7  /* Nonvectorized byte−by−byte algorithm */

8  x_sz−−;

9  x++;

10   }

11

12  /* Bulk of operations w/ aligned data */

13  while (x_sz >= 16) {

14  /* Vectorized SSE algorithm */

15 x_sz −= 16;

16 x += 16;

17  }

18

19  /* Finish up leftovers */

20  while (x_sz) {

21  /* Nonvectorized byte−by−byte algorithm */

22  x_sz −−;

23  x++;

24   }

25 }

15.1.2 SIMD Operations

All SIMD operations, besides those responsible for moving data, can be classified as either operating horizontally or vertically. Operations that are performed vertically occur between the corresponding positions of two separate registers. On the other hand, operations that are performed horizontally occur between all of the positions of one register. These two concepts are illustrated in Figures 15.1 and 15.2. The majority of SIMD operations are performed vertically, as vertical operations provide more parallelism and better performance. When possible, avoid horizontal operations by organizing data in a manner that yields itself to vertical operations.

f15-01-9780128007266
Figure 15.1 Vertical SIMD addition.
f15-02-9780128007266
Figure 15.2 Horizontal SIMD addition.

In order to design data structures, programmers typically organize memory by grouping all of the fields needed for describing a single logical unit. In other words, a struct is organized to contain all of the attributes needed to describe a single instance of the data structure. If multiple instances are required, an array of structures is created.

For example, consider the situation where an array of structures exists for a series of data points, each represented by struct aos. The function foo() in this example iterates over this array and for each element performs floating point addition between the two elements, aos.x and aos.y, storing the result into a different array.

1 struct aos {

2  double x;

3  double y;

4 /* other fields */

5 };

6

7 void foo(struct aos *in, size_t insz, double *out)

8 {

9 size_t i;

10

11 for (i = 0; i < insz; i++, in++) {

12  out[i] = in−>x + in−>y;

13  }

14 }

Assuming that foo() shows up on a performance profile as consuming a large percentage of cycles, this function is a good candidate for vectorization. Unfortunately the current memory organization, an array of structures, is less than optimal for vertical vectorization. This organization leads to poor utilization of the caches and memory bandwidth, since more data is fetched than is required, and the extra fields increase the chance for the data to be evicted from the cache before it is needed again. Additionally, there will be overhead for swizzling the data into the correct format, which potentially could negate any benefit from parallelizing the calculation.

The most efficient method for loading this data into the SIMD registers is sequentially. By reorganizing the layout of memory, it is possible to better accommodate this data flow. Ideally, the structure would be organized as groups of multiple units, rather than each struct representing a single unit. In other words, rather than organizing memory as an array of structures it would be better to organize them as a structure of arrays. This isn’t a strict requirement, but it will benefit performance. Continuing the previous example, the function foo() is easy to vectorize once the data is arranged accordingly:

1 #include <inttypes.h>

2 #include <immintrin.h>

3

4 struct soa {

5 double *x;

6 double *y;

7  size_t len;

8 /* other fields */

9 }

10

11 void foo_avx(struct soa *in, double *out)

12 {

13 size_t sz = in−>len;

14 double *x = in−>x;

15 double *y = in−>y;

16

17 while ((uintptr_t)out & (32 − 1) && sz) {

18 *out = *x + *y;

19

20 x++;

21 y++;

22 out++;

23 sz−−;

24   }

25

26 while (sz >= 4) {

27 __m256d x_ymm, y_ymm, r_ymm;

28

29 x_ymm = _mm256_loadu_pd(x);

30 y_ymm = _mm256_loadu_pd(y);

31 r_ymm = _mm256_add_pd(x_ymm, y_ymm);

32 _mm256_store_pd(out, r_ymm);

33

34 x += 4;

35 y += 4;

36 out += 4;

37 sz −= 4;

38 }

39

40 if (sz >= 2) {

41 __m128d x_xmm, y_xmm, r_xmm;

42

43 x_xmm = _mm_loadu_pd(x);

44 y_xmm = _mm_loadu_pd(y);

45 r_xmm = _mm_add_pd(x_xmm, y_xmm);

46 _mm_store_pd(out, r_xmm);

47

48 x += 2;

49 y += 2;

50 out += 2;

51 sz −= 2;

52 }

53

54 if (sz)

55 *out = *x + *y;

56 }

First, notice on lines 4 through 9, how the layout of memory differs from the definition of struct aos. This structure is designed for describing multiple elements, as opposed to just one. Second, notice that the first while loop, starting on line 17, single-steps until the destination array is aligned. Throughout the function, unaligned loads are used for the source arrays. Ideally, both of these arrays will be aligned correctly during allocation, but the code will work correctly either way. Third, notice that both SSE and Intel AVX are used in a complementary fashion. Of course, it will also be necessary to ensure that this function is only called on processors that support Intel AVX and SSE.

Although each SIMD register can be logically decomposed into a set of packed values, this concept is not enforced. For example, it is completely valid to load an array of unsigned integers into a SIMD register, then perform an operation on that register designed for unsigned words, then an operation designed for signed bytes, and then an operation designed for strings, assuming that this sequence produces the results the programmer desires.

To control how the contents of a SIMD register are interpreted, each instruction typically provides different forms for operating on different types. For example, to perform an addition operation on packed values, there are the ADDPS, ADDPD, PADDB, PADDW, PADDD, PADDQ instructions, which operate on packed single-precision floats, double-precision doubles, bytes, words, double-words, and quad-words types, respectively.

There are far too many instructions to cover them all here. Some of the special purpose SIMD instructions are covered in Chapter 16. Instead, the rest of this section provides a key for determining what an instruction does, based on its name. The following abbreviations should provide insight into the instruction’s meaning:

H (if first or second letter of instruction) Horizontal Operation, e.g., HSUBPS (Horizontal Subtraction on Packed IEEE 32-bit floats) and PHADDW (Horizontal Addition on Packed 16-bit words)

H (if not at beginning) High Lane, e.g., PSHUFHW (Shuffle Packed High Words)

L Low Lane, e.g., PSHUFLW (Shuffle Packed Low Words)

PS Packed Single (multiple IEEE 32-bit floats), e.g., CMPPS (Compare Floats)

PD Packed Double (multiple IEEE 64-bit doubles), e.g., CMPPD (Compare Doubles)

SS Scalar Single (1 IEEE 32-bit float), e.g., CMPSS (Compare Scalar Single)

SD Scalar Double (1 IEEE 64-bit float), e.g., CMPSD (Compare Scalar Double)

References

Cryil B. Bug 45351—general protection fault in raid5, load_balance. 2012, 07. https://bugzilla.kernel.org/show_bug.cgi?id=45351 Kernel Bugzilla Entry.

Intel Corporation, 2013, 10. Intel 64 and IA-32 Architectures Software Developer’s Manual. Computer Hardware Manual.

Intel Corporation, 2014, 10. Intel 64 and IA-32 Architectures Optimization Reference Manual. Order Number: 248966-030.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.242.9