Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

D. KusswurmModern Parallel Programming with C++ and Assembly Languagehttps://doi.org/10.1007/978-1-4842-7918-2_1

1. SIMD Fundamentals

Daniel Kusswurm¹

(1)

Geneva, IL, USA

Chapter 1 introduces x86 SIMD fundamentals and essential concepts. It begins with a section that defines SIMD. This section also introduces SIMD arithmetic using a concise source code example. The next section presents a brief historical overview of x86 SIMD instruction set extensions. The principal sections of Chapter 1 are next. These highlight x86 SIMD concepts and programming constructs including data types, arithmetic calculations, and data manipulation operations. These sections also describe important particulars related to AVX, AVX2, and AVX-512. It is important for you to understand the material presented in this chapter since it provides the necessary foundation to successfully comprehend the topics and source code discussed in subsequent chapters.

Before proceeding, a few words about terminology are warranted. In all ensuing discussions, I will use the official acronyms AVX, AVX2, and AVX-512 when explaining specific features or instructions of these x86 SIMD instruction set extensions. I will use the term x86-AVX as an umbrella expression for x86 SIMD instructions or computational resources that pertain to more than one of the aforementioned x86 SIMD extensions. The terms x86-32 and x86-64 are used to signify x86 32-bit and 64-bit processors and execution environments. This book focuses exclusively on the latter, but the former is occasionally mentioned for historical context or comparison purposes.

What Is SIMD?

SIMD (single instruction multiple data) is a parallel computing technique whereby a CPU (or processing element incorporated within a CPU) performs a single operation using multiple data items concurrently. For example, a SIMD-capable CPU can carry out a single arithmetic operation using several elements of a floating-point array simultaneously. SIMD operations are frequently employed to accelerate the performance of computationally intense algorithms and functions in machine learning, image processing, audio/video encoding and decoding, data mining, and computer graphics.

The underlying concepts behind a SIMD arithmetic calculation are probably best illustrated by a simple source code example. Listing 1-1 shows the source code for three different calculating functions that perform the same arithmetic operation using single-precision floating-point arrays.

//------------------------------------------------

// Ch01_01_fcpp.cpp

//------------------------------------------------

#include <immintrin.h>

#include "Ch01_01.h"

void CalcZ_Cpp(float* z, const float* x, const float* y, size_t n)

{

for (size_t i = 0; i < n; i++)

z[i] = x[i] + y[i];

}

void CalcZ_Iavx(float* z, const float* x, const float* y, size_t n)

{

size_t i = 0;

const size_t num_simd_elements = 8;

for (; n - i >= num_simd_elements; i += num_simd_elements)

{

// Calculate z[i:i+7] = x[i:i+7] + y[i:i+7]

__m256 x_vals = _mm256_loadu_ps(&x[i]);

__m256 y_vals = _mm256_loadu_ps(&y[i]);

__m256 z_vals = _mm256_add_ps(x_vals, y_vals);

_mm256_storeu_ps(&z[i], z_vals);

}

// Calculate z[i] = x[i] + y[i] for any remaining elements

for (; i < n; i += 1)

z[i] = x[i] + y[i];

}

;-------------------------------------------------

; Ch01_01_fasm.asm

;-------------------------------------------------

;------------------------------------------------------------------------------

; extern "C" void CalcZ_Aavx(float* z, const float* x, const float* x, size_t n);

;------------------------------------------------------------------------------

NSE equ 8 ;num_simd_elements

.code

CalcZ_Aavx proc

xor rax,rax ;i = 0;

Loop1: mov r10,r9 ;r10 = n

sub r10,rax ;r10 = n - i

cmp r10,NSE ;is n - i < NSE?

jb Loop2 ;jump if yes

; Calculate z[i:i+7] = x[i:i+7] + y[i:i+7]

vmovups ymm0,ymmword ptr [rdx+rax*4] ;ymm0 = x[i:i+7]

vmovups ymm1,ymmword ptr [r8+rax*4] ;ymm1 = y[i:i+7]

vaddps ymm2,ymm0,ymm1 ;z[i:i+7] = x[i:i+7] + y[i:i+7]

vmovups ymmword ptr [rcx+rax*4],ymm2 ;save z[i:i+7]

add rax,NSE ;i += NSE

jmp Loop1 ;repeat Loop1 until done

Loop2: cmp rax,r9 ;is i >= n?

jae Done ;jump if yes

; Calculate z[i] = x[i] + y[i] for remaining elements

vmovss xmm0,real4 ptr [rdx+rax*4] ;xmm0 = x[i]

vmovss xmm1,real4 ptr [r8+rax*4] ;xmm1 = y[i]

vaddss xmm2,xmm0,xmm1 ;z[i] = x[i] + y[i]

vmovss real4 ptr [rcx+rax*4],xmm2 ;save z[i]

inc rax ;i += 1

jmp Loop2 ;repeat Loop2 until done

Done: vzeroupper ;clear upper bits of ymm regs

ret ;return to caller

CalcZ_Aavx endp

end

Listing 1-1

Example Ch01_01

The function CalcZ_Cpp(), shown at the beginning of Listing 1-1, is a straightforward non-SIMD C++ function that calculates z[i] = x[i] + y[i]. However, a modern C++ compiler may generate SIMD code for this function as explained later in this section.

The next function in Listing 1-1, CalcZ_Iavx(), calculates the same result as CalcZ_Cpp() but employs C++ SIMD intrinsic functions to accelerate the computations. In CalcZ_Iavx(), the first for-loop uses the C++ SIMD intrinsic function _mm256_loadu_ps() to load eight consecutive elements from array x (i.e., elements x[i:i+7]) and temporarily saves these elements in an __m256 object named x_vals. An __m256 object is a generic container that holds eight values of type float. The ensuing _mm256_load_ps() call performs the same operation using array y. This is followed by a call to _mm256_add_ps() that calculates z[i:i+7] = x[i:i+7] + y[i:i+7]. What makes this code different from the code in the non-SIMD function CalcZ_Cpp() is that _mm256_add_ps() performs all eight array element additions concurrently. The final C++ intrinsic function in the first for-loop, _mm256_storeu_ps(), saves the resulting array element sums to z[i:i+7].

It is important to note that since the first for-loop in CalcZ_Iavx() processes eight array elements per iteration, it must terminate if there are fewer than eight elements remaining to process. The second for-loop handles any remaining (or residual) elements and only executes if n is not an integral multiple of eight. It is also important to mention that the C++ compiler treats C++ SIMD intrinsic function calls differently than normal C++ function calls. In the current example, the C++ compiler directly translates each __mm256 function into its corresponding AVX assembly language instruction. The overhead associated with a normal C++ function call is eliminated.

The final function in Listing 1-1 is named CalcZ_Aavx(). This is an x86-64 assembly language function that performs the same array calculation as CalcZ_Cpp() and CalcZ_Iavx(). What is noteworthy about this function is that the AVX instructions vmovps and vaddps contained in the code block are the same instructions that the C++ compiler emits for the C++ SIMD intrinsic functions _mm256_loadu_ps() and _mm256_add_ps(), respectively. The remaining code in CalcZ_Aavx() implements the two for-loops that are also implemented in function CalcZ_Cpp().

Do not worry if you are somewhat perplexed by the source code in Listing 1-1. The primary purpose of this book is to teach you how to develop and code SIMD algorithms like this using either C++ SIMD intrinsic functions or x86-64 assembly language. There are two takeaway points from this section. First, the CPU executes most SIMD arithmetic operations on the specified data elements concurrently. Second, similar design patterns are often employed when coding a SIMD algorithm regardless of whether C++ or assembly language is used.

One final note regarding the code in Listing 1-1. Recent versions of mainstream C++ compilers such as Visual C++ and GNU C++ are sometimes capable of automatically generating efficient x86 SIMD code for trivial arithmetic functions like CalcZ_Cpp(). However, these compilers still have difficulty generating efficient SIMD code for more complicated functions, especially ones that employ nested for-loops or nontrivial decision logic. In these cases, functions written using C++ SIMD intrinsic functions or x86-64 assembly language code can often outperform the SIMD code generated by a modern C++ compiler. However, employing C++ SIMD intrinsic functions does not improve performance in all cases. Many programmers will often code computationally intensive algorithms using standard C++ first, benchmark the code, and then recode bottleneck functions using C++ SIMD intrinsic functions or assembly language.

Historical Overview of x86 SIMD

For aspiring x86 SIMD programmers, having a basic understanding about the history of x86 SIMD and its various extensions is extremely beneficial. This section presents a brief overview that focuses on noteworthy x86 SIMD instruction set extensions. It does not discuss x86 SIMD extensions incorporated in special-use processors (e.g., Intel Xeon Phi) or x86 SIMD extensions that were never widely used. If you are interested in a more comprehensive chronicle of x86 SIMD architectures and instruction set extensions, you can consult the references listed in Appendix B.

Intel introduced the first x86 SIMD instruction set extension, called MMX, in 1997. This extension added instructions that facilitated simple SIMD operations using 64-bit wide packed integer operands. The MMX extension did not add any new registers to the x86 platform; it simply repurposed the registers in the x87 floating-point unit for SIMD integer arithmetic and other operations. In 1998, AMD launched an x86 SIMD extension called 3DNow, which facilitated vector operations using single-precision floating-point values. It also added a few new integer SIMD instructions. Like MMX, 3DNow uses x87 FPU registers to hold instruction operands. Both MMX and 3DNow have been superseded by newer x86 SIMD technologies and should not be used to develop new code.

In 1999, Intel launched a new SIMD technology called Streaming SIMD extensions (SSE). SSE adds 128-bit wide registers to the x86 platform and instructions that perform packed single-precision (32-bit) floating-point arithmetic. SSE also includes a few packed integer instructions. In 2000, SSE2 was launched and extends the floating-point capabilities of SSE to cover packed double-precision (64 bit) values. SSE2 also significantly expands the packed integer capabilities of SSE. Unlike x86-32 processors, all x86-64-compatible processors from both AMD and Intel support the SSE2 instruction set extension. The SIMD extensions that followed SSE2 include SSE3 (2004), SSSE3 (2006), SSE4.1 (2008), and SSE4.2 (2008). These extensions incorporated additional SIMD instructions that perform operations using either packed integer or floating-point operands, but no new registers or data types.

In 2011, Intel introduced processors that supported a new x86 SIMD technology called Advanced Vector Extensions (AVX). AVX adds packed floating-point operations (both single precision and double precision) using 256-bit wide registers. AVX also supports a new three-operand assembly language instruction syntax, which helps reduce the number of register-to-register data transfers that a software function must perform. In 2013, Intel unveiled AVX2, which extends AVX to support packed-integer operations using 256-bit wide registers. AVX2 also adds enhanced data transfer capabilities with its broadcast, gather, and permute instructions. Processors that support AVX or AVX2 may also support fused-multiply-add (FMA) operations. FMA enables software algorithms to perform sum-of-product (e.g., dot product) calculations using a single floating-point rounding operation, which can improve both performance and accuracy.

Beginning in 2017, high-end desktop and server-oriented processors marketed by Intel included a new SIMD extension called AVX-512. This architectural enhancement supports packed integer and floating-point operations using 512-bit wide registers. AVX-512 also includes SIMD extensions that facilitate instruction-level conditional data merging, floating-point rounding control, and embedded broadcast operations.

In addition to the abovementioned SIMD extensions, numerous non-SIMD instructions have been added to the x86 platform during the past 25 years. This ongoing evolution of the x86 platform presents some challenges to software developers who want to exploit the latest instruction sets and computational resources. Fortunately, there are techniques that you can use to determine which x86 SIMD and non-SIMD instruction set extensions are available during program execution. You will learn about these methods in Chapter 9. To ensure software compatibility with future processors, a software developer should never assume that a particular x86 SIMD or non-SIMD instruction set extension is available based on processor manufacturer, brand name, model number, or underlying microarchitecture.

SIMD Data Types

An x86 SIMD data type is a contiguous collection of bytes that is used by the processor to perform an arithmetic calculation or data manipulation operation using multiple values. An x86 SIMD data type can be regarded as a generic container object that holds multiple instances of the same fundamental data type (e.g., 8-, 16-, 32-, or 64-bit integers, single-precision or double-precision floating-point values, etc.). The bits of an x86 SIMD data type are numbered from right to left with 0 and size – 1 denoting the least and most significant bits, respectively. X86 SIMD data types are stored in memory using little-endian byte ordering. In this ordering scheme, the least significant byte of an x86 SIMD data type is stored at the lowest memory address as illustrated in Figure 1-1. In this figure, the terms xmmword , ymmword , and zmmword are x86 assembly language expressions for 128-, 256-, and 512-bit wide SIMD data types and operands.

A program can use x86 SIMD (also called packed) data types to perform simultaneous calculations using either integer or floating-point values. For example, a 256-bit wide packed operand can hold thirty-two 8-bit integers, sixteen 16-bit integers, eight 32-bit integers, or four 64-bit integers. It can also be used to hold eight single-precision or four double-precision floating-point values. Table 1-1 contains a complete list of x86 SIMD data types and the maximum number of elements for various integer and floating-point types.

Table 1-1

SIMD Data Types and Maximum Number of Elements

Numerical Type	xmmword	ymmword	zmmword
8-bit integer	16	32	64
16-bit integer	8	16	32
32-bit integer	4	8	16
64-bit integer	2	4	8
Single-precision floating point	4	8	16
Double-precision floating-point	2	4	8

The width of an x86 SIMD instruction operand varies depending on the x86 SIMD extension and the underlying fundamental data type. AVX supports packed integer operations using 128-bit wide operands. It also supports packed floating-point operations using either 128- or 256-bit wide operands. AVX2 also supports these same operand sizes and adds support for 256-bit wide packed integer operands. Figure 1-2 illustrates the AVX and AVX2 operand types in greater detail. In this figure, the terms byte, word, doubleword, and quadword signify 8-, 16-, 32-, and 64-bit wide integers; SPFP and DPFP denote single-precision and double-precision floating-point values, respectively.

AVX-512 extends maximum width of an x86 SIMD operand from 256 to 512 bits. Many AVX-512 instructions can also be used with 128- and 256-bit wide SIMD operands. However, it should be noted at this point that unlike AVX and AVX2, AVX-512 is not a cohesive x86 SIMD instruction set extension. Rather, it is a collection of interrelated but distinct instruction set extensions. An AVX-512-compliant processor must minimally support 512-bit wide operands of packed floating-point (single-precision or double-precision) and packed integer (32- or 64-bit wide) elements. The AVX-512 instructions that exercise 128- and 256-bit wide operands are a distinct x86 SIMD extension as are the instructions that support packed 8- and 16-bit wide integers. You will learn more about this in the chapters that explain AVX-512 programming. AVX-512 also adds eight opmask registers that a function can use to perform masked moves or masked zeroing.

SIMD Arithmetic

Source code example Ch01_01 introduced simple SIMD addition using single-precision floating-point elements. In this section, you will learn more about SIMD arithmetic operations that perform their calculations using either integer or floating-point elements.

SIMD Integer Arithmetic

Figure 1-3 exemplifies integer addition using 128-bit wide SIMD operands. In this figure, integer addition is illustrated using eight 16-bit integers, four 32-bit integers, or two 64-bit integers. Like the floating-point example that you saw earlier, faster processing is possible when using SIMD arithmetic since the processor can perform the required calculations in parallel. For example, when 16-bit integer elements are used in a SIMD operand, the processor performs all eight 16-bit additions simultaneously.

Figure 1-3
SIMD integer addition using various element sizes

Besides packed integer addition, x86-AVX includes instructions that perform other common arithmetic calculations with packed integers including subtraction, multiplication, shifts, and bitwise logical operations. Figure 1-4 illustrates various packed shift operations using 32-bit wide integer elements.

Figure 1-4
SIMD logical and arithmetic shift operations

Figure 1-5 demonstrates bitwise logical operations using packed 32-bit integers. Note that when performing SIMD bitwise logical operations, distinct elements are irrelevant since the logical operation is carried out using the corresponding bit positions of each SIMD operand.

Figure 1-5
SIMD bitwise logical operations

Wraparound vs. Saturated Arithmetic

One notable feature of x86-AVX is its support for saturated integer arithmetic. When performing saturated integer arithmetic, the processor automatically clips the elements of a SIMD operand to prevent an arithmetic overflow or underflow condition from occurring. This is different from normal (or wraparound) integer arithmetic where an overflow or underflow result is retained. Saturated arithmetic is extremely useful when working with pixel values since it eliminates the need to explicitly check each pixel value for an overflow or underflow. X86-AVX includes instructions that perform packed saturated addition and subtraction using 8- or 16-bit wide integer elements, both signed and unsigned.

Figure 1-6 shows an example of packed 16-bit signed integer addition using both wraparound and saturated arithmetic. An overflow condition occurs when the two 16-bit signed integers are summed using wraparound arithmetic. With saturated arithmetic, however, the result is clipped to the largest possible 16-bit signed integer value. Figure 1-7 illustrates a similar example using 8-bit unsigned integers. Besides addition, x86-AVX also supports saturated packed integer subtraction as shown in Figure 1-8. Table 1-2 summarizes the saturated addition and subtraction range limits for all supported integer sizes and sign types.

Figure 1-6
16-bit signed integer addition using wraparound and saturated arithmetic

Figure 1-7
8-bit unsigned integer addition using wraparound and saturated arithmetic

Figure 1-8
16-bit signed integer subtraction using wraparound and saturated arithmetic

Table 1-2

Range Limits for Saturated Arithmetic

Integer Type	Lower Limit	Upper Limit
8-bit signed	-128	127
8-bit unsigned	0	255
16-bit signed	-32768	32767
16-bit unsigned	0	65535

SIMD Floating-Point Arithmetic

X86-AVX supports arithmetic operations using packed SIMD operands containing single-precision or double-precision floating-point elements. This includes addition, subtraction, multiplication, division, and square roots. Figures 1-9 and 1-10 illustrate a few common SIMD floating-point arithmetic operations.

Figure 1-9
SIMD single-precision floating-point addition and multiplication

Figure 1-10
SIMD double-precision floating-point subtraction and division

The SIMD arithmetic operations that you have seen thus far perform their calculations using corresponding elements of the two source operands. These types of operations are usually called vertical arithmetic. X86-AVX also includes arithmetic instructions that carry out operations using the adjacent elements of a SIMD operand. Adjacent element calculations are termed horizonal arithmetic. Horizontal arithmetic is frequently used to reduce the elements of a SIMD operand to a single scalar value. Figure 1-11 illustrates horizontal addition using packed single-precision floating-point elements and horizontal subtraction using packed double-precision floating-point elements. X86-AVX also supports integer horizontal addition and subtraction using packed 16- or 32-bit wide integers.

Figure 1-11
Floating-point horizontal addition and subtraction

SIMD Data Manipulation Operations

Besides arithmetic calculations, many algorithms frequently employ SIMD data manipulation operations. X86-AVX SIMD data manipulation operations include element compares, shuffles, permutations, blends, conditional moves, broadcasts, size promotions/reductions, and type conversions. You will learn more about these operations in the programming chapters of this book. A few common SIMD data manipulation operations are, however, employed regularly to warrant a few preliminary comments in this chapter.

One indispensable SIMD data manipulation operation is a data compare. Like a SIMD arithmetic calculation, the operations performed during a SIMD compare are carried out simultaneously using all operand elements. However, the results generated by a SIMD compare are different than those produced by an ordinary scalar compare. When performing a scalar compare such as a > b, the processor conveys the result using status bits in a flags register (on x86-64 processors, this flags register is named RFLAGS). A SIMD compare is different in that it needs to report the results of multiple compare operations, which means a single set of status bits in a flags register is inadequate. To overcome this limitation, SIMD compares return a mask value that signifies the result of each SIMD element compare operation.

Figure 1-12 illustrates a couple of SIMD compare operations using packed 16-bit integers and packed single-precision floating-point elements. In this example, the result of each corresponding element compare is a mask of all ones if the compare predicate is true; otherwise, zero is returned. The use of all ones or all zeros for each element compare result facilitates subsequent operations using simple Boolean operations. You will learn how to do this in the programming chapters.

Another common SIMD data manipulation operation is a permutation. SIMD permutations are employed to rearrange the elements of a SIMD operand. They are often applied prior to a specific calculation or to accelerate the performance of a SIMD arithmetic operation. A permutation generally requires two SIMD operands: a packed source operand of elements to permute and a packed operand of integer indices. Figure 1-13 illustrates an x86-AVX permutation using single-precision floating-point elements. In this example, the elements of the top-most SIMD operand are reordered using the element indices specified by the middle SIMD operand. Note that the same source operand element can be copied to multiple destination elements by simply specifying the appropriate index.

A broadcast operation copies a single value (or several values) to multiple locations in a SIMD operand. Broadcasts are often employed to load a scalar constant into each element of a SIMD operand. X86-AVX supports broadcast operations using either integer or floating-point values. Figure 1-14 shows the workings of a broadcast operation using single-precision floating-point values.

Figure 1-15 shows one more common manipulation operation. A masked move conditionally copies the elements from one SIMD operand to another SIMD operand based on the values in a control mask. In Figure 1-15, elements from Operand B are conditionally copied to Operand A if the most significant bit of the corresponding control mask is set to 1; otherwise, the element value in Operand A remains unaltered. Both AVX and AVX2 support masked moves using SIMD control masks. On processors that support AVX-512, functions use an opmask register to perform masked moves. They can also use an opmask register to perform a masked zeroing operation.

SIMD Programming

As mentioned in the Introduction, the primary objective of this book is to help you learn x86 SIMD programming using C++ SIMD intrinsic functions and x86-64 assembly language. The source code examples that you will see in the subsequent chapters are structured to help you achieve this goal.

Most of the source code examples published in this book follow the same design pattern. The first part of each example includes code that performs data initialization. The second part contains the SIMD code, which includes functions that use either C++ SIMD intrinsic functions or x86-AVX assembly language instructions. The final part of each example formats and displays the output. Some of the source code examples also include a rudimentary benchmarking function. Given the variety and scope of the source code in this book, I decided to create a (hopefully) straightforward naming convention for both file and function names. These are outlined Tables 1-3 and 1-4.

Table 1-3

Source Code File Name Suffixes

File Name Suffix	Description
.h	Standard C++ header file
.cpp	Standard C++ source code file
_fcpp.cpp	C++ algorithm code (non-SIMD and SIMD)
_misc.cpp	Miscellaneous C++ functions
_bm.cpp	Benchmarking code
_fasm.asm	Assembly language algorithm code (SIMD)

Table 1-4

Source Code Function Name Suffixes

Function Name Suffix	Description
_Cpp (or no suffix)	Function that uses standard C++ statements
_Iavx	Function that uses C++ AVX intrinsic functions
_Iavx2	Function that uses C++ AVX2 intrinsic functions
_Iavx512	Function that uses C++ AVX-512 intrinsic functions
_Aavx	Function that uses AVX assembly language instructions
_Aavx2	Function that uses AVX2 assembly language instructions
_Avx512	Function that uses AVX-512 assembly language instructions

The most important code resides in files with the suffix names _fcpp.cpp and _fasm.asm. The code in files with other suffix names is somewhat ancillary but still necessary to create an executable program. Note that function names incorporating the substrings avx, avx2, and avx512 will only work on processors that support the AVX, AVX2, and AVX-512 instruction set extensions, respectively. You can use one of the free utilities listed in Appendix B to verify the processing capabilities of your computer.

Finally, it should be noted that the C++ SIMD intrinsic functions used in the source code examples were originally developed by Intel for their compilers. Most of these functions are also supported by other x86-64 C++ compilers including Visual Studio C++ and GNU C++. Appendix A contains additional information about the source code including download and build instructions for both Visual Studio and the GNU toolchain. Depending on your personal preference, you may want to download and install the source code first before proceeding to the next chapter.

Summary

Here are the key learning points for Chapter 1:

SIMD is a parallel computing technique that carries out concurrent calculations using multiple data items.
AVX supports 128- and 256-bit wide packed floating-point operands. It also supports packed 128-bit wide integer operands.
AVX2 extends AVX to support 256-bit wide integer operands. It also adds additional broadcast and permutation instructions.
AVX-512 minimally supports 512-bit wide packed operands of single-precision or double-precision floating-point values. It also supports 512-bit wide operands of packed 32- and 64-bit wide integers.
The terms xmmword , ymmword , and zmmword are x86 assembly language expressions for 128-, 256-, and 512-bit wide SIMD data types and operands.
The terms byte, word, doubleword, and quadword are x86 assembly language designations for 8-, 16-, 32-, and 64-bit integers.
X86-AVX supports both wraparound and saturated arithmetic for packed 8- and 16-bit integers, both signed and unsigned.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 1. SIMD Fundamentals

Create new playlist

Sign In

Sign Up