In earlier chapters, you studied a variety of source code examples that demonstrated floating-point calculations and algorithms using AVX and AVX2 C++ SIMD intrinsic functions. In this chapter, you will examine similar source code examples that highlight the use of AVX-512 C++ SIMD intrinsic functions that perform floating-point operations. The first section contains two source code examples that illustrate simple floating-point arithmetic using 512-bit wide operands. The next two sections focus on using AVX-512 to perform computations with floating-point arrays and matrices. The final section explains how to perform discrete convolutions using AVX-512.
Floating-Point Arithmetic
In this section, you will learn how to perform elementary floating-point arithmetic using AVX-512 C++ SIMD intrinsic functions. You will also learn how to carry out merge masking and zero masking using floating-point operands.
Basic Arithmetic
Example Ch08_01
Toward the top of Listing 8-1 are the function declarations for example Ch08_01. Note that these declarations use the ZmmVal structure that you learned about in Chapter 7. The file Ch08_01.cpp contains two functions named PackedMathF32() and PackedMathF64(). These functions perform test case initialization for the SIMD calculating functions PackedMathF32_Iavx512() and PackedMathF64_Iavx512(). They also stream results to std::cout.
The file Ch08_01_fcpp.cpp begins with the definition of function PackedMathF32_Iavx512(). This function uses the C++ SIMD intrinsic function _mm512_load_ps() to initialize a_vals and b_vals. The next code block consists of C++ SIMD intrinsic function calls that perform various AVX-512 arithmetic operations using packed single-precision floating-point operands. This is followed by a series of _mm512_store_ps() calls that save the calculated results. Note that both _mm512_load_ps() and _mm512_store_ps() require their memory operands to be aligned on a 64-byte boundary.
Compare Operations
Example Ch08_02
In Listing 8-2, the files Ch08_02.h and Ch08_02.cpp contain function declarations and test case initialization code for this source code example. The first function in file Ch08_02_fcpp.cpp, PackedCompareF32_Iavx512(), performs SIMD compare operations using packed single-precision floating-point operands. Unlike AVX and AVX2, AVX-512 SIMD floating-point compare operations return scalar integers that signify the results. In the current example, the C++ SIMD intrinsic function _mm512_cmp_ps_mask() returns an integer value of type __mmask16 . Each bit position of this 16-bit wide mask value reports the compare result for the corresponding SIMD operand element position (1 = compare predicate true, 0 = compare predicate false). Function _mm512_cmp_ps_mask() uses the same compare predicates that _mm256_cmp_ps() uses (see example Ch03_02).
Floating-Point Arrays
Example Ch08_03
The first two functions in file Ch08_03_fcpp.cpp, CalcMeanF32_Cpp() and CalcStDevF32_Cpp(), calculate the mean and standard deviation using standard C++ statements. These functions are included in this example for comparison purposes. The next function, CalcMeanF32_Iavx512(), calculates the array mean using AVX-512 C++ SIMD intrinsic functions. Following argument validation, function CalcMeanF32_Iavx512() uses _mm512_setzero_ps() to initialize sums to zero. The variable sums contains 16 intermediate single-precision floating-point sum values. These values are updated during each iteration of the ensuing for-loop. Following execution of the for-loop, CalcMeanF32_Iavx512() uses the C++ SIMD intrinsic function _mm512_reduce_add_ps() to reduce the 16 single-precision floating-point values in sums to a single scalar value. Recall that the AVX code in example Ch03_04 employed a sequence of C++ SIMD intrinsic function calls to perform this same reduction. Following the reduction of sums, the second for-loop in CalcMeanF32_Iavx512() processes any residual elements using scalar arithmetic.
Like example Ch03_04, the results for source code example Ch08_03 contain some slight discrepancy values due to the non-associativity of floating-point arithmetic. Whether these discrepancies are of any consequence depends on the specific application.
Floating-Point Matrices
In Chapter 5, you studied several source code examples that explained how to perform common matrix operations using AVX2 C++ SIMD intrinsic functions. In this section, you will learn how to carry out some of the same matrix operations using AVX-512 C++ SIMD intrinsic functions. The first source code example highlights the use of AVX-512 to calculate a covariance matrix. This is followed by two source code examples that spotlight matrix multiplication. The final source code example of this section explicates matrix-vector multiplication. As you will soon see, it is often a straightforward programming task to adapt an algorithm originally written using AVX2 C++ SIMD intrinsic functions to one that exploits the computational resources of AVX-512.
Covariance Matrix
Mathematicians often use a statistical measure called covariance to quantify the extent to which two random variables vary together. When multiple random variables are being analyzed, it is common to calculate a matrix of all possible covariances. This matrix is called, unsurprisingly, a covariance matrix. Once calculated, a covariance matrix can be employed to perform a wide variety of advanced statistical analyses. Appendix B contains several references that you can consult if you are interested in learning more about covariance and covariance matrices.
Example Ch08_04
Near the top of Listing 8-4 is the file Ch08_04.h, which begins with the definition of structure CMD (CMD = covariance matrix data). This structure contains the data matrix, the variable means vector, and the covariance matrix. Note that CMD also includes a simple constructor that allocates space for the three container objects using the specified n_vars and n_obvs. The source code that performs argument validation, test data initialization, and result comparisons is not shown in Listing 8-4 but included in the download software package.
The core calculating functions of source code example are in Ch08_08_fcpp.cpp, which begins with the definition of function CalcCovMatF64_Cpp(). This function uses standard C++ statements to calculate the covariance matrix and is included for comparison purposes. The code in CalcCovMatF64_Cpp() is split into two major sections. The first section calculates the mean for each variable (or row) in data matrix x. The second section calculates the covariances. Note that function CalcCovMatF64_Cpp() exploits the fact that a covariance matrix is symmetric and only carries out a complete calculation when i <= j is true. If i <= j is false, CalcCovMatF64_Cpp() executes cov_mat[i][j] = cov_mat[j][j].
The next function in Ch08_04_fcpp.cpp is a SIMD inline function named ReduceAddF64(). This function reduces the double-precision floating-point elements of arguments a (__m512d), b (__m256d), and c (__m128d) to a scalar double-precision value. Note that ReduceAddF64() employs several C++ SIMD intrinsic functions to size-extend argument values b and c to packed 512-bit wide SIMD values. Doing this facilitates the use of the AVX-512 C++ SIMD intrinsic function _mm512_reduce_add_pd() to perform the reduction.
The final function in Listing 8-4 is named CalcCovMatF64_Iavx512(). Like its standard C++ counterpart, function CalcCovMatF64_Iavx512() uses distinct sections of code to calculate the variable means and the covariance matrix. The mean calculating while-loop employs __m512d, __m256d, __m128d, or scalar objects to perform its computations. Note that each if section verifies that enough elements are available in the current row before carrying out any SIMD calculations. Following the while-loop, CalcCovMatF64_Iavx512() invokes ReduceAddF64() to reduce sums_512, sums_256, and sums_128 to a scalar value. It then calculates var_means[i].
Matrix Multiplication
Example Ch08_05
Near the top of Listing 8-5 is the source code for function MatrixMulF32_Iavx512(), which performs single-precision floating-point matrix multiplication. The primary difference between this function and the function MatrixMulF32_Iavx2() that you studied in example Ch05_02 is in the code that calculates the residual column mask for the current row. In example Ch05_02, function MatrixMulF32_Iavx2() used a SIMD integer (__m256i) mask. In this example, function MatrixMulF32_Iavx512() uses a scalar integer (__mmask16) mask since these are directly supported by AVX-512.
Matrix Multiplication (Single-Precision) Execution Times (Microseconds)
CPU | MatrixMulF32_Cpp() | MatrixMulF32_Iavx512() |
---|---|---|
Intel Core i5-11600K | 11432 | 713 |
Example Ch08_06
Matrix Multiplication (Double-Precision) Execution Times (Microseconds)
CPU | MatrixMulF64_Cpp() | MatrixMulF64_Iavx512() |
---|---|---|
Intel Core i5-11600K | 11972 | 1518 |
Matrix (4 x 4) Vector Multiplication
Example Ch08_07
The source code in file Ch08_07_fcpp.cpp begins with a series of arrays that contain permutation indices. The AVX-512 implementation of the matrix-vector multiplication algorithm uses these indices to reorder the elements of the source matrix and vectors. The reason for this reordering is to facilitate the calculation of four matrix-vector products during each iteration of the for-loop. The definition of function MatVecMulF32_Cpp() follows the permutation indices. This function calculates matrix-vector (4 × 4, 4 × 1) products using standard C++ statements.
Matrix-Vector (4 × 4, 4 × 1) Multiplication Execution Times (Microseconds), 1,000,000 Vectors
CPU | MatVecMulF32_Cpp() | MatVecMulF32a_Iavx512() | MatVecMulF32b_Iavx512() |
---|---|---|---|
Intel Core i5-11600K | 5069 | 1111 | 708 |
Convolutions
In Chapter 6, you learned how to compute 1D and 2D discrete convolutions using C++ intrinsic functions and AVX2. In this section, you will examine two source code examples that illustrate convolutions using AVX-512. Like Chapter 6, the source code examples discussed in this section are somewhat more specialized than those covered in the previous sections. If your SIMD programming interests reside elsewhere, you can either skim this section or skip ahead to the next chapter. If you decide to continue, you may want to review the sections in Chapter 6 that explained the mathematics of a discrete convolution before examining the source code.
1D Convolutions
Example Ch08_08
The first function in Listing 8-8, Convolve1D_F32_Cpp(), implements a 1D discrete convolution using standard C++ statements. This function is identical to the one you saw in source code example Ch06_01 and is included again here for benchmarking purposes. The next function in Listing 8-8, named Convolve1D_F32_Iavx512(), uses AVX-512 C++ SIMD intrinsic functions to implement a 1D discrete convolution. This function is similar to the function Convolve1D_F32_Iavx2() that was presented in source example Ch06_01. The primary difference is that Convolve1D_F32_Iavx512() includes an extra code block near the top of the while-loop that processes signal elements y[i:i+15] using __m512 data types and the following C++ SIMD intrinsic functions: _mm512_loadu_ps(), _mm512_set1_ps(), _mm512_fmadd_ps(), and _mm512_storeu_ps(). The other code blocks in the while-loop process signal elements y[i:i+7], y[i:i+3], or y[i] using C++ SIMD intrinsic functions and data types just like function Convolve1D_F32_Iavx2() did in example Ch06_01.
Following Convolve1D_F32_Iavx512() in Listing 8-8 is the function Convolve1DKs5_F32_Iavx512(). This function implements a 1D discrete convolution using AVX-512 and is optimized for a five-element convolution kernel. Recall from the discussions in Chapter 6 that many real-world signal processing applications frequently employ size-optimized convolution functions since they are often faster than their variable-width counterparts. Note that the principal modification between the code in Convolve1DKs5_F32_Iavx512() and the Ch06_01 function Convolve1DKs5_F32_Iavx2() is that the former includes a code block near the top of the while-loop that processes signal elements y[i:i+15] using _m512 data types and AVX-512 C++ SIMD intrinsic functions.
1D Discrete Convolution (Single-Precision) Execution Times (Microseconds)
CPU | Convolve1D_F32_Cpp() | Convolve1D_F32_Iavx512() | Convolve1DKs5_F32_Iavx512 |
---|---|---|---|
Intel Core i5-11600K | 2268 | 242 | 200 |
2D Convolutions
Example Ch08_09
The source code for file Ch08_09_fcpp.cpp that is shown in Listing 8-9 is somewhat lengthy but (hopefully) relatively straightforward to comprehend. It begins with the function Convolve1Dx2_F32_Cpp(), which implements a 2D discrete convolution using standard C++ statements. This function is identical to the one you studied in source code example Ch06_04 and is included again in this example for benchmarking purposes.
Also shown in Listing 8-9 is the SIMD calculating function Convolve1Dx2_F32_Iavx512(). This function, which is a modified version of function Convolve1Dx2_F32_Iavx2() (see Listing 6-4), performs a 2D discrete convolution using AVX-512 C++ SIMD intrinsic functions. In function Convolve1Dx2_F32_Iavx512(), note the inclusion of an extra if block in the x-axis section that processes image pixels using __m512 data types and the following C++ SIMD intrinsic functions: _mm512_loadu_ps(), _mm512_set1_ps(), _mm512_fmadd_ps(), and _mm512_storeu_ps(). A similar if block was also added to the y-axis section of Convolve1Dx2_F32_Iavx512().
2D Discrete Convolution (Single-Precision) Execution Times (Microseconds)
CPU | Convolve1Dx2_F32_Cpp() | Convolve1Dx2_F32_Iavx512() |
---|---|---|
Intel Core i5-11600K | 14373 | 2065 |
Summary
C++ SIMD Intrinsic Function Summary for Chapter 8
C++ SIMD Function Name | Description |
---|---|
_mm256_insertf64x2 | Insert double-precision elements |
_mm512_abs_pd, _ps | Packed floating-point absolute value |
_mm512_add_pd, _ps | Packed floating-point addition |
_mm512_cmp_pd_mask, _ps_mask | Packed floating-point compare |
_mm512_div_pd, _ps | Packed floating-point division |
_mm512_extractf32x4_ps, f32x8_ps | Extract floating-point elements |
_mm512_fmadd_pd, ps | Packed floating-point fused-multiple-add |
_mm512_insertf64x2, f64x4 | Insert double-precision elements |
_mm512_load_epi8, _epi16, _epi32, _epi64 | Load packed integer elements |
_mm512_load_pd, _ps | Load (aligned) floating-point elements |
_mm512_loadu_pd, _ps | Load (unaligned) floating-point elements |
_mm512_max_pd, _ps | Packed floating-point maximum |
_mm512_min_pd, _ps | Packed floating-point minimum |
_mm512_permutexvar_pd, _ps | Permute floating-point elements |
_mm512_reduce_add_pd, _ps | Reduce (sum) floating-point elements |
_mm512_set1_pd, _ps | Broadcast floating-point value to all elements |
_mm512_setzero_pd, _ps | Set floating-point elements to zero |
_mm512_sqrt_pd, _ps | Packed floating-point square root |
_mm512_store_pd, _ps | Store (aligned) floating-point elements |
_mm512_storeu_pd, _ps | Store (unaligned) floating-point elements |
_mm512_stream_pd, _ps | Store (nontemporal) floating-point elements |
_mm512_sub_pd, _ps | Packed floating-point subtraction |
_mm_stream_pd, _ps | Store (nontemporal) floating-point elements |