© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
D. KusswurmModern Parallel Programming with C++ and Assembly Languagehttps://doi.org/10.1007/978-1-4842-7918-2_6

6. AVX2 C++ Programming: Part 3

Daniel Kusswurm1  
(1)
Geneva, IL, USA
 

The previous chapter demonstrated how to use a variety of C++ SIMD intrinsic functions to implement algorithms that performed calculations using floating-point arrays and matrices. In this chapter, you will learn how to implement some common signal processing and image processing methods using SIMD arithmetic. The chapter begins with a brief primer section that discusses signal processing fundamentals and the mathematics of convolutions. This is followed by a section that explains how to implement a 1D discrete convolution using C++ SIMD intrinsic functions. Chapter 6 concludes with an exploration of image processing algorithms that exploit 2D discrete convolutions and SIMD techniques.

Chapter 6 accentuates a specific application domain where appropriate use of AVX2 and SIMD programming techniques significantly accelerates algorithm performance. The topics and source code examples presented in this chapter are slightly more specialized and perhaps a bit more complicated than those discussed in previous chapters. If your SIMD programming interests reside elsewhere, you can either skim this chapter or skip ahead to the next one.

Convolution Primer

A convolution is a mathematical operation that blends an input signal with a response signal to produce an output signal. Convolutions are used extensively in a wide variety of scientific and engineering applications. Many signal processing and image processing methods are founded on convolution theory. In this section, you will learn the essentials of convolution mathematics. The purpose of this section is to provide just enough background math to understand this chapter’s source code examples. Numerous books have been published that explain convolution, signal processing, and image processing theory in significantly greater detail. Appendix B contains a list of references that you can consult for additional information regarding these topics.

Convolution Math: 1D

The 1D convolution of an input signal x and a response signal g is defined as follows:
$$ y(t)=underset{-infty }{overset{infty }{int }}xleft(t-	au 
ight)gleft(	au 
ight) d	au $$

where y represents the output signal. The notation x ∗ g is commonly used to denote the convolution of signals (or functions) x and g.

In computer software, an array of sampled data points is frequently employed to represent the input, response, and output signals. A 1D discrete convolution can be calculated using the following equation:
$$ yleft[i
ight]=sum limits_{k=-M}^Mxleft[i-k
ight]gleft[k
ight] $$

where i = 0, 1, ⋯, N − 1 and M = floor(Ng/2). In the preceding equations, N denotes the number of elements in the input and output signals, and Ng symbolizes the number of elements in the response array. The ensuing discussions and source code examples in this chapter assume that Ng is an odd integer greater than or equal to three. If you examine the 1D discrete convolution equation carefully, you will notice that each element of the output signal array y is computed using a straightforward sum-of-products calculation that encompasses the input signal x and the response signal g. These types of calculations are easy to implement using FMA arithmetic as you have already seen.

In digital signal processing, many applications use smoothing operators to reduce the amount of noise present in a raw data signal. For example, the top plot of Figure 6-1 shows a raw data signal that contains a fair amount of noise. The bottom plot of Figure 6-1 shows the same signal following the application of a smoothing operator. In this example, the smoothing operator convolved the original raw signal with a set of discrete coefficients that approximate a low-pass (or Gaussian) filter. These coefficients correspond to the response signal array g that is included in the discrete 1D convolution equation. The response signal array is often called a convolution kernel or convolution mask. This book uses the term convolution kernel to avoid any potential confusion with SIMD masks.
Figure 6-1

Raw data signal (top plot) and its smoothed counterpart (bottom plot)

The 1D discrete convolution equation can be implemented in source code using a couple of nested for-loops. During each outer loop iteration, the convolution kernel center point g[0] is superimposed over the current input signal array element x[i]. The inner for-loop calculates the intermediate products as shown in Figure 6-2. These intermediate products are then summed and saved to output signal array element y[i].
Figure 6-2

Calculation of 1D discrete convolution output signal element

Convolution Math: 2D

The 2D convolution of input signal x, response signal g, and output signal y is defined as follows:
$$ yleft(u,v
ight)=underset{-infty }{overset{infty }{int }}underset{-infty }{overset{infty }{int }}xleft(u-{	au}_1,v-{	au}_2
ight) gleft({	au}_1,{	au}_2
ight)d{	au}_1d{	au}_2 $$
A 2D discrete convolution with a square dimensioned response signal can be calculated using the following equation:
$$ yleft[i,j
ight]=sum limits_{k_1=-M}^Msum limits_{k_2=-M}^Mxleft[i-{k}_1,j-{k}_2
ight] gleft[{k}_1,{k}_2
ight] $$
where i = 0, 1, ⋯, Nrows − 1, j = 0, 1, ⋯, Ncols − 1, and M = floor(Ng/2). In the preceding equations, Ng symbolizes the height and width of the response signal (or convolution kernel) and is assumed to be an odd integer greater than or equal to three. A matrix of sampled data points is often utilized to represent the input, response, and output signals of a 2D discrete convolution. Figure 6-3 shows two images: the left image is the original grayscale image, while the right image was generated using a 2D discrete convolution kernel that performs low-pass filtering.
Figure 6-3

Application of a low-pass filter using a 2D discrete convolution

The 2D discrete convolution equation can be implemented in source code using four nested for-loops. During each iteration of the inner-most for-loop, the convolution kernel center point g[0][0] is superimposed over the current input signal array element x[i][j] as shown in Figure 6-4. This figure also shows the equations need to calculate y[i][j].
Figure 6-4

Calculation of a 2D discrete convolution output signal element

Direct implementation of the 2D discrete convolution equation is computationally expensive, even when using SIMD arithmetic. One reason for this is that the number of required arithmetic operations increases exponentially as the size of input signal gets larger. The computational costs of a 2D discrete convolution can be significantly reduced if the 2D convolution kernel is separable as shown in Figure 6-5. In this case, a 2D discrete convolution can be carried out using two independent 1D discrete convolutions.
Figure 6-5

Example of a separable 2D convolution kernel and its corresponding 1D convolution kernels

When using a separable 2D convolution kernel, the original 2D discrete convolution equation transforms into the following:
$$ yleft[i,j
ight]=sum limits_{k_1=-M}^Msum limits_{k_2=-M}^Mxleft[i-{k}_1,j-{k}_2
ight] {g}_1left[{k}_1
ight] {g}_2left[{k}_2
ight] $$
Rearranging these terms yields the following equation:
$$ yleft[i,j
ight]=sum limits_{k_1=-M}^M{g}_1left[{k}_1
ight]left[sum limits_{k_2=-M}^Mxleft[i-{k}_1,j-{k}_2
ight] {g}_2left[{k}_2
ight]
ight] $$

1D Convolutions

Listing 6-1 shows the source code for example Ch06_01. This example illustrates how to use C++ SIMD intrinsic functions to perform a 1D discrete convolution using single-precision floating-point values.
//------------------------------------------------
//               Ch06_01.h
//------------------------------------------------
#pragma once
#include <vector>
// Ch06_01_fcpp.cpp
extern void Convolve1D_F32_Cpp(std::vector<float>& y,
    const std::vector<float>& x, const std::vector<float>& kernel);
extern void Convolve1D_F32_Iavx2(std::vector<float>& y,
    const std::vector<float>& x, const std::vector<float>& kernel);
extern void Convolve1DKs5_F32_Iavx2(std::vector<float>& y,
    const std::vector<float>& x, const std::vector<float>& kernel);
// Ch06_01_misc.cpp
extern bool CheckArgs(std::vector<float>& y,
    const std::vector<float>& x, const std::vector<float>& kernel);
// Ch06_01_bm.cpp
extern void Convolve1D_F32_bm(void);
// Miscellaneous constants
const unsigned int c_RngSeed = 97;
//------------------------------------------------
//               Ch06_01_misc.cpp
//------------------------------------------------
#include "Ch06_01.h"
bool CheckArgs(std::vector<float>& y, const std::vector<float>& x,
    const std::vector<float>& kernel)
{
    if ((kernel.size() & 1) == 0)
        return false;
    if (y.size() != x.size())
        return false;
    if (y.size() < kernel.size())
        return false;
    return true;
}
//------------------------------------------------
//               Ch06_01.cpp
//------------------------------------------------
#include <iostream>
#include <iomanip>
#include <stdexcept>
#include "Ch06_01.h"
#include "MT_Convolve.h"
static void Convolve1D_F32(void);
int main()
{
    try
    {
        Convolve1D_F32();
        Convolve1D_F32_bm();
    }
    catch (std::exception& ex)
    {
        std::cout << "Ch06_01 exception: " << ex.what() << ' ';
    }
    return 0;
}
static void Convolve1D_F32(void)
{
    const char nl = ' ';
    const size_t num_pts = 79;
    const char* bn_results = "Ch06_01_Convolve1D_F32_Results";
    const std::vector<float> kernel { 0.0625f, 0.25f, 0.375f, 0.25f, 0.0625f };
    // Create input and output signal arrays
    std::cout << "Executing Convolve1D_F32()" << nl;
    std::vector<float> x(num_pts);
    GenSignal1D(x, c_RngSeed);
    std::vector<float> y1(num_pts);
    std::vector<float> y2(num_pts);
    std::vector<float> y3(num_pts);
    // Perform 1D convolutions
    Convolve1D_F32_Cpp(y1, x, kernel);
    Convolve1D_F32_Iavx2(y2, x, kernel);
    Convolve1DKs5_F32_Iavx2(y3, x, kernel);
    // Save results
    std::vector<std::vector<float>*> signal_vectors { &x, &y1, &y2, &y3 };
    std::vector<std::string> titles { "x", "y1", "y2", "y3"};
    std::string results_fn = SaveResults1D(bn_results, signal_vectors, titles);
    std::cout << "Results saved to file " << results_fn << nl;
}
//------------------------------------------------
//               Ch06_01_fcpp.cpp
//------------------------------------------------
#include <stdexcept>
#include <immintrin.h>
#include "Ch06_01.h"
#include "MiscTypes.h"
void Convolve1D_F32_Cpp(std::vector<float>& y, const std::vector<float>& x, const std::vector<float>& kernel)
{
    if (!CheckArgs(y, x, kernel))
        throw std::runtime_error("Convolve1D_F32_Cpp() - CheckArgs failed");
    indx_t num_pts = (indx_t)y.size();
    indx_t ks2 = kernel.size() / 2;
    for (indx_t i = ks2; i < num_pts - ks2; i++)
    {
        float y_val = 0;
        for (indx_t k = -ks2; k <= ks2; k++)
            y_val += x[i - k] * kernel[k + ks2];
        y[i] = y_val;
    }
}
void Convolve1D_F32_Iavx2(std::vector<float>& y, const std::vector<float>& x, const std::vector<float>& kernel)
{
    if (!CheckArgs(y, x, kernel))
        throw std::runtime_error("Convolve1D_F32_Iavx2() - CheckArgs failed");
    indx_t ks2 = (indx_t)kernel.size() / 2;
    indx_t num_pts = (indx_t)y.size();
    const indx_t num_simd_elements = 8;
    const indx_t num_simd_elements2 = 4;
    indx_t i = ks2;
    while (i < num_pts - ks2)
    {
        if ((i + num_simd_elements) <= num_pts - ks2)
        {
             __m256 y_vals = _mm256_setzero_ps();
            // Calculate y[i:i+7]
            for (indx_t k = -ks2; k <= ks2; k++)
            {
                __m256 x_vals = _mm256_loadu_ps(&x[i - k]);
                __m256 kernel_vals = _mm256_set1_ps(kernel[k + ks2]);
                y_vals = _mm256_fmadd_ps(x_vals, kernel_vals, y_vals);
            }
            _mm256_storeu_ps(&y[i], y_vals);
            i += num_simd_elements;
        }
        else if ((i + num_simd_elements2) <= num_pts - ks2)
        {
             __m128 y_vals = _mm_setzero_ps();
            // Calculate y[i:i+3]
            for (indx_t k = -ks2; k <= ks2; k++)
            {
                __m128 x_vals = _mm_loadu_ps(&x[i - k]);
                __m128 kernel_vals = _mm_set1_ps(kernel[k + ks2]);
                y_vals = _mm_fmadd_ps(x_vals, kernel_vals, y_vals);
            }
            _mm_storeu_ps(&y[i], y_vals);
            i += num_simd_elements2;
        }
        else
        {
             __m128 y_val = _mm_setzero_ps();
            // Calculate y[i]
            for (indx_t k = -ks2; k <= ks2; k++)
            {
                __m128 x_val = _mm_load_ss(&x[i - k]);
                __m128 k_val = _mm_load_ss(&kernel[k + ks2]);
                y_val = _mm_fmadd_ss(x_val, k_val, y_val);
            }
            _mm_store_ss(&y[i], y_val);
            i += 1;
        }
    }
}
void Convolve1DKs5_F32_Iavx2(std::vector<float>& y, const std::vector<float>& x, const std::vector<float>& kernel)
{
    if (!CheckArgs(y, x, kernel))
        throw std::runtime_error("Convolve1DKs5_F32_Iavx2() - CheckArgs failed");
    if (kernel.size() != 5)
        throw std::runtime_error("Convolve1DKs5_F32_Iavx2() - invalid kernel size");
    const indx_t ks2 = 2;
    indx_t num_pts = (indx_t)y.size();
    const indx_t num_simd_elements = 8;
    const indx_t num_simd_elements2 = 4;
    __m256 kernel256_0 = _mm256_set1_ps(kernel[0]);
    __m256 kernel256_1 = _mm256_set1_ps(kernel[1]);
    __m256 kernel256_2 = _mm256_set1_ps(kernel[2]);
    __m256 kernel256_3 = _mm256_set1_ps(kernel[3]);
    __m256 kernel256_4 = _mm256_set1_ps(kernel[4]);
    __m128 kernel128_0 = _mm_set1_ps(kernel[0]);
    __m128 kernel128_1 = _mm_set1_ps(kernel[1]);
    __m128 kernel128_2 = _mm_set1_ps(kernel[2]);
    __m128 kernel128_3 = _mm_set1_ps(kernel[3]);
    __m128 kernel128_4 = _mm_set1_ps(kernel[4]);
    indx_t i = ks2;
    while (i < num_pts - ks2)
    {
        indx_t j = i + ks2;
        if ((i + num_simd_elements) <= num_pts - ks2)
        {
            // Calculate y[i:i+7]
            __m256 x_vals = _mm256_loadu_ps(&x[j]);
            __m256 y_vals = _mm256_mul_ps(x_vals, kernel256_0);
            x_vals = _mm256_loadu_ps(&x[j - 1]);
            y_vals = _mm256_fmadd_ps(x_vals, kernel256_1, y_vals);
            x_vals = _mm256_loadu_ps(&x[j - 2]);
            y_vals = _mm256_fmadd_ps(x_vals, kernel256_2, y_vals);
            x_vals = _mm256_loadu_ps(&x[j - 3]);
            y_vals = _mm256_fmadd_ps(x_vals, kernel256_3, y_vals);
            x_vals = _mm256_loadu_ps(&x[j - 4]);
            y_vals = _mm256_fmadd_ps(x_vals, kernel256_4, y_vals);
            _mm256_storeu_ps(&y[i], y_vals);
            i += num_simd_elements;
        }
        else if ((i + num_simd_elements2) <= num_pts - ks2)
        {
            // Calculate y[i:i+3]
            __m128 x_vals = _mm_loadu_ps(&x[j]);
            __m128 y_vals = _mm_mul_ps(x_vals, kernel128_0);
            x_vals = _mm_loadu_ps(&x[j - 1]);
            y_vals = _mm_fmadd_ps(x_vals, kernel128_1, y_vals);
            x_vals = _mm_loadu_ps(&x[j - 2]);
            y_vals = _mm_fmadd_ps(x_vals, kernel128_2, y_vals);
            x_vals = _mm_loadu_ps(&x[j - 3]);
            y_vals = _mm_fmadd_ps(x_vals, kernel128_3, y_vals);
            x_vals = _mm_loadu_ps(&x[j - 4]);
            y_vals = _mm_fmadd_ps(x_vals, kernel128_4, y_vals);
            _mm_storeu_ps(&y[i], y_vals);
            i += num_simd_elements2;
        }
        else
        {
            // Calculate y[i]
            __m128 x_val = _mm_load_ss(&x[j]);
            __m128 y_val = _mm_mul_ss(x_val, kernel128_0);
            x_val = _mm_load_ss(&x[j - 1]);
            y_val = _mm_fmadd_ss(x_val, kernel128_1, y_val);
            x_val = _mm_load_ss(&x[j - 2]);
            y_val = _mm_fmadd_ss(x_val, kernel128_2, y_val);
            x_val = _mm_load_ss(&x[j - 3]);
            y_val = _mm_fmadd_ss(x_val, kernel128_3, y_val);
            x_val = _mm_load_ss(&x[j - 4]);
            y_val = _mm_fmadd_ss(x_val, kernel128_4, y_val);
            _mm_store_ss(&y[i], y_val);
            i += 1;
        }
    }
}
Listing 6-1

Example Ch06_01

Listing 6-1 begins with the file Ch06_01.h, which includes the function declarations for this example. Note that the function declarations use type std::vector<float> for the various signal arrays. The next file in Listing 6-1 is Ch06_01.cpp and includes the function Convolve1D_F32() . This function uses a function template named GenSignal1D() to generate a synthetic input signal. The source code for GenSignal1D() is not shown in Listing 6-1 but included in the download software package. The variable std::vector<float> kernel contains response signal coefficients that implement a simple low-pass filter. Function Convolve1D_F32() also invokes the 1D discrete convolution functions Convolve1D_F32_Cpp(), Convolve1D_F32_Iavx2(), and Convolve1DKs5_F32_Iavx2(). Following execution of these functions, Convolve1D_F32() saves the signal array data to the specified file.

The first function in file Ch06_01_fcpp.cpp is named Convolve1D_F32_Cpp(). This function is a direct implementation of the 1D discrete convolution equation using standard C++ statements. Note that the for-loop index variables in Convolve1D_F32_Cpp() are declared using the data type indx_t , which is defined in MiscTypes.h. Type indx_t is a signed integer whose size (32 or 64 bits) is platform dependent. Signed integer index variables are employed in this chapter’s convolution calculating functions since this corresponds to the summation indices used in the discrete convolution equations.

Function Convolve1D_F32_Iavx2() performs a 1D discrete convolution using C++ SIMD intrinsic functions. Recall that the C++ SIMD data type __m256 can hold eight single-precision floating-point values. This type facilitates a SIMD implementation of the convolution algorithm that can process eight input signal points simultaneously. Figure 6-6 contains two graphics that illustrate a five-element convolution kernel along with an arbitrary segment of an input signal. Below the graphics are the equations that calculate output signal elements y[i:i+7]. These equations are simple expansion of the 1D discrete convolution equation that you saw earlier in this chapter. Note that each column of the SIMD convolution equation set includes a single kernel value and eight consecutive elements from the input signal array. These equations can be implemented in code using simple SIMD FMA arithmetic.
Figure 6-6

SIMD 1D discrete convolution equations for a five-element convolution kernel

When performing a 1D discrete convolution using SIMD arithmetic, function Convolve1D_F32_Iavx2() must ensure that it does not attempt to access any elements beyond the boundaries of the input and output signal arrays. The first if code block in the while-loop verifies that a sufficient number of input signal elements are available to calculate output signal elements y[i:i+7]. If enough elements are available, the for-loop performs its calculations using the equations shown in Figure 6-6. The second if code block handles the case when there are enough input signal elements to calculate x[i:i+3]. The final else code block uses scalar arithmetic to process any residual signal elements. Note that all three code blocks in the while-loop perform unaligned load and store operations since it is impossible to guarantee that any set of consecutive input or output signal array elements will be aligned on a particular boundary.

The function Convolve1D_F32_Iavx2() carries out its calculations using a convolution kernel that can vary in size. Many real-world signal processing libraries often include convolution functions that are optimized for specific kernel sizes. Size-optimized convolution functions are often faster than their variable-width counterparts as you will soon see. The final function in Listing 6-1, Convolve1DKs5_F32_Iavx2(), is optimized for convolution kernels that contain five elements. The while and if code block arrangement of this function closely resembles the function Convolve1D_F32_Iavx2(). Note that in function Convolve1DKs5_F32_Iavx2(), the inner for-loops have been replaced with discrete calls to the appropriate C++ SIMD intrinsic functions. In this example, elimination of the inner for-loops yields results in faster code since there is no overhead code for the for-loop. Here are the results for source code example Ch06_01:
Executing Convolve1D_F32()
Results saved to file Ch06_01_Convolve1D_F32_Results_OXYGEN4.csv
Running benchmark function Convolve1D_F32_bm - please wait
Benchmark times saved to file Ch06_01_Convolve1D_F32_bm_OXYGEN4.csv
Table 6-1 shows some benchmark timing measurements for source code example Ch06_01. These measurements were made using a 1,000,000-element input signal array and a five-element convolution kernel.
Table 6-1

1D Discrete Convolution (Single-Precision) Execution Times (Microseconds)

CPU

Convolve1D_F32_Cpp()

Convolve1D_F32_Iavx2()

Convolve1DKs5_F32_Iavx2()

Intel Core i7-8700K

3240

411

278

Intel Core i5-11600K

2267

325

221

Listing 6-2 shows the C++ calculating code for source code example Ch06_02. This example is the double-precision counterpart of example Ch06_01.
//------------------------------------------------
//               Ch06_02_fcpp.cpp
//------------------------------------------------
#include <stdexcept>
#include <immintrin.h>
#include "Ch06_02.h"
#include "MiscTypes.h"
void Convolve1D_F64_Cpp(std::vector<double>& y, const std::vector<double>& x,
    const std::vector<double>& kernel)
{
    if (!CheckArgs(y, x, kernel))
        throw std::runtime_error("Convolve1D_F64_Cpp() - CheckArgs failed");
    indx_t num_pts = (indx_t)y.size();
    indx_t ks2 = kernel.size() / 2;
    for (indx_t i = ks2; i < num_pts - ks2; i++)
    {
        double y_val = 0;
        for (indx_t k = -ks2; k <= ks2; k++)
            y_val += x[i - k] * kernel[k + ks2];
        y[i] = y_val;
    }
}
void Convolve1D_F64_Iavx2(std::vector<double>& y, const std::vector<double>& x,     const std::vector<double>& kernel)
{
    if (!CheckArgs(y, x, kernel))
        throw std::runtime_error("Convolve1D_F64_Iavx2() - CheckArgs failed");
    indx_t ks2 = (indx_t)kernel.size() / 2;
    indx_t num_pts = (indx_t)y.size();
    const indx_t num_simd_elements = 4;
    const indx_t num_simd_elements2 = 2;
    indx_t i = ks2;
    while (i < num_pts - ks2)
    {
        if ((i + num_simd_elements) <= num_pts - ks2)
        {
             __m256d y_vals = _mm256_setzero_pd();
            // Calculate y[i:i+3]
            for (indx_t k = -ks2; k <= ks2; k++)
            {
                __m256d x_vals = _mm256_loadu_pd(&x[i - k]);
                __m256d kernel_vals = _mm256_broadcast_sd(&kernel[k + ks2]);
                y_vals = _mm256_fmadd_pd(x_vals, kernel_vals, y_vals);
            }
            _mm256_storeu_pd(&y[i], y_vals);
            i += num_simd_elements;
        }
        else if ((i + num_simd_elements2) <= num_pts - ks2)
        {
             __m128d y_vals = _mm_setzero_pd();
            // Calculate y[i:i+1]
            for (indx_t k = -ks2; k <= ks2; k++)
            {
                __m128d x_vals = _mm_loadu_pd(&x[i - k]);
                __m128d kernel_vals = _mm_set1_pd(kernel[k + ks2]);
                y_vals = _mm_fmadd_pd(x_vals, kernel_vals, y_vals);
            }
            _mm_storeu_pd(&y[i], y_vals);
            i += num_simd_elements2;
        }
        else
        {
             __m128d y_val = _mm_setzero_pd();
            // Calculate y[i]
            for (indx_t k = -ks2; k <= ks2; k++)
            {
                __m128d x_val = _mm_load_sd(&x[i - k]);
                __m128d k_val = _mm_load_sd(&kernel[k + ks2]);
                y_val = _mm_fmadd_sd(x_val, k_val, y_val);
            }
            _mm_store_sd(&y[i], y_val);
            i += 1;
        }
    }
}
void Convolve1DKs5_F64_Iavx2(std::vector<double>& y, const std::vector<double>& x,     const std::vector<double>& kernel)
{
    if (!CheckArgs(y, x, kernel))
        throw std::runtime_error("Convolve1DKs5_F64_Iavx2() - CheckArgs failed");
    if (kernel.size() != 5)
        throw std::runtime_error("Convolve1DKs5_F64_Iavx2() - invalid kernel size");
    const indx_t ks2 = 2;
    indx_t num_pts = (indx_t)y.size();
    const indx_t num_simd_elements = 4;
    const indx_t num_simd_elements2 = 2;
    __m256d kernel256_0 = _mm256_set1_pd(kernel[0]);
    __m256d kernel256_1 = _mm256_set1_pd(kernel[1]);
    __m256d kernel256_2 = _mm256_set1_pd(kernel[2]);
    __m256d kernel256_3 = _mm256_set1_pd(kernel[3]);
    __m256d kernel256_4 = _mm256_set1_pd(kernel[4]);
    __m128d kernel128_0 = _mm_set1_pd(kernel[0]);
    __m128d kernel128_1 = _mm_set1_pd(kernel[1]);
    __m128d kernel128_2 = _mm_set1_pd(kernel[2]);
    __m128d kernel128_3 = _mm_set1_pd(kernel[3]);
    __m128d kernel128_4 = _mm_set1_pd(kernel[4]);
    indx_t i = ks2;
    while (i < num_pts - ks2)
    {
        indx_t j = i + ks2;
        if ((i + num_simd_elements) <= num_pts - ks2)
        {
            // Calculate y[i:i+3]
            __m256d x_vals = _mm256_loadu_pd(&x[j]);
            __m256d y_vals = _mm256_mul_pd(x_vals, kernel256_0);
            x_vals = _mm256_loadu_pd(&x[j - 1]);
            y_vals = _mm256_fmadd_pd(x_vals, kernel256_1, y_vals);
            x_vals = _mm256_loadu_pd(&x[j - 2]);
            y_vals = _mm256_fmadd_pd(x_vals, kernel256_2, y_vals);
            x_vals = _mm256_loadu_pd(&x[j - 3]);
            y_vals = _mm256_fmadd_pd(x_vals, kernel256_3, y_vals);
            x_vals = _mm256_loadu_pd(&x[j - 4]);
            y_vals = _mm256_fmadd_pd(x_vals, kernel256_4, y_vals);
            _mm256_storeu_pd(&y[i], y_vals);
            i += num_simd_elements;
        }
        else if ((i + num_simd_elements2) <= num_pts - ks2)
        {
            // Calculate y[i:i+1]
            __m128d x_vals = _mm_loadu_pd(&x[j]);
            __m128d y_vals = _mm_mul_pd(x_vals, kernel128_0);
            x_vals = _mm_loadu_pd(&x[j - 1]);
            y_vals = _mm_fmadd_pd(x_vals, kernel128_1, y_vals);
            x_vals = _mm_loadu_pd(&x[j - 2]);
            y_vals = _mm_fmadd_pd(x_vals, kernel128_2, y_vals);
            x_vals = _mm_loadu_pd(&x[j - 3]);
            y_vals = _mm_fmadd_pd(x_vals, kernel128_3, y_vals);
            x_vals = _mm_loadu_pd(&x[j - 4]);
            y_vals = _mm_fmadd_pd(x_vals, kernel128_4, y_vals);
            _mm_storeu_pd(&y[i], y_vals);
            i += num_simd_elements2;
        }
        else
        {
            // Calculate y[i]
            __m128d x_val = _mm_load_sd(&x[j]);
            __m128d y_val = _mm_mul_sd(x_val, kernel128_0);
            x_val = _mm_load_sd(&x[j - 1]);
            y_val = _mm_fmadd_sd(x_val, kernel128_1, y_val);
            x_val = _mm_load_sd(&x[j - 2]);
            y_val = _mm_fmadd_sd(x_val, kernel128_2, y_val);
            x_val = _mm_load_sd(&x[j - 3]);
            y_val = _mm_fmadd_sd(x_val, kernel128_3, y_val);
            x_val = _mm_load_sd(&x[j - 4]);
            y_val = _mm_fmadd_sd(x_val, kernel128_4, y_val);
            _mm_store_sd(&y[i], y_val);
            i += 1;
        }
    }
}
Listing 6-2

Example Ch06_02

Like many of the source examples presented in earlier chapters, the changes required to transform the single-precision SIMD convolutions functions into double-precision variants are relatively straightforward. In example Ch06_02, the functions Convolve1D_F64_Iavx2() and Convolve1DKs5_F64_Iavx2() use double-precision versions of the C++ SIMD intrinsic functions. Also note that the values for num_simd_elements and num_simd_elements2 have been cut in half to accommodate SIMD arithmetic using double-precision floating-point elements.

The results for source code example Ch06_02 are equivalent to the results that were obtained for source code example Ch06_01. Table 6-2 shows some benchmark timing measurements for source code example Ch06_02. These measurements were made using a 1,000,000-element input signal array and a 5-element convolution kernel.
Table 6-2

1D Discrete Convolution (Double-Precision) Execution Times (Microseconds)

CPU

Convolve1D_F64_Cpp()

Convolve1D_F64_Iavx2()

Convolve1DKs5_F64_Iavx2()

Intel Core i7-8700K

3383

956

660

Intel Core i5-11600K

2280

703

503

2D Convolutions

In this section, you will study source code examples that carry out 2D discrete convolutions. The first source code example demonstrates a 2D discrete convolution. The code in this example is suitable when performing a 2D discrete convolution using a nonseparable convolution kernel. The second example illustrates a 2D discrete convolution using two 1D discrete convolutions. The code in this example highlights the performance advantages of using dual 1D convolution kernels instead of a single 2D convolution kernel when the 2D kernel is separable.

Nonseparable Kernel

Listing 6-3 shows the source code for example Ch06_03. This example demonstrates a 2D discrete convolution using a nonseparable 2D convolution kernel.
//------------------------------------------------
//               Ch06_03.h
//------------------------------------------------
#pragma once
#include <array>
#include <vector>
struct CD_2D
{
    size_t m_ImH = 0;
    size_t m_ImW = 0;
    size_t m_KernelSize = 0;
    std::vector<float> m_ImSrc;
    std::vector<float> m_ImDes;
    std::vector<float> m_Kernel2D;
};
enum class KERNEL_ID : unsigned int
{
    LowPass2D_3x3, LowPass2D_5x5, LowPass2D_7x7, LowPass2D_9x9, LowPass2D_15x15
};
// Ch06_03_fcpp.cpp
extern void Convolve2D_F32_Cpp(CD_2D& cd);
extern void Convolve2D_F32_Iavx2(CD_2D& cd);
// Ch06_03_misc.cpp
extern bool CheckArgs2D(const CD_2D& cd);
extern void Init2D(std::array<CD_2D, 2>& cd, const char* fn, KERNEL_ID id);
// Ch06_03_misc2.cpp
extern void DisplayKernel2D(float sigma, size_t ks);
extern void GetKernel2D(CD_2D& cd, KERNEL_ID id);
// Ch06_03_bm.cpp
extern void Convolve2D_F32_bm(void);
// Miscellaneous constants
const KERNEL_ID c_KernelID = KERNEL_ID::LowPass2D_15x15;
const KERNEL_ID c_KernelID_BM = KERNEL_ID::LowPass2D_9x9;
//------------------------------------------------
//               Ch06_03_misc.cpp
//------------------------------------------------
#include <iostream>
#include <iomanip>
#include <array>
#include <vector>
#include <stdexcept>
#include "Ch06_03.h"
#include "MT_Convolve.h"
#include "ImageMatrix.h"
bool CheckArgs2D(const CD_2D& cd)
{
    size_t im_src_size = cd.m_ImSrc.size();
    if (im_src_size != cd.m_ImDes.size())
        return false;
    if (im_src_size != cd.m_ImH * cd.m_ImW)
        return false;
    size_t ks = cd.m_KernelSize;
    if ((ks < 3) || ((ks & 0x1) == 0))
        return false;
    if (cd.m_Kernel2D.size() != ks * ks)
        return false;
    return true;
}
void Init2D(std::array<CD_2D, 2>& cd, const char* fn, KERNEL_ID id)
{
    GetKernel2D(cd[0], id);
    GetKernel2D(cd[1], id);
    ImageMatrix im_src0;
    if (fn != nullptr)
    {
        ImageMatrix im_tmp(fn, PixelType::Gray8);
        im_src0 = im_tmp;
    }
    else
    {
        ImageMatrix im_tmp(101, 103, PixelType::Gray8); // test image
        im_tmp.FillRandom(0, 255, 1003);
        im_src0 = im_tmp;
    }
    ImageMatrix im_src1(im_src0);
    im_src0.FillBorder<uint8_t>(cd[0].m_KernelSize / 2, (uint8_t)0);
    cd[0].m_ImH = im_src0.GetHeight();
    cd[0].m_ImW = im_src0.GetWidth();
    cd[0].m_ImSrc = im_src0.ToVector<float>();
    cd[0].m_ImDes.resize(cd[0].m_ImSrc.size());
    std::fill(cd[0].m_ImDes.begin(), cd[0].m_ImDes.end(), 0.0f);
    im_src1.FillBorder<uint8_t>(cd[1].m_KernelSize / 2, (uint8_t)0);
    cd[1].m_ImH = im_src1.GetHeight();
    cd[1].m_ImW = im_src1.GetWidth();
    cd[1].m_ImSrc = im_src1.ToVector<float>();
    cd[1].m_ImDes.resize(cd[1].m_ImSrc.size());
    std::fill(cd[1].m_ImDes.begin(), cd[1].m_ImDes.end(), 0.0f);
}
//------------------------------------------------
//               Ch06_03.cpp
//------------------------------------------------
#include <iostream>
#include <iomanip>
#include <stdexcept>
#include "Ch06_03.h"
#include "ImageMatrix.h"
static void Convolve2D_F32(void);
int main()
{
    try
    {
        Convolve2D_F32();
        Convolve2D_F32_bm();
    }
    catch (std::exception& ex)
    {
        std::cout << "Ch06_03 exception: " << ex.what() << ' ';
    }
    return 0;
}
static void Convolve2D_F32(void)
{
    const char nl = ' ';
    const char* fn_src = "../../Data/ImageE.png";
    const char* fn_des0 = "Ch06_03_ImageE_Conv2D_0.png";
    const char* fn_des1 = "Ch06_03_ImageE_Conv2D_1.png";
    ImageFileType ift_des = ImageFileType::PNG;
    // Initialize convolution data structures
    std::array<CD_2D, 2> cd;
    Init2D(cd, fn_src, c_KernelID);
    // Perform convolutions
    std::cout << "Performing convolutions ";
    Convolve2D_F32_Cpp(cd[0]);
    Convolve2D_F32_Iavx2(cd[1]);
    // Save destination image files
    std::cout << "Saving destination image files ";
    int h0 = (int)cd[0].m_ImH, w0 = (int)cd[0].m_ImW;
    int h1 = (int)cd[1].m_ImH, w1 = (int)cd[1].m_ImW;
    ImageMatrix im_des0 = ImageMatrix::ToImage(cd[0].m_ImDes, h0, w0, PixelType::Gray8);
    ImageMatrix im_des1 = ImageMatrix::ToImage(cd[1].m_ImDes, h1, w1, PixelType::Gray8);
    im_des0.SaveImage(fn_des0, ift_des);
    im_des1.SaveImage(fn_des1, ift_des);
    // Make sure images are alike.
    size_t num_diff;
    const uint8_t max_d = 1;
    bool rc = ImageMatrix::AreAlike(im_des0, im_des1, max_d, &num_diff);
    std::cout << "rc:       " << std::boolalpha << rc << nl;
    std::cout << "num_diff: " << num_diff << nl;
}
//------------------------------------------------
//               Ch06_03_fcpp.cpp
//------------------------------------------------
#include <stdexcept>
#include <immintrin.h>
#include "Ch06_03.h"
#include "MiscTypes.h"
void Convolve2D_F32_Cpp(CD_2D& cd)
{
    if (!CheckArgs2D(cd))
        throw std::runtime_error("Convolve2D_F32_Cpp() - CheckArgs failed");
    indx_t ks = (indx_t)cd.m_KernelSize;
    indx_t ks2 = ks / 2;
    indx_t im_h = (indx_t)cd.m_ImH;
    indx_t im_w = (indx_t)cd.m_ImW;
    const std::vector<float>& im_src = cd.m_ImSrc;
    std::vector<float>& im_des = cd.m_ImDes;
    std::vector<float>& im_ker = cd.m_Kernel2D;
    for (indx_t i = ks2; i < im_h - ks2; i++)
    {
        for (indx_t j = ks2; j < im_w - ks2; j++)
        {
            float im_des_val = 0;
            for (indx_t k1 = -ks2; k1 <= ks2; k1++)
            {
                for (indx_t k2 = -ks2; k2 <= ks2; k2++)
                {
                    float im_src_val = im_src[(i - k1) * im_w + j - k2];
                    float im_ker_val = im_ker[(k1 + ks2) * ks + k2 + ks2];
                    im_des_val += im_src_val * im_ker_val;
                }
            }
            im_des[i * im_w + j] = im_des_val;
        }
    }
}
void Convolve2D_F32_Iavx2(CD_2D& cd)
{
    if (!CheckArgs2D(cd))
        throw std::runtime_error("Convolve2D_F32_Iavx2() - CheckArgs failed");
    indx_t ks = (indx_t)cd.m_KernelSize;
    indx_t ks2 = ks / 2;
    indx_t im_h = (indx_t)cd.m_ImH;
    indx_t im_w = (indx_t)cd.m_ImW;
    const std::vector<float>& im_src = cd.m_ImSrc;
    std::vector<float>& im_des = cd.m_ImDes;
    std::vector<float>& im_ker = cd.m_Kernel2D;
    const indx_t num_simd_elements = 8;
    const indx_t num_simd_elements2 = 4;
    for (indx_t i = ks2; i < im_h - ks2; i++)
    {
        indx_t j = ks2;
        while (j < im_w - ks2)
        {
            if (j + num_simd_elements <= im_w - ks2)
            {
                __m256 im_des_vals = _mm256_setzero_ps();
                for (indx_t k1 = -ks2; k1 <= ks2; k1++)
                {
                    for (indx_t k2 = -ks2; k2 <= ks2; k2++)
                    {
                        indx_t i_src = (i - k1) * im_w + j - k2;
                        indx_t i_ker = (k1 + ks2) * ks + k2 + ks2;
                        __m256 im_src_vals = _mm256_loadu_ps(&im_src[i_src]);
                        __m256 im_ker_vals = _mm256_set1_ps(im_ker[i_ker]);
                        im_des_vals = _mm256_fmadd_ps(im_src_vals, im_ker_vals, im_des_vals);
                    }
                }
                _mm256_storeu_ps(&im_des[i * im_w + j], im_des_vals);
                j += num_simd_elements;
            }
            else if (j + num_simd_elements2 <= im_w - ks2)
            {
                __m128 im_des_vals = _mm_setzero_ps();
                for (indx_t k1 = -ks2; k1 <= ks2; k1++)
                {
                    for (indx_t k2 = -ks2; k2 <= ks2; k2++)
                    {
                        indx_t i_src = (i - k1) * im_w + j - k2;
                        indx_t i_ker = (k1 + ks2) * ks + k2 + ks2;
                        __m128 im_src_vals = _mm_loadu_ps(&im_src[i_src]);
                        __m128 im_ker_vals = _mm_set1_ps(im_ker[i_ker]);
                        im_des_vals = _mm_fmadd_ps(im_src_vals, im_ker_vals, im_des_vals);
                    }
                }
                _mm_storeu_ps(&im_des[i * im_w + j], im_des_vals);
                j += num_simd_elements2;
            }
            else
            {
                float im_des_val = 0;
                for (indx_t k1 = -ks2; k1 <= ks2; k1++)
                {
                    for (indx_t k2 = -ks2; k2 <= ks2; k2++)
                    {
                        indx_t i_src = (i - k1) * im_w + j - k2;
                        indx_t i_ker = (k1 + ks2) * ks + k2 + ks2;
                        float im_src_val = im_src[i_src];
                        float im_ker_val = im_ker[i_ker];
                        im_des_val += im_src_val * im_ker_val;
                    }
                }
                im_des[i * im_w + j] = im_des_val;
                j += 1;
            }
        }
    }
}
Listing 6-3

Example Ch06_03

Listing 6-3 opens with the file Ch06_03.h. Near the top of this file is the definition of a structure named CD_2D . This structure holds the input image, output image, and convolution kernel matrices, which are implemented using the C++ STL class std::vector<float>. Structure CD_2D also includes image and kernel size information. Following the definition of structure CD_2D is an enum named KERNEL_ID. The function Init2D() in Ch06_03_misc.cpp employs a KERNEL_ID argument to select a low-pass filter kernel for the convolution functions. In file Ch06_03.cpp, there is a function named Convolve2D_F32() . This function initializes the CD_2D structures, invokes the convolution calculating functions, and saves the final images.

The file Ch06_03_fcpp.cpp begins with the definition of function Convolve2D_F32_Cpp(). This non-SIMD function uses standard C++ statements to perform a 2D discrete convolution using the specified source image and convolution kernel. Note that function Convolve2D_F32_Cpp() employs four nested for-loops to implement the 2D discrete convolution equation that was defined earlier in this chapter.

The SIMD counterpart of function Convolve2D_F32_Cpp() is named Convolve2D_F32_Iavx2(). The code layout of this function is somewhat analogous to the 1D discrete convolution functions you saw earlier in this chapter. In function Convolve2D_F32_Iavx2() , the index variable i in the outer-most for-loop specifies the current image row, and the index variable j in the second-level while-loop specifies the current image column. The first if statement in the while-loop verifies that there are num_simd_elements or more columns remaining in the current row. If true, the two inner for-loops calculate im_des[i][j:j+7] using the appropriate elements from the source image im_src and the convolution kernel im_ker. Note that these C++ code variables correspond to the symbols y, x, and g in the 2D discrete convolution equation.

The explicit equations required to calculate each im_des[i][j] are unwieldy to write out, especially for large convolution kernels. As an alternative, Figure 6-7 shows a table of input signal and convolution kernel (size 3 × 3) elements that are required to calculate each output signal element when using an __m256 C++ SIMD type and SIMD arithmetic. This figure uses the variable names of the 2D discrete convolution equation. To calculate y[i][j], for example, each input signal element in its row must be multiplied by the convolution kernel element that is shown at the top of each table column. These products are then summed to generate the final y[i][j]. Note that each column in the table contains consecutive input signal elements. This allows input signal element loading using the C++ SIMD intrinsic function _mm256_loadu_ps(). The function _mm256_set1_ps() broadcasts single elements from the convolution kernel, while _mm256_fmadd_ps() performs the required FMA SIMD arithmetic.
Figure 6-7

Input signal and convolution kernel elements required to calculate an output element in a 2D discrete convolution

Returning to the code in Listing 6-3, the second else if code block in Convolve2D_F32_Iavx2() calculates im_des[i][j:j+3] whenever the number of remaining columns in the current row is less than num_simd_elements but greater than or equal to num_simd_elements2. This facilitates the processing of signal elements using 128-bit wide SIMD operands. The final else code block handles any residual columns in the current row using scalar single-precision floating-point arithmetic. It should be noted that the convolution calculating functions Convolve2D_F32_Cpp() and Convolve2D_F32_Iavx2() ignore the border elements of the input images when performing their calculations. This is the reason for the black border in the output image of Figure 6-3. In real-word image processing, the application’s functional requirements usually determine if extra processing is necessary to handle any border elements or if they can just be ignored. The image processing references listed in Appendix B contain more information regarding this topic. Here are the results for source code example Ch06_03:
Performing convolutions
Saving destination image files
rc:       true
num_diff: 0
Running benchmark function Convolve2D_F32_bm - please wait
..................................................
Benchmark times saved to file Ch06_03_Convolve2D_F32_bm_OXYGEN4.csv
Table 6-3 shows some benchmark timing measurements for source code example Ch06_03. These measurements were made using test image ImageE.png and a 9 × 9 convolution kernel that performs low-pass filtering.
Table 6-3

2D Discrete Convolution (Single-Precision) Execution Times (Microseconds)

CPU

Convolve2D_F32_Cpp()

Convolve2D_F32_Iavx2()

Intel Core i7-8700K

89569

11241

Intel Core i5-11600K

75609

9266

Separable Kernel

Earlier in this chapter, you learned that if a 2D convolution kernel is separable, a 2D discrete convolution can be carried out using two independent 1D discrete convolutions. Using two independent 1D discrete convolutions is significantly faster than a single 2D discrete convolution since the total number of arithmetic operations of the former is considerably less. The next source code example, Ch06_04, demonstrates using two independent 1D discrete convolutions to carry out a 2D discrete convolution. Listing 6-4 shows the source code for example Ch06_04.
//------------------------------------------------
//               Ch06_04.h
//------------------------------------------------
#pragma once
#include <array>
#include <vector>
struct CD_1Dx2
{
    size_t m_ImH;
    size_t m_ImW;
    size_t m_KernelSize;
    std::vector<float> m_ImSrc;
    std::vector<float> m_ImDes;
    std::vector<float> m_ImTmp;
    std::vector<float> m_Kernel1Dx;
    std::vector<float> m_Kernel1Dy;
};
enum class KERNEL_ID : unsigned int
{
    LowPass1Dx2_3x3, LowPass1Dx2_5x5, LowPass1Dx2_7x7,
    LowPass1Dx2_9x9, LowPass1Dx2_15x15
};
// Ch06_04_fcpp.cpp
extern void Convolve1Dx2_F32_Cpp(CD_1Dx2& cd);
extern void Convolve1Dx2_F32_Iavx2(CD_1Dx2& cd);
// Ch06_04_misc.cpp
extern bool CheckArgs1Dx2(const CD_1Dx2& cd);
extern void Init1Dx2(std::array<CD_1Dx2, 2>& cd, const char* fn, KERNEL_ID id);
// Ch06_04_misc2.cpp
extern void DisplayKernel1Dx2(float sigma, size_t ks);
extern void GetKernel1Dx2(CD_1Dx2& cd, KERNEL_ID id);
// Ch06_04_bm.cpp
extern void Convolve1Dx2_F32_bm(void);
// Miscellaneous constants
const KERNEL_ID c_KernelID = KERNEL_ID::LowPass1Dx2_15x15;
const KERNEL_ID c_KernelID_BM = KERNEL_ID::LowPass1Dx2_9x9;
//------------------------------------------------
//               Ch06_04_fcpp.cpp
//------------------------------------------------
#include <stdexcept>
#include <immintrin.h>
#include "Ch06_04.h"
#include "MiscTypes.h"
void Convolve1Dx2_F32_Cpp(CD_1Dx2& cd)
{
    if (!CheckArgs1Dx2(cd))
        throw std::runtime_error("Convolve1Dx2_F32_Cpp() - CheckArgs failed");
    indx_t ks = (indx_t)cd.m_KernelSize;
    indx_t ks2 = ks / 2;
    indx_t im_h = cd.m_ImH;
    indx_t im_w = cd.m_ImW;
    const std::vector<float>& im_src = cd.m_ImSrc;
    std::vector<float>& im_des = cd.m_ImDes;
    std::vector<float>& im_tmp = cd.m_ImTmp;
    const std::vector<float>& im_ker_x = cd.m_Kernel1Dx;
    const std::vector<float>& im_ker_y = cd.m_Kernel1Dy;
    // Perform 1D convolution (X)
    for (indx_t i = ks2; i < im_h - ks2; i++)
    {
        for (indx_t j = ks2; j < im_w - ks2; j++)
        {
            float im_tmp_val = 0;
            for (indx_t k = -ks2; k <= ks2; k++)
                im_tmp_val += im_src[i * im_w + j - k] * im_ker_x[k + ks2];
            im_tmp[i * im_w + j] = im_tmp_val;
        }
    }
    // Perform 1D convolution (Y)
    for (indx_t j = ks2; j < im_w - ks2; j++)
    {
        for (indx_t i = ks2; i < im_h - ks2; i++)
        {
            float im_des_val = 0;
            for (indx_t k = -ks2; k <= ks2; k++)
                im_des_val += im_tmp[(i - k) * im_w + j] * im_ker_y[k + ks2];
            im_des[i * im_w + j] = im_des_val;
        }
    }
}
void Convolve1Dx2_F32_Iavx2(CD_1Dx2& cd)
{
    if (!CheckArgs1Dx2(cd))
        throw std::runtime_error("Convolve1Dx2_F32_Iavx2() - CheckArgs failed");
    indx_t ks = (indx_t)cd.m_KernelSize;
    indx_t ks2 = ks / 2;
    indx_t im_h = cd.m_ImH;
    indx_t im_w = cd.m_ImW;
    const std::vector<float>& im_src = cd.m_ImSrc;
    std::vector<float>& im_des = cd.m_ImDes;
    std::vector<float>& im_tmp = cd.m_ImTmp;
    const std::vector<float>& im_ker_x = cd.m_Kernel1Dx;
    const std::vector<float>& im_ker_y = cd.m_Kernel1Dy;
    const indx_t num_simd_elements = 8;
    const indx_t num_simd_elements2 = 4;
    // Perform 1D convolution (X)
    for (indx_t i = ks2; i < im_h - ks2; i++)
    {
        indx_t j = ks2;
        while (j < im_w - ks2)
        {
            if (j + num_simd_elements <= im_w - ks2)
            {
                __m256 im_tmp_vals = _mm256_setzero_ps();
                for (indx_t k = -ks2; k <= ks2; k++)
                {
                    __m256 im_src_vals = _mm256_loadu_ps(&im_src[i * im_w + j - k]);
                    __m256 im_ker_vals = _mm256_set1_ps(im_ker_x[k + ks2]);
                    im_tmp_vals = _mm256_fmadd_ps(im_src_vals, im_ker_vals,
                                    im_tmp_vals);
                }
                _mm256_storeu_ps(&im_tmp[i * im_w + j], im_tmp_vals);
                j += num_simd_elements;
            }
            else if (j + num_simd_elements2 <= im_w - ks2)
            {
                __m128 im_tmp_vals = _mm_setzero_ps();
                for (indx_t k = -ks2; k <= ks2; k++)
                {
                    __m128 im_src_vals = _mm_loadu_ps(&im_src[i * im_w + j - k]);
                    __m128 im_ker_vals = _mm_set1_ps(im_ker_x[k + ks2]);
                    im_tmp_vals = _mm_fmadd_ps(im_src_vals, im_ker_vals,
                                    im_tmp_vals);
                }
                _mm_storeu_ps(&im_tmp[i * im_w + j], im_tmp_vals);
                j += num_simd_elements2;
            }
            else
            {
                __m128 im_tmp_vals = _mm_setzero_ps();
                for (indx_t k = -ks2; k <= ks2; k++)
                {
                    __m128 im_src_vals = _mm_load_ss(&im_src[i * im_w + j - k]);
                    __m128 im_ker_vals = _mm_load_ss(&im_ker_x[k + ks2]);
                    im_tmp_vals = _mm_fmadd_ss(im_src_vals, im_ker_vals,
                                    im_tmp_vals);
                }
                _mm_store_ss(&im_tmp[i * im_w + j], im_tmp_vals);
                j += 1;
            }
        }
    }
    // Perform 1D convolution (Y)
    indx_t j = ks2;
    while (j < im_w - ks2)
    {
        if (j + num_simd_elements <= im_w - ks2)
        {
            for (indx_t i = ks2; i < im_h - ks2; i++)
            {
                __m256 im_des_vals = _mm256_setzero_ps();
                for (indx_t k = -ks2; k <= ks2; k++)
                {
                    __m256 im_tmp_vals = _mm256_loadu_ps(&im_tmp[(i - k) * im_w + j]);
                    __m256 im_ker_vals = _mm256_set1_ps(im_ker_y[k + ks2]);
                    im_des_vals = _mm256_fmadd_ps(im_tmp_vals, im_ker_vals,
                                    im_des_vals);
                }
                _mm256_storeu_ps(&im_des[i * im_w + j], im_des_vals);
             }
            j += num_simd_elements;
        }
        else if (j + num_simd_elements2 <= im_w - ks2)
        {
            for (indx_t i = ks2; i < im_h - ks2; i++)
            {
                __m128 im_des_vals = _mm_setzero_ps();
                for (indx_t k = -ks2; k <= ks2; k++)
                {
                    __m128 im_tmp_vals = _mm_loadu_ps(&im_tmp[(i - k) * im_w + j]);
                    __m128 im_ker_vals = _mm_set1_ps(im_ker_y[k + ks2]);
                    im_des_vals = _mm_fmadd_ps(im_tmp_vals, im_ker_vals,
                                    im_des_vals);
                }
                _mm_storeu_ps(&im_des[i * im_w + j], im_des_vals);
             }
            j += num_simd_elements2;
        }
        else
        {
            for (indx_t i = ks2; i < im_h - ks2; i++)
            {
                __m128 im_des_vals = _mm_setzero_ps();
                for (indx_t k = -ks2; k <= ks2; k++)
                {
                    __m128 im_tmp_vals = _mm_load_ss(&im_tmp[(i - k) * im_w + j]);
                    __m128 im_ker_vals = _mm_load_ss(&im_ker_y[k + ks2]);
                    im_des_vals = _mm_fmadd_ss(im_tmp_vals, im_ker_vals,
                                    im_des_vals);
                }
                _mm_store_ss(&im_des[i * im_w + j], im_des_vals);
            }
            j += 1;
        }
    }
}
Listing 6-4

Example Ch06_04

In Listing 6-4, the header file Ch06_04.h begins with the definition of a structure named CD_1Dx2 . This structure holds the data that is used by the convolution calculating functions. There are two noteworthy differences between the structure CD_1Dx2 and the structure CD_2D that was used in the previous example. First, structure CD_1Dx2 includes a third image buffer named m_ImTmp. This temporary buffer holds the result of the first 1D discrete convolution. The second notable change is that structure CD_1Dx2 contains two 1D convolution kernels, m_Kernel1Dx and m_Kernel1Dy, instead of a single 2D discrete convolution kernel. Not shown in Listing 6-4 are files Ch06_04_misc.cpp and Ch06_04.cpp. These files contain the functions that perform data structure initialization, image loading and saving, etc., and are very similar to the ones used in example Ch06_03. They are, of course, included in the software download package.

The first function in file Ch06_04_fcpp.cpp, Convolve1Dx2_F32_Cpp() , implements the dual 1D discrete convolution using standard C++ statements. Note that Convolve1Dx2_F32_Cpp() performs its convolutions using two separate code blocks. The first code block contains three nested for-loops that execute a 1D convolution using the x-axis of source image buffer im_src. The results of this convolution are saved to the temporary image buffer im_tmp. The second code block in function Convolve1Dx2_F32_Cpp() performs a 1D convolution using the y-axis of im_tmp. The result of this 1D convolution is saved to the output image buffer im_des.

The SIMD counterpart function of Convolve1Dx2_F32_Cpp() is named Convolve1Dx2_F32_Iavx2(). This function is also partitioned into two well-defined sections that execute independent 1D discrete convolutions. The first section carries out a 1D discrete convolution using C++ SIMD intrinsic functions and the x-axis of input image im_src. The logic used in this section is similar to the previous example in that num_simd_elements and num_simd_elements2 are tested to determine if 256-bit wide, 128-bit wide, or scalar operands should be used. The second section in Convolve1Dx2_F32_Cpp() performs a 1D discrete convolution using the y-axis of im_tmp and saves results to the output image im_des. Here are the results for source code example Ch06_04:
Performing convolutions
Saving destination image files
rc:       true
num_diff: 0
Running benchmark function Convolve1Dx2_F32_bm - please wait
..................................................
Benchmark times saved to file Ch06_04_Convolve1Dx2_F32_bm_OXYGEN4.csv
Table 6-4 shows some benchmark timing measurements for source code example Ch06_04. These measurements were made using test image ImageE.png and a 9 × 9 convolution kernel that performs low-pass filtering. If you compare the execution times in Tables 6-3 and 6-4, it was clearly worth the effort to code the functions that implement a 2D discrete convolution using two independent 1D discrete convolutions. This is true for both the standard C++ and SIMD algorithms.
Table 6-4

2D Discrete Convolution (Single-Precision) Execution Times (Microseconds)

CPU

Convolve1Dx2_F32_Cpp()

Convolve1Dx2_F32_Iavx2()

Intel Core i7-8700K

21568

3623

Intel Core i5-11600K

14395

2467

Source code example Ch06_05 contains a simple function that compares the output images generated by the convolution functions in examples Ch06_03 and Ch06_04. In theory, these images should be identical. However, when comparing the images, pixel values are allowed to differ by ±1 to account for the non-associativity of floating-point arithmetic. Here are the results:
Comparing convolution result images
../Ch06_03/Ch06_03_ImageE_Conv2D_0.png
../Ch06_03/Ch06_03_ImageE_Conv2D_1.png
../Ch06_04/Ch06_04_ImageE_Conv2D_0.png
../Ch06_04/Ch06_04_ImageE_Conv2D_1.png
All images are alike

In source code file Ch06_05.cpp, changing max_d = 1 to max_d = 0 yields only a small number of pixel value discrepancies.

Summary

Most of the C++ SIMD intrinsic functions demonstrated in the chapter were also used in previous chapters. However, some new functions were introduced, and these are summarized in Table 6-5 along with commonly used size variants. Before proceeding to the next chapter, you should understand the SIMD arithmetic calculation or data manipulation operation that is performed by each function shown in Table 6-5.
Table 6-5

C++ SIMD Intrinsic Function Summary for Chapter 6

C++ SIMD Function Name

Description

_mm_fmadd_sd, _ss

Scalar fused-multiply-add

_mm_load_sd, _ss

Load scalar floating-point element

_mm_mul_pd, _ps

Packed floating-point multiplication

_mm_mul_sd, _ss

Scalar floating-point multiplication

_mm_set1_pd, _ps

Broadcast floating-point value

_mm_setzero_pd, _ps

Set packed floating-point elements to zero

_mm_store_sd, _ss

Store scalar floating-point element

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.219.166