26. SIMD

Jo Van Hoey¹

(1)

Hamme, Belgium

SIMD is the abbreviation for Single Instruction Stream, Multiple Data . SIMD is a term proposed by Michael J. Flynn and refers to the functionality that allows you to execute one instruction on multiple data “streams.” SIMD can potentially improve the performance of your programs. SIMD is a form of parallel computing; however, in some cases, the execution on the different data streams can happen sequentially, depending on the hardware functionality and the instructions to be executed. You can find more about the Flynn taxonomy here:

https://ieeexplore.ieee.org/document/5009071/

and here:

https://en.wikipedia.org/wiki/Flynn%27s_taxonomy

The first implementation of SIMD was MMX, and nobody seems to know the exact meaning of MMX. It could mean Multi Media Extension or Multiple Math Extension or Matrix Math Extension. Anyway, MMX was superseded by Streaming SIMD Extension (SSE). Later SSE was extended by Advanced Vector Extension (AVX). Here we will give an introduction on SSE as a base to start, and in a later chapter we will give an introduction on AVX.

Scalar Data and Packed Data

A processor that supports SSE functionality has 16 additional 128-bit registers (xmm0 to xmm15) and a control register, mxcsr. We already used the xmm registers to do floating-point calculations, but we can do more with these advanced registers. The xmm registers can contain scalar data or packed data.

With scalar data, we mean just one value. When we put 3.141592654 in xmm0, then xmm0 contains a scalar value. We can also store multiple values in xmm0; these values are referred to as packed data. Here are the possibilities of storing values in an xmm register :

Two 64-bit double-precision floating-point numbers
Four 32-bit single-precision floating-point numbers
Two 64-bit integers (quadwords)
Four 32-bit integers (double words)
Eight 16-bit short integers (words)
Sixteen 8-bit bytes or characters

Schematically, it looks like Figure 26-1.

../images/483996_1_En_26_Chapter/483996_1_En_26_Fig1_HTML.png — Figure 26-1
Content of an xmm register

There are distinct assembly instructions for scalar numbers and packed numbers. In the Intel manuals, you can see that there are a huge number of SSE instructions available. We will just use a couple of examples in this and the following chapters as an introduction to get you going.

In later chapters, we will use AVX functionality. AVX registers are double the size of xmm. The AVX registers are called ymm registers and have 256 bits. There is also AVX-512, which provides for AVX-512 registers that have 512 bits and are called zmm registers.

Because of the potential for parallel computing, SIMD can be used to speed up computations in a wide area of applications such as image processing, audio processing, signal processing, vector and matrix manipulations, and so on. In later chapters, we will use SIMD for doing matrix manipulations, but don’t worry; we will limit the mathematics to basic matrix operations. The purpose is to learn SIMD, not linear algebra.

Unaligned and Aligned Data

Data in memory can be unaligned or aligned on certain addresses that are multiples of 16, 32, and so on. Aligning data in memory can drastically improve the performance of a program. Here is the reason why: aligned packed SSE instructions want to fetch memory chunks of 16 bytes at the time; see the left side of Figure 26-2. When data in memory is not aligned, the CPU has to do more than one fetch to get the needed 16-byte data, and that slows down the execution. We have two types of SSE instructions: aligned packed instructions and unaligned packed instructions. Unaligned packed instructions can deal with unaligned memory, but in general there is a performance disadvantage.

../images/483996_1_En_26_Chapter/483996_1_En_26_Fig2_HTML.jpg — Figure 26-2
Data alignment

When using SSE, alignment means that data in section .data and in section .bss should be aligned on a 16-byte border. In NASM you can use the assembly directives align 16 and alignb 16 in front of the data to be aligned. In the upcoming chapters, you will see examples of this. For AVX, data should be aligned on a 32-byte border, and for AVX-512, data needs to be aligned on a 64-bit border.

Summary

In this chapter, you learned the following:

SSE provides you with 16 additional 128-bit registers.
You know the difference between scalar data and packed data.
You know the importance of data alignment.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 26. SIMD

Create new playlist

Sign In

Sign Up

26. SIMD

Scalar Data and Packed Data

Unaligned and Aligned Data

Summary

Table of Contents for
26. SIMD