The first implementation of SIMD was MMX, and nobody seems to know the exact meaning of MMX. It could mean Multi Media Extension or Multiple Math Extension or Matrix Math Extension. Anyway, MMX was superseded by Streaming SIMD Extension (SSE). Later SSE was extended by Advanced Vector Extension (AVX). Here we will give an introduction on SSE as a base to start, and in a later chapter we will give an introduction on AVX.
Scalar Data and Packed Data
A processor that supports SSE functionality has 16 additional 128-bit registers (xmm0 to xmm15) and a control register, mxcsr. We already used the xmm registers to do floating-point calculations, but we can do more with these advanced registers. The xmm registers can contain scalar data or packed data.
Two 64-bit double-precision floating-point numbers
Four 32-bit single-precision floating-point numbers
Two 64-bit integers (quadwords)
Four 32-bit integers (double words)
Eight 16-bit short integers (words)
Sixteen 8-bit bytes or characters
There are distinct assembly instructions for scalar numbers and packed numbers. In the Intel manuals, you can see that there are a huge number of SSE instructions available. We will just use a couple of examples in this and the following chapters as an introduction to get you going.
In later chapters, we will use AVX functionality. AVX registers are double the size of xmm. The AVX registers are called ymm registers and have 256 bits. There is also AVX-512, which provides for AVX-512 registers that have 512 bits and are called zmm registers.
Because of the potential for parallel computing, SIMD can be used to speed up computations in a wide area of applications such as image processing, audio processing, signal processing, vector and matrix manipulations, and so on. In later chapters, we will use SIMD for doing matrix manipulations, but don’t worry; we will limit the mathematics to basic matrix operations. The purpose is to learn SIMD, not linear algebra.
Unaligned and Aligned Data
When using SSE, alignment means that data in section .data and in section .bss should be aligned on a 16-byte border. In NASM you can use the assembly directives align 16 and alignb 16 in front of the data to be aligned. In the upcoming chapters, you will see examples of this. For AVX, data should be aligned on a 32-byte border, and for AVX-512, data needs to be aligned on a 64-bit border.
Summary
SSE provides you with 16 additional 128-bit registers.
You know the difference between scalar data and packed data.
You know the importance of data alignment.