It’s time to start the real SSE work! Although we have had a number of chapters on SSE, we only scratched the surface of the subject. There are hundreds of SIMD instructions (MMX, SSE, AVX), and investigating them in-depth would require another book or even a series of books. In this chapter, we will give a number of examples so that you know where to start. The purpose of these examples is to enable you to find your way in the multitude of SIMD instructions in the Intel manuals. In this chapter, we will discuss alignment, which we already covered briefly in Chapter 26.
Unaligned Example
Listing 28-1 shows how to add vectors using data that is unaligned in memory.
; sse_unaligned.asm
extern printf
section .data
;single precision
spvector1 dd 1.1
dd 2.2
dd 3.3
dd 4.4
spvector2 dd 1.1
dd 2.2
dd 3.3
dd 4.4
;double precision
dpvector1 dq 1.1
dq 2.2
dpvector2 dq 3.3
dq 4.4
fmt1 db "Single Precision Vector 1: %f, %f, %f, %f",10,0
fmt2 db "Single Precision Vector 2: %f, %f, %f, %f",10,0
fmt3 db "Sum of Single Precision Vector 1 and Vector 2:"
db " %f, %f, %f, %f",10,0
fmt4 db "Double Precision Vector 1: %f, %f",10,0
fmt5 db "Double Precision Vector 2: %f, %f",10,0
fmt6 db "Sum of Double Precision Vector 1 and Vector 2:"
db " %f, %f",10,0
section .bss
spvector_res resd 4
dpvector_res resq 4
section .text
global main
main:
push rbp
mov rbp,rsp
; add 2 single precision floating point vectors
mov rsi,spvector1
mov rdi,fmt1
call printspfp
mov rsi,spvector2
mov rdi,fmt2
call printspfp
movups xmm0, [spvector1]
movups xmm1, [spvector2]
addps xmm0,xmm1
movups [spvector_res], xmm0
mov rsi,spvector_res
mov rdi,fmt3
call printspfp
; add 2 double precision floating point vectors
mov rsi,dpvector1
mov rdi,fmt4
call printdpfp
mov rsi,dpvector2
mov rdi,fmt5
call printdpfp
movupd xmm0, [dpvector1]
movupd xmm1, [dpvector2]
addpd xmm0,xmm1
movupd [dpvector_res], xmm0
mov rsi,dpvector_res
mov rdi,fmt6
call printdpfp
leave
ret
printspfp:
push rbp
mov rbp,rsp
movss xmm0, [rsi]
cvtss2sd xmm0,xmm0
movss xmm1, [rsi+4]
cvtss2sd xmm1,xmm1
movss xmm2, [rsi+8]
cvtss2sd xmm2,xmm2
movss xmm3, [rsi+12]
cvtss2sd xmm3,xmm3
mov rax,4; four floats
call printf
leave
ret
printdpfp:
push rbp
mov rbp,rsp
movsd xmm0, [rsi]
movsd xmm1, [rsi+8]
mov rax,2; four floats
call printf
leave
ret
Listing 28-1
sse_unaligned.asm
The first SSE instruction is movups (which means “move unaligned packed single precision”), which copies data from memory into xmm0 and xmm1. As a result, xmm0 contains one vector with four single-precision values, and xmm1 contains one vector with four single-precision values. Then we use addps (which means “add packed single precision”) to add the two vectors; the resultant vector goes into xmm0 and is then transferred to memory. Then we print the result with the function printspfp. In the printspfp function, we copy every value from memory into xmm registers using movss (which means “move scalar single precision”). Because printf expects double-precision floating-point arguments, we convert the single-precision floating-point numbers to double precision with the instruction cvtss2sd (which means “convert scalar single to scalar double”).
Next, we add two double-precision values. The process is similar to adding single-precision numbers, but we use movupd and addpd for double precision. The printdpfp function for printing double-precision is a bit simpler. We have only a two-element vector, and because we are already using double precision, we do not have to convert the vectors.
Here we create a dummy variable to make sure the memory is not 16-byte aligned. Then we use the NASM assembler directive align 16 in section .data and the directive alignb 16 in section .bss. You need to add these assembler directives before each data block that needs to be aligned.
The SSE instructions are slightly different from the unaligned version. We use movaps (which means “move aligned packed single precision”) to copy data from memory into xmm0. Then we can immediately add the packed numbers from memory to the values in xmm0. This is different from the unaligned version, where we had to put the two values in an xmm register first. If we add the dummy variable to the unaligned example and try to use movaps instead of movups with a memory variable as a second operand, we risk having a runtime segmentation fault. Try it!
The register xmm0 contains the resulting sum vector with four single-precision values. Then we print the result with the function printspfp. In the printspfp function, we call every value from memory and put them into xmm registers. Because printf expects double-precision floating-point arguments, we convert the single-precision floating-point numbers to double precision with the instruction cvtss2sd (“convert scalar single to scalar double”).
Next, we use double-precision values. The process is similar to using single precision, but we use movapd and addpd for double-precision values.
Figure 28-2 shows the output for the aligned example.
Figure 28-3 shows the unaligned example, with the dummy variable added as the second operand of movaps.
Summary
In this chapter, you learned about the following:
Scalar data and packed data
Aligned and unaligned data
How to align data
Data movement and arithmetic instructions on packed data
How to convert between single-precision and double-precision data