© Jo Van Hoey 2019
J. Van HoeyBeginning x64 Assembly Programminghttps://doi.org/10.1007/978-1-4842-5076-1_28

28. SSE Alignment

Jo Van Hoey1 
(1)
Hamme, Belgium
 

It’s time to start the real SSE work! Although we have had a number of chapters on SSE, we only scratched the surface of the subject. There are hundreds of SIMD instructions (MMX, SSE, AVX), and investigating them in-depth would require another book or even a series of books. In this chapter, we will give a number of examples so that you know where to start. The purpose of these examples is to enable you to find your way in the multitude of SIMD instructions in the Intel manuals. In this chapter, we will discuss alignment, which we already covered briefly in Chapter 26.

Unaligned Example

Listing 28-1 shows how to add vectors using data that is unaligned in memory.
; sse_unaligned.asm
extern printf
section .data
;single precision
      spvector1  dd   1.1
                 dd   2.2
                 dd   3.3
                 dd   4.4
      spvector2  dd   1.1
                 dd   2.2
                 dd   3.3
                 dd   4.4
;double precision
      dpvector1  dq   1.1
                 dq   2.2
      dpvector2  dq   3.3
                 dq   4.4
      fmt1 db "Single Precision Vector 1: %f, %f, %f, %f",10,0
      fmt2 db "Single Precision Vector 2: %f, %f, %f, %f",10,0
      fmt3 db "Sum of Single Precision Vector 1 and Vector 2:"
           db " %f, %f, %f, %f",10,0
      fmt4 db "Double Precision Vector 1: %f, %f",10,0
      fmt5 db "Double Precision Vector 2: %f, %f",10,0
      fmt6 db "Sum of Double Precision Vector 1 and Vector 2:"
           db " %f, %f",10,0
section .bss
      spvector_res resd 4
      dpvector_res resq 4
section .text
      global main
main:
push  rbp
mov   rbp,rsp
; add 2 single precision floating point vectors
      mov   rsi,spvector1
      mov   rdi,fmt1
      call  printspfp
      mov   rsi,spvector2
      mov   rdi,fmt2
      call  printspfp
      movups     xmm0, [spvector1]
      movups     xmm1, [spvector2]
      addps      xmm0,xmm1
      movups     [spvector_res], xmm0
      mov        rsi,spvector_res
      mov        rdi,fmt3
      call       printspfp
; add 2 double precision floating point vectors
      mov   rsi,dpvector1
      mov   rdi,fmt4
      call  printdpfp
      mov   rsi,dpvector2
      mov   rdi,fmt5
      call  printdpfp
      movupd     xmm0, [dpvector1]
      movupd     xmm1, [dpvector2]
      addpd      xmm0,xmm1
      movupd     [dpvector_res], xmm0
      mov        rsi,dpvector_res
      mov        rdi,fmt6
      call       printdpfp
leave
ret
printspfp:
push  rbp
mov   rbp,rsp
      movss      xmm0, [rsi]
      cvtss2sd   xmm0,xmm0
      movss      xmm1, [rsi+4]
      cvtss2sd   xmm1,xmm1
      movss      xmm2, [rsi+8]
      cvtss2sd   xmm2,xmm2
      movss      xmm3, [rsi+12]
      cvtss2sd   xmm3,xmm3
      mov        rax,4; four floats
      call       printf
leave
ret
printdpfp:
push  rbp
mov   rbp,rsp
      movsd      xmm0, [rsi]
      movsd      xmm1, [rsi+8]
      mov        rax,2; four floats
      call       printf
leave
ret
Listing 28-1

sse_unaligned.asm

The first SSE instruction is movups (which means “move unaligned packed single precision”), which copies data from memory into xmm0 and xmm1. As a result, xmm0 contains one vector with four single-precision values, and xmm1 contains one vector with four single-precision values. Then we use addps (which means “add packed single precision”) to add the two vectors; the resultant vector goes into xmm0 and is then transferred to memory. Then we print the result with the function printspfp. In the printspfp function, we copy every value from memory into xmm registers using movss (which means “move scalar single precision”). Because printf expects double-precision floating-point arguments, we convert the single-precision floating-point numbers to double precision with the instruction cvtss2sd (which means “convert scalar single to scalar double”).

Next, we add two double-precision values. The process is similar to adding single-precision numbers, but we use movupd and addpd for double precision. The printdpfp function for printing double-precision is a bit simpler. We have only a two-element vector, and because we are already using double precision, we do not have to convert the vectors.

Figure 28-1 shows the output.
../images/483996_1_En_28_Chapter/483996_1_En_28_Fig1_HTML.jpg
Figure 28-1

sse_unaligned.asm output

Aligned Example

Listing 28-2 shows how to add two vectors.
; sse_aligned.asm
extern printf
section .data
      dummy   db      13
align 16
      spvector1 dd    1.1
                dd    2.2
                dd    3.3
                dd    4.4
      spvector2 dd    1.1
                dd    2.2
                dd    3.3
                dd    4.4
      dpvector1 dq    1.1
                dq    2.2
      dpvector2 dq    3.3
                dq    4.4
      fmt1 db "Single Precision Vector 1: %f, %f, %f, %f",10,0
      fmt2 db "Single Precision Vector 2: %f, %f, %f, %f",10,0
      fmt3 db "Sum of Single Precision Vector 1 and Vector 2:"
           db " %f, %f, %f, %f",10,0
      fmt4 db "Double Precision Vector 1: %f, %f",10,0
      fmt5 db "Double Precision Vector 2: %f, %f",10,0
      fmt6 db "Sum of Double Precision Vector 1 and Vector 2:"
           db " %f, %f",10,0
section .bss
alignb 16
        spvector_res resd 4
        dpvector_res resq 4
section .text
      global main
main:
push  rbp
mov   rbp,rsp
; add 2 single precision floating point vectors
      mov   rsi,spvector1
      mov   rdi,fmt1
      call  printspfp
      mov   rsi,spvector2
      mov   rdi,fmt2
      call  printspfp
      movaps     xmm0, [spvector1]
      addps      xmm0, [spvector2]
      movaps     [spvector_res], xmm0
      mov        rsi,spvector_res
      mov        rdi,fmt3
      call       printspfp
; add 2 double precision floating point vectors
      mov        rsi,dpvector1
      mov        rdi,fmt4
      call       printdpfp
      mov        rsi,dpvector2
      mov        rdi,fmt5
      call       printdpfp
      movapd     xmm0, [dpvector1]
      addpd      xmm0, [dpvector2]
      movapd     [dpvector_res], xmm0
      mov        rsi,dpvector_res
      mov        rdi,fmt6
      call       printdpfp
; exit
mov   rsp,rbp
pop   rbp        ; undo the push at the beginning
ret
printspfp:
push  rbp
mov   rbp,rsp
      movss      xmm0, [rsi]
      cvtss2sd   xmm0,xmm0  ;printf expects double precision argument
      movss      xmm1, [rsi+4]
      cvtss2sd   xmm1,xmm1
      movss      xmm2, [rsi+8]
      cvtss2sd   xmm2,xmm2
      movss      xmm3, [rsi+12]
      cvtss2sd   xmm3,xmm3
      mov        rax,4; four floats
      call printf
leave
ret
printdpfp:
push  rbp
mov   rbp,rsp
      movsd      xmm0, [rsi]
      movsd      xmm1, [rsi+8]
      mov        rax,2; two floats
      call printf
leave
ret
Listing 28-2

sse_aligned.asm

Here we create a dummy variable to make sure the memory is not 16-byte aligned. Then we use the NASM assembler directive align 16 in section .data and the directive alignb 16 in section .bss. You need to add these assembler directives before each data block that needs to be aligned.

The SSE instructions are slightly different from the unaligned version. We use movaps (which means “move aligned packed single precision”) to copy data from memory into xmm0. Then we can immediately add the packed numbers from memory to the values in xmm0. This is different from the unaligned version, where we had to put the two values in an xmm register first. If we add the dummy variable to the unaligned example and try to use movaps instead of movups with a memory variable as a second operand, we risk having a runtime segmentation fault. Try it!

The register xmm0 contains the resulting sum vector with four single-precision values. Then we print the result with the function printspfp. In the printspfp function , we call every value from memory and put them into xmm registers. Because printf expects double-precision floating-point arguments, we convert the single-precision floating-point numbers to double precision with the instruction cvtss2sd (“convert scalar single to scalar double”).

Next, we use double-precision values. The process is similar to using single precision, but we use movapd and addpd for double-precision values.

Figure 28-2 shows the output for the aligned example.
../images/483996_1_En_28_Chapter/483996_1_En_28_Fig2_HTML.jpg
Figure 28-2

sse_aligned.asm output

Figure 28-3 shows the unaligned example, with the dummy variable added as the second operand of movaps.
../images/483996_1_En_28_Chapter/483996_1_En_28_Fig3_HTML.jpg
Figure 28-3

sse_unaligned.asm segmentation fault

Summary

In this chapter, you learned about the following:
  • Scalar data and packed data

  • Aligned and unaligned data

  • How to align data

  • Data movement and arithmetic instructions on packed data

  • How to convert between single-precision and double-precision data

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.131.255