Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

J. Van HoeyBeginning x64 Assembly Programminghttps://doi.org/10.1007/978-1-4842-5076-1_35

35. AVX

Jo Van Hoey¹

(1)

Hamme, Belgium

Advanced Vector Extensions (AVX) is an extension of SSE. Whereas SSE provides 16 xmm registers, each 128 bits wide, AVX offers 16 ymm registers, each 256 bits wide. The lower half of each ymm register is in fact the corresponding xmm register. The xmm registers are aliases of the ymm registers. AVX-512 is a further extension offering 32 zmm registers, each 512 bits wide.

In addition to these registers, AVX extends the SSE instructions and provides a whole range of additional new instructions. After you work your way through the SSE chapters in this book, you will not find it too difficult to navigate the large number of SSE and AVX instructions.

In this chapter, we will first explain which AVX version is supported by the processor, and then we will show an example program.

Test for AVX Support

Listing 35-1 shows a program to find out whether your CPU supports AVX.

; cpu_avx.asm

extern printf

section .data

fmt_noavx db "This cpu does not support AVX.",10,0

fmt_avx db "This cpu supports AVX.",10,0

fmt_noavx2 db "This cpu does not support AVX2.",10,0

fmt_avx2 db "This cpu supports AVX2.",10,0

fmt_noavx512 db "This cpu does not support AVX-512.",10,0

fmt_avx512 db "This cpu supports AVX-512.",10,0

section .bss

section .text

global main

main:

push rbp

mov rbp,rsp

call cpu_sse ; returns 1 in rax if AVX supported, otherwise 0

leave

ret

cpu_sse:

push rbp

mov rbp,rsp

;test for avx

mov eax,1 ; request CPU feature flags

cpuid

mov eax,28 ; test bit 28 in ecx

bt ecx,eax

jnc no_avx

xor rax,rax

mov rdi,fmt_avx

call printf

;test for avx2

mov eax,7 ; request CPU feature flags

mov ecx,0

cpuid

mov eax,5 ; test bit 5 in ebx

bt ebx,eax

jnc the_exit

xor rax,rax

mov rdi,fmt_avx2

call printf

;test for avx512 foundation

mov eax,7 ; request CPU feature flags

mov ecx,0

cpuid

mov eax,16 ; test bit 16 in ebx

bt ebx,eax

jnc no_avx512

xor rax,rax

mov rdi,fmt_avx512

call printf

jmp the_exit

no_avx:

mov rdi,fmt_noavx

xor rax,rax

call printf ; displays message if AVX not available

xor rax,rax ; returns 0, no AVX

jmp the_exit ; and exits

no_avx2:

mov rdi,fmt_noavx2

xor rax,rax

call printf ; displays message if AVX not available

xor rax,rax ; returns 0, no AVX

jmp the_exit ; and exits

no_avx512:

mov rdi,fmt_noavx512

xor rax,rax

call printf ; displays message if AVX not available

xor rax,rax ; returns 0, no AVX

jmp the_exit ; and exits

the_exit:

leave

ret

Listing 35-1

cpu_avx.asm

This program is similar to the program we used to test for SSE support, but we have to look for AVX flags now. So, there is nothing special here; you can find more details of which registers to use and what information can be retrieved in the Intel manual, Volume 2, in the section on cpuid.

Figure 35-1 shows the output.

../images/483996_1_En_35_Chapter/483996_1_En_35_Fig1_HTML.jpg — Figure 35-1
cpu_avx.asm output

Example AVX Program

Listing 35-2 is adapted from the SSE unaligned example in Chapter 28.

; avx_unaligned.asm

extern printf

section .data

spvector1 dd 1.1

dd 2.1

dd 3.1

dd 4.1

dd 5.1

dd 6.1

dd 7.1

dd 8.1

spvector2 dd 1.2

dd 1.2

dd 3.2

dd 4.2

dd 5.2

dd 6.2

dd 7.2

dd 8.2

dpvector1 dq 1.1

dq 2.2

dq 3.3

dq 4.4

dpvector2 dq 5.5

dq 6.6

dq 7.7

dq 8.8

fmt1 db "Single Precision Vector 1:",10,0

fmt2 db 10,"Single Precision Vector 2:",10,0

fmt3 db 10,"Sum of Single Precision Vector 1 and Vector 2:",10,0

fmt4 db 10,"Double Precision Vector 1:",10,0

fmt5 db 10,"Double Precision Vector 2:",10,0

fmt6 db 10,"Sum of Double Precision Vector 1 and Vector 2:",10,0

section .bss

spvector_res resd 8

dpvector_res resq 4

section .text

global main

main:

push rbp

mov rbp,rsp

;SINGLE PRECISION FLOATING POINT VECTORS

;load vector1 in the register ymm0

vmovups ymm0, [spvector1]

;extract ymm0

vextractf128 xmm2,ymm0,0 ;first part of ymm0

vextractf128 xmm2,ymm0,1 ;second part of ymm0

;load vector2 in the register ymm1

vmovups ymm1, [spvector2]

;extract ymm1

vextractf128 xmm2,ymm1,0

vextractf128 xmm2,ymm1,1

;add 2 single precision floating point vectors

vaddps ymm2,ymm0,ymm1

vmovups [spvector_res],ymm2

;print the vectors

mov rdi,fmt1

call printf

mov rsi,spvector1

call printspfpv

mov rdi,fmt2

call printf

mov rsi,spvector2

call printspfpv

mov rdi,fmt3

call printf

mov rsi,spvector_res

call printspfpv

;DOUBLE PRECISION FLOATING POINT VECTORS

;load vector1 in the register ymm0

vmovups ymm0, [dpvector1]

;extract ymm0

vextractf128 xmm2,ymm0,0 ;first part of ymm0

vextractf128 xmm2,ymm0,1 ;second part of ymm0

;load vector2 in the register ymm1

vmovups ymm1, [dpvector2]

;extract ymm1

vextractf128 xmm2,ymm1,0

vextractf128 xmm2,ymm1,1

; add 2 double precision floating point vectors

vaddpd ymm2,ymm0,ymm1

vmovupd [dpvector_res],ymm2

;print the vectors

mov rdi,fmt4

call printf

mov rsi,dpvector1

call printdpfpv

mov rdi,fmt5

call printf

mov rsi,dpvector2

call printdpfpv

mov rdi,fmt6

call printf

mov rsi,dpvector_res

call printdpfpv

leave

ret

printspfpv:

section .data

.NL db 10,0

.fmt1 db "%.1f, ",0

section .text

push rbp

mov rbp,rsp

push rcx

push rbx

mov rcx,8

mov rbx,0

mov rax,1

.loop:

movss xmm0,[rsi+rbx]

cvtss2sd xmm0,xmm0

mov rdi,.fmt1

push rsi

push rcx

call printf

pop rcx

pop rsi

add rbx,4

loop .loop

xor rax,rax

mov rdi,.NL

call printf

pop rbx

pop rcx

leave

ret

printdpfpv:

section .data

.NL db 10,0

.fmt db "%.1f, %.1f, %.1f, %.1f",0

section .text

push rbp

mov rbp,rsp

mov rdi,.fmt

mov rax,4 ; four floats

call printf

mov rdi,.NL

call printf

leave

ret

Listing 35-2

avx_unaligned.asm

In this program, we use the 256-bit ymm registers and some new instructions. For example, we use vmovups to put unaligned data in a ymm register. We use SASM to view the registers. After the vmovups instructions, ymm0 contains the following:

{0x4083333340466666400666663f8ccccd,0x4101999a40e3333340c3333340a33333}

Here is what it looks like converted to decimal:

{4.1 3.1 2.1 1.1 , 8.1 7.1 6.1 5.1}

Look at where the values are stored, which can be confusing.

Just for the sake of the demo, we extract data from a ymm register, and we use vextractf128 to put packed floating-point values from ymm0 to xmm2, 128 bits at a time. You could use extractps to further extract floating-point values and store them in general-purpose registers.

New are instructions with three operands, as shown here:

vaddps ymm2,ymm0,ymm1

Add ymm1 to ymm0 and store the result in ymm2.

The print functions simply load the values from memory into an xmm register, convert single precision to double precision where needed, and then call printf.

Figure 35-2 shows the output.

../images/483996_1_En_35_Chapter/483996_1_En_35_Fig2_HTML.jpg — Figure 35-2
avx_unaligned.asm output

Summary

In this chapter, you learned about the following:

How to determine CPU support for AVX
That AVX uses 16 256-bit ymm registers
That the 128-bit xmm registers are aliased ymm registers
How to extract values from ymm registers

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 35. AVX

Create new playlist

Sign In

Sign Up

35. AVX

Test for AVX Support

Example AVX Program

Summary

Table of Contents for
35. AVX