Instead of summing up a number of possibly interesting AVX instructions, let’s look at some matrix operations using AVX. This is a long chapter with several pages of code; a lot will be familiar, but we will introduce several new instructions here.
We will show matrix multiplication and matrix inversion. In the next chapter, we will show how to transpose a matrix.
Example Matrix Code
matrix4x4.asm
The interesting parts of this code are in the functions. The main function is for initializing the program, calling functions, and printing. The matrices we use in this example are 4×4 double-precision floating-point matrices. Note the 32-byte alignment of the matrices; in AVX we use ymm registers, with a size of 32 bytes. We will analyze the program function by function.
Matrix Print: printm4x4
We read the matrix one row at a time into four xmm registers, and then we push a number of registers onto the stack. These registers will be modified by printf, so we have to preserve them. Then we align the stack on a 16-byte boundary. Because of normal operation, rsp will be aligned on an 8-byte boundary. To align the stack on a 16-byte boundary, we cannot use the trick with the and instruction from Chapter 16. This is because with the and instruction, we do not know whether rsp will be changed or not. And we need the correct stack pointer because we pop the pushed registers after printf. If rsp was changed, we need to return it to its previous value before popping; otherwise, the wrong values will be popped from the stack. If rsp was not changed, we do not need to adjust it.
We will use the test instruction and 0xf to verify the alignment of the stack. If the last hexadecimal digit of rsp is a 0, then rsp is 16-byte aligned. If the last digit contains anything other than 0, then the last half-byte will have at least one of its bits set to 1. The test instruction is similar to an and instruction. If the last half-byte of rsp has one or more bits set to 1, the result of the comparison will be nonzero, and the zero-flag ZF will be cleared. The setnz (set-if-non-zero) instruction reads the zero flag (ZF), and if the ZF is not set, setnz will put 0000 0001 into r15b. If that happens, it means that rsp is not 16-byte aligned, and we will subtract 8 to put it on a 16-byte boundary. We left-shift r15b three times to obtain the decimal value 8 and do the subtraction. After the execution of printf, we restore the correct stack address by adding r15 back to rsp, that is, adding 8 if we had to align or adding 0 if we did not have to align. The stack is then where it was before our alignment, and we can pop the registers.
Matrix Multiplication: multi4x4
In the sample code and in the following explanation, we use the following two matrices:
This is efficient for manual computation; however, we are going to use a method that is more appropriate for a computer. We will use the ymm registers for keeping running totals and for updating the totals in subsequent loops. Here we make use of the power of AVX instructions.
First, we clear all the ymm registers with vzeroall. Then we go into a loop four times, once for every row in matrixB. A row of four double-precision values from matrixB is loaded in ymm0. Then a value from a sequentially selected column of matrixA is broadcasted into ymm1. The register rax serves as a column counter, and the column values are at offset 0, 32, 64, and 96. Broadcasting means that all four quadwords (8 bytes each) will contain that value. Then the values in ymm1 are multiplied with the values in ymm0 and added to ymm12. The multiplying and adding are done with one instruction called vfmadd231pd, which means “vector fused multiply add packed double.” The 231 indicates how the registers are used. There are multiple variants of vfmadd (132, 213, 231), and there are variants for double precision and single precision. We used 231, which means multiply the second operand with the third operand, add to the first operand, and put the result in the first operand. This is done for every column value of the matrixA column, and then the iteration continues; the next row of matrixB is loaded, and the computation restarts.
rdi | rsi | ||||||||
---|---|---|---|---|---|---|---|---|---|
32 bytes | 32 bytes | ||||||||
8 bytes | 8 bytes | 8 bytes | 8 bytes | 8 bytes | 8 bytes | 8 bytes | 8 bytes | ||
0–31 | 1 | 3 | 5 | 7 | 0–31 | 2 | 4 | 6 | 8 |
32–63 | 9 | 11 | 13 | 15 | 32–63 | 10 | 12 | 14 | 16 |
64–95 | 17 | 19 | 21 | 23 | 64–95 | 18 | 20 | 22 | 24 |
96–127 | 25 | 27 | 29 | 31 | 96–127 | 26 | 28 | 30 | 32 |
vmovapd ymm0, [rsi] | ymm0 | 2 | 4 | 6 | 8 |
vbroadcastsd ymm1,[rdi+0] | ymm1 | 1 | 1 | 1 | 1 |
vfmadd231pd ymm12,ymm1,ymm0 | ymm12 | 2 | 4 | 6 | 8 |
vbroadcastsd ymm1,[rdi+32+0] | ymm1 | 9 | 9 | 9 | 9 |
vfmadd231pd ymm13,ymm1,ymm0 | ymm13 | 18 | 36 | 54 | 72 |
vbroadcastsd ymm1,[rdi+64+0] | ymm1 | 17 | 17 | 17 | 17 |
vfmadd231pd ymm14,ymm1,ymm0 | ymm14 | 34 | 68 | 102 | 136 |
vbroadcastsd ymm1,[rdi+96+0] | ymm1 | 25 | 25 | 25 | 25 |
vfmadd231pd ymm15,ymm1,ymm0 | ymm15 | 50 | 100 | 150 | 200 |
vmovapd ymm0, [rsi+32] | ymm0 | 10 | 12 | 14 | 16 |
vbroadcastsd ymm1,[rdi+8] | ymm1 | 3 | 3 | 3 | 3 |
vfmadd231pd ymm12,ymm1,ymm0 | ymm12 | 32 | 40 | 48 | 56 |
vbroadcastsd ymm1,[rdi+32+8] | ymm1 | 11 | 11 | 11 | 11 |
vfmadd231pd ymm13,ymm1,ymm0 | ymm13 | 128 | 168 | 208 | 248 |
vbroadcastsd ymm1,[rdi+64+8] | ymm1 | 19 | 19 | 19 | 19 |
vfmadd231pd ymm14,ymm1,ymm0 | ymm14 | 224 | 296 | 368 | 440 |
vbroadcastsd ymm1,[rdi+96+8] | ymm1 | 27 | 27 | 27 | 27 |
vfmadd231pd ymm15,ymm1,ymm0 | ymm15 | 320 | 424 | 528 | 632 |
vmovapd ymm0, [rsi+32+32] | ymm0 | 18 | 20 | 22 | 24 |
vbroadcastsd ymm1,[rdi+8+8] | ymm1 | 5 | 5 | 5 | 5 |
vfmadd231pd ymm12,ymm1,ymm0 | ymm12 | 122 | 140 | 158 | 176 |
vbroadcastsd ymm1,[rdi+32+8+8] | ymm1 | 13 | 13 | 13 | 13 |
vfmadd231pd ymm13,ymm1,ymm0 | ymm13 | 362 | 428 | 494 | 560 |
vbroadcastsd ymm1,[rdi+64+8+8] | ymm1 | 21 | 21 | 21 | 21 |
vfmadd231pd ymm14,ymm1,ymm0 | ymm14 | 602 | 716 | 830 | 944 |
vbroadcastsd ymm1,[rdi+96+8+8] | ymm1 | 29 | 29 | 29 | 29 |
vfmadd231pd ymm15,ymm1,ymm0 | ymm15 | 842 | 1004 | 1166 | 1328 |
vmovapd ymm0, [rsi+32+32+32] | ymm0 | 26 | 28 | 30 | 32 |
vbroadcastsd ymm1,[rdi+8+8+8] | ymm1 | 7 | 7 | 7 | 7 |
vfmadd231pd ymm12,ymm1,ymm0 | ymm12 | 304 | 336 | 368 | 400 |
vbroadcastsd ymm1,[rdi+32+8+8+8] | ymm1 | 15 | 15 | 15 | 15 |
vfmadd231pd ymm13,ymm1,ymm0 | ymm13 | 752 | 848 | 944 | 1040 |
vbroadcastsd ymm1,[rdi+64+8+8+8] | ymm1 | 23 | 23 | 23 | 23 |
vfmadd231pd ymm14,ymm1,ymm0 | ymm14 | 1200 | 1360 | 1520 | 1680 |
vbroadcastsd ymm1,[rdi+96+8+8+8] | ymm1 | 31 | 31 | 31 | 31 |
vfmadd231pd ymm15,ymm1,ymm0 | ymm15 | 1648 | 1872 | 2096 | 2320 |
Matrix Inversion: Inverse4x4
Mathematicians have developed a range of algorithms to efficiently compute the inverse of a matrix. It is not our intent to provide you with an inversion program with all the bells and whistles; we just want to show how to use AVX.
We will use a method based on the Cayley-Hamilton theorem about characteristic polynomials. Here is an interesting site with more information on characteristic polynomials: http://www.mcs.csueastbay.edu/~malek/Class/Characteristic.pdf .
Caley-Hamilton Theorem
where An is A to the power of n. For example, A3 is AAA, the matrix A three times multiplied with itself. The p’s are coefficients to be determined, I is the identity matrix, and 0 is the zero matrix.
So, to find the inverse of matrix A, we need to do a number of matrix multiplications, and we need a method to find the p’s.
Leverrier Algorithm
To compute the p coefficients, we use the Leverrier algorithm , also covered at http://www.mcs.csueastbay.edu/~malek/Class/Characteristic.pdf . First, we find the traces of the matrices, that is, the sum of the elements on the diagonal from the upper left to the lower right. Let’s call sn the trace of the matrix An.
s1 for A
s2 for AA
s3 for AAA
s4 for AAAA
Pretty simple, right? Apart from some elaborate matrix multiplications to obtain the traces, of course.
The Code
In our function inverse4x4, we have a separate section .data, where we put our identity matrix and some variables we will use later. First, we compute the power matrices and store them in matrix2, matrix3, and matrix4. We will not use matrix1 yet. Then we call the function vtrace for every matrix to compute the traces. In the vtrace function , we first build our matrix in the ymm registers (ymm0, ymm1, ymm2, ymm3), each containing a row. Then we use the instruction vblendpd, which has four operands: two source operands, one destination operand, and a control mask. We want to extract the diagonal elements in rows 2, 3, and 4 and put them as packed values in ymm0, at locations index 1, 2, and 3. At location 0, we keep the trace element of ymm0.
In the first trace computation, after the blending, the ymm0 register contains the trace elements 2, 13, 29, 47. You can check this with SASM. Don’t be fooled by the order of the values of ymm0 as represented: 13, 2, 47, 29. We now have to sum these values. This can easily be done by extracting and simply adding, but for the sake of the demo, we will use AVX instructions. We apply the horizontal add instruction vhaddpd. ymm0 then contains 15, 15, 76, 76, which are the sum of the two lower values and the sum of the two higher values. Then we execute a permutation vpermpd with mask 00100111. Each two-bit value selects a value in the source operand; see Figure 36-2 for an explanation. Now the lower half of ymm0, which is xmm0, contains two values, so we have to add these to obtain the trace. We execute a horizontal add on xmm0 with haddpd. We store the traces in xmm8, xmm9, xmm10, and xmm11 for later use.
When we have all the traces, we can compute the p-coefficients. See how we change the sign of a value by applying a minus mask and the instruction vxorpd. We use the vfmadd213sd and vfmadd231sd to do additions and multiplications in one instruction. The instruction vfmadd213sd means multiply the first and second operands, add a third operand, and put the result in the first operand. The instruction vfmadd231sd means multiply the second and third operands, add the first operand, and put the result in the first operand. There is a list of similar instructions in the Intel manual. Study them carefully.
When we have all the coefficients, we scalar-multiply matrix, matrix2, matrix3, and matrixI with the coefficients, according to the previous formulae. The result of multiplication with matrix is put into matrix1. We do not need matrix4 anymore, so to save memory, we could have used the space for inverse as temporary memory instead of matrix4.
We have to divide by coefficient p4, so we have to check that p4 is nonzero. In this case, we could have done this simple operation after computing p4 earlier, but we wanted to show how to use the mxcsr register. We set the zero-division mask bit in mxcsr and do the division with the instruction vdivsd. If after division the third bit (index 2) in the mxcsr register is set, then we had a zero division, and the matrix is singular and cannot be inversed. In the and instruction, we used decimal 4, which is 0000 0100 in binary, so we are checking the third bit indeed. If we had a zero division, we head for the exit with 1 in rax to signal the error to the caller.
When a matrix is singular, the program will not crash because zero division is masked by default in the mxcsr register. After you finish the analysis of this code, comment out the part that checks for zero division and see what happens.
If p4 is nonzero, we add the four matrices and scalar-multiply the result with -1/p4. We do the addition and multiplication in the same loop. When everything goes fine, we have the inverse, and we return to the caller with 0 in rax.
Summary
AVX matrix operations
AVX instruction with three operands
AVX fuse operations
Use of masks for blending and permutations