You will agree that a lot of the AVX instructions are far from intuitive, especially the different mask layouts that make the code difficult to read and understand. Moreover, the bit masks are sometimes written in hexadecimal notation, so you have to convert them first to binary notation to see what they do.
In this chapter, we will demonstrate that using AVX instructions can dramatically improve performance, and the effort of using AVX pays off in a number of cases. You can find an interesting white paper on benchmarking code at https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf .
In our examples, we will use the measuring method presented in this white paper.
Transpose Computation Performance
transpose.asm
Before we call the transpose function, we start the timing process. Modern processors support out-of-order execution code, which could result in instructions being executed at the wrong moment, before we start the timing or after we stop the timing. To avoid that, we need to use “serializing” instructions, which are instructions that guarantee that our timing instructions measure only what we want to measure. See the previous white paper for a more detailed explanation. One such instruction that can be used for serializing is cpuid. Before starting the timer with rdtsc , we execute cpuid. We use rdtsc to write the beginning timestamp counter “low cycles” in register eax and “high cycles” in edx; these values are stored in memory. The instruction rdtsc uses these two registers for historical reasons: in 32-bit processors, one register would be too small to hold the timer counts. One 32-bit register is used for the lower part of the timer counter value, and another register is used for the higher part. After recording the beginning timer counter values, we execute the code we want to measure and use the rdtscp instruction to stop the measurement. The ending “high cycles” and “low cycles” counters are stored again in memory, and cpuid is executed once again to make sure that no execution of instructions is postponed by the processor.
We use a 64-bit processor environment, so we shift left 32 the higher timestamp values and then xor the higher timestamp value with the lower timestamp value to obtain the complete timestamps in a 64-bit register. The difference between the beginning counter values and the ending counter values gives the number of cycles used.
The function seq_transpose uses “classic” instructions, and the function AVX_transpose is the transpose_shuffle4x4 function from the previous chapter. The functions are executed a large number of times as specified in the variable loops.
You can see that using AVX instructions spectacularly speeds up the processing.
Intel has a volume dedicated to code optimization: https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf .
This manual has a lot of interesting information on improving the performance of assembly code. Search for handling port 5 pressure (currently covered in Chapter 14). In that section, you will find several versions of a transpose algorithm for 8×8 matrices as well as the performance impact of different instructions. In the previous chapter, we demonstrated two ways of transposing a matrix, using unpacking and using shuffle. The Intel manuals go much deeper into the details of this subject; if performance is important to you, there are treasures to be found there.
Trace Computation Performance
The function blend_trace is an extension from 4×4 to 8×8 of the trace function we used in Chapter 36, in our matrix inversion code, with AVX instructions. The function seq_trace walks sequentially through the matrix, finds the trace elements, and adds them. When running this code, you will see that seq_trace is much faster than blend_trace.
If you want to know more about optimization, use the previously mentioned Intel manual. Here is another excellent source: https://www.agner.org .
Summary
Measuring and computing elapsed cycles
That AVX can speed up processing drastically
That AVX is not suited for every situation