Let’s do one last matrix operation that is useful: transposing. We have coded two versions, one using unpacking and one using shuffling.
Example Transposing Code
transpose4x4.asm
The Unpack Version
First a remark about little-endian and packed ymm values. When in the example we have the rows 1, 2, 3, 4, then the little-endian format would be 4, 3, 2, 1. However, because ymm stores packed values in our example, ymm in SASM would look like this: 2, 1, 4, 3. You can verify this with your debugger. This can be confusing when debugging your program. In what follows we will use the little-endian format of 4, 3, 2, 1, and we will not use the 2, 1, 4, 3, format.
ymm0 | high qword2 (4) | low qword2 (3) | high qword1 (2) | low qword1 (1) |
ymm1 | high qword4 (8) | low qword4 (7) | high qword3 (6) | low qword3 (5) |
... |
ymm12 | low qword4 (7) | low qword2 (3) | low qword3 (5) | low qword1 (1) |
ymm13 | high qword4 (8) | high qword2 (4) | high qword3 (6) | high qword1 (2) |
The purpose of this method of unpacking is to change column pairs to row pairs. For example, becomes [1 5].
ymm12 | 7 | 3 | 5 | 1 |
ymm13 | 8 | 4 | 6 | 2 |
ymm14 | 15 | 11 | 13 | 9 |
ymm15 | 16 | 12 | 14 | 10 |
1 | 5 | 3 | 7 |
2 | 6 | 4 | 8 |
9 | 13 | 11 | 15 |
10 | 14 | 12 | 16 |
13 | 9 | 5 | 1 |
14 | 10 | 6 | 2 |
15 | 11 | 7 | 3 |
16 | 12 | 8 | 4 |
You may notice that the two lower values of ymm12 and ymm13 are in the correct place. Similarly, the two upper values of ymm14 and ymm15 are in the correct position.
We have to move the two lower values of ymm14 to the upper values of ymm12 and the two lower values of ymm15 to the upper values of ymm13.
The two upper values from ymm12 have to go to the lower values of ymm14, and we want the two upper values of ymm13 to go into the lower positions of ymm15.
01: Take the 128-byte high field from source 1 and put it at destination position 0.
00: This has a special meaning; see the following explanation.
11: Take the 128-byte high field from source 2 and put it at destination position 128.
00: This has a special meaning; see the following explanation.
Here again we use little-endian format (4, 3, 2, 1) and do not consider the order in which these values are stored in the ymm registers.
Source 1 low field = 00
Source 1 high field = 01
Source 2 low field = 10
Source 2 high field = 11
Special meaning means if you set the third bit (index 3) in the mask, the destination low field will be zeroed, and if you set the seventh bit (index 7) in the mask, the destination high field will be zeroed.
The second, third, sixth, and seventh bits are not used here. In most cases, you can read a mask such as 00110001 as follows: 00110001.
The lower 00 means take the ymm12 low field (5, 1) and put it in the low field of ymm0.
The higher 10 means take the ymm14 low field (13, 9) and put it in the high field of ymm0.
ymm12 | 7 | 3 | 5 | 1 |
ymm14 | 15 | 11 | 13 | 9 |
ymm0 | 13 | 9 | 5 | 1 |
The lower 00 means take the ymm13 low field (6, 2) and put it in the low field of ymm1.
The higher 10 means take the ymm15 low field (14, 10) and put it in the high field of ymm1.
ymm13 | 8 | 4 | 6 | 2 |
ymm15 | 16 | 12 | 14 | 10 |
ymm1 | 14 | 10 | 6 | 2 |
The lower 01 means take the ymm13 high field (7, 3) and put it in the low field of ymm2.
The higher 11 means take the ymm15 high field (15, 11) and put it in the high field of ymm2.
ymm12 | 7 | 3 | 5 | 1 |
ymm14 | 15 | 11 | 13 | 9 |
ymm2 | 15 | 11 | 7 | 3 |
The lower 01 means take the ymm13 high field (8,4) and put it in the low field of ymm3.
The higher 11 means take the ymm15 high field (16,12) and put it in the high field of ymm3.
ymm13 | 8 | 4 | 6 | 2 |
ymm15 | 16 | 12 | 14 | 10 |
ymm3 | 16 | 12 | 8 | 4 |
And we are done permutating. All that’s left is to copy the rows from the ymm registers into the correct order in memory.
The Shuffle Version
We already used a shuffle instruction called pshufd in Chapter 33. Here we use the instruction vshufpd, which also uses a mask to control the shuffle. Don’t get confused; the instruction pshufd uses an 8-bit mask. The masks we will be using here count as only 4 bits.
Again, we are using little-endian format (remember 4, 3, 2, 1) and do not care how the packed values are stored in the ymm registers. That is the processor’s business.
Select from upper two values in source 2. | Select from upper two values in source 1. | Select from lower two values in source 2. | Select from lower two values in source 1. |
0 = lower value of source 2 | 0 = lower value of source 1 | 0 = lower value of source 2 | 0 = lower value of source 1 |
1 = higher value of source 2 | 1 = higher value of source 1 | 1 = higher value of source 2 | 1 = higher value of source 1 |
ymm0 | 4 | 3 | 2 | 1 |
ymm1 | 8 | 7 | 6 | 5 |
ymm12 | Low upper ymm1 7 | Low upper ymm0 3 | Low lower ymm1 5 | Low lower ymm0 1 |
ymm0 | 4 | 3 | 2 | 1 |
ymm1 | 8 | 7 | 6 | 5 |
ymm13 | High upper ymm1 8 | High upper ymm0 4 | High lower ymm1 6 | High lower ymm0 2 |
ymm2 | 12 | 11 | 10 | 9 |
ymm3 | 16 | 15 | 14 | 13 |
ymm14 | Low upper ymm3 15 | Low upper ymm2 11 | Low lower ymm3 13 | Low lower ymm2 9 |
ymm2 | 12 | 11 | 10 | 9 |
ymm3 | 16 | 15 | 14 | 13 |
ymm15 | High upper ymm3 16 | High upper ymm2 12 | High lower ymm3 14 | High lower ymm2 10 |
After applying the shuffle mask, we have eight pairs of values in the ymm registers. We chose the registers so that we obtained the same intermediate result as in the unpacked version. Now the pairs need to be rearranged in the right places to form the transpose. We do that in exactly the same way as in the unpack section by permutating fields (blocks) of 128 bits with vperm2f128.
Summary
That there are two ways to transpose a matrix
How to use shuffle, unpack, and permutate instructions
That there are different masks for shuffle, unpack, and permutate