With the unmasked string instructions, we have a few options. We can find a first or last occurrence of a character, but finding all occurrences is more challenging. We can compare strings and find a difference, but finding all differences is more complicated. Luckily, we also have string instructions that use masks, which makes them much more powerful. But before diving into mask instructions, we need to look at shuffling.
A First Look at Shuffling
Shuffling means moving around packed values. The moving can be within the same xmm register or from one xmm register to another xmm register, or it can be from a 128-bit memory location to an xmm register.
shuffle.asm
First, we reserve space on the stack for variables of 128 bytes. We need this space for “pushing” xmm registers on the stack. We cannot use the standard push/pop instructions with xmm registers; we must use memory addressing to copy them to and from the stack. We use rbp, the base pointer, as a point of reference.
We print the numbers we will use as packed values. Then we load the numbers as double words into xmm0 with the instruction pinsrd (which means “packed insert double”). We save (push) xmm0 as a local stack variable with the instruction movdqu [rbp-16],xmm0. (We reserved space for this local variable at the start of the program.) Every time we execute printf, xmm0 will be modified, intentionally or not. So, we have to preserve and restore the original value of xmm0 if needed. The instruction movdqu is used to move unaligned packed integer values. To help visualize the results of the shuffling, we take into account little-endian formatting when printing. Doing so will show you xmm0, as you can see in a debugger such as SASM.
Shuffle broadcast
Shuffle reverse
Shuffle rotate
Shuffle Broadcast
In the figure, the source and target are both xmm0. The lowest significant double word, d0, is specified in the mask as 00b. The second lowest, d1, is specified as 01b. The third, d2, is specified as 10b. The fourth, d3, is specified as 11b. The binary mask 10101010b, or aah in hexadecimal, works as follows: put d2 (10b) in the four target packed double-word positions. Similarly, the mask 11111111b would place d3 (11b) in the four target packed double word positions.
We accomplish a broadcast of the third-lowest element in xmm0. Because the function printf modifies xmm0, we need to save the content of xmm0 by storing it to memory before calling printf. In fact, we need to do more work to protect the content of xmm0 than to do the shuffling itself.
Of course, you are not limited to the four masks we presented here; you can create any 8-bit mask and mix and shuffle as you like.
Shuffle Reverse
11 (value in d3) goes into position 0
01 (value in d2) goes into position 1
10 (value in d1) goes into position 2
00 (value in d0) goes into position 3
Shuffle Rotate
Shuffle Bytes
You can shuffle double words with pshufd and words with pshufw. You can also shuffle high words and low words with pshufhw and pshuflw, respectively. You can find all the details in the Intel manuals. All these instructions use a source operand, a target operand, and a mask specified with an immediate. Providing an immediate as a mask has its limitations: it is inflexible, and you have to provide the mask at assembly time, not at runtime.
But there is a solution: shuffle bytes.
Then the magic happens. Remember, the mask goes in the second operand; the source is the same as the destination and goes in the first operand.
The nice thing here is that we do not have to provide the mask at assemble time as an immediate. The mask can be built in xmm1 as a result of a computation at runtime.
Summary
Shuffle instructions
Shuffle masks
Runtime masks
How to use the stack with xmm registers