In this chapter, we will start using the control byte to help us find a specific character in a string.
Determining the Length of a String
In the first example, we will determine the length of a string by looking for a terminating 0.
sse_string_length.asm
First, as usual, we print the strings. Then we call the custom-built search function pstrlen. Our function pstrlen scans for the first occurrence of a zero byte. The instruction pcmpistri analyzes blocks of 16 bytes at a time; we use rax as a block counter. If pcmpistri detects a zero byte in the current block, ZF will be set and used to decide whether to jump. We have to avoid that incrementing rax will impact the ZF flag just before the jump is evaluated, so we have to increment the ZF flag before pcmpistri. That is why we start with -16 in rax; now we can increase rax before using pcmpistri. Note the pxor instruction ; it is the logical or instruction for xmm registers. SIMD has its own logical instructions!
00 Packed unsigned bytes
10 Equal each
00 Positive Polarity
0 Least significant index
0 Reserved
You might expect that we use “equal any” to find any 0. But instead, we are using “equal each”! Why is that?
You have to know that pcmpistri initializes rcx to contain the value 16, which is the number of bytes in a block. If a matching byte is found, pcmpistri will copy the index of the matching byte in rcx. If there is no match found, rcx will contain 16.
Look in the Intel manuals, specifically, in Volume 2B. Section 4.1.6, “Valid/Invalid Override of Comparisons,” explains what happens when a block has “invalid” bytes, or bytes past the end of a string.
xmm0 | Memory | Equal any | Equal each |
---|---|---|---|
Invalid | Invalid | Force false | Force true |
Invalid | Valid | Force false | Force false |
We have xmm0 invalid because we initialized it to contain 0 bytes. When we have a 16-byte block containing a 0 byte, in the case of “equal any,” pcmpistri detects that one of the 16 bytes contains 0. At that moment, we have xmm0 invalid and memory invalid. However, pcmpistri is designed to “force false” in the case of “equal any.” So, pcmpistri thinks there is no match and returns 16 in rcx, so the calculated string length will not be correct.
But when we use “equal each,” xmm0 is invalid like before, and as soon as pcmpistri reads the terminating 0 byte in the block, it is designed to “force true.” The index of the 0 byte is recorded in ecx. And that value in ecx can be used to correctly calculate the end of the string.
One caveat: the program reads in blocks of 16 bytes. That is okay as long as the place where the data is found is within a memory space allocated to the program. If it tries reading beyond the allowed memory border, the program will crash. You can avoid this by keeping track of where you are in the memory page (in most cases, pages are chunks of 4K bytes), and if you come close to the page border, start reading byte per byte. That way you will never accidentally try to cross over from an allowed memory page to a memory page of another process. We did not implement this feature to complicate the explanation and the example program. But be warned that such a situation can happen.
Searching in Strings
sse_string_search.asm
At the beginning of the program, we added two templates in comments to make the character counting easier for us.
Here, string1 contains the string, and string2 contains the search argument. We will be searching for the first and last occurrences of the search argument. First, we print the strings; then we call the custom-built functions. We have separate functions for finding the first occurrence of the character and the last occurrence. The function pstrscan_f scans for the first occurrence of the search argument. The instruction pcmpistri treats blocks of 16 bytes at a time; we use rax as a block counter. We clear xmm0 with the pxor instruction. With pinsrb , we put the search argument in the low byte of xmm0 (byte 0). We use “equal any” to find the occurrences, and as soon as an occurrence is found, rcx indicates the index of the matching byte in the current 16-byte block. If no occurrence is found in the current block, the value 16 is put into rcx. With jc, we check if CF=1. If so, we find a match; rcx is added to rax, which contains the number of bytes already screened in previous blocks, and then rax is returned, corrected for the counting to start at 1 instead of 0.
If CF=0, we check with jz to see if we have reached the last block. pcmpistri sets ZF=1 when a null byte is detected, and rax is cleared, because no match was found. And the function returns with 0.
Of course, we did not do any error checking; if the string is not null terminated, you may get erroneous results. Try to delete the 0 at the end of the string and watch the result.
Summary
Using pcmpistri to scan for characters and string length
Interpreting the outcome of pcmpistri with different control bytes