Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

S. SmithProgramming with 64-Bit ARM Assembly Languagehttps://doi.org/10.1007/978-1-4842-5881-1_14

14. Optimizing Code

Stephen Smith¹

(1)

Gibsons, BC, Canada

In this chapter, we will look at ways to make our upper-case routine more efficient. We look at some design patterns for more efficient conditional statements, as well as some new ARM instructions that can simplify our code.

Optimizing code often involves thinking outside the box and going beyond finding ways to remove one or two instructions in a loop; we’ll look at a couple of novel ways to greatly improve the upper-case routine.

First of all, we’ll look at a trick to simplify the main if statement.

Optimizing the Upper-Case Routine

Our original upper-case routine implements the pseudo-code:

IF (W5 >= 'a') AND (W5 <= 'z') THEN

W5 = W5 - ('a'-'A')

END IF

with the following Assembly code:

// If W5 > 'z' then goto cont

CMP W5, #'z' // is letter > 'z'?

B.GT cont

// Else if W5 < 'a' then goto end if

CMP W5, #'a'

B.LT cont // goto to end if, if < 'a'

// if we got here then the letter is lower case, so convert it.

SUB W5, W5, #('a'-'A')

cont: // end if

This code implements the reverse logic of branching around the SUB instruction if W5 < ‘a’ or W5 > ‘z’. This was fine for a chapter teaching branch instructions, since it demonstrated two of them. However, in this chapter, we look at eliminating branches entirely, so let’s see how we can improve this code one step at a time.

Simplifying the Range Comparison

A common way to simplify range comparisons is to shift the range, so we don’t need a lower comparison. If we subtract ‘a’ from everything, then our pseudo-code becomes

W6 = W5 - 'a'

IF (W6 >= 0) AND W6 <= ('z'-'a') THEN

W5 = W5 - ('a'-'A')

END IF

If we treat W6 as an unsigned integer, then the first comparison does nothing, since all unsigned integers are greater than 0. In this case, we simplified our range from two comparisons to one comparison that W6 <= (‘z’-’a’). To understand why we use two registers here, see Exercise 1 in this chapter.

This leads us to the first improved version of our upper.s file. This new upper.s is shown in Listing 14-1.

// Assembler program to convert a string to

// all upper case.

// X1 - address of output string

// X0 - address of input string

// X4 - original output string for length calc.

// W5 - current character being processed

// W6 - minus 'a' to compare < 26.

.global toupper // Allow other files to call this routine

toupper: MOV X4, X1

// The loop is until byte pointed to by X1 is non-zero

loop: LDRB W5, [X0], #1 // load char and increment pointer

// Want to know if 'a' <= W5 <= 'z'

// First subtract 'a'

SUB W6, W5, #'a'

// Now want to know if W6 <= 25

CMP W6, #25 // chars are 0-25 after shift

B.HI cont

// if we got here then the letter is lower case, so convert it.

SUB W5, W5, #('a'-'A')

cont: // end if

STRB W5, [X1], #1 // store character to output str

CMP W5, #0 // stop on hitting a null character

B.NE loop // loop if character isn't null

SUB X0, X1, X4 // get the len by sub'ing the pointers

RET // Return to caller

Listing 14-1

Upper-case routine with simplified range comparison

All the examples in this chapter use the same main.s from Listing 6-3, except the third one, which skips needing a main.s. Listing 14-2 is a makefile for all the code in this chapter. Comment out any programs that you haven’t gotten to yet, or you will get a compile error.

UPPEROBJS = main.o upper.o

UPPER2OBJS = main.o upper2.o

UPPER3OBJS = upper3.o

UPPER4OBJS = main.o upper4.o

ifdef DEBUG

DEBUGFLGS = -g

else

DEBUGFLGS =

endif

LSTFLGS =

all: upper upper2 upper3 upper4

%.o : %.s

as $(DEBUGFLGS) $(LSTFLGS) $< -o $@

upper: $(UPPEROBJS)

ld -o upper $(UPPEROBJS)

upper2: $(UPPER2OBJS)

ld -o upper2 $(UPPER2OBJS)

upper3: $(UPPER3OBJS)

ld -o upper3 $(UPPER3OBJS)

upper4: $(UPPER4OBJS)

ld -o upper4 $(UPPER4OBJS)

Listing 14-2

Makefile for the upper-case routine version in this chapter

This is an improvement and a great optimization to use when you need range comparisons. Let's use a conditional instruction to remove another branch.

Using a Conditional Instruction

The ARM processor has a handful of instructions that help eliminate branch instructions. First of all, consider conditional select:

CSEL Xd, Xn, Xm, cond

This statement implements

IF cond is true then

Xd = Xn

else

Xd = Xm

This is like the C conditional operator, as follows:

Xd = cond ? Xn : Xm

Note

You can use either W or X registers with the CSEL instruction, but all the registers must be the same type.

There are a few variations on this instruction; a typical one is conditional select increment:

CSINC Xd, Xn, Xm, cond

which implements

IF condition is true then

Xd = Xn

else

Xd = Xm + 1

Next, we’ll use CSEL to replace another branch instruction.

Example with CSEL

Listing 14-3 is our upper-case routine modified to use a CSEL instruction, eliminating another branch instruction, which should be placed in the file upper2.s.

// Assembler program to convert a string to

// all upper case.

// X1 - address of output string

// X0 - address of input string

// X4 - original output string for length calc.

// W5 - current character being processed

// W6 - minus 'a' to compare < 26.

// W6 - char minus 0x20, potential upper-cased

.global toupper // Allow other files to call this routine

toupper:

MOV X4, X1

// The loop is until byte pointed to by R1 is non-zero

loop: LDRB W5, [X0], #1 // load char and increment pointer

// Want to know if 'a' <= W5 <= 'z'

// First subtract 'a'

SUB W6, W5, #'a'

// Now want to know if W6 <= 25

CMP W6, #25 // chars are 0-25 after shift

// perform lower case conversion to W6

SUB W6, W5, #('a'-'A')

// Use W6 if lower case, otherwise use original character in W5

CSEL W5, W6, W5, LS

STRB W5, [X1], #1 // store character to output str

CMP W5, #0 // stop on hitting a null character

B.NE loop // loop if character isn't null

SUB X0, X1, X4 // get the len by sub'ing the pointers

RET // Return to caller

Listing 14-3

Upper-case routine using a conditional CSEL instruction

In this example, we perform

SUB W6, W5, #('a'-'A')

into a different result register W6. Now, we have the original character in W5 and the converted character in W6. We perform

CSEL W5, W6, W5, LS

This places W6 into W5 if the LS condition is true—the character is an alphabetic lower-case character, else it puts W5 into W5—the original character.

This code is more structured; it isn’t a spaghetti of branch instructions. Once you are used to using these operators, following the logic is easier. This sequence is easier on the execution pipeline, since branch prediction isn’t required to keep things moving.

Restricting the Problem Domain

The best optimizations of code arise from restricting the problem domain. If we are only dealing with alphabetic characters, we can eliminate the range comparison entirely. In Appendix D, “ASCII Character Set,” the only difference between upper- and lower-case letters is that lower-case letters have the 0x20 bit set, whereas upper-case letters do not. This means we convert a lower-case letter to upper-case by performing a bit clear (BIC) operation on that bit. If we do this to special characters, it will corrupt the bits of quite a few of them.

Often in computing, we want code to be case insensitive, meaning that you can enter any combination of case. The Assembler does this, so it doesn’t care if we enter MOV or mov. Similarly, many computer languages are case insensitive, so you can enter variable names in any combination of upper- and lower-case and it means the same thing. Machine learning algorithms that process text always convert them into a standard form, usually throwing away all punctuation and converting them to all one case. Forcing this standardization saves a lot of extra processing later.

Let's look at an implementation of this for our code. Listing 14-4 goes in upper3.s.

// Assembler program to convert a string to

// all upper case. Assumes only alphabetic

// characters. Uses bit clear blindly without

// checking if character is alphabetic or not.

// X0 - address of input string

// X1 - address of output string

// X2 - original output string for length calc.

// W3 - current character being processed

.global _start // Provide program starting address

.MACRO toupper inputstr, outputstr

LDR X0, =inputstr // start of input string

LDR X1, =outputstr // address of output string

MOV X2, X1

// The loop is until byte pointed to by R1 is non-zero

loop: LDRB W3, [X0], #1 // load char and increment pointer

BIC W3, W3, #0x20 // kill bit that makes it lower case

STRB W3, [X1], #1 // store character to output str

CMP W3, #0 // stop on hitting a null character

B.NE loop // loop if character isn't null

SUB X0, X1, X2 // get the len by sub'ing the pointers

.ENDM

_start:

toupper instr, outstr

// Setup the parameters to print our hex number

// and then call Linux to do it.

MOV X2,X0 // return code is the length of the string

MOV X0, #1 // 1 = StdOut

LDR X1, =outstr // string to print

MOV X8, #64 // linux write system call

SVC 0 // Call linux to output the string

// Setup the parameters to exit the program

// and then call Linux to do it.

MOV X0, #0 // Use 0 return code

MOV X8, #93 // Service command code 96 terminates

SVC 0 // Call linux to terminate the program

.data

instr: .asciz "ThisIsRatherALargeVariableNameAaZz//[`{ "

.align 4

outstr: .fill 255, 1, 0

Listing 14-4

Upper-case routine as a macro, using BIC for alphabetic characters only

This file contains the _start entry point and print Linux calls, so no main.s is needed. Here is the output of building and running this version:

smist08@kali:~/asm64/Chapter 14$ make

as upper3.s -o upper3.o

ld -o upper3 upper3.o

smist08@kali:~/asm64/Chapter 14$ ./upper3

THISISRATHERALARGEVARIABLENAMEAAZZ[@[

smist08@kali:~/asm64/Chapter 14$

There are a few special characters at the end of the string showing how some are converted correctly and some aren’t.

Besides using this BIC instruction to eliminate all conditional processing, we implement the toupper routine as a macro to eliminate the overhead of calling a function. We change the register usage, so we only use the first four registers in the macro, so we don’t need to save any registers around the call.

This is typical of many optimizations, showing how we can save instructions if we narrow our problem domain, in this case to just working on alphabetic characters rather than all ASCII characters.

Using Parallelism with SIMD

In Chapter 13, “Neon Coprocessor,” we looked at performing operations in parallel and mentioned that this coprocessor processes characters, as well as integers and floats. Let’s see if we can use NEON instructions to process 16 characters at a time (16 characters fit in a 128-bit V register).

Let’s look at the code in upper4.s shown in Listing 14-5.

Note

This code won’t run until we make an adjustment to main.s at the end of this section in Listing 14-6.

// Assembler program to convert a string to

// all upper case.

// X0 - address of input string

// X1 - address of output string

// X2 - use as indirection to load data

// Q0 - 8 characters to be processed

// V1 - contains all a's for comparison

// V2 - result of comparison with 'a's

// Q3 - all 25's for comp

// Q8 - spaces for bic operation

.global toupper // Allow other files to call this routine

.EQU N, 4

toupper:

LDR X2, =aaas

LDR Q1, [X2] // Load Q1 with all as

LDR X2, =endch

LDR Q3, [X2] // Load Q3 with all 25's

LDR X2, =spaces

LDR Q8, [X2] // Load Q8 with all spaces

MOV W3, #N

// The loop is until byte pointed to by R1 is non-zero

loop: LDR Q0, [X0], #16 // load 16 chars and incr pointer

SUB V2.16B, V0.16B, V1.16B // Subtract 'a's

CMHI V2.16B, V2.16B, V3.16B // compare chars to 25's

NOT V2.16B, V2.16B // no CMLO so need to not

AND V2.16B, V2.16B, V8.16B // and result with spaces

BIC V0.16B, V0.16B, V2.16B // kill lower-casebit

STR Q0, [X1], #16 // store character to output str

SUBS W3, W3, #1 // dec loop counter and set flags

B.NE loop // loop if character isn't null

MOV X0, #(N*16) // get the len by sub'ing the pointers

RET // Return to caller

.data

aaas: .fill 16, 1, 'a' // 16 a's

endch: .fill 16, 1, 25 // after shift, chars are 0-25

spaces: .fill 16, 1, 0x20 // spaces for bic

Listing 14-5

Upper-case routine using the NEON Coprocessor

This routine uses 128-bit registers to process 16 characters at a time. There are more instructions than some of our previous routines, but the parallelism makes it worthwhile. We start by loading our constants into registers. You can’t use immediate constants with NEON instructions, so these must be in registers. Additionally, they need to be duplicated 16 times, so there is one for each of our 16 lanes.

We then load 16 characters to process into Q0 with an LDR instruction. We use post-indexed addressing, so the pointer is left pointing to the next block of characters for when we loop.

Figure 14-1 shows the processing through the NEON Coprocessor for the first four lanes. We use BIC, but we could have just as easily used SUB to do the conversion. We test that the character is lower-case alphabetic before doing this, so it is correct for all ASCII characters.

../images/494415_1_En_14_Chapter/494415_1_En_14_Fig1_HTML.jpg — Figure 14-1
The parallel processing steps to convert to upper-case

The CMHI is our first encounter with a NEON comparison instruction. It compares all 16 lanes at once. It places all 1s in the destination lane if the comparison is true, otherwise 0. All 1s are 0xFF hex. We really want CMLO, but there is no such instruction, so we need to do a CMHI followed by a NOT. With this, we can AND it with a register full of 0x20s. Any lanes that don’t have a lower-case alphabetic character will result in 0.

This means in lanes with 0, there are no bits for BIC to clear. Then the lanes that still have 0x20 will clear that one bit doing the conversion.

For this routine to work, we need to make a change to main.s. We need to add a “.align 4” between the two strings. This is because we can only load or store NEON data from or to word-aligned memory locations. If we don’t do this, we get a “Bus Error” when the program runs. The updated code is shown in Listing 14-6.

instr: .asciz "This is our Test String that we will convert. AaZz@[`{ "

.align 4

outstr: .fill 255, 1, 0

Listing 14-6

Changes required in main.s

I also added edge case characters to the end of the string; this ensures we don’t have any off-by-one errors in our code.

This code runs fine, but that’s partly because of the way our .data section is set up. Notice there is no test for the string NULL terminator. This routine just converts fixed length strings, and we set the fixed length at 4∗16 by making the loop perform four iterations. The NEON processor has no easy way to detect a NULL terminator. If we looped through the characters outside of the NEON processor to look for the NULL, we do nearly as much work as our last toupper routine. To do string processing in the NEON Coprocessor, here are some notes:

Don’t use NULL-terminated strings. Use a length field followed by the string. Or use fixed length strings, for example, every string is just 256 characters and contains spaces beyond the last character.
Pad all strings to use data storage in multiples of 16. This way you won’t ever have to worry about NEON processing past the end of your buffer.
Make sure all the strings are word aligned.

We’ve looked at several techniques to optimize our upper-case routine; let’s look at why we concentrate so much on eliminating branch instructions as well as provide a few other tips.

Tips for Optimizing Code

The first rule of optimizing code is to time and test everything. The designers of the ARM processor are always incorporating improvements to their hardware designs. Each year, the ARM processors get faster and more optimized. Improving performance though optimizing Assembly Language code isn’t always intuitive. The processor can be quite smart at some things and quite dumb at others. If you don’t set up tests to measure the results of your changes, you could well be making things worse.

With that said, let’s look at some general Assembly Language optimization techniques.

Avoiding Branch Instructions

The ARM CPU works on several instructions at once, and if the instructions don’t involve a branch, then everything works great. If the CPU hits a branch instruction, it must do one of three things:

1.
Throw away any work it has done on instructions after the branch instruction.
2.
Make an educated guess as to which way the branch is likely to go and proceed in that direction; then it only needs to discard the work if it guessed wrong.
3.
Start processing instructions in both directions of the branch at once; perhaps it can’t do as much work, but it accomplishes something until the direction of the conditional branch is decided.

CPUs were getting quite good at predicting branches and keeping their pipelines busy. This was until the Spectre and Meltdown security exploits figured out how to access this work and exploit it. That caused CPU vendors, including ARM, to reduce some of this functionality.

As a result, conditional branch instructions can still be expensive. They also lead to hard to maintain spaghetti code that should be avoided. So reducing conditional branches helps performance and leads to more maintainable code.

Avoiding Expensive Instructions

Instructions like multiplication and division take multiple clock cycles to execute. If you can accomplish them through additions or subtractions in an existing loop, that can help. Also, consider using bit manipulation instructions like shifting left to multiply by 2. If these instructions are necessary for your algorithm, then there isn’t much you can do.

One trick is to execute the multiplication or division on the FPU or NEON Coprocessor; this will allow other regular ARM instructions to continue in parallel.

Don’t Be Afraid of Macros

Calling a function can be costly if a lot of registers need to be saved to the stack and then restored before returning. Don’t be afraid of using macros to eliminate the function call and return instructions along with all the register saving/restoring.

Loop Unrolling

We’ll see an example of loop unrolling in Chapter 15, “Reading and Understanding Code.” This is repeating the code the number of times of the loop, saving the overhead of the instructions that do the looping. We did this in the NEON version of 3x3 matrix multiplication where we inserted calls to a macro three times rather than write a loop.

Keeping Data Small

Even though the ARM process can mostly process instructions involving the 64-bit X registers in the same time as involving the 32-bit W registers, it puts strain on the memory bus moving all that data. Remember the memory bus is moving your data, along with loading instructions to execute and doing all that for all the processing cores. Reducing the quantity of data you move to and from memory can help speed things up.

Beware of Overheating

A single ARM processor typically has four or more processing cores, each of these with an FPU and NEON Coprocessor. If you work hard, you can get all these units working at once, theoretically processing a huge amount of data in parallel. The gotcha is that the more circuitry you involve in processing, the more heat is produced.

If you do this, beware that a single board computer, like the Raspberry Pi, can overheat. Similarly, smartphones overheat when they need to sustain too much processing. Often there are guidelines as to how busy you can keep the processor before it starts to overheat.

You won’t damage the processor; it will detect the overheating and slow itself down, undoing all the great work you’ve done.

Summary

In this chapter, we performed several optimizations on our upper-case function. We looked at

1.
Simplifying range comparisons
2.
Using conditional instructions
3.
Simplifying the domain and using bit manipulations
4.
Upper-casing 16 characters at once using the NEON Coprocessor

We then provided several hints to consider when optimizing your code.

In Chapter 15, “Reading and Understanding Code,” we will examine how the C compiler generates code and talk about understanding compiled programs.

Exercises

1.
In our first optimization, consider this alternate pseudo-code:

W5 = W5 - 'a'

IF (W5 >= 0) AND W5 <= ('z'-'a') THEN

W5 = W5 + 'A'

END IF

Why is this incorrect?

2.
Think back to the loops we developed in Chapter 4, “Conditional Program Flow.” Construct a FOR loop using a CSINC statement to do the increment and test for loop end.
3.
Each generation of ARM CPU adds a few more instructions, especially to the NEON Coprocessor. List the pros and cons of utilizing newer instructions to optimize your code.
4.
Set up a way to run each of the programs in this chapter in a large loop, and time how long each one takes. Which technique is fastest and why? Consider using the Linux gettimeofday service.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 14. Optimizing Code

Create new playlist

Sign In

Sign Up

14. Optimizing Code

Optimizing the Upper-Case Routine

Simplifying the Range Comparison

Using a Conditional Instruction

Example with CSEL

Restricting the Problem Domain

Using Parallelism with SIMD

Tips for Optimizing Code

Avoiding Branch Instructions

Avoiding Expensive Instructions

Don’t Be Afraid of Macros

Loop Unrolling

Keeping Data Small

Beware of Overheating

Summary

Exercises

Table of Contents for
14. Optimizing Code