Chapter 14. Optimization Techniques

Things should be made as simple as possible, but not any simpler.

Albert Einstein

This chapter offers some tips to optimize code to reduce resource utilization. These techniques can be roughly divided into strategies for reducing memory usage, increasing code efficiency, and lowering power requirements. The need for low-cost versions of our products drives hardware designers to provide just barely enough memory and processing power to get the job done.

Most of the optimizations performed on code involve a tradeoff between execution speed and code size. Your program can be made either faster or smaller, but not both. In fact, an improvement in one of these areas can have a negative impact on the other. It is up to the programmer to decide which of these improvements is most important. Given that single piece of information, the compiler’s optimization phase can make the appropriate choice whenever a speed versus size tradeoff is encountered.

The first step in optimization is to determine which problems you have. You might have size issues, speed issues, or both. If you have one type of issue, you can have the compiler help you out with the optimization. If you have both size and speed issues, we recommend letting the compiler do what it can to reduce the size of your program. Then you can find the time-critical code or bottlenecks (where the program is spending most of its time) and manually optimize that code for speed. (In battery-powered devices, every unnecessary processor cycle results in reduced runtime; therefore, the thing to do is optimize for speed across the entire application.)

Execution speed is usually important only within certain of those few portions of the code that have short deadlines and those most frequently executed. There are many things you can do to improve the efficiency of those sections by hand. However, code size is a difficult thing to influence manually, and the compiler is in a much better position to make this change across all of your software modules.

Increasing Code Efficiency

By the time your program is working, you might already know, or have a pretty good idea, which functions and modules are the most critical for overall code efficiency. ISRs, high-priority tasks, calculations with real-time deadlines, and functions that are either compute-intensive or frequently called are all likely candidates.

A tool called a profiler , included with some software development suites, can be used to narrow your focus to those routines in which the program spends most (or too much) of its time. A profiler collects and reports execution statistics for a program. These execution statistics include the number of calls to and the total time spent in each routine.

Once you’ve identified the routines that require greater code efficiency, you can use the following techniques to reduce their execution time. Note that the techniques described here are very compiler-dependent. In most cases, there aren’t general rules that can be applied in all situations. The best way to determine if a technique will provide improvement is to look at the compiler’s assembly output.

Inline functions

In C99, the keyword inline can be added to any function declaration. This keyword asks the compiler to replace all calls to the indicated function with copies of the code that is inside. This eliminates the runtime overhead associated with the function call and is most effective when the function is used frequently but contains only a few lines of code.

Inline functions provide a perfect example of how execution speed and code size are sometimes inversely linked. The repetitive addition of the inline code will increase the size of your program in direct proportion to the number of times the function is called. And, obviously, the larger the function, the more significant the size increase will be. However, you will lose the overhead of setting up the stack frame if parameters are passed into the function. The resulting program runs faster but requires more code memory.

Table lookups

A switch statement is one common programming technique that should be used with care. Each test and jump that makes up the machine language implementation uses up valuable processor time simply deciding what work should be done next. To speed things up, try to put the individual cases in order by their relative frequency of occurrence. In other words, put the most likely cases first and the least likely cases last. This will reduce the average execution time, though it will not improve at all upon the worst-case time.

If there is a lot of work to be done within each case, it might be more efficient to replace the entire switch statement with a table of pointers to functions. For example, the following block of code is a candidate for this improvement:

enum NodeType {NODE_A, NODE_B, NODE_C};
     
switch (getNodeType())
{
    case NODE_A:
        .
        .
    case NODE_B:
        .
        .
    case NODE_C:
        .
        .
}

To speed things up, replace this switch statement with the following alternative. The first part of this is the setup: the creation of an array of function pointers. The second part is a one-line replacement for the switch statement that executes more efficiently.

int processNodeA(void);
int processNodeB(void);
int processNodeC(void);
     
/* Establishment of a table of pointers to functions. */
int (* nodeFunctions[])() = {processNodeA, processNodeB, processNodeC};
     
.
.

/* The entire switch statement is replaced by the next line. */
status = nodeFunctions[getNodeType()]();
Hand-coded assembly

Some software modules are best written in assembly language. This gives the programmer an opportunity to make them as efficient as possible. Though most C compilers produce much better machine code than the average programmer, a skilled and experienced assembly programmer might do better work than the compiler for a given function.

For example, on one of our past projects, a digital filtering algorithm was implemented in C and targeted to a TI TMS320C30 DSP. The compiler was unable to take advantage of a special instruction that performed exactly the mathematical operations needed. By manually replacing one for loop of the C program with inline assembly instructions that did the same thing, overall computation time decreased by more than a factor of 10.

Register variables

The keyword register can be used when declaring local variables. This asks the compiler to place the variable into a general-purpose register rather than on the stack. Used judiciously, this technique provides hints to the compiler about the most frequently accessed variables and will somewhat enhance the performance of the function. The more frequently the function is called, the more likely it is that such a change will improve the code’s performance. But some compilers ignore the register keyword.

Global variables

It is sometimes more efficient to use a global variable than to pass a parameter to a function. This eliminates the need to push the parameter onto the stack before the function call and pop it back off once the function is completed. In fact, the most efficient implementation of any subroutine would have no parameters at all. However, the decision to use a global variable can also have some negative effects on the program. The software engineering community generally discourages the use of global variables in an effort to promote the goals of modularity and reentrancy, which are also important considerations.

Polling

ISRs are often used to improve a program’s responsiveness. However, there are some rare cases in which the overhead associated with the interrupts actually causes inefficiency. These are cases in which the average time between interrupts is of the same order of magnitude as the interrupt latency. In such cases, it might be better to use polling to communicate with the hardware device. But this too can lead to a less modular software design.

Fixed-point arithmetic

Unless your target platform features a floating-point processor, you’ll pay a very large penalty for manipulating float data in your program. The compiler-supplied floating-point library contains a set of software subroutines that emulate the floating-point instructions. Many of these functions take a long time to execute relative to their integer counterparts and also might not be reentrant.

If you are using floating-point for only a few calculations, it might be better to implement the calculations themselves using fixed-point arithmetic. For example, two fractional bits representing a value of 0.00, 0.25, 0.50, or 0.75 are easily stored in any integer by merely multiplying the real value by 4 (e.g., << 2). Addition and subtraction can be accomplished via the integer instruction set, as long as both values have the same imaginary binary point. Multiplication and division can be accomplished similarly, if the other number is a whole integer.

It is theoretically possible to perform any floating-point calculation with fixed-point arithmetic. (After all, that’s how the floating-point software library does it, right?) Your biggest advantage is that you probably don’t need to implement the entire IEEE 754 standard just to perform one or two calculations. If you do need that kind of complete functionality, stick with the compiler’s floating-point library and look for other ways to speed up your program.

Variable size

It is typically best to use the processor’s native register width for variables whenever possible (whether it is 8, 16, or 32 bits). This allows the compiler to produce code that takes advantage of the fast registers built into the processor’s machine opcodes. Obviously, you need to ensure that the variable size accommodates the number range that the variable represents. For example, if you need a count that goes from 0 to 512, you can’t use an 8-bit variable.

A variable size tailored to the processor can also speed up processing by limiting the number of external memory accesses. If a processor has a 16-bit data bus and it needs to access a 32-bit variable in external RAM, two data fetches must be performed for the processor to get the variable.

C99 defines integer types int_fastN_t and uint_fastN_t (where N represents the integer length) in stdint.h. These types are meant to be used when you need at least “X bits” (e.g., X = 16) to store your data but don’t care if the field is larger than X in width, to make access as fast as possible. These “fast” integer types are thus no good for use with peripheral registers, which always have a fixed width that cannot be larger or smaller than X.

Loop unrolling

In some cases, repetitive loop code can be optimized by performing loop unrolling. In loop unrolling, the loop overhead at the start and end of a loop is eliminated. Here’s an example of a for loop:

for (idx = 0; idx < 5; idx++)
{
    value[idx] = incomingData[idx];
}

Here’s the unrolled version without the loop overhead:

value[0] = incomingData[0];
value[1] = incomingData[1];
value[2] = incomingData[2];
value[3] = incomingData[3];
value[4] = incomingData[4];

Some compilers offer loop unrolling as an optimization; in other cases, it might be better for the developer to code it. It is helpful to check the assembly output from the compiler to see whether efficiency has actually been improved.

The amount of rolling that you—or the compiler—choose to do must balance the gain in speed versus the increased size of the code. Loop unrolling increases code size—another situation where you must trade code size for speed. Also, loop unrolling can be used only when the number of iterations through the loop are fixed. One example of an optimized implementation of loop unrolling is the coding technique known as Duff’s device (http://en.wikipedia.org/wiki/Duff’s_device).

Decreasing Code Size

As stated earlier, when it comes to reducing code size, your best bet is to let the compiler do the work for you. However, if the resulting program is still too large for your available ROM, there are several programming techniques you can use to further reduce the size of your program.

Once you’ve got the automatic optimizations working, take a look at these tips for further reducing the size of your code by hand:

Avoid standard library routines

One of the best things you can do to reduce the size of your program is to avoid using large standard library routines. Many of the largest routines are costly in terms of size because they try to handle all possible cases. For example, the strupr function might be small, but a call to it might drag other functions such as strlower, strcmp, strcpy, and others into your program whether they are used or not.

It might be possible to implement a subset of the functionality yourself with significantly less code. For example, the standard C library’s sprintf routine is notoriously large. Much of this bulk is located within the floating-point manipulation routines on which it depends. But if you don’t need to format and display floating-point values (%a, %e, %f, or %g), you could write your own integer-only version of sprintf and save several kilobytes of code space. In fact, a few implementations of the standard C library (Cygnus’s newlib comes to mind) include just such a function, called siprintf.

Use goto statements

As with global variables, good software engineering practice dictates against the use of this technique. But in a pinch, goto statements can be used to remove complicated control structures or to share a block of oft-repeated code.

For example, many programmers use the goto statement to bail out of a routine in case of error. In this way, the programmer can group together any things that must be done before exiting the routine, as shown here:

int functionWork(void)
{
    /* Do some work here. */
    ...

    /* If there was an error doing the work, exit. */
    goto CLEANUP;

    /* Do some more work here. */
    ...

    /* If there was an error doing the work, exit. */
    goto CLEANUP;

    ...

    /* Otherwise, everything succeeded. */
    return SUCCESS;

CLEANUP:
    /* Clean up code here. */

    return FAILURE;
}

In addition to these techniques for reducing code size, several of the ones described in the prior section could be helpful, specifically table lookups, hand-coded assembly, register variables, and global variables. Of these techniques, the use of hand-coded assembly usually yields the largest decrease in code size.

Problems with Optimizing Compilers

The GNU C compiler has several optimization command-line options, all of which are variants of –O. Specifying –O3 turns on all available gcc optimizations, regardless of their effects on the speed-versus-size tradeoff. The command-line option –Os also optimizes the code for size. For a detailed explanation of the different gcc optimization levels, refer to the gcc online manual at http://gcc.gnu.org/onlinedocs.

Murphy’s Law dictates that the first time you enable the compiler’s optimization feature, your previously working program will suddenly fail. Perhaps the most notorious of the automatic optimizations is “dead code elimination.” This optimization eliminates code that the compiler believes to be either redundant or irrelevant. For example, adding zero to a variable requires no runtime calculation whatsoever. But you might still want the compiler to generate those “irrelevant” instructions if they perform some function that the compiler doesn’t know about.

For example, given the following block of code, most optimizing compilers would remove the first statement because the value of *pControl is not used before it is overwritten (on the third line):

    *pControl = DISABLE;
    *pData    = 'a';
    *pControl = ENABLE;

But what if pControl and pData are actually pointers to memory-mapped device registers? In that case, the peripheral device would not receive the DISABLE command before the byte of data was written. This could potentially wreak havoc on all future interactions between the processor and this peripheral. To protect yourself from such problems, you must declare all pointers to memory-mapped registers and global variables that are shared between tasks (or a task and an ISR) with the keyword volatile. And if you miss just one of them, Murphy’s Law will come back to haunt you in the final days of your project—guaranteed.

Warning

Never make the mistake of assuming that the optimized program will behave the same way as the unoptimized one. You must completely retest your software at each new optimization level to be sure its behavior hasn’t changed.

To make matters worse, debugging an optimized program is challenging, to say the least. With the compiler’s optimization enabled, the correlation between a line of source code and the set of processor instructions that implements that line is much weaker. Those particular instructions might have moved or been split up, or two similar code blocks might now share a common implementation. In fact, some lines of the high-level language program might have been removed from the program altogether (as they were in the previous example)! As a result, you might be unable to set a breakpoint on a particular line of the program or examine the value of a variable of interest.

Reducing Memory Usage

In some cases, RAM rather than ROM is the limiting factor for your application. In these cases, you’ll want to reduce your dependence on global data, the stack, and the heap. These are all optimizations better made by the programmer than by the compiler.

Because ROM is usually cheaper than RAM (on a per-byte basis), one acceptable strategy for reducing the amount of global data might be to move constant data into ROM. This can be done automatically by the compiler if you declare all of your constant data with the keyword const. Most C compilers place all of the constant global data they encounter into a special data segment that is recognizable to the locator as ROM-able. This technique is most valuable if there are lots of strings or table-oriented data that will not change at runtime.

If some of the data is fixed once the program is running but not necessarily constant, the constant data segment could be placed in a hybrid memory device such as flash or EEPROM. This memory device could then be updated over a network or by a technician assigned to make the change. An example of such data is the sales tax rate for each locale in which your product will be deployed. If a tax rate changes, the memory device can be updated, but additional RAM can be saved in the meantime.

Stack size reductions can also lower your program’s RAM requirement. One way to figure out approximately how much stack you need is to fill the entire memory area reserved for the stack with a special data pattern, such as 0xAAAA. Then, after the software has been running for a while—under both normal and stressful conditions—use a debugger to examine the modified stack. The part of the stack memory area that still contains your special data pattern has never been overwritten, so you can reduce the size of the stack area by that amount. [1]

Be especially conscious of stack space if you are using a real-time operating system. Preemptive operating systems create a separate stack for each task. These stacks are used for function calls and ISRs that occur within the context of a task. You can determine the amount of memory required for each task stack in the manner previously described. You might also try to reduce the number of tasks or switch to an operating system that has a distinct “interrupt stack” for execution of all ISRs. The latter method can significantly reduce the stack size requirement of each task.

The size of the heap is limited to the amount of RAM left over after all of the global data and stack space has been allocated. If the heap is too small, your program will not be able to allocate dynamic memory when it is needed, so always be sure to compare the result of malloc with NULL before dereferencing the memory you tried to allocate. If you’ve tried all of these suggestions and your program is still requiring too much memory, you might have no choice but to eliminate the heap altogether. This isn’t entirely bad in the case of embedded systems, which frequently allocate all memory needed by the system at initialization time.

Note that many embedded programmers avoid the use of malloc, and thus the need for a heap, altogether. But the key benefit of dynamic memory allocation is that you don’t need to spend RAM to keep variables around that are only used briefly in the program. This is a way to reduce total memory utilization.

Power-Saving Techniques

A major concern in battery-powered embedded systems design is power consumption. In this section, we take a brief look at areas where embedded software can assist in conserving the system’s vital energy source.

Power consumption is a major concern for portable or battery-operated devices. Power issues, such as how long the device needs to run and whether the batteries can be recharged, need to be thought out ahead of time. In some systems, replacing a battery in a device can be a big expense. This means the system must be conscious of the amount of power it uses and take appropriate steps to conserve battery life.

There are several methods to conserve power in an embedded system, including clock control, power-sensitive processors, low-voltage ICs, and circuit shutdown. Some of these techniques must be addressed by the hardware designer in his selection of the different system ICs. There may be lower-power versions of certain peripherals. Some power-saving techniques are under software control.

It might seem ideal to select the fastest and most powerful processor available for a particular embedded system. However, one of the tasks of the hardware designer is to use just enough processing power to enable the device to get its job done. This helps reduce the power consumed by the device. The processor selected plays a key role in determining the amount of power an embedded system will consume. In addition, some processors can automatically shut down different execution units when they are not in use.

Processor Modes

One software technique offered by many embedded processors to conserve power is different operating modes. These modes allow the software to scale processor power consumption to match the moment-by-moment needs of the application. For example, the Arcom board’s PXA255 processor has four operating modes:

Turbo mode

The processing core runs at the peak frequency. Minimizing external memory accesses would be worthwhile in this mode, because the processor would have to wait for the external memory.

Run mode

The processor core runs at its normal frequency. This is the normal or default operating mode.

Idle mode

The processor core is not clocked, but the other peripheral components operate as normal.

Sleep mode

This is the lowest power state for the processor.

Understanding the details of these modes and how to get into and out of them is key. For example, the PXA255 can conserve power by entering and exiting idle mode multiple times in a second, because the processor is quickly reactivated in the prior state. However, in sleep mode, the processor state is not maintained and may require a complete system reboot when exiting this mode.

Operating the processor in different modes can save quite a bit of power. The power consumption for the PXA255 processor (running at 200 MHz) in normal run mode is typically 178 mW. While in idle mode, the PXA255 typically consumes 63 mW.

There are several issues to consider when planning the power management software design. First, you must ensure that each task is able to get enough cycles to perform its assigned work. If a system doubles battery life by entering idle modes often but is thus unable to perform its work, the product fails to meet its design goals.

You also need to determine when the system is not doing anything and how to wake up the processor when it needs to operate, and you need to know what events will wake up the system. For example, in an embedded system that sends some data across a network every few minutes, it makes sense to be able to shut down the device to conserve power until it is time to send the data. The device must still be able to wake up in case an error condition arises. Therefore, you must understand how a peripheral circuit wakes up the processor when the processor needs to operate (including how long it takes the circuit to wake up and whether any reinitialization needs to be done).

The optimization techniques presented earlier in this chapter can be used to conserve power as well. By reducing the amount of execution time for the main tasks in a system, you allow the system to spend more time in its low-power state.

Even though the processor is in idle mode, various peripherals still operate and can be programmed to wake up the processor. Typically, interrupts can be used to wake up the processor to perform some task. This is why power management must be considered when designing the software. For example, some behaviors can be achieved by polling. But when the events are less frequent and power management is an issue, it makes sense to use interrupts rather than polling because this allows the processor to sleep for the maximum amount of time before waking up to handle an event. When you choose to use polling, the processor must constantly perform the polling operation, which typically happens at a set interval. This wastes power when the polling operation executes and no events have occurred.

You can also take advantage of peripherals that operate while the processor is in idle mode. For example, if you are transferring data from an external peripheral into RAM and need to process the data once a certain amount of data is received, you can use the DMA controller. This way, instead of the processor handling each byte received, it sleeps while the data is transferred. You can configure the DMA controller to interrupt the processor once this data has been received.

Clock Frequency

Another power-saving technique that can be controlled by software is to vary processor clock speeds. Some processors accept a fixed-input clock frequency but feature the ability to reduce internal clock speeds by programming clock configuration registers. Software can reduce the clock speed to save power during the execution of noncritical tasks and increase the clock speed when processing demands are high.

The PXA255 datasheet shows the power consumption while the processor core operates at different frequencies. Table 14-1 shows a comparison for three different PXA255 core clock frequencies and the associated power consumption at each frequency.

Table 14-1. PXA255 power consumption comparison
Processor core clock speed (MHz)Power consumption (mW)
400411
300283
200178

As the software designer, you need to understand what happens during the frequency change sequence—what to do if an interrupt occurs during the frequency change and what needs to be reconfigured (such as DRAM refresh cycles) for the new frequency.

You will need comprehensive knowledge of all software operation in the system if you decide to alter the processor frequency on the fly. For example, it can be tricky to know when to lower the clock speed when a multitasking RTOS is used.

In other cases, particular peripherals can be completely disabled when they are not in use. For example, if a particular peripheral module is not used in the PXA255 processor, the clock to that unit can be disabled using the Clock Enable Register (CKEN).

External Memory Access

There are several things that can be done to reduce external memory accesses. If a cache is available, you can enable it to avoid having the processor fetch data or instructions from external memory. A cache is very high-speed, on-chip memory that supplies the most recently used instructions and/or data to the processor with no or few wait states.

Similarly, internal processor memory can be used, if available. In some cases, the internal memory can be used for both data and code. It might not be feasible to incorporate all the system software in internal memory. If there is not enough memory available for the entire system software, you must determine what data and code should be included. It’s best to include the stack and frequently used variables or functions for limiting external memory accesses. In an embedded system where power consumption is the top priority, it might make sense to switch to a processor with more on-chip memory in order to reduce off-chip accesses.

Optimization techniques can help here as well. For power optimization, instead of focusing solely on speed or code size, you need to focus on analyzing code to determine how to reduce external bus transactions.

These are just a few considerations to keep in mind when working on power management software. Future products may include technology with self-renewing sources of energy. There are ways to harvest energy for circuits that are self-powered using sources such as vibration, light, and thermal sources. These techniques are discussed in the June 2005 Embedded Systems Design article “Energy-Harvesting Chips: The Quest for Everlasting Life,” which can be found online at http://www.embedded.com.

Limiting the Impact of C++

One of the issues we faced upon deciding to write this book was whether or not to include C++ in the discussion and examples. For almost all of the projects we have worked on throughout our respective careers, the embedded software was written in C and assembly language. In addition, there has been much debate within the embedded software community about the appropriateness of C++. It is widely believed that C++ programs produce larger executables that run more slowly than programs written entirely in C. However, C++ has many benefits for programmers.

We believe that many readers will face the choice of using C++ in their embedded programming. This section covers some of the C++ features that are useful for embedded system software and warns you about some of the more expensive features in the language.

Of course, not every feature of C++ is expensive. In fact, the earliest C++ compilers used a technology called C-front to turn C++ programs into C, which was then fed into a standard C compiler. That this is even possible demonstrates that many of the syntactical differences between the languages have little or no runtime cost. [2] For example, the definition of a class is completely benign. The list of public and private member data and functions is not much different from a struct and a list of function prototypes. However, the C++ compiler is able to use the public and private keywords to determine which method calls and data accesses are allowed and prohibited. Because this determination is made at compile time, there is no penalty paid at runtime. Thus, the use of classes alone affects neither the code size nor efficiency of your programs.

Default parameter values are also penalty-free. The compiler simply inserts code to pass the default value whenever the function is called without an argument in that position. Similarly, function name overloading involves only a compile-time code modification. Functions with the same names but different parameters are each assigned unique names during the compilation process. The compiler alters the function name each time it appears in your program, and the linker matches them up appropriately.

Operator overloading is another feature that might be used in embedded systems. Whenever the compiler sees such an operator, it simply replaces it with the appropriate function call. So in the C++ code listing that follows, the last two lines are equivalent, and the performance penalty is easily understood:

Complex  a, b, c;
     
c = operator+(a, b);           // The traditional way: Function Call
c = a + b;                     // The C++ way: Operator Overloading

Constructors and destructors have a slight penalty. These special methods are guaranteed to be called each time an object of the type is created or goes out of scope, respectively. However, this small amount of overhead is a reasonable price to pay for fewer bugs. Constructors eliminate an entire class of C programming errors having to do with uninitialized data structures. This feature has also proved useful for hiding the awkward initialization sequences associated with some classes.

Virtual functions also have a reasonable cost/benefit ratio. Without going into too much detail about what virtual functions are, let’s just say that polymorphism would be impossible without them. And without polymorphism, C++ would not be a true object-oriented language. The only significant cost of virtual functions is one additional memory lookup before a virtual function can be called. Ordinary function and method calls are not affected.

The features of C++ that are typically too expensive for embedded systems are templates, exceptions, and runtime type identification. All three of these negatively impact code size, and exceptions and runtime type identification also increase execution time. Before deciding whether to use these features, you might want to do some experiments to see how they will affect the size and speed of your own application.



[1] Of course, you probably want to leave a little extra space on the stack, in case your testing didn’t last long enough or did not accurately reflect all possible runtime scenarios. Never forget that a stack overflow is a potentially fatal event for your software and should be avoided at all costs.

[2] Moreover, we want to make clear that there is no penalty for compiling an original C program with a C++ compiler.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
54.147.123.159