Tips

The following are a few things you can do in your code relatively easily to achieve better performance.

Inlining Functions

Because function calls can be expensive operations, inlining functions (that is, the process of replacing the function call with the body of the function itself) can make your code run faster. Making a function inlined is simply a matter of adding the “inline” keyword as part of its definition. An example of inline function is showed in Listing 3–30.

You should use this feature carefully though as it can result in bloated code, negating the advantages of the instruction cache. Typically, inlining works better for small functions, where the overhead of the call itself is significant.

NOTE: Alternatively, use macros.

Unrolling Loops

A classic way to optimize loops is to unroll them, sometimes partially. Results will vary and you should experiment in order to measure gains, if any. Make sure the body of the loop does not become too big though as this could have a negative impact on the instruction cache.

Listing 3–34 shows a trivial example of loop unrolling.

Listing 3–34. Unrolling

void add_buffers_unrolled (int* dst, const int* src, int size)
{
    int i;

    for (i = 0; i < size/4; i++) {
        *dst++ += *src++;
        *dst++ += *src++;
        *dst++ += *src++;
        *dst++ += *src++;
        // GCC not really good at that though... No LDM/STM generated
    }
  
    // leftovers
    if (size & 0x3) {
        switch (size & 0x3) {
            case 3: *dst++ += *src++;
            case 2: *dst++ += *src++;
            case 1:
            default:  *dst += *src;
        }
    }
}

Preloading Memory

When you know with a certain degree of confidence that specific data will be accessed or specific instructions will be executed, you can preload (or prefetch) this data or these instructions before they are used.

Because moving data from external memory to the cache takes time, giving enough time to transfer the data from external memory to the cache can result in better performance as this may cause a cache hit when the instructions (or data) are finally accessed.

To preload data, you can use:

  • GCC's __builtin_prefetch()
  • PLD and PLDW ARM instructions in assembly code

You can also use the PLI ARM instruction (ARMv7 and above) to preload instructions.

Some CPUs automatically preload memory, so you may not always observe any gain. However, since you have a better knowledge of how your code accesses data, preloading data can still yield great results.

TIP: You can use the PLI ARM instruction (ARMv7 and above) to preload instructions.

Listing 3–35 shows how you can take advantage of the preloading built-in function.

Listing 3–35. Preloading Memory

void add_buffers_unrolled_prefetch (int* dst, const int* src, int size)
{
    int i;

    for (i = 0; i < size/8; i++) {
        __builtin_prefetch(dst + 8, 1, 0); // prepare to write
        __builtin_prefetch(src + 8, 0, 0); // prepare to read
    
        *dst++ += *src++;
        *dst++ += *src++;
        *dst++ += *src++;
        *dst++ += *src++;
        *dst++ += *src++;
        *dst++ += *src++;
        *dst++ += *src++;
        *dst++ += *src++;
    }
    
    // leftovers
    for (i = 0; i < (size & 0x7); i++) {
        *dst++ += *src++;
    }
}

You should be careful about preloading memory though as it may in some cases degrade the performance. Anything you decide to move into the cache will cause other things to be removed from the cache, possibly impacting performance negatively. Make sure that what you preload is very likely to be needed by your code or else you will simply pollute the cache with useless data.

NOTE: While ARM supports the PLD, PLDW, and PLI instructions, ×86 supports the PREFETCHT0, PREFETCHT1, PREFETCHT2, and PREFETCHNTA instructions. Refer to the ARM and ×86 documentations for more information. Change the last parameter of __builtin_prefetch() and compile for ×86 to see which instructions will be used.

LDM/STM Instead Of LDR/STD

Loading multiple registers with a single LDM instruction is faster than loading registers using multiple LDR instructions. Similarly, storing multiple registers with a single STM instruction is faster than using multiple STR instructions.

While the compiler is often capable of generating such instructions (even when memory accesses are somewhat scattered in your code), you should try to help the compiler as much as possible by writing code that can more easily be optimized by the compiler. For example, the code in Listing 3–36 shows a pattern the compiler should quite easily recognize and generate LDM and STM instructions for (assuming an ARM ABI). Ideally, access to memory should be grouped together whenever possible so that the compiler can generate better code.

Listing 3–36. Pattern to Generate LDM And STM

    unsigned int a, b, c, d;

    // assuming src and dst are pointers to int

    // read source values
    a = *src++;
    b = *src++;
    c = *src++;
    d = *src++;

    // do something here with a, b, c and d

    // write values to dst buffer
    *dst++ = a;
    *dst++ = b;
    *dst++ = c;
    *dst++ = d;

NOTE: Unrolling loops and inlining functions can also help the compiler generate LDM or STM instructions more easily.

Unfortunately, the GCC compiler does not always do a great job at generating LDM and STM instructions. Review the generated assembly code and write the assembly code yourself if you think performance would improve significantly with the use of the LDM and STM instructions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.25.4