Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

5 Optimizing Performance

Optimizing software for performance is an important topic for Cortex-M software development. Obtaining ample application performance from a particular Cortex-M device is essential to avoid selecting a processor with a higher cost, larger silicon area, or more power than needed. Optimizing performance can be used to select the correct processor for a given use case and can also be used after selecting a processor to optimize system performance in general.

In Chapter 1, Selecting the Right Hardware, we looked at the wide range of Cortex-M processors and reviewed use cases and hardware characteristics to help determine the right processor for an application. Processor selection, however, is only one variable contributing to building a system with optimal performance. There are two other key factors: compiler settings and algorithm implementation. Measuring performance and making informed changes to these three variables will lead to a solid design with excellent performance.

In this chapter, we use one example software use case (a simple dot product calculation of two vectors) and analyze how its performance is affected by changing the processor, algorithm implementation, and compiler options. In the process, we will demonstrate a method for taking performance measurements of a critical section of code.

In a nutshell, the following topics will be covered:

Our algorithm – the dot product
Measuring cycle count
Measuring dot product performance
Optimization takeaways

Our algorithm – the dot product

We will use the dot product, also called the scalar product, as a straightforward algorithm to clearly demonstrate the concepts of performance optimization. The dot product of vectors provides information about the lengths and angles of vectors and is frequently used in ML applications. The dot product is a very easy calculation for teaching purposes. It also can be done in multiple ways and can take advantage of vector processing hardware. In fact, the Arm Cortex-A processors have special instructions to increase the performance of dot product computation (the Cortex-M processors, as of yet, do not).

Of course, in a realistic setting, your software algorithms will be more complex than a simple dot product. The same underlying principles of optimization will apply, however, as there are many ways to solve any given problem in software. Understanding how to compare these implementations quickly will enable effective system optimization.

As a quick review of what the dot product does, let’s take two vectors and calculate the dot product:

V1 = [1, 3, -5] 
V2 = [4, -2, -1]

First, we multiply the vectors to create a new vector:

V3 = [4, -6, 5]

Finally, we sum the values of the vector to produce a scalar value:

DP = 4 + (-6) + 5
DP = 3

The following is a simple C program using integers to calculate the dot product for vectors v1 and v2. Copy the text into a file called dot-simple.c or get it from the book’s GitHub project: (https://github.com/PacktPublishing/The-Insiders-Guide-to-Arm-Cortex-M-Development/tree/main/chapter-5/dotprod-personal-computer):

#include <stdio.h>
int dot_product(int v1[], int v2[], int length)
{
    int sum = 0;
    for (int i = 0; i < length; i++)
      sum += v1[i] * v2[i];
    return sum;
}
int main(void)
{
    int len = 3;
    int v1[] = {1, 3, -5};
    int v2[] = {4, -2, -1};
    int dp = dot_product(v1, v2, len);
    printf("Dot product is %d
", dp);
    return 0;
}

On a computer with a C compiler, build and run it. The following commands demonstrate using GCC to compile and run:

$ gcc dot-simple.c -o dot-simple

$ ./dot-simple

Dot product is 3

The authors’ recommended quick and free C compiler options are listed here, by OS, for your convenience:

Windows: MSYS2 (https://www.msys2.org/)
Linux: GCC (installed by default on most distributions)
Mac: Clang via Xcode

With this understanding of the dot product example, let’s see how to measure the cycles required to compute the dot product on a Cortex-M microcontroller.

Measuring cycle count

Cortex-M processors provide multiple ways to compare the performance of different implementations of an algorithm. One way is to use a cycle counter register. Another way is to use a timer to measure the time of execution and convert the time into clock cycles using the processor frequency. The most recent Cortex-M microcontrollers contain a full performance monitoring unit (PMU), which enables software to get information about the count of various events occurring while the software is executing. One of the measured events can be a cycle counter, but additional events such as cache and memory accesses can also be counted.

This section will cover how to use the most common methods of counting cycles for a specific section of code on Cortex-M microcontrollers. First, we will introduce System Tick Timer or System Time Tick (SysTick) followed by Data Watchpoint and Trace (DWT). Both these programming interfaces can be used to count clock cycles for a benchmark or section of software. Examples are given in context later in this chapter.

System Tick Timer

SysTick is a system timer peripheral, included inside a Cortex-M processor. It includes a count-down timer and can generate interrupts to the processor core itself. It is used for operating system context switching or timekeeping. It can also be used to measure time to find out how long a section of code takes to execute.

SysTick has a simple programming interface that is used to set up the timer. The three primary registers used to control the timer are the following:

Control and status register: Used to configure, start, and stop the timer
Reload value register: Used to load a value into the counter
Current value register: Returns the current value of the counter

The counter is 24 bits and can measure any software that runs for less than 16,777,215 clock cycles. As an example, for hardware running at 100 MHz, SysTick can measure software running up to 168 milliseconds without overflowing the counter.

To create a simple interface to SysTick, use the systick.h file from the book’s GitHub project: https://github.com/PacktPublishing/The-Insiders-Guide-to-Arm-Cortex-M-Development/tree/main/chapter-5/dotprod-pico/systick.h. The contents of the file are shown here for convenience:

#include <stdint.h>
void start_systick(void);
void stop_systick(void);
/* Systick variables */
#define SysTick_BASE          (0xE000E000UL +  0x0010UL)
#define SysTick_START         0xFFFFFF
#define SysTick_CSR           (*((volatile uint32_t*)(SysTick_BASE + 0x0UL)))
#define SysTick_RVR           (*((volatile uint32_t*)(SysTick_BASE + 0x4UL)))
#define SysTick_CVR           (*((volatile uint32_t*)(SysTick_BASE + 0x8UL)))
#define SysTick_Enable        0x1
#define SysTick_ClockSource   0x4

The systick.h file contains the location of the peripheral registers in the Cortex-M memory map, the constants to use for programming, as well as the function prototypes to start and stop the counter.

The control logic is found in the systick.c file, also located in the book’s GitHub project. The contents of the file are shown here for convenience:

#include <stdio.h>
#include "systick.h"
void start_systick()
{
    SysTick_RVR = SysTick_START;
    SysTick_CVR = 0;
    SysTick_CSR |=  (SysTick_Enable | SysTick_ClockSource);
}
void stop_systick()
{
    SysTick_CSR &= ~SysTick_Enable;
    uint32_t cycles = (SysTick_START - SysTick_CVR);
    printf("CCNT = %u
", cycles);
    if (SysTick_CSR & 0x10000)
        printf("WARNING: counter has overflowed, more than 16,777,215 cycles");
}

This file contains a function to start the timer and another to stop the timer and read the cycle count. The start function sets the clock source as the processor clock.

You can now use SysTick in a Cortex-M application to count clock cycles by wrapping a section of code with the start and stop functions. The cycles it takes will automatically print from the printf statement included in the stop_systick function. To test this, try putting the start and stop functions around printf() to measure how many cycles it takes:

#include <stdio.h>
#include "systick.h"
int main()
{
     (void) start_systick();
     printf("Count this hello world using SysTick
");
     (void) stop_systick();
}

Note that the preceding code will only run on a Cortex-M microcontroller with SysTick so don’t try to run it on your laptop! We will use this SysTick measurement method later in this chapter on Cortex-M devices.

Next, let’s look at DWT as another way to count clock cycles.

Data Watchpoint and Trace

Another way to count clock cycles on Cortex-M microcontrollers is using DWT. Not all Cortex-M microcontrollers have DWT implemented, but most do. The Cortex-M0+ used in the Raspberry Pi Pico does not include DWT for cycle count measurement.

DWT provides functionality beyond counters. It includes hardware watchpoints and triggers for debugging. Our focus will be on using the cycle counter and the additional counters that report various hardware events. If interested, refer to the technical reference manual for a particular Cortex-M processor to learn about the full DWT features and complete set of registers.

DWT includes a list of six counters that are easy for software to access when measuring performance. Here is the list:

Clock cycle count
Number of folded instructions
Cycles performing loads and stores
Cycles sleeping
Count of instruction cycles beyond the first cycle (CPI cycles)
Cycle spent processing interrupts

Using DWT requires a special data write to the lock access register (LAR). This is the most common mistake programmers make when trying to use DWT functionality. To create a simple interface to DWT, refer to the dwt.h file in the book’s GitHub project: https://github.com/PacktPublishing/The-Insiders-Guide-to-Arm-Cortex-M-Development/tree/main/chapter-5/dotprod-nxp-lpcxpresso55s69/dwt.h. As with the SysTick files, the dwt.h file contents are displayed here for convenience:

#include <stdint.h>
void start_dwt(void);
void stop_dwt(void);
/* DWT Variables */
#define DWT_CYCCNTENA_BIT (1UL << 0)
#define TRCENA_BIT (1UL << 24)
#define DWT_CONTROL (*((volatile uint32_t*) 0xE0001000))
#define DWT_CYCCNT (*((volatile uint32_t*) 0xE0001004))
#define DWT_LAR (*((volatile uint32_t*) 0xE0001FB0))
#define DEMCR (*((volatile uint32_t*) 0xE000EDFC))
#define DWT_CPICNT (*((volatile uint32_t* )0xE0001008))
#define DWT_EXCCNT (*((volatile uint32_t*) 0xE000100C))
#define DWT_SLEEPCNT (*((volatile uint32_t*) 0xE0001010))
#define DWT_LSUCNT (*((volatile uint32_t*) 0xE0001014))
#define DWT_FOLDCNT (*((volatile uint32_t*) 0xE0001018))

The dwt.h file contains the location of the peripheral registers in the Cortex-M memory map and constants to use for programming, as well as the function prototypes to start and stop the counter.

The dwt.c file, also available in the book’s GitHub project, contains a function to unlock DWT and start counting, and another to stop counting and print the six counted values:

#include <stdio.h>
#include "dwt.h"
void start_dwt()
{
    DWT_LAR = 0xC5ACCE55;
    DEMCR |= TRCENA_BIT;
    DWT_CYCCNT = 0;
    DWT_CONTROL |= DWT_CYCCNTENA_BIT;
}
void stop_dwt()
{
    DWT_CONTROL &= ~DWT_CYCCNTENA_BIT;
    printf("CCNT = %u
", DWT_CYCCNT);
    printf("CPICNT = %u
", DWT_CPICNT);
    printf("EXCCNT  = %u
", DWT_EXCCNT);
    printf("SLEEPCNT = %u
", DWT_SLEEPCNT);
    printf("LSUCNT = %u
", DWT_LSUCNT);
    printf("FOLDCNT = %u
", DWT_FOLDCNT);
}

Here is another simple C program to measure the number of cycles a printf statement takes, this time using DWT counters:

#include <stdio.h>
#include "dwt.h"
int main()
{
     (void) start_dwt();
     printf("Count this hello world using DWT
");
     (void) stop_dwt();
}

In the preceding SysTick and DWT examples, we covered only the minimal registers and programming values needed to effectively measure the performance of software on Cortex-M devices. CMSIS-Core includes a full description of the register interfaces and simplifies software reuse for more robust projects. Visit the CMSIS documentation for more information: https://arm-software.github.io/CMSIS_5/General/html/index.html.

If your Arm Cortex-M has a DWT, we recommend using the cycle counter to measure the cycles spent executing code. For Cortex-M processors without the cycle counter (CYCCNT) functionality of DWT, SysTick is a quality alternative to count cycles when there is no DWT available.

Measuring dot product performance

Now that we know how to measure the cycle count of an important section of code, let’s give it a try by measuring the dot product performance on the Raspberry Pi Pico. We will look at multiple implementations of the dot product and experiment with compiler optimizations to see how the implementation of the dot product impacts performance.

Using the Raspberry Pi Pico

Often, a project already has a Cortex-M microcontroller chosen, which cannot be changed. In this case, the best system performance can be obtained using a combination of changes to the source code algorithms and the compiler optimization levels. In some cases, the compiler itself can also be changed (though this is often predetermined for projects).

In this section, we take the dot product example and create three different implementations with different source code and then use the compiler options to check the impact on performance. As the Cortex-M0+ in the Raspberry Pi Pico does not support CYCCNT DWT, we use the SysTick code provided in the previous section to count the clock cycles.

To replicate the example in this section, you will be using the following tools and environment:

Platform	Raspberry Pi Pico
Software	Dot Product
Environment	Raspberry Pi 4
Host OS	Linux (Ubuntu)
Compiler	GCC
IDE	-

For this example, we use the same C/C++ SDK introduced in Chapter 3, Selecting the Right tools, for the Pico. It uses the GNU compiler.

First, clone the project from the book's GitHub project by cloning the full project then navigating into the correct directory:

$ git clone https://github.com/PacktPublishing/The-Insiders-Guide-to-Arm-Cortex-M-Development.git

$ cd The-Insiders-Guide-to-Arm-Cortex-M-Development/chapter-5/dotprod-pico

The C main function is in dotprod.c and defines two vectors each with 256 entries. There are three implementations to compute the dot product of these two vectors:

dot_product1(): This is what most programmers would initially do. Make a loop, multiply, and add the vectors:
for (int i = 0; i < length; i++)
sum += v1[i] * v2[i];
dot_product2(): This is a reasonable next step to optimize the calculation after discovering CMSIS-DSP and looking at the example programs. It uses a CMSIS-DSP function to multiply and another to add:
arm_mult_f32(v1, v2, multOutput, MAX_BLOCKSIZE);
for(i=0; i< MAX_BLOCKSIZE; i++)
arm_add_f32(&testOutput, &multOutput[i], &testOutput, 1);
dot_product3(): This leverages the single dot product function in the CMSIS-DSP library:
(void) arm_dot_prod_f32(v1, v2, length, &result);

Look over each implementation and think about factors that impact performance. All implementations give the same numerical result but compute it using different instructions.

Results

Running the application on the Raspberry Pi Pico with the default compiler settings gives the following cycle counts. Please note that the exact cycle count may vary based on the compiler version:

Implementation	Number of cycles
1: Plain C code	41,668
2: CMSIS-DSP for multiply then add	54,539
3: CMSIS-DSP for dot product	41,808

Table 5.1 – Dot product performance, default optimization settings

Implementations 1 and 3 are very similar, and implementation 2 is slower because it has more function calls.

Let’s take a look at how to work with the compiler flags.

There are four different settings for the GCC compiler. Each one can be specified when running cmake:

$ cmake -DCMAKE_BUILD_TYPE=<type>

The build type values are listed in the following table:

CMAKE_BUILD_TYPE	GCC flags used for optimization
Release	-O3 (max optimization)
Debug	-Og and -g (max debug)
RelwithDebInfo	-O2 and -g (good optimization with debug)
MinSizeRel	-Os (minimum code size)

Table 5.2 – Common GCC optimization flags

To see the actual compiler output and the flags used, set the VERBOSE flag for make. This will print the full compiler commands, and you can inspect the flags:

$ make VERBOSE=1

The VERBOSE build also reveals that the other flags are -mcpu=cortex-m0plus -mthumb. This is expected for Cortex-M0+.

During application creation and debugging, use the Debug target. This makes it easier for your debugger to step through code properly and makes improving code interactively much easier:

$ cmake -DCMAKE_BUILD_TYPE=Debug

The Debug build uses -g and -Og to optimize for debugging. The default cmake uses -O3. This will enable the majority of GCC optimizations. This is the release build type.

There is also RelWithDebInfo for the release build with debug info, which is -O2 and -g. The final build type is MinSizeRel, which uses -Os to optimize for the smallest code size.

If you wish, you can override the values for each of the four build types. For example, to change the optimization level to -O1 for the release build type, add the following line to the CMakeLists.txt file:

set(CMAKE_C_FLAGS_RELEASE "-O1 -DNDEBUG")

Then when you run cmake with the release build type, the -O1 flag will be used instead of -O3:

$ cmake -DCMAKE_BUILD_TYPE=Release

Note that the variable is different for each build type. To change the optimization level of the Debug build type, set the CMAKE_C_FLAGS_DEBUG variable instead. Take some time and experiment with the compiler optimization levels to see different algorithm performances for each type of dot product implementation.

For a minimum-size build type, the cycle counts are in the following table. As mentioned earlier, you might notice different cycle counts based on the compiler version you use:

Implementation	Number of cycles
1: Plain C code	42,696
2: CMSIS-DSP for multiply then add	56,902
3: CMSIS-DSP for dot product	42,937

Table 5.3 – Dot product performance, minimum size optimization settings

In this case, the minimum size build resulted in worse performance than the Debug build for each dot product implementation.

There are other changes to make besides the compiler flags to impact performance. The CMSIS functions have alternative implementations for loop unrolling. These can be seen in the source files such as arm_dot_prod_f32.c. To review, open the source file with a text editor and look for this line:

#if defined (ARM_MATH_LOOPUNROLL) && !defined(ARM_MATH_AUTOVECTORIZE)

Edit CMakeLists.txt to set the value for loop unrolling and run again to see the performance impact:

set(CMAKE_C_FLAGS_RELEASE "-O3 -DNDEBUG -DARM_MATH_LOOPUNROLL")

The new values are listed in the following table:

Implementation	Number of cycles
1: Plain C code	41,668
2: CMSIS-DSP for multiply then add	60,887
3: CMSIS-DSP for dot product	41,034

Table 5.4 – Dot product performance, with loop unrolling

The plain C code implementation runs the same as the previous -O3 implementation as it is not a CMSIS function. The second implementation increased in cycles, while the third implementation decreased and thus had improved performance. This reveals a truth about optimizing performance for a multi-faceted problem such as embedded software development: it is both a science and an art.

Making informed decisions about what Cortex-M processor, algorithm implementation, and compiler options to select is crucial to optimizing performance. It is equally important to modify some variables and measure the results yourself for your unique situation. The interplay between compilers, compiler flags, hardware, and software is a complex domain where experience is the best teacher.

To gain a deeper understanding of system behavior, you should always look at the disassembly file to see the actual code generated. For example, the dotprod.dis file generated in the build directory shows the difference in code with and without the loop unrolling. If you have questions about why a certain implementation has a different performance than expected, it helps to be able to look at and generally understand the assembly code.

The next section will take the same three dot product implementations and try them on different hardware using a different compiler.

Using NXP LPC55S69-EVK

This section ports over the same software that ran dot products on the Pico to run on the NXP LPC55S69-EVK board. In this case, the Cortex-M33 supports DWT and will use that to measure cycles here as opposed to the SysTick counter.

To replicate the example in this section, you will be using the following tools and environment:

Platform	NXP LPC55S69-EVK
Software	Dot Product
Environment	Personal Computer
Host OS	Windows
Compiler	Arm Compiler for Embedded
IDE	Keil MDK Community

While you can create a new project in Keil µVision to build and run the dot product example, the easiest way is to modify the existing hello_world demo example. The steps to download and install the hello_world example using Pack Installer are the same as the ones you would have followed in Chapter 4, Booting to Main, while running the led_blinky example. The instructions are already documented here: https://developer.arm.com/documentation/kan322/latest/.

Once you’ve loaded the hello_world project in µVision, replace the contents of the hello_world.c source file with the file containing our dot product algorithms, dot_product.c. You can find it at our book’s GitHub link: https://github.com/PacktPublishing/The-Insiders-Guide-to-Arm-Cortex-M-Development/tree/main/chapter-5/dotprod-nxp-lpcxpresso55s69/hello_world.c. Make sure to use the correct dot_product.c file intended for the NXP board, as it has code specific to its memory map. You can also ignore the dwt.c and dwt.h files in that GitHub repository, as they are automatically included in the example hello_world project through CMSIS and are there only for reference.

Then save and build your project with the dot product algorithms in the hello_world.c file. This project uses the arm_math.h file, so you must include the CMSIS-DSP library in your build. To do so, select Manage Run-Time Environment from the top of the IDE’s GUI and check the box under CMSIS | DSP:

Figure 5.1 – Adding CMSIS-DSP library in Keil MDK

Once you are able to build your project successfully, you can choose between two different configurations for building your project: debug and release. The compiler and linker optimization settings differ based on your selection. The main difference in the default settings is that debug uses -O1 and release uses -Oz.

These are the compiler settings for the Debug build:

Figure 5.2 – Debug build compiler settings

These are the compiler settings for the Release build:

Figure 5.3 – Release build compiler settings

Similar to GCC, Arm Compiler for Embedded has specific compiler optimization settings for different situations. They can be summarized as follows:

Ideal for	Arm Compiler for Embedded flags for optimization
Performance	-Ofast or -O3
Debug, low performance	-O0
Debug, more optimized	-O1
Minimum code size	-Oz

Table 5.5 – Common Arm Compiler for Embedded optimization flags

With this understanding, we can now try different compiler settings and see the resulting performance.

For optimization levels above -O0, the compiler will inline the start_dwt() and stop_dwt() functions. This may result in the unexpected ordering of the functions, and the cycle count of the dot product is not actually measured because stop_dwt() happens before the calculation is done.

Please add -fno-inline-functions to the Misc Controls box on the Options for Target screen as shown in Figure 5.3.

Results

When running with the debug option with the flag set to -O1, these are the resulting numbers:

Implementation	Number of cycles
1: Plain C code	2,398
2: CMSIS-DSP for multiply then add	9,765
3: CMSIS-DSP for dot product	2,422
Minimum code size	-Oz

Table 5.6 – Dot product performance, debug optimization settings

Note that when clicking build in the Keil IDE, it also reports the code size for the software. This is another axis to measure optimization, not on performance but on code size. The code size for the preceding debug option is 6,700 bytes, which can vary based on compiler versions.

Running with the release option with the flag set to -Oz, these are the resulting numbers:

Implementation	Number of cycles
1: Plain C code	2,100
2: CMSIS-DSP for multiply then add	8,891
3: CMSIS-DSP for dot product	1,857

Table 5.7 – Dot product performance, minimum size optimization settings

This has a code size of 5,232, which is lower than expected.

Lastly, change the compiler optimization flag to the maximum performance -Ofast and see the result. To do this, go into Project | Options for Target ‘hello_world release’ | the C/C++(AC6) tab | Optimization. Set the Optimization flag to -Ofast. Here are the results:

Implementation	Number of cycles
1: Plain C code	1,752
2: CMSIS-DSP for multiply then add	11,246
3: CMSIS-DSP for dot product	1,652

Table 5.8 – Dot product performance, fast optimization settings

This has a code size of 7,944.

As with the Pico, some performance gains are expected, and others respond in less intuitive ways. This is why trying different compiler options and algorithm implementations leads to better performance over time. The next section will cover the same topic for the Cortex-M55.

Using Arm Virtual Hardware

This section measures the same dot product code, slightly modified to run on the Arm Virtual Hardware Cortex-M55 FVP. It measures the cycle count using the SysTick timer.

To replicate the example in this section, you will be using the following tools and environment:

Platform	Arm Virtual Hardware – Corstone-300
Software	Dot Product
Environment	Amazon EC2 (AWS account required)
Host OS	Ubuntu Linux
Compiler	Arm Compiler for Embedded
IDE	-

The setup for this example is largely the same as in Chapter 4, Booting to Main. Set up a new AMI of Arm Virtual Hardware on Amazon EC2, download the example via the GitHub link for this example, build, and run. These are the commands to run on your AMI cloud instance:

git clone https://github.com/PacktPublishing/The-Insiders-Guide-to-Arm-Cortex-M-Development.git

cd The-Insiders-Guide-to-Arm-Cortex-M-Development/chapter-5/dotprod-avh-corstone-300

./build.sh

./run.sh

Results

You will then get these results with the default Arm Compiler for Embedded flags:

Implementation	Number of cycles
1: Plain C code	4,294,967,295
2: CMSIS-DSP for multiply then add	11,390,997
3: CMSIS-DSP for dot product	5,201,475

Table 5.9 – Dot product performance, default optimization settings

If these results seem so high as to be incorrect, your intuition is right! Our method of measurement in the previous sections has been using DWT and SysTick, which depend on physical oscillating counters on the Pico and NXP boards to record accurate cycle counts. On this virtual Corstone-300, however, there are no physical oscillating counters to read from; the SysTick in this example is abstracted to work at the functional level, but not at a cycle-accurate level.

This functional accuracy of software gives Arm Virtual Hardware a large advantage when you just need software to behave as it would in the end system without cycle accuracy. You can spin up hundreds of virtual boards to run CI/CD tests in parallel without buying as many boards (a topic covered in Chapter 9, Implementing Continuous Integration). You can easily start working with Arm hardware without buying physical hardware. These advantages are possible because the virtual hardware runs as fast or faster than the same software running on a physical Cortex-M board. Simulating software at the cycle-accurate level requires a significant amount of complexity, which slows down execution to the point of being useless for these software development use cases. This is the trade-off of functional accuracy versus cycle accuracy.

What this means in this chapter’s context is that it is generally not helpful to measure and optimize software performance on virtual hardware. The closest you can get is a relative understanding of performance. In this example, using plain C is an order of magnitude slower than CMSIS-DSP for multiply and add, which is an order of magnitude slower still than using the CMSIS-DSP dot product function. This indicates that implementation 3 is better than 2, which is better than 1, but you should always verify on physical hardware to obtain realistic measurements.

Optimization takeaways

We have evaluated the performance of the dot product while altering the following variables:

Processor type
Software source code
Compiler and compiler options

To provide some helpful guidelines when optimizing a Cortex-M system, here is a summary table of the recorded cycle counts when altering the dot product algorithm and compiler flags across both the Pico and NXP boards. As described in the previous section, the Corstone-300 Arm Virtual Hardware system does not allow cycle-accurate measurements, so we will not include it in our summary table here:

		RPi	NXP
Implementation	Compiler Flags	M0+	M33
1: Plain C code	Debug	41,668	2,398
	Release Size	42,696	2,100
	Release Speed	41,668	1,752
2: CMSIS-DSP for multiply then add	Debug	54,539	9,765
	Release Size	56,902	8,891
	Release Speed	60,887	11,246
3: CMSIS-DSP for dot product	Debug	41,808	2,422
	Release Size	42,937	1,857
	Release Speed	41,034	1,652

Table 5.10 – Comparing optimization techniques for dot product

Looking at the preceding table, we can make some specific observations, as well as some helpful generalizations about Cortex-M optimization.

Processor performance

It is clear from the results that the NXP board, based on the Cortex-M33, runs every dot product implementation at every compiler optimization level faster than the Pico, based on the Cortex-M0+. This should not be surprising as the Cortex-M33 is designed to be much more powerful than the Cortex-M0+ in mathematical operations, especially as it includes the FPU.

Importantly, the difference in performance will be more or less prevalent with different software. Running a more realistic software stack and complex algorithms will likely differentiate the Cortex-M33 from the Cortex-M0+ even more. Ultimately, the delta between processors will vary on how your specific software leverages the capabilities of each. And faster is not always better; if you are optimizing for cost and a Cortex-M0+ can run your algorithm fast enough, that may be the wiser choice.

Compilers

In this chapter, we used GCC for the Raspberry Pi Pico and Arm Compiler for Embedded for the NXP LPC55S69-EVK. While the difference in performance is clearly attributable to the Cortex-M33 capabilities over the Cortex-M0+, in general, the Arm Compiler for Embedded has a performance edge over the open source GCC. This is because the Arm Compiler for Embedded is developed alongside the Cortex-M processors, and it is explicitly designed to extract as much performance from each processor as possible. Over time, most of these performance tricks are upstreamed into the GNU compiler, but this can take years after a processor is released. The performance difference is more prominent for newer Cortex-M processors but is generally present for all of them.

In use cases where maximizing application performance is tantamount, this compiler performance difference is another variable to improve. For most cases in the Cortex-M space, however, using GCC or Arm Compiler for Embedded (or another compiler) will not make or break your application.

Compiler flags

The results of altering the flags are a bit more mixed. It is really important to understand what compiler flags can be changed in different settings. This chapter gave examples of the optimization flag and the loop unrolling flag, and there are other optimizations to explore to ensure you are extracting all the performance possible from your system. When using Arm Compiler for Embedded for more realistic software applications, investigate Link Time Optimization (LTO). This is a powerful technique for pushing the performance of applications and can induce unexpected software behavior and cause runtime errors when not implemented correctly.

Source code / algorithm implementation

The key takeaway is to use pre-optimized algorithms when available. Leveraging the CMSIS dot product consistently gave better performance than the other two implementations. In general, CMSIS libraries offer highly optimized implementations of common algorithms; where possible, use them to your advantage.

For this simple dot product example, it was mostly a worse implementation to use two CMSIS functions (multiply then add) as opposed to a plain C implementation due to the overhead caused by transferring data between functions. As your software gets more complex, this type of overhead versus efficiency trade-off may prove worth it.

In general, that overhead versus efficiency trade-off can be quite complex in a realistic software setting. It is up to you to explore different optimization options with your given constraints on time, cost, hardware, and software. Take the lessons from this chapter to direct your focus and make educated decisions about altering your project to optimize it well.

Summary

This chapter outlined basic principles of performance optimization for Cortex-M systems. We took an example software algorithm, the dot product, and went through several boards and compiler options to explore performance implications. The skills learned in this chapter will help you optimize your Cortex-M system intelligently by altering the main factors influencing performance: processor type, software source code implementation, and compilers and compiler options.

No matter how optimized your system is for performance, there are still other important aspects to consider that ensure you have a quality Cortex-M product. The next chapter provides both an overview and practical guide for using machine learning on edge devices.

Table of Contents for
Chapter 5: Optimizing Performance

5

Optimizing Performance

Our algorithm – the dot product

Measuring cycle count

System Tick Timer

Data Watchpoint and Trace

Measuring dot product performance

Using the Raspberry Pi Pico

Results

Using NXP LPC55S69-EVK

Results

Using Arm Virtual Hardware

Results

Optimization takeaways

Processor performance

Compilers

Compiler flags

Source code / algorithm implementation

Summary

Further reading

Table of Contents for Chapter 5: Optimizing Performance

Create new playlist

Sign In

Sign Up

5

Optimizing Performance

Our algorithm – the dot product

Measuring cycle count

System Tick Timer

Data Watchpoint and Trace

Measuring dot product performance

Using the Raspberry Pi Pico

Results

Using NXP LPC55S69-EVK

Results

Using Arm Virtual Hardware

Results

Optimization takeaways

Processor performance

Compilers

Compiler flags

Source code / algorithm implementation

Summary

Further reading

Table of Contents for
Chapter 5: Optimizing Performance