Appendix

Performance Analysis Using NXP’s i.MX RT1050 Crossover Processor and the Zephyr™ Real-Time Operating System

Abstract

Benchmarking and analysis are key components of embedded system development. Whether analyzing the overall performance of an application, or focusing on a specific hardware or software component, embedded systems developers require not only the proper tools, but also applications and kernels to shed insight into how the system under development operates. In addition, embedded developers often want to compare and contrast the performance of multiple hardware and software solutions with the goal of achieving parity between competing solutions.

This section provides a case study of two systems comprised of modern embedded processors and embedded operating systems. It describes the background to benchmarking for embedded systems, and more specifically the use of microbenchmarks for component-level analysis of varying modern embedded platforms. By using strategically crafted microbenchmarks, the section illustrates how embedded developers can create their own application code to provide insights into system features for the broader development of system-level applications, in addition to the selection of appropriate chipset solutions and tooling for a given application and market.

A benchmark study to understand performance advantages as compared to Linux BSP on i.MX 6UL Processors

Florin Leotescu, Marius Cristian Vlad, and Michael C. Brogioli

A.1 Introduction

Software and hardware performance analysis is integral to the evaluation and design of embedded systems. Such analysis helps to understand the limitations of a system, identify performance bottlenecks, and determine how well the system is performing in comparison with other devices. Performance analysis can be done using custom software benchmarking applications that execute specific algorithms, which will deliver performance statistics about the system under test, design, and development. Examples of such benchmarks are the SPEC CPU benchmarks, designed to provide performance measurements that can be used to compare compute-intensive workloads on different computer systems.1EEMBC is another group of benchmarks, predominantly targeted at embedded computing.2EEMBC benchmark suites are developed by working groups of members who share an interest in developing clearly defined standards for measuring the performance and energy efficiency of embedded processor implementations, from IoT edge nodes to next-generation advanced driver-assistance systems.

In addition to the use of standardized benchmarks, like those mentioned earlier, system developers often also elect to implement microbenchmarks that focus on a very small, or critical, feature of the system. While not intended to characterize broader system-level performance, microbenchmarks can be a very useful tool when focusing on specific system components. For example, microbenchmarks can be used to analyze the time required to create threads of execution within a given system. While this does not characterize the performance of the entire system, nor the system under load of a given target application, it can be used to provide fine-grained insights into specific aspects of the system.

It should be noted, however, that the use of benchmarking and microbenchmarking can only go so far. Many embedded solutions vendors do not open up the underlying hardware design of their solution, nor very often provide access for system users to their system-level software or source code. As such, benchmarking and microbenchmarking are limited in terms of analyzing and comparing features between systems.

This section provides a real-world example of the use of microbenchmarks to perform an analysis of differing hardware and software solutions that are critical to embedded systems design. Specifically, a performance analysis is presented comparing the Zephyr™ OS running on the NXP i.MX RT1050 crossover processor, based on the Arm® Cortex®-M7 core, and the Linux BSP running on the NXP i.MX 6UL applications processor, based on the Arm Cortex-A7 core. This analysis is performed via the use of custom microbenchmarks for various system components, including but not limited to thread creation, use of mutexes, and memory allocation, all of which are fundamental contexts to modern high-performance embedded systems design.

Noting the differences between Zephyr™ OS (a tiny open-source RTOS for IoT) and Linux (an open-source monolithic Unix-like computer operating system kernel), it is important to recognize that this comparison is not fully an “apples to apples” comparison.3Rather, this study is intended to provide embedded designers with a set of exemplary microbenchmarks to compare hardware and software solutions when executing the same tasks. It is left to the reader or system developer to extrapolate how these system-level tasks relate to their overall target application. To evaluate the performance difference between the two solutions, certain synthetic microbenchmarks were developed specifically to evaluate the time between dynamic memory allocation and deallocation, mutex lock and unlock, thread creation, thread joining, and context switching.

In summary, the results of this performance analysis showed that the Zephyr™ OS (running on an i.MX RT1050 crossover processor) improved overall system responsiveness and ultimately reduced costs of the IoT and embedded systems development. The aforementioned tasks executed much faster on the Zephyr™ OS + i.MX RT1050 solution, compared with the Linux + i.MX 6UL solution. The following sections explain the methodology used to derive the results of the comparison.

A.2 Configuration Information

The first configuration analyzed in the study is the NXP i.MX RT1050. The i.MX RT1050 is a crossover processor that combines the high-performance and high level of integration of an application processor with the ease of use and real-time functionality of a microcontroller. The i.MX RT1050 runs on the Arm® Cortex®-M7 core at 600 MHz.4

This device is fully supported by NXP’s MCUXpresso Software and Tools, a comprehensive and cohesive set of free software development tools for Kinetis, LPC, and i.MX RT microcontrollers. MCUXpresso SDK also includes project files for Keil MDK and IAR Embedded Workbench for Arm.5

Configuration #1 also includes the Zephyr™ operating system. Zephyr™ is a small real-time and scalable operating system for connected, resource-constrained devices supporting multiple architectures and released under the Apache License 2.0.6,7

A.2.1 Summary of Configuration #1: i.MX RT1050 Configuration—Hardware and Software

Unlabelled Image
  •  Development board: MIMXRT1050-EVK
    •  Processor: MIMXRT1052DVL6A Arm® Cortex®-M7 core
    •  Number of cores: 1
    •  Core frequency: 600 MHz
    •  Board schematic: SCH-29538 REV A1
  •  OS name: Zephyr OS 1.11.99
    •  OS type: real-time
    •  Zephyr OS web page

The second configuration analyzed in the study is a combination of the NXP i.MX 6UL hardware with the Linux operating system. The i.MX6 UltraLight is a high-performance and efficient processor family featuring an ARM A7 core operating at speeds up to 696 MHz at the time of writing. The i.MX6 UltraLite applications processor includes an integrated power management module that reduces the complexity of the external power supply and simplifies power sequencing. Each processor in this family provides various memory interfaces, including 16-bit LPDDR2, DDR3, DDR3L, raw and managed NAND flash, NOR flash, eMMC, Quad SPI, and a wide range of other interfaces for connecting peripherals, such as WLAN, Bluetooth™, GPS, displays, and camera sensors. The software running on the i.MX6 UL is Linux Board Support Package release Linux BSP - kernel 4.9.88-imx_4.9.88_2.0.0_ga. As discussed in greater detail later, unlike the Zephyr™ OS of the first configuration, this is not a real-time variant of Linux.

A.2.2 Summary of Configuration # 2: i.MX 6UL Configuration

Unlabelled Image
  •  Development board: MCIMX6G2CVM05AB
    •  Processor: i.MX6UL: i.MX 6UltraLite Processor, based on Arm Cortex-A7 core
    •  Number of cores: 1
    •  Core frequency: 528 MHz
    •  Board schematic: SCH-29163 REV A2
  •  OS name: Linux BSP - kernel 4.9.88-imx_4.9.88_2.0.0_ga
    •  OS type: nonreal-time

A.3 Scope of Analysis

As mentioned in the introduction, this work is intended to evaluate and compare the performance of the i.MX RT1050 EVK with the Zephyr™ OS and the i.MX 6UL EVK with the Linux BSP. The goal being to determine any potential performance gaps between the MIMXRT1050-EVK board, equipped with an embedded ARM SoC, and a similar board equipped with an application processor. Due to the fact that the closest CPU speed configuration to the i.MX RT1050 EVK was found with the i.MX 6UL EVK, we selected the i.MX 6UL development board for best comparison.

In addition, the Zephyr™ OS was selected over other real-time operating systems because it is free and very comprehensive, developed as a collaborative project, and supported by an active open-source community. While both Zephyr™ and Linux OSs can exhibit real-time characteristics, the Zephyr™ OS was originally designed to fully abide with traditional RTOS principles, whereas Linux has traditionally served larger workloads for the desktop and server spaces. Furthermore, at the time of writing, Linux requires additional patches to abide to traditional RTOS principles.

To obtain comparable results, despite the known operating system differences, focus was placed on using the same application peripheral interface (API) for the custom microbenchmarks used in this study. The microbenchmarks were developed in C language and made use of the Pthreads API library (POSIX API library). In the case of the Zephyr™ OS, the available API version was POSIX PSE52, which according to Zephyr™ community documentation, implements only partial support for the full POSIX specifications.

The microbenchmarks perform memory allocation and deallocation, mutex lock and unlock, thread creation, thread joining, context switching, and record the time spent on each of these actions.

To determine the time spent performing the tasks, we used the POSIX clock_gettime ()for the Linux + i.MX 6UL EVK solution. For the Zephyr™ OS, running on i.MX RT1050 EVK, we used the TIMING_INFO_PRE_READ ()function instead of clock_gettime(). Due the nature of the OS scheduler on Zephyr™, which is beyond the scope of this study, the clock_gettime()function generates inconsistent timing values, and because of the fact that Zephyr™ source code also uses the TIMING_INFO_PRE_READ() function, the decision was made to continue with it.

A.3.1 Microbenchmark #1: Dynamic Memory (Heap) Allocation and Deallocation Benchmark

In C programming language, dynamic memory allocation refers to performing manual memory management via a group of functions in the C standard library. The C programming language manages memory statically, automatically, or dynamically.

Static variables are allocated in main memory, along with the executable code, and persist for the lifetime of the program. The automatically managed variables or local variables are allocated on the stack and they come and go as functions are called and as functions exit. The size of the memory allocation for the static and local variables is defined at the compile-time, except for variable-length arrays. If the required size is not known until runtime (e.g., if data of arbitrary size is being read from the user or from a disk file), then using fixed-size data objects is inadequate. In this situation, dynamic memoryallocation solves the problem—memory is more explicitly managed, typically by allocating it in large regions of free spacecalled heap (Fig. 1).

Fig. 1
Fig. 1 Application memory organization.

In other words, heap is a memory region of the computer which is managed manually by the programmer (in the case of C language). In other programming languages, for example, Java, memory is managed automatically.

To manage heap memory location in C under Linux, the malloc () and free ()functions are used (there are also new ()and delete ()functions on C ++). The mallocfunction is used for allocating a space into this memory, and freeis used to deallocate it. In the case of the Zephyr™ OS these functions are named k_malloc() and k_free(), which do the same thing as malloc() and free(). For this analysis, the microbenchmark was developed around these functions and was named heap_bench. The purpose was to measure the time for allocating and deallocating heap memory. Behind the scenes, the benchmark allocates 4 bytes (sizeof(int)) for 1000 iterations of the heap allocation loop, and then deallocates the same allocated memory via the second “heap deallocation” loop. For the Linux BSP, each loop of allocation and deallocation time was measured using a time_get_time (),which is a wrapper function on top of clock_get_time (). For the Zephyr™ OS, the TIMING_INFO_PRE_READ ()function was used.

  • ...
  • //Linux BSP – heap allocation //
  • for (i = 0; i < ITERATIONS; i ++) {
  •     time_get_time(&start);
  •     pointer[i]= malloc(sizeof(int));
  •     *pointer[i] = 0xdeadbeef;
  •     time_get_time(&stop);
  •     diff = time_get_diff(&stop, &start);
  •     total += diff;
  • }
  • printf("Only call time function: %.lf ns ", (total) / (double) ITERATIONS);
  • ...
  • ...
  • //Linux BSP – heap deallocation //
  • for (i = 0; i < ITERATIONS; i ++) {
  •     time_get_time(&start);
  •     free(pointer[i]);
  •     time_get_time(&stop);
  •     diff = time_get_diff(&stop, &start);
  •     total += diff;
  •  }
  • printf("Average heap free time: %.lf ns ", (total) / (double)ITERATIONS);
  • ...
  • ...
  • // Zephyr OS – heap allocation //
  • for (i = 0; i < ITERATIONS; i ++) {
  •     TIMING_INFO_PRE_READ();
  •     heap_malloc_start_time = TIMING_INFO_OS_GET_TIME();
  •     pointer[i]= k_malloc(sizeof(int));
  •     *pointer[i] = 0xdeadbeef;
  •     TIMING_INFO_PRE_READ();
  •     heap_malloc_end_time = TIMING_INFO_OS_GET_TIME();
  • ...
  • ...
  • //Zephyr OS – heap deallocation//
  • for (i = 0; i < ITERATIONS; i ++) {
  •     TIMING_INFO_PRE_READ();
  •     heap_free_start_time = TIMING_INFO_OS_GET_TIME();
  •     k_free(pointer[i]);
  •     TIMING_INFO_PRE_READ();
  •     heap_free_end_time = TIMING_INFO_OS_GET_TIME();
  • ...

At the end of the heap allocation and heap deallocation loops, the final average allocation and deallocation times were calculated for a given loop, as can be seen in the source code above.

A.3.2 Microbenchmark #2: Thread Creation and Joining Benchmark

In computer science, a thread of execution is a small sequence of programmed instructions that can be managed independently by a scheduler, the scheduler being part of the operating system in this context. The implementation of threads and processes differ between operating systems, but in most cases a thread is a component of a process. Multiple threads can exist within one process, executing concurrently and sharing resources, like memory, across threads, while different processes do not share these resources. In particular, the threads of a process share its executable code and the values of its variables at any given time.8 Fig. 2 depicts two processes, each one having one or multiple threads.

Fig. 2
Fig. 2 Single threaded process model vs. multithreaded process model.

In this comparison, the process is the running benchmark, named thread_bench, which spawns multiple threads using the pthread_create ()POSIX function. It creates 2000 threads and measures the time of creation for all 2000 threads. At the end of thread creation, the time recorded is divided by the number of threads created, giving the average time to create a thread.

  • ...
  • for (i = 0; i < ITERATIONS; i ++) {
  •     if (pthread_attr_init(&attr[i]) != 0) {
  •     fprintf(stderr, "pthread_attr_init! ");
  •     exit(EXIT_FAILURE);
  •     }
  •     if (posix_memalign(&stacks[i], sysconf(_SC_PAGESIZE), MAX_STACK_SIZE) != 0) {
  •     fprintf(stderr, "Failed to allocate aligned memory ");
  •     exit(EXIT_FAILURE);
  •     }
  •     if (pthread_attr_setstack(&attr[i], stacks[i], MAX_STACK_SIZE) != 0) {
  •     fprintf(stderr, "Failed pthread_attr_setstack! ");
  •     exit(EXIT_FAILURE);
  •     }
  •     time_get_time(&start);
  •     if (pthread_create(&threads[i], &attr[i], test_function, NULL) != 0) {
  •     fprintf(stderr, "Failed to create thread! ");
  •     exit(EXIT_FAILURE);
  •     }
  •     time_get_time(&stop);
  • #ifdef DEBUG
  •     fprintf(stdout, "Created thread_id %d ", i);
  • #endif
  •     diff = time_get_diff(&stop, &start);
  •     total += diff;
  • }
  • fprintf(stdout, "Average pthread_create time: %.lf ns ", (total/(double)ITERATIONS));
  • ...

This benchmark also measures join time using the pthread_join () function, which synchronizes the parent thread by pausing its execution, until the child thread terminates.

  • ...
  • for (i = 0; i < ITERATIONS; i ++) {
  •     if (threads[i]) {
  •     time_get_time(&start);
  •     if (pthread_join(threads[i], NULL) < 0) {
  •     fprintf(stdout, "Failed to join thread ");
  •     exit(EXIT_FAILURE);
  •     }
  •     time_get_time(&stop);
  •     diff = time_get_diff(&stop, &start);
  •     total += diff;
  • #ifdef DEBUG
  •     fprintf(stdout, "thread %d, joined ", i);
  • #endif
  •     }
  •     }
  •     fprintf(stdout, "Average pthread_join time: %.lf ns ",(total/(double)ITERATIONS));
  • ...

A.3.3 Microbenchmark #3: Mutex Lock and Unlock Benchmark

In computer science, mutual exclusion is a concurrency control method dedicated to prevent race conditions between two, or multiple, threads. A race condition is a behavior of a system where two independent workflows are modifying in a shared resource, which is used to generate the output of the system. Making an analogy to the real world, we can consider two mechanics (two threads) who are jointly assembling a car engine. They assemble the engine in parallel, however, some of the subcomponents must be assembled in some specific order to ensure that the engine will work properly. Each mechanic has their own part of the engine to assemble. To be sure that components are mounted in the correct order, each mechanic should have exclusive ownership to the relevant portion of the engine during the critical sections of assembly. This exclusive ownership could be associated with the mutex lock, where mutex is our car engine. Freeing the engine could be associated with mutex unlock.

The mutex lock and unlock benchmark, named mutex_bench, measures the time of these two actions 1000 times. At the end, it calculates the average time for locking and unlocking a mutex variable. To execute mutex lock and unlock, we used the pthread_mutex_lock () and pthread_mutex_unlock ()functions of the Pthread API library. Below are some code samples of the benchmark which measures lock and unlock timings.

  • ...
  • //Linux BSP//
  • for (i = 0; i < nr_iterations; i ++) {
  •     time_get_time(&start);
  •     pthread_mutex_lock(&lock);
  •     time_get_time(&stop);
  •     delta = time_get_diff(&stop, &start);
  •     total_lock += delta;
  •     time_get_time(&start);
  •     pthread_mutex_unlock(&lock);
  •     time_get_time(&stop);
  •     delta = time_get_diff(&stop, &start);
  •     total_unlock += delta;
  • }
  •     fprintf(stdout, "Average time for locking a mutex: %.8f ns ",
  •     (double) total_lock/ (double) nr_iterations);
  •     fprintf(stdout, "Average time for unlocking a mutex: %.8f ns ",
  •     (double) total_unlock/ (double) nr_iterations);
  • ...
  • ...
  • //Zephyr OS//
  • for (i = 0; i < nr_iterations; i ++) {
  •     TIMING_INFO_PRE_READ();
  •     mutex_lock_start_time = TIMING_INFO_OS_GET_TIME();
  •     pthread_mutex_lock(&lock);
  •     TIMING_INFO_PRE_READ();
  •     mutex_lock_end_time = TIMING_INFO_OS_GET_TIME();
  •     total_lock +=  (mutex_lock_end_time -mutex_lock_start_time);
  •     TIMING_INFO_PRE_READ();
  •     mutex_unlock_start_time = TIMING_INFO_OS_GET_TIME();
  •     pthread_mutex_unlock(&lock);
  •     TIMING_INFO_PRE_READ();
  •     mutex_unlock_end_time = TIMING_INFO_OS_GET_TIME();
  •     total_unlock +=  (mutex_unlock_end_time -
  •     mutex_unlock_start_time);
  • ...

A.3.4 Microbenchmark #4: Context Switching Benchmark

In computing, a context switch is the process of storing the state of a process of a thread, so that it can be restored and then resume execution from the same point later. This allows multiple processes to share a single CPU and is an essential feature of a multitasking operating system.

The precise meaning of the phrase “context switch” varies significantly in usage. In a multitasking context, it refers to the process of storing the system state for one task, so that task can be paused and another task resumed. A context switch can also occur as the result of an interrupt, such as when a task needs to access disk storage, freeing up CPU time for other tasks. Some operating systems also require a context switch to move between user-mode and kernel-mode tasks. The process of context switching can have a negative impact on system performance, although the size of this effect depends on the nature of the switch being performed (Fig. 3).9

Fig. 3
Fig. 3 Thread contextswitch.

This benchmark measures the context switch time by creating two threads, which are continuously context switched 500,000 times. During context switch time, the benchmark records elapsed time which is then divided by the number of context switches to generate the average time for a context switch.

A.4 Analysis Results

Figs. 4 and 5 contain the scores reported by the aforementioned benchmarks. Three iterations were performed for each benchmark. As can be seen, the results using the Zephyr™ OS are deterministic. Each time you execute the benchmark on the Zephyr™ OS with i.MX RT1050 EVK, the results will be the same (Fig. 4).

Fig. 4
Fig. 4 Benchmark results on the i.MX RT1050 EVK with the Zephyr™ OS.
Fig. 5
Fig. 5 Benchmark results onthe i.MX 6UL EVK with the Linux BSP.

With the Linux BSP running on the i.MX 6UL, the results varied from run to run with a deviation from average values of up to 9% (Fig. 5).

The table below highlights the average time calculated from these benchmark iterations. The average time here is calculated in cycles (lower is better).

OSZephyr™ OS
1.11.99
Linux BSP
4.9.88-imx_4.9.88_2.0.0_ga
Difference (as a multiple)
Board namei.MX RT1050 EVKi.MX 6UL EVK
CPU cores11
Core frequency (MHz)600528
Average heap malloc time (cycles)100111,49911x
Average heap free time (cycles)112648704x
Average pthread_mutex_lock time (cycles)5379915x
Average pthread_mutex_unlock time (cycles)8381810x
Average pthread_create time (cycles)71985,478118x
Average pthread_join time (cycles)170289,21952x
Average context switch time (cycles)47128427x

Unlabelled Table

According to this data, the Zephyr™ OS running on the i.MX RT1050 presented a significant improvement in all time cycles compared with the Linux BSP + i.MX 6UL EVK. More specifically, the use of microbenchmarks detailed in this section illustrate to the system developer that the Zephyr™ OS running on the i.MX RT1050 provides key performance improvements in heap allocation, the use of mutexes, thread creation and join times, as well as context switching. As most embedded solutions developers will appreciate, these are often considered key building blocks in the overall design and implementation of solutions and system-level applications.

A.5 Summary and Conclusions

Key contributions of this section are detailed below:

  1. 1. A performance analysis was completed by running custom microbenchmarks on two different hardware and software solutions.
  2. 2. Benchmarks were developed around a common API to ensure comparable results.
  3. 3. Different functions were used for collecting elapsed time: clock_get_time()on Linux and TIMING_INFO_PRE_READ on the Zephyr™ OS.
  4. 4. Compared with the Linux BSP and i.MX 6UL, the Zephyr™ OS and i.MX RT1050 EVK is:
    1. a. 27 times faster in context switching.
    2. b. Up to 11 times faster in allocating and deallocating memory.
    3. c. Up to 15 times faster in locking and unlocking mutexes using pthread library.
    4. d. Faster at creating, joining, and canceling threads.
    5. e. Better at providing additional performance at a lower cost.
  5. 5. The Zephyr™ OS with the i.MX RT1050 EVK board presents a predictable execution time offering the possibility for use in applications that require various time constrains.

In summary, this section introduces the use of benchmarks and microbenchmarks as yet another tool in the embedded systems developer’s tool chest. By coupling the use of strategically written microbenchmarks with other system-level monitoring and metrics collections, embedded systems developers can garner key insights into the performance of various hardware and software components within a given solution, as well as across competing solutions within in a given market. With the ability to optimize, and tune the development of features in the overall system, embedded systems developers can strategically focus on development and optimization times for bringing products and solutions to market. In addition, by benchmarking multiple systems with identical benchmarks, systems architects and application developers can assess performance differences between the hardware and software capabilities of competing market solutions. By doing so, systems architects and developers can select the appropriate hardware and software solutions for their particular application.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.231.194