Chapter 18

Performance Tuning

Performance tuning is one of the black arts of embedded system development. You will almost certainly spend some portion of your development schedule on optimization and performance activities. Unfortunately, these activities usually seem to occur when the ship date is closing in and everyone is under the most pressure.

However, help is at hand. We have developed a useful toolbox of tricks and techniques for performance tuning, which are summarized in this chapter. These best-known methods are presented in “pattern” form. Most of the techniques described here are generic performance tuning techniques. The optimization and performance tuning patterns have been organized under the following headings:

General approaches

Code and design

Processor specifics

Networking techniques

What are Patterns?

Each performance improvement suggestion is documented in the form of a pattern (Alexander, 1979; Gamma et al., 1995). A pattern is “a solution to a problem in a context,” a literary mechanism to share experience and impart solutions to commonly occurring problems. Each pattern contains these elements:

Name—For referencing a problem/solution pairing.

Context—The circumstance in which we solve the problem that imposes constraints on the solution.

Problem—THE specific problem to be solved.

Solution—the proposed solution to the problem. Many problems can have more than one solution, and the “goodness” of a solution to a problem is affected by the context in which the problem occurs. Each solution takes certain forces into account. It resolves some forces at the expense of others. It may even ignore some forces.

Forces—The often-contradictory considerations we must take into account when choosing a solution to a problem.

A pattern language is the organization and combination of a number of interrelated patterns. Where one pattern references another pattern we use the following format “(see Pattern Name).”

You may not need to read each and every pattern. You certainly do not need to apply all of the patterns to every performance optimization task. You might, however, find it useful to scan all of the context and problem statements to get an overview of what is available in this pattern language.

General Approaches

This first set of patterns proposes general approaches and tools you might use when embarking on performance tuning work. These patterns are not specific to any processor or application.

Defined Performance Requirement

Context: You are a software developer starting a performance improvement task on an application or driver.

Problem: Performance improvement work can become a never-ending task. Without a goal, the activity can drag on longer than productive or necessary.

Solution: At an early stage of the project or customer engagement, define a relevant, specific, realistic, and measurable performance requirement. Document that performance requirement as a specific detailed application and configuration with a numerical performance target.

“Make it as fast as possible” is not a specific performance requirement.

“The application must be capable of 10-gigabit per second wire-speed routing of 64-byte packets with a 600-megahertz CPU” is not a realistic performance requirement.

Forces:

A performance target can be hard to define.

Waiting to have a goal might affect your product’s competitiveness.

A performance target can be a moving target; competitors do not stand still. New competitors come along all the time.

Without a goal, the performance improvement work can drag on longer than is productive.

Performance Design

Context: You are a software developer designing a system. You have a measurable performance requirement (see Defined Performance Requirement).

Problem: The design of the system does not meet the performance requirement.

Solution: At design time, describe the main data path scenario. Walk through the data path in the design workshop and document it in the high-level design.

When you partition the system into components, allocate a portion of the clock cycles to the data path portion of each component. Have a target at design time for the clock cycle consumption of the whole data path. Ganssle (1999) gives notations and techniques for system design performance constraints.

During code inspections, hold one code inspection that walks through the most critical data path.

Code inspections are usually component based. This code inspection should be different and follow the scenario of the data path.

If you use a polling mechanism, ensure that the CPU is shared appropriately.

It can also be useful to analyze the application’s required bus bandwidth at design time to decide if the system will be CPU or memory bandwidth/latency limited. If available, you should use performance modeling environments during this phase.

Forces:

It can be difficult to anticipate some system bottlenecks at design time.

The design of the system can make it impossible to meet the performance requirement. If you discover this late in the project, it might be too difficult to do anything about it.

Premature Code Tuning Avoided

Context: You are implementing a system, and you are in the coding phase of the project. You do have a good system-level understanding of the performance requirements and the allocation of performance targets to different parts of the system because you have a performance design (see Performance Design).

Problem: It is difficult to know how much time or effort to spend thinking about performance or efficiency when initially writing the code.

Solution: It is important to find the right balance between performance, functionality, and maintainability.

Some studies have found that 20% of the code consumes 80% of the execution time; others have found less than 4% of the code accounts for 50% of the time (McConnell, 1993).

KISS—keep it simple and straightforward. Until you have measured and can prove that a piece of code is a system-wide bottleneck, do not optimize it. Simple design is easier to optimize. The compiler finds it easier to optimize simple code.

If you are working on a component of a system, you should have a performance budget for your part of the data path (see Performance Design).

In the unit test, you could have a performance test for your part of the data path. At integration time, the team could perform a performance test for the complete assembled data path.

The best is the enemy of the good. Working toward perfection may prevent completion. Complete it first, then perfect it. The part that needs to be perfect is usually small.

—Steve McConnell

For further information, see Chapters 28 and 29 of Code Complete (McConnell, 1993) and question 20.13 in the comp.lang.c FAQ web site (Summit, 1995).

Forces:

Efficient code is not necessarily “better” code. It might be difficult to understand and maintain.

It is almost impossible to identify performance bottlenecks before you have a working system.

If you spend too much time doing micro-optimization during initial coding, you might miss important global optimizations.

If you look at performance too late in a project, it can be too late to do anything about it.

Step-by-Step Records

Context: You are trying a number of optimizations to fix a particular bottleneck. The system contains a number of other bottlenecks.

Problem: Sometimes it is difficult when working at a fast pace to remember optimizations made only a few days earlier.

Solution: Take good notes of each experiment you have tried to identify bottlenecks and each optimization you have tried to increase performance. These notes can be invaluable later. You might find you are stuck at a performance level with an invisible bottleneck. Reviewing your optimization notes might help you identify incorrect paths taken or diversionary assumptions.

When a performance improvement effort is complete, it can be very useful to have notes on the optimization techniques that worked. You can then put together a set of best-known methods to help other engineers in your organization benefit from your experience.

Forces:

Writing notes can sometimes break the flow of work or thought.

Slam-Dunk Optimization

Context: You have made a number of improvements that have increased the efficiency of code running on the processor core.

Problem: The latest optimizations have not increased performance. You have hit some unidentified performance-limiting factor. You might have improved performance to a point where environmental factors, protocols, or test equipment is now the bottleneck.

Solution: It is useful to have a code modification identified that you know should improve performance. For example:

An algorithm on the data path that can be removed temporarily such as IP checksum.

Increase the processor clock speed.

In one application, we implemented a number of optimizations that should have improved performance but did not. We then removed the IP checksum calculation and performance still did not increase. These results pointed to a hidden limiting factor, an unknown bottleneck. When we followed this line of investigation, we found a problem in the way we configured a physical layer device, and when we fixed this hidden limiting factor, performance improved immediately by approximately 25%. We retraced our steps and reapplied the earlier changes to identify the components of that performance improvement.

Forces:

Increasing the processor clock speed improves performance only for CPU-bound applications.

Best Compiler for Application

Context: You are writing an application using a compiler. You have a choice of compilers for the processor architecture you are using.

Problem: Different compilers generate code that has different performance characteristics. You need to select the right one for your application and target platform.

Solution: Experiment with different compilers and select the best performing compiler for your application and environment.

Performance can vary between compilers and versions of the compiler. GCC is an excellent compiler for general use, but vendors often support highly optimized processor-specific microarchitecture optimizations. The difference can be in the order of 5–10% for certain applications

Forces:

Some compilers are more expensive than others.

Some compilers and operating systems might not match. For example, the compiler you want to use might generate the wrong object file format for your tool chain or development environment.

A particular compiler might optimize a particular benchmark better than another compiler, but that is no guarantee that it will optimize your specific application in the same way.

You might be constrained in your compiler choice because of tools support issues. If you are working in the Linux kernel, you might have to use GCC. Some parts of the kernel use GCC-specific extensions.

Compiler Optimizations

Context: You have chosen to use a C compiler (see Best Compiler for Application).

Problem: You have not enabled all of the compiler optimizations.

Solution: Your compiler supports a number of optimization switches. Using these switches can increase global application performance for a small amount of effort. Read the documentation for your compiler and understand these switches.

In general, the highest-level optimization switch is the -O switch. In GCC, the switch takes a numeric parameter. Find out the maximum parameter for your compiler and use it. Typical compilers support three levels of optimization. Try the highest. In GCC, the highest level is -O3. However, in the past -O3 code generation had more bugs than -O2, the most-used optimization level. The Linux kernel is compiled with -O2. If you have problems at -O3, you might need to revert to -O2.

Moving from -O2 to -O3 made an improvement of approximately 15% in packet processing in one application tested. In another application, -O3 was slower than -O2.

You can limit the use of compiler optimizations to individual C source files.

Introduce optimization flags, one by one, to discover the ones that give you benefit.

Other GCC optimization flags that can increase performance are

 -funroll-loops

 -fomit-frame-pointer

Forces:

Generally, optimizations increase generated code size.

Some optimizations might not increase performance.

Compilers support a large number of switches and options. It can be time consuming to read the lengthy documentation.

Optimized code is difficult to debug.

Some optimizations can reveal compiler bugs, as they are not as frequently used as the normal default options.

Enabling optimization can change timings in your code. It might reveal latent undiscovered problems.

Data Cache

Context: You are using a processor that contains a data cache. The core is running faster than memory or peripherals.

Problem: The processor core is spending a significant amount of time stalled waiting on an external memory degrading performance. You have identified this problem using the performance monitoring function on your chosen processor quantifying the number of cycles the processor is stalled for.

In some applications, we have observed that a significant number of cycles are lost to data-dependency stalls.

Solution:

In general, the most efficient mechanism for accessing memory is to use the data cache. Core accesses to cached memory is several times faster than accessing it from the DRAM In addition, it does not need to use the internal/external bus, leaving it free for other devices such has high speed I/O.

The cache unit can make efficient use of the memory bus. On the IA-32 processors the core fetches an entire 64-byte cache line, using special memory burst cycles when accessing memory. This is far more efficient than when compared to issuing separate 32-bit data reads.

The cache supports several features that give you flexibility in tailoring the system to your design needs. These features affect all applications to some degree; however, the optimal settings are application dependent. It is critical to understand the effects of these features and how to fine-tune them for the usage model of a particular application. We cover a number of these cache features in later sections.

In one application (using an RTOS) that was not caching buffer descriptors and packet data, developers enabled caching and saw an approximate 25% improvement in packet-processing performance. Choose data structures appropriate for a data cache. For example, stacks are typically more cache efficient than linked list data structures.

In most of the applications we have seen, the instruction cache is very efficient. It is worth spending time optimizing the use of the data cache.

Forces:

On IA-32 systems, if you cache data-memory that the core shares with another bus master, the cache coherence with I/O is maintained; however, on some architectures you must manage cache flush/invalidation explicitly.

If you use caching, it is best to ensure that two distinct data sets never share the same cache line. Inadvertent cache line sharing, popularly called false cache sharing, arises when two agents (either CPU threads or IO devices) attempt to access the same cache line. In such a case, the cache coherent based architecture like Intel has to resolve the cache conflict. This may result in a temporary stall in execution, leading to performance degradation.

Be careful what you cache. Temporal locality refers to the amount of time between accesses to the data. If you access a piece of data once or access it infrequently, it has low temporal locality. If you mark this kind of data for caching, the cache replacement algorithm can cause the eviction of performance-sensitive data by this lower-priority data.

The processor implements a round-robin line replacement algorithm.

Code and Design

This section covers some general code tuning guidelines that are applicable to most processors. In many cases, these optimizations can decrease the readability, maintainability, or portability of your code. Be sure you are optimizing code that needs optimization (see Premature Code Tuning Avoided).

Reordered Struct

Context: You have identified a bottleneck segment of code on your application data path. The code uses a large struct.

Problem: The struct spans a number of cache lines.

Solution: Reorder the fields in a struct to group the frequently accessed fields together. If all of the accessed fields fit on a cache line, the first access pulls them all into a cache, potentially avoiding data-dependency stalls when accessing the other fields. Organize all frequently written fields into the same cache line.

Some architectures are sensitive to variable address alignment on a particular address boundary. Whenever possible, align variables at the right address boundary using complier pragmas.

Forces:

Reordering structs might not be feasible. Some structs might map to a packet definition.

Multiprocessor access to the same cache line will generate additional interprocessor traffic. If possible, data shared by a process should be split into separate instances.

Supersonic Interrupt Service Routines

Context: Your application uses multiple interrupt service routines (ISRs) to signal the availability of data on an interface and trigger the processing of that data.

Problem: Interrupt service routines can interfere with other ISRs and real-time processing work such as packet processing code.

Solution: Keep ISRs short. Design them to be re-entrant.

For example, an ISR should just give/release a semaphore, set a flag, or en-queue a packet. You should de-queue and process the data outside the ISR. This way, you obviate the need for interrupt locks around data in an ISR.

Interrupt locks in a frequent ISR can have hard-to-measure effects on the overall system.

For more detailed interrupt design guidelines, see Doing Hard Time_ (Douglass, 1999).

Forces:

Posting to a semaphore or queue can cause extra context switches, which reduce the overall efficiency of a system.

Bugs in ISRs usually have a catastrophic effect. Keeping them short and simple reduces the probability of bugs.

Assembly-Language-Critical Functions

Context: You have identified a C function that consumes a significant portion of the data path.

Problem: The code generated for this function might not be optimal for your processor.

Solution: Re-implement the critical function directly in assembly language.

Use the best compiler for the application (see Best Compiler for Application) to generate initial assembly code, then hand-optimize it.

Forces:

Modern compiler technology is beginning to out-perform the ability of humans to optimize assembly language for sophisticated processors.

Assembly language is more difficult to read and maintain.

Assembly language is more difficult to port to other processors.

Inline Functions

Context: You have identified a small C function that is frequently called on the data path.

Problem: The overhead associated with the entry and exit to the function can become significant in a small function, frequently called on by the application data path.

Solution: Declare the function inline. This way, the function gets inserted directly into the code of the calling function.

Forces:

Inline functions can increase the code size of your application and add stress to the instruction cache.

Some debuggers have difficulty showing the thread of execution for inline functions.

A function call itself can limit the compiler’s ability to optimize register usage in the calling function.

Cache-Optimizing Loop

Context: You have identified a critical loop that is a significant part of the data-path performance.

Problem: The structure of the loop or the data on which it operates could be “trashing” the data cache.

Solution: You can consider a number of loop/data optimizations:

Array merging—the loop uses two or more arrays. This merges them into a single array of a struct.

Induction variable interchange. Induction variables are variables that get increased/decreased by a fixed amount each iteration of a loop (e.g., For i).

Loop fusion

Forces:

Loop optimizations can make the code harder to read, understand, and maintain.

Minimizing Local Variables

Context: You have identified a function that needs optimization. It contains a large number of local variables.

Problem: A large number of local variables might incur the overhead of storing them on the stack. The compiler might generate code to set up and restore the frame pointer.

Solution: Minimize the number of local variables. The compiler may be able to store all the locals and parameters in processor registers.

Forces:

Removing local variables can decrease the readability of code or require extra calculations during the execution of the function.

Explicit Registers

Context: You have identified a function that needs optimization. A local variable or a piece of data is frequently used in the function.

Problem: Sometimes the compiler does not identify a register optimization.

Solution: It is worth trying explicit register hints to local variables that are frequently used in a function.

It can also be useful to copy a frequently used part of a packet that is also used frequently in a data path algorithm into a local variable declared register. An optimization of this kind made a performance improvement of approximately 20% in one real application.

Alternatively, you could add a local variable or register to explicitly “cache” a frequently used global variable. Some compilers do not work on global variables in local variables or registers. If you know the global is not modified by an interrupt handler and the global is modified a number of times in the same function, copy it to a register local variable, make updates to the local, and then write the new value back out to the global before exiting the function. This technique is especially useful when updating packet statistics in a loop handling multiple packets.

Forces:

The register keyword is only a hint to the compiler.

Optimized Hardware Register Use

Context: The data path code does multiple reads or writes to one or more hardware registers.

Problem: Multiple read-operation-writes on hardware registers can cause the processor to stall.

Solution: First, break up read-operation-write statements to hide some of the latencies when dealing with hardware registers. For example:

Read-operation-writes on hardware registers:

∗reg1ptr |= 0x0400;

∗reg2ptr &= ~0x80;

Optimized read-operation-writes:

reg1 = ∗reg1ptr;

reg2 = ∗reg2ptr;

reg1 |= 0x0400;

reg2 &= ~0x80;

∗reg1ptr = reg1;

∗reg2ptr = reg2;

This modified code eliminates one of the read dependency stalls.

Second, search the data path code for multiple writes to the same hardware register. Combine all the separate writes into a single write to the actual register. For example, some applications disable hardware interrupts using multiple set/resets of bits in the interrupt enable register. In one such application, when we manually combined these write instructions, performance improved by approximately 4%. This is particularly important on IA-32 systems with strongly ordered memory models.

Forces:

Manually separated read-operation-write code expands code slightly. It can also add local variables and could trigger the creation of a frame pointer.

Avoiding the OS Buffer Pool

Context: The application uses a system buffer pool.

Problem: Memory allocation or calls to buffer pool libraries can be processor intensive. In some operating systems, these functions lock interrupts and use local semaphores to protect simultaneous access to shared heaps.

Pay special attention to buffer management at design time. Are buffers being allocated on the application data path?

Solution: Avoid allocating or interacting with the RTOS packet buffer pool on the data path. Pre-allocate packet buffers outside the data path and store them in lightweight software pools/queues.

Stacks or arrays are typically faster than linked lists for packet buffer pool collections because they require fewer memory accesses to add and remove buffers. Stacks also improve data cache utilization.

Forces:

OS buffer pools implement buffer collections. Writing another light collection duplicates functionality.

C Language Optimizations

Context: You have identified a function or segment of code that is consuming a significant portion of the CPU clock cycles on the data path. You might have identified this code using profiling tools (see Profiling Tools) or a performance measurement.

Problem: A function or segment of C code needs optimization.

Solution: You can try a number of C language level optimizations:

Pass large function parameters by reference, never by value. Values take time to copy.

Avoid array indexing. Use pointers.

Minimize loops by collecting multiple operations into a single loop body.

Avoid long if-then-else chains. Use a switch statement or a state machine.

Use int (natural word size of the processor) to store flags rather than char or short.

Use unsigned variants of variables and parameters where possible. Doing so might allow some compilers to make optimizations.

Avoid floating-point calculations on the data path.

Use decrementing loop variables, for example,

for (i=10; i--;) {do something}

or even better

do { something } while (i--)

Look at the code generated by your compiler in this case.

Adjust structure sizes to the power of two.

Place the most frequently true statement first in if-else statements.

Place frequent case labels first.

Write small functions. The compiler likes to reuse registers as much as possible and cannot do it in complex nested code. However, some compilers automatically use a number of registers on every function call. Extra function call entries and returns can cost a large number of cycles in tight loops.

Use the function return as opposed to an output parameter to return a value to a calling function. Return values on many processors are stored in a register by convention.

For critical loops, use Duff’s device, a devious, generic technique for unrolling loops. See question 20.35 in the comp.lang.C FAQ (Summit, 1995).

For other similar tips, see Chapter 29 in Code Complete (McConnell, 1993).

Forces:

Good compilers make a number of these optimizations.

Disabled Counters/Statistics

Context: Many applications have a large number of statistics/counters associated with the data path in the application. You have completed integration testing of your application.

Problem: The software keeps a number of counters and statistics to facilitate integrating and debugging of both the components and the system as a whole. These counters usually incur a read and write or increment in main, or possibly cached, memory.

The access layer also contains code that checks parameters for legal values. This feature facilitates the integration and debugging of the software and the system as a whole. These checks usually test conditions that never occur once the system and customer code have been fully tested and integrated.

Solution: You can identify if there are macros in the code to disable the incorporation of debug counters. Doing so also removes many of the internal parameter debug checks.

In one application, use of this pattern increased packet-processing throughput by up to 4%.

Forces:

Disabling counters and statistics removes useful debugging information.

Removing parameter checks can obfuscate an issue, making it harder to detect incorrect parameter checks in customer code.

Processor-Specific

For IA-32-specific platforms, Intel has developed Intel® 64 and IA-32 Architectures Optimization Reference Manual. In particular, Chapter 12 covers specific Intel Atom™ optimizations.

Stall Instructions

Context: You have run some tests using performance measurements (for example, the Intel VTune™ Performance Tools of your target) that indicate that a large number of core cycles are being lost due to data dependency stalls.

Problem: You might find a large portion of the cycles is lost to stalls on fast processors. You need to identify the pieces of code that are causing these stalls.

Solution: One simple way to identify “hot instructions” is to use a program counter sampler. The sampler would run at a regular interval and count the number of times each instruction or program-counter executes while running the networking performance test.

To reduce the impact of these stalls you could use the Prefetch instruction (see Prefetch Instructions). You could also move code that won’t cause a stall before the code that does.

Forces:

Adding sampling code can affect the behavior of the system under test.

Profiling Tools

Context: You are at an early stage of performance improvement. You have not identified a specific bottleneck but you have proven tht the current bottleneck is the speed of execution of the code on the processor core.

Problem: You have a working system that is not meeting a performance requirement. You suspect that raw algorithmic processing power is the current bottleneck; you need to identify the bottleneck code.

Solution: A number of profiling tools exist to help you identify code hotspots. Typically, they identify the percent of time spent in each C function in your code base.

Intel Vtune tools are performance characterization tools; these provide significant detail on the application behavior.

Rational Quantify™ contains an excellent performance profiler, not to mention Purify™, the memory corruption/leak checker. You instrument your code and then execute that code on the target platform.

Gprof is available for many Linux-based systems.

Some JTAG debug tools contain profiling features.

Forces:

Profiling tools can affect the performance of the system.

Some tools might not be available for your RTOS.

Some profiling tools cost money.

Each of these tools has a learning curve but could pay back the time and money investment.

Prefetch Instructions

Context: You have identified a stall instruction (see Stall Instructions).

Problem: You want to reduce the time the processor spends stalled due to a data dependency.

Solution: The IA-32 has a number of temporal prefetch load instructions called PREFETCH™. The purpose of these instructions is to preload data into multiple levels of the cache hierarchy. The prefetch instruction is a hint.

Data prefetching allows hiding of memory transfer latency while the processor continues to execute instructions. The judicious use of the prefetch instruction can improve throughput of the processor.

Look at the line of C code that generates the stall instruction (see Stall Instructions).

Insert an explicit assembly language prefetch instruction some time before the stall instruction (see Stall Instructions). Data prefetch can be applied not only to loops but also to any data references within a block of code.

Using prefetches requires careful experimentation. In some cases performance improves, and in others the performance degrades. Overuse of prefetches can use shared resources and degrade performance.

Spread prefetch operations over calculations so as to allow bus traffic to free flow and to minimize the number of necessary prefetches.

Forces:

Overuse of prefetches can use shared resources and degrade performance.

The placement of a prefetch instruction can be CPU speed specific. The latency to external memory when measured in cycles changes when you change the CPU speed.

The placement of the prefetch instruction can depend on the previous data pattern access to the data.

Separate DRAM Memory Banks

Context: You have completed a performance measurement. These data have identified a significant percentage of cycles lost due to data dependency stalls.

Problem: DRAMs are typically divided into four or eight banks. Thrashing occurs when subsequent memory accesses within the same memory bank access different pages. The memory page change adds three to four bus clock cycles to memory latency (tens of nanoseconds).

Solution: You can resolve this type of thrashing by either placing the conflicting data structures into different memory banks or paralleling the data structures such that the data resides within the same memory page. Either action can reduce the latency reading data from memory and reduce the extent of many stalls.

Allocate data buffers in their own bank. The DRAM controller can keep a page partially open in four/eight different memory banks. You could also split data buffers across two banks.

It is also important to ensure that instruction and data sections are in different memory banks, or they might continually thrash the memory page selection.

In one networking application, this technique increased packet-processing performance by approximately 10%. In another, it had no effect.

Forces:

Write code to use different banks for code and data or spread data across multiple banks. Either action will complicate your BSP and configuration code.

Line-Allocation Policy

Context: You are using data cache for data or packet memory. You have enabled the cache.

Problem: The cache line-allocation policy can affect the performance of your application.

Solution: The logic a processor uses to make a decision about placing new data into the cache is based on the line-allocation policy.

If the line-allocation policy is read-allocate, all load operations that miss the cache request a 64-byte cache line from external memory and allocate it. Store operations that miss the cache do not cause a line to be allocated.

With a read/write-allocate policy, load or store operations that miss the cache request a 64-byte cache line from external memory if the cache is enabled.

In general, regular data and the stack for your application should be allocated to a read-write allocate region. Most applications regularly write and read this data. Again, it is worth experimenting to see if your processor architecture supports it.

Write-only data—or data that is written and subsequently not used for a long time—should be placed in a read-allocate region. Under the read-allocate policy, if a cache write miss occurs, a new cache line is not allocated and hence does not evict critical data from the data cache.

In general, read-allocate seems to be the best performing policy for packet data. One application had an improvement of approximately 10% when packet memory was set up read-allocate.

Forces:

The appropriate cache line-allocation policy can be application dependent. It is worth experimenting with both types of line-allocation policies.

Not all processors allow you to configure the allocation policy.

Cache Write Policy

Context: You are using data cache for data or packet memory. You have enabled the cache.

Problem: The cache write policy can affect the performance of your application.

Solution: Cached memory also has an associated write policy. A write-through policy instructs the data cache to keep external memory coherent by performing stores to both external memory and the cache. A write-back policy only updates external memory when a line in the cache is cleaned or needs to be replaced with a new line.

Generally, write-back provides higher performance because it generates less data traffic to external memory. However, if your application is making a small number of modifications, for example, to packet data or message buffers, write-through may be more efficient.

In a multiple-bus/master environment, you might have to use a write-through policy or explicit cache flushes if data are shared across multiple masters.

Forces:

The appropriate cache write policy can be application dependent. It is worth experimenting with both types of write policies.

In systems with multiple hierarchies of cache (L1 and L2), the inclusion of L2 may dictate the write policy (probably the best, because having an L2 cache improves processor performance).

Cache-Aligned Data Buffers

Context: Your application/driver caches packet buffers and buffer descriptors.

Problem: You need to use the cache as effectively as possible. On some systems, the descriptors might be larger than a cache line.

Solution: Allocate key data on cache line boundaries. This action maximizes the use of cache when accessing these data structures.

You must make sure the descriptors and packet storage for different packets do not share the same cache line.

Forces:

You might waste some memory if the size of these data structures in your operating system is not divisible by the cache line size. Typically, this memory wastage is worth the increase in performance.

On-Chip Memory

Context: Your data path code makes frequent reference to a specific table or piece of data.

Problem: Accesses to these data are causing stalls because the cache is heavily used and the data are being frequently evicted. Most processors employ a round-robin/or least-recently-used replacement cache policy; all cache data that is not locked is eventually evicted.

Solution: Some processor architectures support cache locking. You could consider locking key data/code elements into the cache.

Locking data from external memory into the data cache is useful for lookup tables, constants, and any other data that are frequently accessed.

Forces:

Locking data into the cache reduces the amount of cache available to the processor for general processing.

Optimized Libraries

Context: You are optimizing an application for the target processor. Your application uses some of the C functions for key vector algorithm elements in the system.

Problem: Some compilers do not take full advantage of SIMD type instructions such as Intel® SSE/AVX when targeting vector-based code.

Solution: Intel and other silicon vendors provide optimized libraries that take full advantage of the SIMD engines available. The Intel libraries are known as Intel Integrated Performance Primitives (Intel IPP).

Modulo/Divide Avoided

Context: You are writing code for a processor.

Problem: The processor does not directly support modulo or divide instructions. When compiled, the code generates a call to a library support function.

Solution: You can translate some modulo or divide calculations into bit masks or shifts.

For example, modulo for dimensions that are a power of 2 can use a mask, such as instead of (var % 8), use (var & 7). Likewise, you can convert some divisions by constants into shift and add instructions.

Most compilers should be capable of generating this optimization, but it might be worth examining generated code for any modulo or divide on your data path.

Forces:

Bit masks/shifts are less readable code than division or modulo.

Networking Techniques

The following patterns can be applied to networking performance in general.

Bottleneck Hunting

Context: You have a running functional system. You have a performance requirement (see Defined Performance Requirement). A customer is measuring performance lower than that requirement.

Problem: You can have a number of performance bottlenecks in the designed system, but unless you identify the current limiting factor, you might optimize the wrong thing. One component of the system might be limiting the flow of network packets to the rest of the system.

Solution: Performance improvement really starts with bottleneck hunting. It is only when you find the performance-limiting bottleneck that you can work on optimizations to remove the bottleneck. A system typically has a number of bottlenecks. You first need to identify the current limiting bottleneck, then remove it. You then need to iterate through the remaining bottlenecks until the system meets its performance requirements.

First, determine if your application is CPU or I/O bound. In a CPU-bound system, the limiting factor or bottleneck is the amount of cycles needed to execute some algorithm or part of the data path. In an I/O-bound system, the bottleneck is external to the processor. The processor has enough CPU cycles to handle the traffic, but the traffic flow is not enough to make full use of the available processor cycles.

To determine if the system is CPU or I/O bound, try running the processor at a number of different clock speeds. If you see a significant change in performance, your system is probably CPU bound.

Next, look at the software components of the data path; these might include the following:

Low-level device drivers specific to a piece of hardware. These device drivers could conform to an OS-specific interface.

A network interface service mechanism running on the CPU core. This mechanism might be a number of ISRs or a global polling loop.

Encapsulation layers of the networking stack.

The switching/bridging/routing engine of the networking stack or the RTOS.

If some algorithm in a low-level device driver is limiting the flow of data into the system, you might waste your time if you start tweaking compiler flags or optimize the routing algorithm.

It is best to look at the new or unique components to a particular system first. Typically, these are the low-level device drivers or the adapter components unique to this system.

Concentrate on the unique components first, especially if these components are on the edge of the system. In one wireless application we discovered that the wireless device driver was a bottleneck that limited the flow of data into the system.

Many components of a data path can contain packet buffers. Packet counters inserted in the code can help you identify queue overflows or underflows. Typically, the code that consumes the packet buffer is the bottleneck.

This process is typically iterative. When you fix the current bottleneck, you then need to loop back and identify the next one.

Forces:

Most systems have multiple bottlenecks.

Early bottleneck hunting—before you have a complete running system—increases the risk of misidentified bottlenecks and wasted tuning effort.

Evaluating Traffic Generator and Protocols

Context: You are using a network traffic generator and protocols to measure the performance of a system.

Problem: The performance test or protocol overheads can limit the measured performance of your application.

Solution: Identifying the first bottleneck is a challenge. First, you need to eliminate your traffic generators and protocols as bottlenecks and analyze the invariants.

Typical components in a complete test system might include the following:

Traffic sources, sinks, and measurement equipment

The device under test (DUT) for which you are tuning the performance

Physical connections and protocols between traffic sources and the DUT

Your test environment might use a number of different types of traffic sources, sinks, and measurement equipment. You need to first make sure that they are not the bottleneck in your system.

Equipment, like Smartbits™ and Adtech™ testers, is not typically a bottleneck. However, using a PC with FTP software to measure performance can be a bottleneck. You need to test the PC and FTP software without the DUT to make sure that your traffic sources can reach the performance you require.

Running this test can also flush out bottlenecks in the physical media or protocols you are using.

In addition, you need to make sure that the overhead inherent in the protocols you are using makes the performance you require feasible. For example:

You cannot expect 100 megabits per second over Ethernet with 64-byte packets due to interframe gap and frame preamble. You can expect to get at most 76 megabits per second.

You cannot expect to get 8 megabits per second over an ADSL link; you can expect to get at most 5.5 megabits per second.

You cannot expect to get 100 megabits per second on FTP running over Ethernet. You must take IP protocol overhead and TCP acknowledgements into account.

You cannot expect 52 megabits per second on 802™.11a/g networks due to CTS/RTS overhead and protocol overhead.

Characteristics of the particular protocol or application could also be causing the bottleneck. For example, if the FTP performance is much lower (by a factor of 2) than the large-packet performance with a traffic generator (Smartbits), the problem could be that the TCP acknowledgement packets are getting dropped. This problem can sometimes be a buffer management issue.

FTP performance can also be significantly affected by the TCP window sizes on the FTP client and server machines.

Forces:

Test equipment typically outperforms the DUT.

Environmental Factors

Context: You are finding it difficult to identify the bottleneck.

Problem: Environmental factors can cause a difficult-to-diagnose bottleneck.

Solution: Check the environmental factors.

When testing a wireless application, you might encounter radio interference in the test environment. In this case, you can use a Faraday cage to radio-isolate your test equipment and DUT from the environment. Antenna configuration is also important. The antennas should not be too close (<1 meter). They should be erect, not lying down. You also need to make sure you shield the DUT to protect it from antenna interference.

Check shared resources. Is your test equipment or DUT sharing a resource, such as a network segment, with other equipment? Is that other equipment making enough use of the shared resource to affect your DUT performance?

Check all connectors and cables. If you are confident you are making improvements but the measurements are not giving the improvement you expect, try changing all the cables connecting your DUT to the test equipment. As a last resort, try a replacement DUT. We have seen a number of cases where a device on the DUT has degraded enough to affect performance.

Polled Packet Processor

Context: You are designing the fundamental mechanism that drives the servicing of network interfaces.

Problem: Some fundamental mechanisms can expose you to more overhead and wasted CPU cycles. These wasted cycles can come from interrupt preamble/dispatch and context switches.

Solution: You can categorize most applications as interrupt or polling driven or a combination of both.

When traffic overloads a system, it runs optimally if it is running in a tight loop, polling interfaces for which it knows there is traffic queued.

If the application driver is interrupt based, look to see how many packets you handle per interrupt. To get better packet processing performance, handle more packets per interrupt by possibly using a polling approach in the interrupt handler.

Some systems put the packet on a queue from the interrupt handler and then do the packet processing in another thread. In this kind of a system, you need to understand how many packets the system handles per context switch. To improve performance, increase the number of packets handled per context switch.

Other systems can drive packet processing, triggered from a timer interrupt. In this case, you need to make sure the timer frequency and number of packets handled per interrupt are not limiting the networking performance of your system. In addition, this system is not optimally efficient when the system is in overload.

Systems based on Linux usually support a combination of both interrupt and polling based for network drivers (NAPI is such a mechanism used for network drivers).

Forces:

Reducing wasted CPU cycles can complicate the overall architecture or design of an application.

Some IP stacks or operating systems can restrict the options in how you design these fundamental mechanisms.

You might need to throttle the amount of CPU given to packet processing to allow other processing to happen even when the system is in overload.

Applying these techniques might increase the latency in handling some packets.

Edge Packet Throttle

Context: The bottleneck of your system is now the IP forwarding or transport parts of the IP stack.

Problem: You might be wasting CPU cycles processing packets to later drop them when a queue fills later in the data path.

Solution: When a system goes into overload, it is better to leave the frames back up in the RX queue and let the edges of your system, the MAC devices, throttle reception. You can avoid wasting core cycles by checking a bottleneck indicator, such as queue full, early in the data path code.

For example, on VxWorks™, you can make the main packet-processing task (netTask) the highest-priority task. This technique is one easy way to implement a “self-throttling” system. Alternatively, you could make the buffer replenish code a low-priority task, which would ensure that receive buffers are only supplied when you have available CPU.

Forces:

Checking a bottleneck indicator might weaken the encapsulation of an internal detail of the IP stack.

Implementing an early check wastes some CPU cycles when the system is in overload.

Detecting Resource Collisions

Context: You make a change and performance drops unexpectedly.

Problem: A resource collision effect could be causing a pronounced performance bottleneck. Examples of such effects we have seen are the following:

TX traffic is being generated from RX traffic; Ethernet is running in half-duplex mode. The time it takes to generate the TX frame from an RX frame corresponds to the interframe gap. When the TX frame is sent, it collides with the next RX frame.

The Ethernet interface is running full duplex, but traffic is being generated in a loop and the frame transmissions occur at times the MAC is busy receiving frames.

Solution: These kinds of bottlenecks are difficult to find and can only be checked by looking at driver counters and the underlying PHY devices. Error counters on test equipment can also help.

Forces:

Counters might not be available or easily accessible.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.120.109