5.6 Program-Level Performance Analysis

Because embedded systems must perform functions in real time, we often need to know how fast a program runs. The techniques we use to analyze program execution time are also helpful in analyzing properties such as power consumption. In this section, we study how to analyze programs to estimate their run times. We also examine how to optimize programs to improve their execution times; of course, optimization relies on analysis.

It is important to keep in mind that CPU performance is not judged in the same way as program performance. Certainly, CPU clock rate is a very unreliable metric for program performance. But more importantly, the fact that the CPU executes part of our program quickly doesn’t mean that it will execute the entire program at the rate we desire. As illustrated in Figure 5.21, the CPU pipeline and cache act as windows into our program. In order to understand the total execution time of our program, we must look at execution paths, which in general are far longer than the pipeline and cache windows. The pipeline and cache influence execution time, but execution time is a global property of the program.

image

Figure 5.21 Execution time is a global property of a program.

While we might hope that the execution time of programs could be precisely determined, this is in fact difficult to do in practice:

The execution time of a program often varies with the input data values because those values select different execution paths in the program. For example, loops may be executed a varying number of times, and different branches may execute blocks of varying complexity.

The cache has a major effect on program performance, and once again, the cache’s behavior depends in part on the data values input to the program.

Execution times may vary even at the instruction level. Floating-point operations are the most sensitive to data values, but the normal integer execution pipeline can also introduce data-dependent variations. In general, the execution time of an instruction in a pipeline depends not only on that instruction but on the instructions around it in the pipeline.

Measuring execution speed

We can measure program performance in several ways:

Some microprocessor manufacturers supply simulators for their CPUs. The simulator runs on a workstation or PC, takes as input an executable for the microprocessor along with input data, and simulates the execution of that program. Some of these simulators go beyond functional simulation to measure the execution time of the program. Simulation is clearly slower than executing the program on the actual microprocessor, but it also provides much greater visibility during execution. Be careful—some microprocessor performance simulators are not 100% accurate, and simulation of I/O-intensive code may be difficult.

A timer connected to the microprocessor bus can be used to measure performance of executing sections of code. The code to be measured would reset and start the timer at its start and stop the timer at the end of execution. The length of the program that can be measured is limited by the accuracy of the timer.

A logic analyzer can be connected to the microprocessor bus to measure the start and stop times of a code segment. This technique relies on the code being able to produce identifiable events on the bus to identify the start and stop of execution. The length of code that can be measured is limited by the size of the logic analyzer’s buffer.

We are interested in the following three different types of performance measures on programs:

average-case execution time: This is the typical execution time we would expect for typical data. Clearly, the first challenge is defining typicalinputs.

worst-case execution time: The longest time that the program can spend on any input sequence is clearly important for systems that must meet deadlines. In some cases, the input set that causes the worst-case execution time is obvious, but in many cases it is not.

best-case execution time: This measure can be important in multirate real-time systems, as seen in Chapter 6.

First, we look at the fundamentals of program performance in more detail. We then consider trace-driven performance based on executing the program and observing its behavior.

5.6.1 Elements of Program Performance

The key to evaluating execution time is breaking the performance problem into parts. Program execution time [Sha89] can be seen as

image

The path is the sequence of instructions executed by the program (or its equivalent in the high-level language representation of the program). The instruction timing is determined based on the sequence of instructions traced by the program path, which takes into account data dependencies, pipeline behavior, and caching. Luckily, these two problems can be solved relatively independently.

Although we can trace the execution path of a program through its high-level language specification, it is hard to get accurate estimates of total execution time from a high-level language program. This is because there is not a direct correspondence between program statements and instructions. The number of memory locations and variables must be estimated, and results may be either saved for reuse or recomputed on the fly, among other effects. These problems become more challenging as the compiler puts more and more effort into optimizing the program. However, some aspects of program performance can be estimated by looking directly at the C program. For example, if a program contains a loop with a large, fixed iteration bound or if one branch of a conditional is much longer than another, we can get at least a rough idea that these are more time-consuming segments of the program.

Of course, a precise estimate of performance also relies on the instructions to be executed, because different instructions take different amounts of time. (In addition, to make life even more difficult, the execution time of one instruction can depend on the instructions executed before and after it.)

The next example illustrates data-dependent program paths.

Example 5.8 Data-Dependent Paths in if Statements

Here is a pair of nested if statements:

if (a || b) { /* test 1 */

   if (c) /* test 2 */

     { x = r * s + t; /* assignment 1 */ }

   else { y = r + s; /* assignment 2 */ }

   z = r + s + u; /* assignment 3 */

} else {

   if (c) /* test 3 */

     { y = r − t; /* assignment 4 */ }

}

The conditional tests and assignments are labeled within each if statement to make it easier to identify paths. What execution paths may be exercised? One way to enumerate all the paths is to create a truth table–like structure. The paths are controlled by the variables in the if conditions, namely, a, b, and c. For any given combination of values of those variables, we can trace through the program to see which branch is taken at each if and which assignments are performed. For example, when a = 1, b = 0, and c = 1, then test 1 is true and test 2 is true. This means we first perform assignment 1 and then assignment 3.

Here are the results for all the controlling variable values:

Image

Notice that there are only four distinct cases: no assignment, assignment 4, assignments 2 and 3, or assignments 1 and 3. These correspond to the possible paths through the nested ifs; the table adds value by telling us which variable values exercise each of these paths.

Enumerating the paths through a fixed-iteration for loop is seemingly simple. In the code below,

for (i = 0; i < N; i++)

   a[i] = b[i]*c[i];

the assignment in the loop is performed exactly N times. However, we can’t forget the code executed to set up the loop and to test the iteration variable.

Example 5.9 illustrates how to determine the path through a loop.

Example 5.9 Paths in a Loop

Here is the loop code for the FIR filter of Application Example 2.1:

for (i = 0, f = 0; i < N; i++)

   f = f + c[i] * x[i];

By examining the CDFG for the code we can more easily determine how many times various statements are executed. Here is the CDFG once again:

image

The CDFG makes it clear that the loop initiation block is executed once, the test is executed N + 1 times, and the body and loop variable update are each executed N times.

Instruction timing

Once we know the execution path of the program, we have to measure the execution time of the instructions executed along that path. The simplest estimate is to assume that every instruction takes the same number of clock cycles, which means we need only count the instructions and multiply by the per-instruction execution time to obtain the program’s total execution time. However, even ignoring cache effects, this technique is simplistic for the reasons summarized below.

Not all instructions take the same amount of time. RISC architectures tend to provide uniform instruction execution times in order to keep the CPU’s pipeline full. But as we saw in Chapter 3, even very simple RISC architectures like the PIC16F take different amounts of time to execute certain instructions. Floating-point instructions show especially wide variations in execution time—while basic multiply and add operations are fast, some transcendental functions can take thousands of cycles to execute.

Execution times of instructions are not independent. The execution time of one instruction depends on the instructions around it. For example, many CPUs use register bypassing to speed up instruction sequences when the result of one instruction is used in the next instruction. As a result, the execution time of an instruction may depend on whether its destination register is used as a source for the next operation (or vice versa).

The execution time of an instruction may depend on operand values. This is clearly true of floating-point instructions in which a different number of iterations may be required to calculate the result. Other specialized instructions can, for example, perform a data-dependent number of integer operations.

We can handle the first two problems more easily than the third. We can look up instruction execution time in a table; the table will be indexed by opcode and possibly by other parameter values such as the registers used. To handle interdependent execution times, we can add columns to the table to consider the effects of nearby instructions. Because these effects are generally limited by the size of the CPU pipeline, we know that we need to consider a relatively small window of instructions to handle such effects. Handling variations due to operand values is difficult to do without actually executing the program using a variety of data values, given the large number of factors that can affect value-dependent instruction timing. Luckily, these effects are often small. Even in floating-point programs, most of the operations are typically additions and multiplications whose execution times have small variances.

Caching effects

Thus far we have not considered the effect of the cache. Because the access time for main memory can be 10–100 times larger than the cache access time, caching can have huge effects on instruction execution time by changing both the instruction and data access times. Caching performance inherently depends on the program’s execution path because the cache’s contents depend on the history of accesses.

5.6.2 Measurement-Driven Performance Analysis

The most direct way to determine the execution time of a program is by measuring it. This approach is appealing but it does have some drawbacks. First, in order to cause the program to execute its worst-case execution path, we have to provide the proper inputs to it. Determining the set of inputs that will guarantee the worst-case execution path is infeasible. Furthermore, in order to measure the program’s performance on a particular type of CPU, we need the CPU or its simulator.

Despite these problems, measurement is the most commonly used way to determine the execution time of embedded software. Worst-case execution time analysis algorithms have been used successfully in some areas, such as flight control software, but many system design projects determine the execution time of their programs by measurement.

Program traces

Most methods of measuring program performance combine the determination of the execution path and the timing of that path: as the program executes, it chooses a path and we observe the execution time along that path. We refer to the record of the execution path of a program as a program trace (or more succinctly, a trace). Traces can be valuable for other purposes, such as analyzing the cache behavior of the program.

Measurement issues

Perhaps the biggest problem in measuring program performance is figuring out a useful set of inputs to give the program. This problem has two aspects. First, we have to determine the actual input values. We may be able to use benchmark data sets or data captured from a running system to help us generate typical values. For simple programs, we may be able to analyze the algorithm to determine the inputs that cause the worst-case execution time. The software testing methods of Section 5.10 can help us generate some test values and determine how thoroughly we have exercised the program.

The other problem with input data is the software scaffolding that we may need to feed data into the program and get data out. When we are designing a large system, it may be difficult to extract out part of the software and test it independently of the other parts of the system. We may need to add new testing modules to the system software to help us introduce testing values and to observe testing outputs.

We can measure program performance either directly on the hardware or by using a simulator. Each method has its advantages and disadvantages.

Profiling

Profiling is a simple method for analyzing software performance. A profiler does not measure execution time—instead, it counts the number of times that procedures or basic blocks in the program are executed. There are two major ways to profile a program: we can modify the executable program by adding instructions that increment a location every time the program passes that point in the program; or we can sample the program counter during execution and keep track of the distribution of PC values. Profiling adds relatively little overhead to the program and it gives us some useful information about where the program spends most of its time.

Physical performance measurement

Physical measurement requires some sort of hardware instrumentation. The most direct method of measuring the performance of a program would be to watch the program counter’s value: start a timer when the PC reaches the program’s start, stop the timer when it reaches the program’s end. Unfortunately, it generally isn’t possible to directly observe the program counter. However, it is possible in many cases to modify the program so that it starts a timer at the beginning of execution and stops the timer at the end. While this doesn’t give us direct information about the program trace, it does give us execution time. If we have several timers available, we can use them to measure the execution time of different parts of the program.

A logic analyzer or an oscilloscope can be used to watch for signals that mark various points in the execution of the program. However, because logic analyzers have a limited amount of memory, this approach doesn’t work well for programs with extremely long execution times.

Some CPUs have hardware facilities for automatically generating trace information. For example, the Pentium family microprocessors generate a special bus cycle, a branch trace message, that shows the source and/or destination address of a branch [Col97]. If we record only traces, we can reconstruct the instructions executed within the basic blocks while greatly reducing the amount of memory required to hold the trace.

Hardware traces

The alternative to physical measurement of execution time is simulation. A CPU simulator is a program that takes as input a memory image for a CPU and performs the operations on that memory image that the actual CPU would perform, leaving the results in the modified memory image. For purposes of performance analysis, the most important type of CPU simulator is the cycle-accurate simulator, which performs a sufficiently detailed simulation of the processor’s internals that it can determine the exact number of clock cycles required for execution. A cycle-accurate simulator is built with detailed knowledge of how the processor works, so that it can take into account all the possible behaviors of the microarchitecture that may affect execution time. Cycle-accurate simulators are slower than the processor itself, but a variety of techniques can be used to make them surprisingly fast, running only hundreds of times slower than the hardware itself. A simulator that functionally simulates instructions but does not provide timing information is known as an instruction-level simulator.

Simulation-based performance measurement

A cycle-accurate simulator has a complete model of the processor, including the cache. It can therefore provide valuable information about why the program runs too slowly. The next example discusses a simulator that can be used to model many different processors.

Example 5.10 Cycle-Accurate Simulation

SimpleScalar (http://www.simplescalar.com) is a framework for building cycle-accurate CPU models. Some aspects of the processor can be configured easily at run time. For more complex changes, we can use the SimpleScalar toolkit to write our own simulator.

We can use SimpleScalar to simulate the FIR filter code. SimpleScalar can model a number of different processors; we will use a standard ARM model here.

We want to include the data as part of the program so that the execution time doesn’t include file I/O. File I/O is slow and the time it takes to read or write data can change substantially from one execution to another. We get around this problem by setting up an array that holds the FIR data. And because the test program will include some initialization and other miscellaneous code, we execute the FIR filter many times in a row using a simple loop. Here is the complete test program:

#define COUNT 100

#define N 12

int x[N] = {8,17,3,122,5,93,44,2,201,11,74,75};

int c[N] = {1,2,4,7,3,4,2,2,5,8,5,1};

main() {

   int i, k, f;

   for (k=0; k<COUNT; k++) { /* run the filter */

     for (i=0; i<N; i++)

        f += c[i]*x[i];

   }

}

To start the simulation process, we compile our test program using a special compiler:

% arm-linux-gcc firtest.c

This gives us an executable program (by default, a.out) that we use to simulate our program:

% arm-outorder a.out

SimpleScalar produces a large output file with a great deal of information about the program’s execution. Because this is a simple example, the most useful piece of data is the total number of simulated clock cycles required to execute the program:

sim_cycle     25854 × total simulation time in cycles

To make sure that we can ignore the effects of program overhead, we will execute the FIR filter for several different values of N and compare. This run used N = 100; when we also run N = 1,000 and N = 10,000, we get these results:

N Total simulation time in cycles Simulation time for one filter execution
100 25854 259
1,000 155759 156
10,000 1451840 145

Because the FIR filter is so simple and ran in so few cycles, we had to execute it a number of times to wash out all the other overhead of program execution. However, the time for 1,000 and 10,000 filter executions are within 10% of each other, so those values are reasonably close to the actual execution time of the FIR filter itself.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.198.120