3.6 CPU Performance

Now that we have an understanding of the various types of instructions that CPUs can execute, we can move on to a topic particularly important in embedded computing: How fast can the CPU execute instructions? In this section, we consider two factors that can substantially influence program performance: pipelining and caching.

3.6.1 Pipelining

Modern CPUs are designed as pipelined machines in which several instructions are executed in parallel. Pipelining greatly increases the efficiency of the CPU. But like any pipeline, a CPU pipeline works best when its contents flow smoothly. Some sequences of instructions can disrupt the flow of information in the pipeline and, temporarily at least, slow down the operation of the CPU.

ARM7 pipeline

The ARM7 has a three-stage pipeline:

1. Fetch: the instruction is fetched from memory.

2. Decode: the instruction’s opcode and operands are decoded to determine what function to perform.

3. Execute: the decoded instruction is executed.

Each of these operations requires one clock cycle for typical instructions. Thus, a normal instruction requires three clock cycles to completely execute, known as the latency of instruction execution. But because the pipeline has three stages, an instruction is completed in every clock cycle. In other words, the pipeline has a throughput of one instruction per cycle. Figure 3.16 illustrates the position of instructions in the pipeline during execution using the notation introduced by Hennessy and Patterson [Hen06]. A vertical slice through the timeline shows all instructions in the pipeline at that time. By following an instruction horizontally, we can see the progress of its execution.

image

Figure 3.16 Pipelined execution of ARM instructions.

C55x pipeline

The C55x includes a seven-stage pipeline [Tex00B]:

1. Fetch.

2. Decode.

3. Address: computes data and branch addresses.

4. Access 1: reads data.

5. Access 2: finishes data read.

6. Read stage: puts operands onto internal busses.

7. Execute: performs operations.

RISC machines are designed to keep the pipeline busy. CISC machines may display a wide variation in instruction timing. Pipelined RISC machines typically have more regular timing characteristics—most instructions that do not have pipeline hazards display the same latency.

Pipeline stalls

The one-cycle-per-instruction completion rate does not hold in every case, however. The simplest case for extended execution is when an instruction is too complex to complete the execution phase in a single cycle. A multiple load instruction is an example of an instruction that requires several cycles in the execution phase. Figure 3.17 illustrates a data stall in the execution of a sequence of instructions starting with a load multiple (LDMIA) instruction. Because there are two registers to load, the instruction must stay in the execution phase for two cycles. In a multiphase execution, the decode stage is also occupied, because it must continue to remember the decoded instruction. As a result, the SUB instruction is fetched at the normal time but not decoded until the LDMIA is finishing. This delays the fetching of the third instruction, the CMP.

image

Figure 3.17 Pipelined execution of multicycle ARM instructions.

Branches also introduce control stall delays into the pipeline, commonly referred to as the branch penalty, as shown in Figure 3.18. The decision whether to take the conditional branch BNE is not made until the third clock cycle of that instruction’s execution, which computes the branch target address. If the branch is taken, the succeeding instruction at PC+4 has been fetched and started to be decoded. When the branch is taken, the branch target address is used to fetch the branch target instruction. Because we have to wait for the execution cycle to complete before knowing the target, we must throw away two cycles of work on instructions in the path not taken. The CPU uses the two cycles between starting to fetch the branch target and starting to execute that instruction to finish housekeeping tasks related to the execution of the branch.

image

Figure 3.18 Pipelined execution of a branch in ARM.

One way around this problem is to introduce the delayed branch. In this style of branch instruction, a fixed number of instructions directly after the branch are always executed, whether or not the branch is taken. This allows the CPU to keep the pipeline full during execution of the branch. However, some of those instructions after the delayed branch may be no-ops. Any instruction in the delayed branch window must be valid for both execution paths, whether or not the branch is taken. If there are not enough instructions to fill the delayed branch window, it must be filled with no-ops.

Let’s use this knowledge of instruction execution time to develop two examples. First, we will look at execution times on the PIC16F; we will then evaluate the execution time of some C code on the more complex ARM.

Example 3.9 Execution Time of a Loop on the PIC16F

The PIC16F is pipelined but has relatively simple instruction timing [Mic07]. An instruction is divided into four Q cycles:

Q1 decodes instructions;

Q2 reads operands;

Q3 processes data;

Q4 writes data.

The time required for an instruction is called Tcy. The CPU executes one Q cycle per clock period. Because instruction execution is pipelined, we generally refer to execution time of an instruction as the number of cycles between it and the next instruction. The PIC16F does not have a cache.

The majority of instructions execute in one cycle. But there are exceptions:

Several flow-of-control instructions (CALL, GOTO, RETFIE, RETLW, RETURN) always require two cycles.

Skip-if instructions (DECFSZ, INCFSZ, BTFSC, BTFSS) require two cycles if the skip is taken, one cycle if the skip is not taken. (If the skip is taken, the next instruction remains in the pipeline but is not executed, causing a one-cycle pipeline bubble).

The PIC16F’s very predictable timing allows real-time behavior to be encoded into a program. For example, we can set a bit on an I/O device for a data-dependent amount of time [Mic97B]:

       movf len, w ; get ready for computed goto

       addwf pcl, f ; computed goto (PCL is low byte of PC)

len3:    bsf x,l   ; set the bit at t-3

len2:    bsf x,l   ; set the bit at t-2

len1:    bsf x,l   ; set the bit at t-1

       bcf x,l   ; clear the bit at t

A computed goto is a general term for a branch to a location determined by a data value. In this case, the variable len determines the number of clock cycles for which the I/O device bit is set. If we want to set the bit for 3 cycles, we set len to 1 so that the computed goto jumps to len3. If we want to set the device bit for 2 cycles, we set len to 2; to set the device bit for 1 cycle we set len to 3. Setting the device bit multiple times does not harm the operation of the I/O device. The computed goto allows us to vary the I/O device’s operation time dynamically while still maintaining very predictable timing.

Example 3.10 Execution Time of a Loop on the ARM

We will use the C code for the FIR filter of Application Example 2.1:

for (i = 0, f = 0; i < N; i++)

     f = f + c[i] * x[i];

We repeat the ARM code for this loop:

     ;loop initiation code

     MOV r0,#0              ;use r0 for i, set to 0

     MOV r8,#0              ;use a separate index for arrays

     ADR r2,N               ;get address for N

     LDR r1,[r2]                ;get value of N for loop termination test

     MOV r2,#0              ;use r2 for f, set to 0

     ADR r3,c              ;load r3 with address of base of c array

     ADR r5,x              ;load r5 with address of base of x array

     ;loop body

loop   LDR r4,[r3,r8]            ;get value of c[i]

     LDR r6,[r5,r8]              ;get value of x[i]

     MUL r4,r4,r6              ;compute c[i]*x[i]

     ADD r2,r2,r4              ;add into running sum f

     ;update loop counter and array index

     ADD r8,r8,#4              ;add one word offset to array index

     ADD r0,r0,#1              ;add 1 to i

     ;test for exit

     CMP r0,r1

     BLT loop               ; if i < N, continue loop

loopend   …

Inspection of the code shows that the only instruction that may take more than one cycle is the conditional branch in the loop test. We can count the number of instructions and associated number of clock cycles in each block as follows:

Image

The unconditional branch at the end of the update block always incurs a branch penalty of two cycles. The BLT instruction in the test block incurs a pipeline delay of two cycles when the branch is taken. That happens for all but the last iteration, when the instruction has an execution time of ttest,worst; the last iteration executes in time ttest,best. We can write a formula for the total execution time of the loop in cycles as

image

3.6.2 Cache Performance

We have already discussed caches functionally. Although caches are invisible in the programming model, they have a profound effect on performance. We introduce caches because they substantially reduce memory access time when the requested location is in the cache. However, the desired location is not always in the cache because it is considerably smaller than main memory. As a result, caches cause the time required to access memory to vary considerably. The extra time required to access a memory location not in the cache is often called the cache miss penalty. The amount of variation depends on several factors in the system architecture, but a cache miss is often several clock cycles slower than a cache hit.

The time required to access a memory location depends on whether the requested location is in the cache. However, as we have seen, a location may not be in the cache for several reasons.

At a compulsory miss, the location has not been referenced before.

At a conflict miss, two particular memory locations are fighting for the same cache line.

At a capacity miss, the program’s working set is simply too large for the cache.

The contents of the cache can change considerably over the course of execution of a program. When we have several programs running concurrently on the CPU, we can have very dramatic changes in the cache contents. We need to examine the behavior of the programs running on the system to be able to accurately estimate performance when caches are involved. We consider this problem in more detail in Section 5.7.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.81.43