3.8. Benchmarking the Xtensa Core ISA

A processor’s ISA efficiency is best measured by specific target application code. Often, that code doesn’t exist when the SOC designers select a processor for the target tasks. Consequently, benchmark programs often stand in for the target application code.

The original definition of a benchmark was literally a mark on a workbench that provided some measurement standard. Eventually, early benchmarks were replaced with standard measuring tools such as yardsticks. Processor benchmarks provide yardsticks for measuring processor performance.

In one sense, the ideal processor benchmark is the actual application code that the processor will run. No other piece of code can possibly be as representative of the actual task to be performed as the actual code that executes that task. No other piece of code can possibly replicate the instruction-use distribution, register and memory use, or data-movement patterns of the actual application code. In many ways, however, the actual application code is less than ideal as a benchmark.

First and foremost, the actual application code may not exist when candidate processors are benchmarked because benchmarking and processor selection usually occur early in the project cycle. A benchmark that doesn’t exist is worthless because it cannot aid in processor selection.

Next, the actual application code serves as an overly specific benchmark. It will indeed give a very accurate prediction of processor performance for a specific task, and for no other task. In other words, the downside of a highly specific benchmark is that the benchmark will give a less-than-ideal indication of processor performance for other tasks. Because on-chip processor cores are often used for a variety of tasks, the ideal benchmark may well be a suite of application programs and not just one program.

Yet another problem with application-code benchmarks is their lack of instrumentation. The actual application code has almost always been written to execute the task, not to measure a processor core’s performance. Appropriate measurements may require modification of the application code. This modification consumes time and resources, which may not be readily available. Even with all of these issues, the target application code (if it exists) provides invaluable information on processor core performance and should be used to help make a processor core selection whenever possible.

EDN editor Markus Levy founded the non-profit, embedded-benchmarking organization called EEMBC (EDN Embedded Benchmark Consortium) in 1997. (EEMBC—pronounced “embassy”—later dropped the “EDN” from its name but not the corresponding “E” from its abbreviation.) EEMBC’s stated goal was to produce accurate and reliable metrics based on real-world embedded applications for evaluating embedded processor performance. EEMBC drew remarkably broad industry support from microprocessor and DSP (digital signal processor) vendors including Advanced Micro Devices, Analog Devices, ARC, ARM, Hitachi, IBM, IDT, Lucent Technologies, Matsushita, MIPS, Mitsubishi Electric, Motorola, National Semiconductor, NEC, Philips, QED, Siemens, STMicroelectronics, Sun Microelectronics, Texas Instruments, and Toshiba.

EEMBC spent nearly three years working on a suite of benchmarks for testing embedded microprocessors and introduced its first benchmark suite at the Embedded Processor Forum in 1999. EEMBC released its first certified scores in 2000 and, during the same year, announced that it would start to certify benchmarks run on simulators so that processor cores could be benchmarked. EEMBC has more than 50 corporate members.

EEMBC’s benchmark suites are based on real-world application code. As such, they provide some of the industry’s best measuring tools for comparing the performance of various processor cores. Descriptions of EEMBC’s consumer, networking, and telecom benchmark suites appear in Tables 3.103.12.

Table 3.10. EEMBC consumer benchmark programs
EEMBC Consumer benchmark nameBenchmark descriptionExample applications
High Pass Grey-Scale Filter2-D array manipulation and matrix arithmeticCCD and CMOS sensor signal processing
JPEGJPEG image compression and decompressionStill-image processing
RGB to CMYK conversionColor-space conversion at 8 bits/pixelColor printing
RGB to YIQ conversionColor-space conversion at 8 bits/pixelNTSC video encoding

Table 3.11. EEMBC networking benchmark programs
EEMBC networking 2.0 benchmark nameBenchmark descriptionExample applications
IP Packet CheckIP header validation, checksum calculation, logical comparisonsNetwork router, switch
IP Network Address Translator (NAT)Network-to-network address translationNetwork router, switch
OSPF version 2Open shortest path first/Djikstra shortest path first algorithmNetwork routing
QoSQuality of service network bandwidth managementNetwork traffic flow control

Table 3.12. EEMBC telecom benchmark programs
EEMBC telecom benchmark nameBenchmark descriptionExample applications
AutocorrelationFixed-point autocorrelation of a finite-length input sequenceSpeech compression and recognition, channel and sequence estimation
Bit allocationBit-allocation algorithm for DSL modems using DMTDSL modem
Convolutional encoderGeneric convolutional coding algorithmForward error correction
Fast Fourier Transform (FFT)Decimation in time 256-point FFT using Butterfly techniqueMobile phone
Viterbi decoderIS-136 channel decoding using Viterbi algorithmMobile phone

EEMBC’s consumer benchmarking suite, shown in Table 3.10, consists of four image-processing algorithms commonly used in digital cameras and printers. The networking benchmark suite, shown in Table 3.11, consists of four networking algorithms used in networking equipment such as routers. The telecom suite, shown in Table 3.12, consists of algorithms used to develop wired and wireless telephony equipment.

Figures 3.53.7, respectively show the performance of Tensilica’s base ISA Xtensa compared with processor ISAs offered by MIPS and ARM as published on the EEMBC Web site for the consumer, networking, and telecom benchmark results. The results in Figures 3.53.7 are all normalized to the ARM 1026 processor core, which is the slowest of the three processors compared. The results in the figures are also normalized with respect to clock frequency so that the figures give a true impression of work performed per clock. If a processor core performs more work per clock, then it can perform more work overall and it can perform time-constrained tasks at lower clock rates, which in turn reduces the SOC’s power dissipation.

Figure 3.5. EEMBC “out-of-the-box” consumer benchmark scores.


Figure 3.6. EEMBC “out-of-the-box” networking benchmark scores.


Figure 3.7. EEMBC “out-of-the-box” telecom benchmark scores.


The Xtensa processor’s reduced instruction size, large register file, and tuned compiler all contribute to the superior EEMBC benchmark results achieved by the Xtensa ISA, when compared to other commonly used microprocessor cores. These attributes are shared with the Diamond Standard processor cores, which are all built on the Xtensa processor’s base ISA.

EEMBC rules allow for two levels of benchmarking. The lower level produces the “out-of-the-box” scores shown in Figures 3.53.7. Out-ofthe-box EEMBC benchmark tests can use any compiler (in practice, the compiler selected has affected the performance results by as much as 40%) and any selection of compiler switches, but the benchmark source code cannot be modified. EEMBC’s “out-of-the-box” results therefore give a fair representation of the abilities of the processor/compiler combination without adding programmer creativity as a wild card.

The higher benchmarking level is called “full-fury.” Processor vendors seeking to improve their full-fury EEMBC scores (posted as “optimized” scores on the EEMBC Web site) can use hand-tuned code, assembly-language subroutines, special libraries, special CPU instructions, coprocessors, and other hardware accelerators. As a result, full-fury scores tend to be much better than out-of-the-box scores, just as application-optimized production code generally runs much faster than code that has merely been run through a compiler.

As will be discussed in Chapter 4, the Xtensa configurable processor core delivers considerably better performance than fixed-ISA processor cores on EEMBC’s full-fury benchmarks because of its ability to incorporate specialized instructions that collapse large portions of benchmark code into individual instructions, as was demonstrated in the discussion of endian-conversion routines above. EEMBC’s full-fury benchmarking process provides SOC designers with a more realistic assessment of a configurable processor core’s capability because it precisely mirrors the way that the configurable processor core will be applied to a target application.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.91.252