Chapter 6

The DSP Hardware/Software Continuum

Mike Brogioli

Chapter Outline

Introduction

Hardware systems and system topologies used in digital signal processing (DSP) can vary greatly in design. Quite often, each system design and component within the system comes with its own programmability, power, and performance tradeoffs. What may be appropriate for one system designer’s needs may often not be suitable for another system designer, for a varied number of reasons. This chapter details various components of DSP platform designs as they pertain to system configurability, system programmability, algorithmic demands, and system complexity. At one end of the spectrum, application specific integrated circuits (ASICs) are detailed as a high performance, low configurability solution. At the other end of the spectrum, general purpose software programmable embedded microprocessors are presented as a highly configurable solution. Various design points are discussed along the way such as reconfigurable FPGA based solutions and hardware acceleration. Subsequent sections detail tradeoffs for each manner of system design, and aim to provide the system designer with insights into when each solution is appropriate both for current system designs as well as those that may need to scale or migrate hardware platforms in the future.

FPGA in embedded design

FPGA-based solutions for embedded systems often have the advantage of providing a more targeted solution to a given application than the more traditional one size fits most programmable DSPs and embedded processors. In the case of a system designer seeking to tailor the amount of hardware parallelism afforded to their applications, FPGAs provide an attractive alternative. By exploiting the available functional unit level parallelism that FPGAs can provide to signal processing system designers, an FPGA-based system component using modern FPGA platforms such as the Xilinx Virtex 6 can often provide 40–50× improvement in raw performance over its programmable signal processing core counterpart.

There are a number of emerging new use cases for FPGAs in the embedded processing domain. Video surveillance is one such application domain, incorporating real-time video analytics in the overall system design. In-car entertainment and rich media content is another, which strives to incorporate true multimedia experiences often associated with in-home entertainment with other data such as real-time traffic and weather. Much of these in-car applications exhibit large levels of parallelism that may outstrip programmable signal processors and are thus better suited to FPGAs and other highly parallel accelerator-like architectures.

In light of the growth in FPGA development tools, reusable logic components, and available third party designs, system designers are now free to incorporate FPGA- based solutions into a far broader set of designs. At the same time, given the reprogrammable nature of the FPGA, many system designers can retool or augment their designs with significantly less engineering overhead and recurring costs than their traditional discrete or custom logic counterparts.

While FPGAs do afford significantly higher levels of parallelism at the hardware block level versus their programmable microprocessor counterparts, there are a number of other factors that system designers should keep in mind. FPGAs tend to operate at significantly lower clock rates than their programmable processor counterparts, yet still tend to provide higher computational throughput per clock cycle due to the increase in parallel computational hardware. As an example, a given FPGA may operate in the hundreds of MHz range, but may be capable of performing tens of thousands of computations per clock cycle operating in the tens of watts power range. A comparable microprocessor may operate at the 1–2 GHz range, but will be much more limited in the amount of parallel computation that can be performed per clock cycle. Typical SIMD style architectures in this latter category may provide only four to eight computations per clock cycle, whereas an FPGA may provide a 50× improvement in raw computational throughput over its microprocessor counterpart. There are, however, other points that must be considered in using an FPGA within the system design.

The impressive 50× noted performance gains possible with FPGA based designs are predicated on whether or not the computational workload in the overall system is suitable for an FPGA based implementation. Typically, system designers must consider whether or not the workload is suitable in terms of the type of computation performed, whether or not the algorithm will require fixed point or floating point computation, and challenges that are associated with implementing FPGA based designs versus more traditional embedded software written in C/C++ like languages.

FPGA computational throughput and power

As has been mentioned previously, there are power consumption challenges with the usage of FPGAs in embedded designs. Typically in a signal processing workload, however, the high levels of computational parallelism and matrix like nature of computation can permit the FPGA to exploit the parallel nature of the computation to offset the lower clock rate versus their traditionally higher clock rated programmable microprocessor counterparts. In summary, while the FPGA hardware itself may have a significantly lower clock rate than the programmable microprocessor, suitable applications can exploit the vast increases in parallel FPGA hardware to provide significant computational throughput.

Algorithm suitability

When considering the use of an FPGA in an embedded system design, the type of algorithm to be implemented on the FPGA must be considered. FPGAs themselves tend to be suitable to what are traditionally referred to as embarrassingly parallel type problems – that is to say, those types of algorithms that perform parallel repeated tasks in a regular fashion, and may be easily decomposable into modular parts. Many algorithms are suitably parallel, such as radar applications, beam forming, certain types of image compression, and various signal processing kernels such as Fast Fourier Transforms and other matrix based computations.

Other algorithms and applications that are less predictable or require dynamic partitioning and load balancing are likely not suitable for FPGA implementation. In addition, algorithms with significant amounts of control plane processing typically implemented in control logic statements in higher level C/C++ like programming languages may also be unsuitable for FPGA implementation.

Fixed point versus floating point

Another issue that system designers must consider when utilizing FPGAs in their designs is whether or not key algorithms require fixed point or floating point calculations. Traditionally, software programmable microprocessors will support floating point calculations natively in hardware units within the processor pipeline, which affords significant performance improvement for floating point computation versus emulation on the host processor via a software runtime library. In addition, microprocessors will often have parallel hardware within the processor pipeline for vector or SIMD style floating point computation, such as Freescale’s Altivec style SIMD extensions as well as other vendors.

While FPGAs can implement floating point computation, they are not particularly well suited to this type of computation. This is due to the large amount of logic that is required for the FPGA design to implement the floating point calculation. This large amount of logic serves to decrease the overall density of the FPGA design in terms of gates versus computational throughput, and in summary reduces the overall computational effectiveness of the FPGA based implementation in the case of primarily floating point based algorithms.

Implementation challenges

When deploying a given algorithm on an FPGA, there are a number of challenges that may arise in the programming of the device. The architectural abstraction of an FPGA is fairly low, compared with that of traditional high level programming languages. As such, programmer knowledge of the hardware design and device details is often required through such languages as Verilog or VHDL. Identifying the optimal mapping of application parallelism onto hardware is also not trivial, and the time consuming synthesis flow increases complexity in identifying performance optional aspects of the implementation. While it is true that high level synthesis tools can raise the level of hardware abstraction, parallelism extraction may be inherently limited by the programming model or may not offer the evaluation and selection of the best parallel extraction for system performance.

As FPGA technology has matured for embedded computing and signal processing, vendors have devised multiple ways to program and configure the FPGA functionality. The most common approach to specifying the FPGA design is by the use of a hardware description language as mentioned already, commonly Verilog or VHDL. These languages are used to describe the functionality and topology of the system. Follow-on tools are also often used in the design to further optimize the power and configuration of the design for peak performance.

System designers may also want to avail themselves of the modular and open source nature of many FPGA system components. As these types of devices have grown in complexity, designers have begun to produce blocks of hardware description language (HDL) code that can be reused in various products. These modular components allow system developers to reuse circuit designs and other system components from outside designs or third party sources. These blocks can provide very simple functionality, all the way up to fully functional codecs and microcontroller solutions. Programmable IP cores are often available directly from the FPGA vendors, as well as from third party suppliers or, in some cases, even as open source HDL code. Commercial IP is often fee based, but comes with expected documentation, verification tools, and support. These packages often include embedded development software kits, development boards, hard and/or soft processor configurations for programmable cores to be used within the design, and various other software tools and system profilers as necessary.

Given the large amounts of parallelism often available in FPGA based solutions, FPGA based computation is often suitable for high performance embedded solutions based around high rate data acquisition, digital signal processing, and software defined radio solutions. It is not uncommon for various solution providers to incorporate FPGA based components in their overall solution for the heavier lifting portions of the computational stack; this may also afford configurability and programmability that is desired by OEMs building their products upon the platform. The tradeoff here is usually system designer dependent; however, offloading significant computational portions into the FPGA can yield a higher computational throughput per watt for suitable parts of the overall application stack. Furthermore, by alleviating the need for development of these portions of the application stack in software, reduction in time to market and validation may be possible.

As various designers attempt to remain within budget while increasing system complexity, FPGA devices and development tools become increasingly important considerations in an overall platform design. The FPGA can facilitate multiple configurations and performance points within a single hardware design. These are often performance points that are difficult, if not impossible, to meet with software programmable processor solutions. While their recurring cost is typically higher than for other solutions, they are often very attractive solutions for low to medium sized projects that requires significant heavy lifting.

Application specific integrated circuits versus FPGA

There are multiple design decisions that must be considered when choosing between an FPGA or an application specific integrated circuit (ASIC) for the hardware design of a system. FPGAs are semiconductor devices that contain programmable logic components or logic blocks, and programmable interconnects. The logic blocks within the FPGA can be programmed with the functions of various logic gates such as logical AND/OR, as well as more complex digital functions such as decoding and multiplication. In most FPGAs, the logic block component of the system also includes memory elements for local data storage. These memory storage elements can be either simple flip flops or local RAM arrays within the device.

Modern FPGAs have a number of uses in embedded systems. At the prototyping stage, they can be used for ASIC prototyping. Due to the high cost of ASIC chips, FPGAs can be used to first verify the logic of the application by programming and uploading the HDL code into the FPGA. This permits faster testing time with reduced cost, and permits the ASIC to be manufactured only upon verification of the design. FPGAs similarly benefit applications that make use of large amounts of parallelism, as was detailed earlier in this chapter. In summary, FPGAs are very attractive for computationally intensive kernels such as FFTs, whereby the data flow graph of the computation maps more directly to the FPGAs compute resources than to those of a programmable embedded architecture. In implementing such kernels in FPGAs, the overhead of software executing on a programmable processor is eliminated, and the massively parallel hardware can be targeted more directly.

Advantages of ASICs over FPGAs

ASICs have a number of advantages over FPGAs, depending on the system designer’s goals. ASICs, for instance, permit fully custom capability for the system designer as the device is manufactured to custom design specifications. Additionally, for very high volume designs, an ASIC implementation will have a significantly lower cost per unit. It is also likely that the ASIC will have a smaller form factor since it is manufactured to custom design specifications. ASICs will also benefit from higher potential clock speeds over their FPGA counterparts.

A corresponding FPGA implementation, on the other hand, will typically have a faster time to market as there is no need for layout of masks and manufacturing steps. FPGAs will also benefit from simpler design cycles over their ASIC counterparts, due to software development tools that handle placement, routing, and timing restrictions. FPGAs also benefit from being reprogrammable, in that a new bit stream can quickly be uploaded, during system development as well as when deployed in the field. This is one large advantage over the ASIC counterparts.

While FPGA devices at one time were selected for lower volume and capacity systems, modern FPGAs have competitive clock rates for use in high capacity systems. In addition, FPGAs have benefited from increased logic density in recent years, as well as other features such as integration of embedded processors, DSP blocks, and high speed serial input/output. As such, they may offer competitive solutions in the signal processing space depending on system requirements and flexibility.

Software programmable digital signal processing

Unlike the aforementioned ASIC and FPGA based solutions, modern DSPs are software programmable by the user using standard C-like programming languages. As such, they provide a greater amount of flexibility to the system developer since their function within the system is determined by the application layer software. Unlike their general purpose processor counterparts, or even embedded general purpose microprocessors, DSPs have a number of high performance, application domain specific features that require a certain level of expertise on the part of system programmers in order to fully exploit the performance of the processor. Examples of such features are advanced addressing modes such as bit reversed and modulo addressing, non-standard bit widths such as wide accumulators, saturating arithmetic operations and perhaps limited integer computational support, memory alignment constraints, and non-uniform and non-orthogonal instruction sets.

Each of these application domain specific, high performance features described previously may introduce programming hurdles for high performance DSP codes. In addition, they may contribute to barriers in software portability across vendor solutions, although increasingly vendors are providing software emulation libraries to tackle this hurdle. This may be a challenge in the traditional DSP market space as OEMs traditionally do not want to lock their software solution to a given vendor’s silicon. As evidence, a number of vendors have recently begun providing software migration packages that allow native software based intrinsic functions for one vendor’s architecture to execute on another vendor’s architecture. Examples of this are both CEVA and Freescale, who have begun offering software solutions whereby Texas Instruments C6000 intrinsics functions can be emulated in software on the CEVA and Freescale DSPs. In doing this, vendors lower the barrier to entry on migrating legacy software solutions to their own architecture.

General purpose embedded cores

While the DSP solutions described previously offer a software programmable solution that is much more flexible than their FPGA and ASIC counterparts, as can be seen, the application developer must be aware of the underlying architecture and proprietary software constructs needed to target said architecture’s performance. Additionally, once a high performance software solution is achieved on a given architecture, a clear software migration path to implementing that solution on differing architectures is not always apparent. General purpose embedded architectures tend to provide a more application generic solution for embedded computing, often incorporating some limited set of features to handle signal processing components of a given application. These embedded general purpose architectures are designed for wide ranges of applications, varying from consumer electronics to communications and automotive solutions. They usually employ standard 32-bit data paths, and are often scaled back versions of existing microprocessors. Examples of these are ARM, MIPS, and PowerPC architectures. It is not uncommon for these devices to be incorporated into larger dedicated systems or system on chip architectures. Many of these embedded processors include functionality for lighter weight signal processing needs, such as SIMD extensions within the instruction set that are targeted to multimedia or signal processing workload requirements. Examples of this are the ARM NEON general purpose SIMD extensions, which are targeted multimedia, signal processing, and gaming extensions. Quite often these SIMD extensions are supported by the build tools allowing for efficient vectorization of SIMD operations without the requirement of custom intrinsics and pragmas, as was the case for their DSP counterparts. These are often attractive solutions when the application requires modest amounts of signal or image processing, but also requires a general purpose embedded processor solution. Software developers can often aim to exploit the SIMD functionality afforded on these general purpose embedded architectures only when required by key computational bottlenecks in the application, while relying on the general purpose nature of the architecture for the remaining portion of the overall software application.

Putting it all together

DSP and embedded multiprocessor system on chip architectures and their related hardware constructs are a unique area of computer architecture as driven by the requirements placed on these systems, such as real-time deadline demands, low power consumption, and the multitasking requirements as well as often standardized components of the system. Rather than utilize commodity processors scaled down to fit a standard chip size, unique components are often used for the embedded and signal processing components to meet the strict requirements of each system. While some system designers do incorporate custom-like hardware acceleration blocks in their overall system topology, these are quite often in the form of microcode controlled accelerators rather than true non-configurable custom circuit designs. While there are still significant challenges in maintaining the intellectual property for vast libraries of hardware acceleration blocks, by utilizing a configurable hardware acceleration block, vendors can tailor the functionality as markets change, or open up the microcode driven control plane to customers as needed for custom functionality.

Architecture

The large majority of multiprocessor systems for embedded computing follow the standard design methodology of programmable CPU cores with associated memories connected via an interconnect framework. Unlike their general purpose or scientific computing counterparts, however, most of these processing solutions are highly heterogeneous in nature. Examples of these are the Texas Instruments OMAP series of processors for mobile handsets, and the Freescale MSC8156 series of processing platforms for wireless infrastructure.

General purpose architectures, such as those used in the general purpose and scientific computing, are typically not amenable to embedded and signal processing applications and platforms. This is due to the fact that the applications of interest to most embedded developers have high levels of performance that must be met in conjunction with the real-time and predictive nature. Traditional general purpose and scientific computing has striven for high performance and computational throughput, but often without regard for hard limits on how long such computation should require. These real-time demands of embedded systems are further complicated by strict power requirements and cost requirements placed on these devices.

Embedded systems hardware and platform designers can often exploit various characteristics of not only the workloads to be run, but also the system being developed to help meet these deadlines, such as strict power consumption budgets. Irregular memory systems may save power consumption by reducing RAM ports, and dedicated software controlled scratch pad memories may provide real-time predictability into a system design, as well as irregular interconnect networks amongst processing components. Even varying the types of processors used in the system topology can increase performance according to the strict criteria above. For example, a system design that requires large amounts of control processing code typically associated with user interfaces and operating systems, may reside on an ARM-like core or general purpose processing core within the system topology. Numerically intensive portions of the system software may be better suited to reside on a programmable signal processor architecture similar to a Texas Instruments DSP, or in some cases, may benefit from further parallelism afforded by a GPU style processor. As various components in the software application stack require differing levels of instruction, data, thread, and task level parallelism, so too the system designer should strive to map those computational requirements to the appropriate computational component within the system.

Application driven design

Unlike their general purpose computing counterparts, many of the applications with heterogeneous multiprocessing hardware solutions which are applied in the embedded space are not single algorithms but complex systems consisting of many algorithms. As such, the requirements of the computation to be performed at various points in the system can vary widely over both physical location within the system and over time amongst components within the system. As an example, algorithmic components within the application may vary widely in terms of types of operations or computation to be performed, memory system bandwidth requirements, memory access patterns that may relate to computational efficiency depending on processing component architecture, activity profiles, and total memory footprint.

Many multimedia codecs such as MPEG-2 and H.264 have components within them that have widely varying computational demands at runtime (Haskell, 1997). Major computational blocks within these systems can vary from having small data sets, requiring fairly regular systems of multiplication and addition operations, to other data dependent computation that operates on large data sets. Additionally, in large scale systems various components may be fairly standardized in terms of algorithmic requirements (such as a Fast Fourier Transform within the system), whereas for other algorithmic components it may be desirable to keep a proprietary implementation executing in software due to key algorithmic insights from specific vendors.

Bibliography

Haskell BG, Prui A, Netravali AN. Digital Video: An Introduction to MPEG-2. New York: Chapman & Hall; 1997.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.131.10