3.7 Benchmarking

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

3.7 BENCHMARKING

One of the problems of the programmable logic area has been the lack of progress in developing credible methodologies for comparing architectures. In the absence of such methodologies, marketing departments have sprung in to fill the gap with a wide variety of benchmarks based around two main numbers, gate equivalents and utilization. This situation mirrors the somewhat checkered history of benchmarking conventional computers. A third class of benchmarks based around the use of silicon area has been proposed by academics, but no comparative data have been published as yet.

3.7.1 Utilization

Utilization is defined as the ratio of resources used to those provided to implement a particular design. For example, a design as implemented on a cellular array may use 50% of the available cell function units. High utilization is seen as a “good-thing,” indicating that the architecture efficiently supports the design. The principal advantage of this metric is that it is easy to calculate. There are several serious shortcomings, however.

Utilization figures are influenced by the size of the box one draws around the design. If the box is measured in chip units, then the design may use a small percentage of the chip's resources simply because it is a small design. For this reason, one might suggest drawing the bounding box in terms of logic cell units. While this is certainly an improvement, it suffers from the drawback that designs that use special-purpose resources (e.g., long wires) may need to be spread out over the chip to access those resources. Similarly when there is free space, it is often advantageous to spread a design round the periphery of the array to minimize delays caused by routing between the edge of the chip and the logic. Thus even when the bounding box of the design is drawn at the cell level, good architectures that implement designs efficiently using only a small fraction of available resources, may show lower utilization numbers than poorer architectures. It is a hard problem to determine whether a cell is unused because a design did not need to use it (and therefore available for use if the design was expanded) or unused because routing congestion prevented it being reached.
Utilization figures are influenced by cell granularity: larger cells will in general result in higher utilizations. To understand this point consider two cellular arrays: one whose function unit could implement any function of up to four input variables and one whose function unit could implement any function of two input variables. There will be many fewer 4-input function units per chip than 2-input function units per chip, given similar implementation technologies. Some of the 4-input function units will be implementing simple functions, such as 2-input gates. These function units will count as fully utilized on a simple cell counting benchmark, but in fact when one considers the amount of control store and other resources assigned to them, their internal utilization is low (it requires 4 bits of RAM to control a function unit capable of implementing all functions of 2-input variables and 16 bits of RAM to implement a function unit capable of implementing all functions of four variables). If one wishes to make a valid comparison between the architectures based on the utilization of the underlying resource (RAM cells or silicon area), one must take into account the internal utilization of the larger cells. Furthermore, when one compares well-implemented designs on these architectures, one will normally find that a higher percentage of the larger cells are used than the smaller cells. One reason for this is that the routing resource on the 4-input cell array will have been designed with enough capacity to ensure that almost all the expensive function units can be used. Where function units are less expensive, it makes sense to trade off the area required by a routing system rich enough to ensure that they could all be used, with the area that is effectively lost in function units that cannot be used because of routing constraints. This trade-off reduces utilization as measured in terms of function units, but increases utilization as measured in terms of the percentage of RAM cells and silicon area that are being usefully employed.
Limited architectures and constrained layout styles can lead to high utilization but very low efficiency. As an example, Algotronix developed a simple piece of synthesis software for its array that generates a ROM-like implementation of a truth table. This implementation does not take advantage of the full flexibility of the cell in terms of either routing or function and corresponds to a design done on a much more limited architecture. When one implements a seven-segment decoder using this tool, one requires 80 cells and obtains 100% utilization. An implementation that takes full advantage of the array, on the other hand, requires 44 cells and has 64% utilization. It is also approximately three times as fast.

3.7.2 Gate Equivalents

Gate equivalent figures can be derived in three main ways.

Marketing Gates Marketing gates are calculated by FPGA suppliers using advanced algorithms based on three basic parameters:

The highest number of 2-input NAND gates (or some other simple primitive) required to implement any circuit configuration of their device. For example, if a configurable cell can implement a D-latch and it takes five NANO gates to implement a D-latch in a particular gate array vendor's library and there are 4000 configurable cells, then a gate equivalent figure of 20,000 could be quoted. This would correspond to implementing a 4000 stage shift register on the device.
The user gullibility factor (UGF) is used to derate the figure calculated in step 1 by a percentage intended to represent the device use by “real” designs, since most people will not believe the figures in step 1.
The gate equivalent figures quoted by their direct competitors are used to alter the UGF figure arrived at in step 2 by a sufficient margin to ensure that the current product comes out ahead of all existing competition in any gate equivalent comparisons done in the trade press.

Benchmark Gates Benchmark gates are calculated by running a series of “typical” benchmark circuits through automatic placement and routing systems on the target FPGA. These benchmark circuits have previously been implemented on an industry standard gate array technology using n gate array gates and are classified as having n equivalent gates. From the results of these tests the FPGA is rated as being equivalent to n gates in the gate array technology. This technique is certainly the most credible way of benchmarking FPGAs in common use, but it has several problems.

The technique tests not only the underlying architecture but also the entire CAD system. This may be regarded as an advantage, but it is unclear whether benchmark results are more influenced by place-and-route software than by the underlying architecture. Place-and-route systems are generally under continuous improvement, and often have a large number of tuning parameters that can make sizable differences to the final results. Thus benchmarks that rely on place-and-route can generate a lot of heat without throwing much light on the architectural comparison.
More seriously, this benchmarking technique will favor architectures that are similar to the gate array, over more radical architectures that may be more efficient in programmable structures. This is because the original designs will have been done with gate array technology in mind and will use primitives that are efficiently implemented on gate arrays.
The benchmark circuits chosen are often not representative of those used in the application of interest. Poor performance on benchmark circuits typical of small glue-logic functions is not indicative of performance when implementing highly structured systolic arrays. In this context it is worth pointing out that many standard benchmarks are too small and show only how efficiently particular subcircuits can be implemented. Ideally there should be whole systems composed of many subcircuits, for which serious consideration has been given to floorplanning issues, in order to minimize area and optimize routing.
Benchmark circuits are known in advance and are generally small; often they are based on popular TTL parts. This implies that FPGA architectures and CAD systems can be tuned to score well on them, for example, by providing a particular resource useful in the benchmark circuit or by handcrafting a macro to implement a benchmark subsystem.
One important architectural trade-off in FPGA design is whether to devote silicon area to special-purpose resources that allow the very efficient implementation of particular common subcircuits or whether to devote it to increasing the quantity or performance of the general-purpose resources provided. As an example, an area devoted to a long line could be used to tune the performance of all the routing multiplexers, and an area devoted to wide gates could be used to improve the performance or increase the number of functions implemented by the cell function units. Small benchmarks that involve the use of only a small proportion of a device's resources favor architectures that provide special-purpose resources over those that provide general-purpose resources, both in terms of density and performance measures. This is because they do not reflect the normal situation with larger designs where the number and distribution of special- purpose resources over the chip may prevent their use. For example, a 16:1 multiplexer may be implemented very efficiently using long lines on a given architecture to reduce delays. When this circuit is placed and routed in isolation there are plenty of long lines to go around. In a more realistic design one might need four or five such multiplexers plus an address decoder and some register files, all of which will compete for the long line resources. The distribution of the long lines on the chip limits where these units can be placed if they are to take advantage of the long lines. Thus, it is unlikely that every subunit on the real design will be able to take advantage of the special-purpose resources. It is also likely that those that do not use the special-purpose resources on an architecture that provides them will be slower and require more area than those on the architecture without special resources. The actual performance of the two architectures on a realistic design is likely, therefore, to be significantly different from that predicted by the benchmark.

A better way of obtaining realistic gate-equivalent numbers would be to specify a set of several applications to be implemented by different FPGAs at a very high level and allow the implementations to be done by experts on those architectures. Unfortunately, the effort involved in implementing realistic applications is likely to rule out this approach.

Real Gates These are gate-equivalent figures arrived at by users after implementing their applications on a given FPGA. Unfortunately, these numbers cannot be arrived at without spending several months and several thousand dollars on CAD tools. Normally real gate counts are 5 to 10 times smaller than marketing gates, and can be either better or worse than benchmark gates according to how well the application and chosen design style fits the architecture.

3.7.3 RAM- and Area-based Benchmarks

A third class of benchmarks that has found favor mainly in the academic community attempts to quantify architectures based on their use of the underlying resource: silicon area. With RAM-based FPGAs, silicon area scales vary closely with RAM cell count and counting RAM allows a comparison between architectures independent of technology design rules and die size. This style of benchmarking is therefore of most use to those attempting to evaluate architectures rather than those attempting to decide which existing device to buy.

There are two basic ways in which this methodology can be used: the first is to implement a set of benchmark circuits and compare the absolute RAM cell cost of each implementation. As with the gate-equivalent benchmarks, the choice of benchmark circuits is of critical importance. The second is to attempt to make a more mathematical, information-theoretic analysis of the architecture based on the number of distinct circuit configurations it can implement for a particular number of bits of control store. This form of analysis can rapidly become intractable, but it is likely that in the long term fundamental results obtained here will be as important for FPGA designers, as those obtained in computational complexity theory are for algorithm designers.

3.7.4 Benchmarking and Dynamic Reconfiguration

One important area that has not been considered in any of the FPGA benchmark proposals to date is the use of dynamic reconfiguration. Dynamic reconfiguration can allow swapping of large sections of an application into and out of an FPGA. For example, a serial interface may be specified to either be in receive or transmit mode. Using dynamic reconfiguration, an FPGA of sufficient size to implement the larger of the receive and transmit functions could be used: with conventional technology the FPGA would have to be large enough to implement both functions at the same time. The area saving could approach 50%. Should this give the FPGA that supports dynamic reconfiguration as many equivalent gates? Does it have 200% utilization?

Similarly, direct accesses to the control store can be used on smaller logic blocks within a design to implement functions such as clearing or loading registers or customizing adders for a particular constant input: this technique can often halve the gate count and double the speed of a subunit. The ability to read the contents of an internal register through the control store interface might make a large bus to route the value to the chip edge redundant, saving a large amount of space. Should benchmarks take these capabilities into account or should they insist that every architecture implements the same netlist?

3.7.5 The PREP Benchmarks

In 1993, PREP, a nonprofit organization supported by most of the major companies in the programmable logic market, produced a set of nine benchmarks [Prep93], These benchmarks cover a range of traditional logic functions from state machines to adders and counters. For each benchmark two main numbers are calculated: the number of instances of the benchmark that can be fitted on the device and the maximum clock frequency the benchmark can run at on the device. PREP has a benchmarking methodology and publishes “certified” results so one does not have to trust the device manufacturer's own data. While these benchmarks are far from perfect, they represent the best available source of comparative data on FPGA architectures.

3.7.6 Summary and Health Warning

In conclusion, with the present state of technology, one should treat all benchmark claims with several pinches of salt. If at all possible look for similar applications to yours that have been successfully implemented on the architecture either in manufacturer's applications notes or in published literature. Be aware that there is a high degree of variance in the efficiency with which an architecture may implement different applications; just because it scores well on benchmarks does not mean it is good for your application. Also be aware that to take full advantage of FPGAs, it is necessary to adopt different design styles that match their architectures; if you avoid doing so, performance and resource efficiency will suffer badly. Trying to map an existing TTL netlist onto the FPGA is usually a recipe for disaster. Many new users of FPGAs make purchasing decisions on the basis of the CAD system's ability to support design input in terms of TTL macros to allow easy conversion of existing designs: many quickly become disillusioned with the results, particularly for RAM-programmed architectures. Similarly, benchmarks that do not involve changing design styles will not indicate the full potential of the architectures they measure.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3.7 Benchmarking

Create new playlist

Sign In

Sign Up