7

Conclusions

image

Prediction is very difficult, especially if it’s about the future.

—Nils Bohr

As cycle times in high-performance digital systems shrink faster than mere process improvement allows, sequencing overhead consumes an increasing fraction of the clock period. Flip-flops and traditional domino circuits, in particular, suffer from clock skew, latch delay, and the inability to balance logic between cycles through time borrowing. The overhead of traditional domino circuits can waste 25% or more of the cycle time in aggressive systems! Fortunately, the designer can hide much of this overhead through better design techniques. Static pipelines built from transparent latches can tolerate nearly half a cycle of clock skew and help the designer balance logic with time borrowing. Pulsed latches offer similar advantages, trading some skew tolerance and time borrowing for a faster latch. Skew-tolerant domino circuits are particularly fast, completely eliminating latch delay and tolerating modest amounts of clock skew and time borrowing. Smaller amounts of skew can be budgeted in local clock domains than across a large die, reducing the burden of clock skew.

Chapter 2 explored the design of static circuits using flip-flops, transparent latches, and pulsed latches. We found that the purpose of such elements is not so much to remember information as to sequence information along a pipeline or through a state machine, preventing data in one stage from interfering with data in another. The elements must slow down fast paths to prevent interference while minimizing extra delay on paths that are already critical. Because it is impossible to slow some paths without at least slightly impeding all others, these elements inevitably introduce sequencing overhead. The sequencing overhead imposed by the hard edges of flip-flops is worst, including two latch delays and clock skew. Transparent latches are faster, hiding the clock skew. Pulsed latches can be even faster, introducing only one latch delay while still possibly hiding the clock skew. The speed of pulsed latches comes at the expense of longer hold times. Transparent latches and pulsed latches are also good because they provide a window during which data may arrive without extra delay. In addition to hiding clock skew, this window allows logic to borrow time across cycles to balance logic. Pulsed latches and flip-flops have frequently been mixed up in the literature because they both are used once per cycle; pulsed latches can be distinguished by their window of transparency.

Chapter 3 moved on to the design of domino circuits. Domino circuits offer raw gate delays 1.5 to 2 times faster than static circuits, making them very popular for high-speed designs. Unfortunately, traditional domino design techniques also impose hard edges at every half-cycle boundary, leading to enormous overhead of two latch delays and twice the clock skew in every cycle! Skew-tolerant domino circuits use multiple overlapping clock phases and eliminate latches to soften these hard edges, removing the sequencing overhead entirely.

Chapter 4 united skew-tolerant domino with transparent latches and pulsed latches in a systematic four-phase skew-tolerant circuit design methodology. The interface from static to domino logic inherently must budget clock skew, motivating the designer to build entire critical loops from dual-rail domino circuits to avoid this penalty. Skew-tolerant domino also integrates seamlessly with RAMs, PLAs, and other dynamic structures. With four clock phases and many different types of clocked elements, it is easy to become confused about legal connections. By tagging each signal with a timing type, it is simple to verify connectivity. The methodology also includes scanning data in and out of both static and domino pipeline stages to help testability.

None of these skew-tolerant circuit techniques would be useful if the clock generators were too complex or introduced more skew than they tolerated. Chapter 5 examined clocking, beginning with the often used but seldom defined term clock skew. The designer wishes to receive a small number of logical clocks with precisely defined phase relationships arriving at all parts of the chip simultaneously. Variations in the global and local clock generation and distribution circuits cause the designer to actually receive slightly different physical clocks at each point. These variations can be categorized as predictable or unpredictable and as DC, slowly varying, or rapidly varying; different techniques can be used to handle different components. Ultimately, it is very difficult to reduce worst-case clock skew below 200 ps across a complex chip. When such skews become a significant problem, the designer can introduce clock domains to budget smaller amounts of skew between local elements than across the entire chip. Unfortunately, clock domains do not reduce duty cycle variation, which is an increasingly important component of skew.

Design techniques are of little value unless accompanied by suitable verification tools. In particular, most static timing analyzers from the mid1990s are unable to take advantage of reduced clock skews in local clock domains. Chapter 6 addressed timing analysis, showing that arrival times cease to have absolute meaning in systems with different skews between different elements. Instead, arrival times must be specified with reference to a particular launching clock that determines the skew relative to the receiver. Therefore, timing analysis introduces a vector of arrival times at each latch with respect to different launching clocks. Fortunately, this vector is relatively sparse because most paths do not borrow time in a real system.

In summary, conventional designs with flip-flops and traditional domino clocking are becoming inadequate for high-speed designs. Systems operating above 1 GHz will be unlikely to achieve acceptably low global skew across the entire die at reasonable cost. Instead of abandoning the synchronous paradigm entirely for an asynchronous design, designers will divide the die into local clock domains offering smaller amounts of skew within each domain and will use skew-tolerant circuit design techniques to hide this modest amount of skew. Transparent latches have a long history of successful use; pulsed latches bring larger min-delay constraints, but are even faster and have been successfully used on large microprocessors. Skew-tolerant domino can achieve zero overhead, offering the full speedup of domino gates. With such approaches, we expect clocked systems will remain viable to extremely high operating frequencies.

As systems grow to include hundreds of millions of transistors operating at many gigahertz, circuit designers will encounter even more challenges. Although skew-tolerant techniques can potentially hide up to half a cycle of clock skew, design becomes very difficult at such extremes. If global skew does not fall well below 200 ps, standard approaches to global communication will not work above 2.5 GHz. Communication between different clock domains may have to occur at reduced frequency or via an asynchronous interface [10, 23]. Even within a local clock domain, duty cycle variation will cut into the amount of time available for borrowing and may eventually require local correction. Moreover, domino circuits face an impending power crunch. Chip performance will become power limited because only a finite amount of heat can be removed from a die with a reasonably priced cooling system. Although dual-rail domino gates are extremely fast, they are much more power hungry than static circuits because their activity factors are usually far greater. Improvements in clock gating will disable some inactive domino gates, but domino will continue to pay a large power premium. Will domino gates become too expensive from a power perspective, or will designers find it better to build simpler machines with fewer transistors running at extreme domino speeds than very complex machines churning along at the speed of static logic? Domino presents other challenges as well. The aspect ratios of wires continue to grow, making coupling problems greater. Scaling device thresholds increase leakage currents and reduce noise margins. The design time of domino also can be high, possibly increasing time to market. Will static circuits become a better choice for teams with finite resources, or will advances in CAD tools improve domino productivity? Circuit design should remain an exciting field as these issues are explored in the coming decade.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.72.74