Current Status and Recent Developments in RSFQ Processor Design

M. Dorojevets

Dept. of Electrical and Computer Engineering SUNY–Stony Brook, NY 11794-2350, U.S.A.

1.   Introduction

The need for high-performance and low power computer systems capable of dealing with the avalanche of data in critical applications is well recognized.

A 2002 report entitled “High Performance Computing for the National Security Community”, prepared by NSA in cooperation with other federal agencies, clearly identified the need for user-friendly high-end computing systems with high-bandwidth capabilities because of the existence of critical applications for the national security community “that are neither met nor addressed by the commercial sector”.1

Some of the key findings of a 2006 report on Joint U.S. Defense Science Board and UK Defense Scientific Advisory Council Task Force on Defense Critical Technologies2 were that “there are applications that cannot be solved with sufficient speed or with sufficient precision”, that “commercial off-the-shelf (COTS) technologies will be insufficient to meet unique military needs”, and that the DoD “should invest in critical, defense-niche technologies in order to assure competitive advantage over potential adversaries”.

The 2008 DARPA “Exascale Computing Study: Technology Challenges in Achieving Exascale Systems”3 concluded that there exist four major challenges – energy and power; memory and storage; concurrency and locality; and resiliency, – to achieving exascale systems where current technology trends are simply insufficient, and that significant new research was absolutely needed to bring alternatives on line.

Superconductor processors based on rapid single flux quantum (RSFQ) circuit technology can reach and exceed operating frequencies of 100 GHz, while keeping processor power consumption low. These features provide an opportunity to build different types of systems with ultra-high-speed energy-efficient RSFQ microprocessors to address the government’s critical mission needs.

An NSA report entitled “Superconducting Technology Assessment (STA)” was prepared by a group of experts in 2005 and updated in December of 2007.4 The major conclusion of the report was that the superconducting technology was “sufficiently mature for a major development investment to bring it to a state of readiness for use in high-end machines”.

2.   Past and present RSFQ processor design projects

The issues of RSFQ processor and system design have been addressed in several projects: SPELL processors for the hybrid technology multi-threaded (HTMT) project,5,6 the 8-bit FLUX processor7,8 and 8-bit Frontier datapath9 projects in the U.S., the bit-serial CORE processor10 and LSRDP computer11 projects in Japan. These projects are summarized in Table 1.

The HTMT petaflops computer project was a collaboration of several academic, industrial, and U.S. government labs with the goal of studying the feasibility of a petaflops computer system design based on new technologies, including superconductor RSFQ technology.

The HTMT RSFQ-related design work focused on the following issues:

•   a multithreaded processor architecture that could tolerate huge disparities between the projected 50-60 GHz speed of RSFQ SPELL processors and the much slower non-superconductor memories outside the cryostat; and

•   the projected characteristics of the RSFQ superconductor petaflops subsystem consisting of ~4,000 SPELL processors with a small amount of superconductor memory and a superconductor network for inter-processor communication.

image

Table 1. Superconductor RSFQ microprocessor design projects.

The architecture of SPELL processors supports dual-level multithreading with 8–16 multistream units (MSUs), each of which was capable of simultaneous execution of up to four instructions from multiple threads running within each MSU and sharing its set of functional units. However, no processor chip design was done for SPELL processors; their technical characteristics are estimates based on the best projection of RSFQ circuits available at that time (1997–1999).

The 8-bit FLUX-1 microprocessor was the first practical attempt to address architectural and design challenges for 20 GHz RSFQ processors. The FLUX-1 design and fabrication was done in the framework of the FLUX project as a collaboration between the SUNY–Stony Brook, TRW (now Northrop Grumman Space Technology), and NASA’s Jet Propulsion Laboratory.

A new partitioned microarchitecture was developed for FLUX-1 with the following distinctive features:

•   ultrapipelining to achieve 20 GHz clock rate with only 2-3 Boolean operations per stage;

•   two operations per cycle (40 GOPS peak performance for 8-bit data);

•   short-distance interaction and reduced connectivity between arithmetic logic units (ALUs) and registers;

•   bit-streaming, which allows any operation that is dependent on the result of an operation-in-progress, to start working with the data as soon as its first bit is ready;

•   wave pipelining in the instruction memory;

•   modular design; and

•   ~25 control, integer arithmetic, and logical operations (no load/store operations).

The final FLUX-1 chip, called FLUX-1R, was fabricated in 2002. It had 63,107 Josephson junctions (JJs) on a 10.35 x 10.65 mm2 die with power consumption of ~ 9.5 mW at 4.5 K.

Operation of a one-bit ALU-register block (the most complex FLUX-1R component) was confirmed by testing. No operational FLUX-1R chips were demonstrated by the time the project ended in 2002.

Several bit-serial microprocessor prototypes with a very simple architecture called COREI have been designed, fabricated, and successfully tested at high speed in the Japanese Superconductor Network Devices project. Participants in this project included Nagoya, Yokohama, and Hokkaido Universities, the National Institute of Information and Communications Technology at Kobe, and the ISTEC Superconductor Research Lab (SRL) at Tsukuba in Japan.

The latest CORE1γ microprocessor has two 8-bit data registers, and two bit-serial ALUs, and small amount (several bytes) of on-chip (shift register type) memory for instructions and data. The instruction set consists of seven 8-bit instructions.

These COREI microprocessors have extremely simplified, pipelined processing and control logic, and use slow (1 GHz) system and fast (16-25 GHz) local clocks. The slow 1 GHz system clock is used to advance an instruction from one execution phase to another. The fast local clock is used for bit-serial data transfer and bit-serial data processing within each instruction execution phase. A 25 GHz CORElγ chip has ~ 22K JJs, and a power consumption of 6.3 mW.

A new long-term large-scale reconfigurable data-path (LSRDP) project with the goal of developing a 10 TFLOPS RSFQ computer started in Japan in 2007. This proposed computer will have a silicon CMOS general-purpose processor and DRAM main memory at room temperature and a cryo-cooled bit-serial RSFQ LRSDP datapath sub-system. The LSRDP section consists of a large number of bit-serial floating-point units connected to each other through a bit-serial programmable operand routing network. No cryo-memory is planned for the LSRDP sub-system. The data flow in LSRDP is one-directional from input to output with no internal information feedback loops.

By 2009, several bit-serial floating point units and routing network chips with up to UK JJs operating at 22-25 GHz clock frequency have been successfully designed and demonstrated within the framework of the LSRDP project in Japan.

In the meantime, a new RSFQ processor datapath project started in the U.S. in 2009. Its goal is to design and demonstrate the first 8-bit wide 20 GHz processor datapath with on-chip cryo-memory. The datapath will use new asynchronous wave-pipelined wide datapath techniques developed at SUNY-Stony Brook.12,13 This project is the collaboration between the design teams at Stony Brook and Hypres. The fabrication of the datapath chip using the Hypres 4.5 kA/cm2 technology is planned for 2010.

3.   Major challenges in RSFQ processor design

The availability of ultra-high-speed, low power superconductor circuit technology is only one of several requirements for successful high-performance processor design. Any change in circuit technology calls for a re-examination of the processor microarchitectural and design techniques that work well with it.

The major design challenges are:

•   reliable design and synchronization techniques for 50–100 GHz wide datapath RSFQ processors;

•   latency tolerance;

•   static power reduction;

•   memory; and

•   CAD tools for VLSI superconductor circuit design.

The processors shown in Table 1 have been designed with use of the canonical RSFQ logic14 and corresponding cell libraries.15,16 The canonical RSFQ logic favors super-pipelined designs with very high clock frequencies because almost all logical functions can only be done with clocked gates, each of which logically represents a combination of a stateless Boolean logic element (e.g. an inverter) and a flip-flop. As a result, the current synchronous datapath RSFQ designs are optimized for ultra-high processing rates at the expense of significant latency in their long processing pipelines. As is well known, the increase in the clock rate by increasing the number of pipeline stages does not necessarily improve performance.

In order to distribute clock signals to RSFQ cells, binary signal distribution trees built of active elements called pulse splitters have to be used. This leads to unavoidable skew and jitter created in clock distribution networks in synchronous RSFQ processors. The need to synchronize the work of almost all RSFQ gates makes the issue of clock skew and synchronization in general extremely challenging for synchronous RSFQ designs with wide 32/64-bit datapaths. This was one of the reasons (besides the immature technology) why both Flux-1,7 CORE microprocessors10 and the LSRDP RSFQ subsystem” used bit-serial processing in their ALUs and floating-point units. While this approach allowed the designers to decrease the impact of clock skew and reach frequencies of up to 25 GHz for low-complexity bit-serial processor datapaths, it is not scalable or applicable to future full-fledged 50 GHz 32/64-bit RSFQ processor designs.

The on-chip gate-to-gate communication delays in 50 GHz microprocessors will limit the space reachable in one cycle to ~1–2 mm. The microarchitecture and design of superconductor processors must support a communication-aware, localized computation model in order to have processor functional units fed with data from registers located in close proximity to the units. The most radical attempt to deal with this issue was made in the design of the FLUX-1 RSFQ microprocessor. It had a partitioned microarchitecture where eight ALUs were interleaved with eight registers, which made the distance between them very short. Also, instead of data traveling from registers to ALUs and back, multiple computation waves traveled along static data in registers, which could be called integer bit-streaming processing-in-registers.7

Another characteristic of the canonical RSFQ logic is the use of the large bias resistors and static (bias) current to keep junctions biased to ~70% of their critical current value. The static power dissipated in the bias resistors used in each cell is currently an order of magnitude higher than the dynamic power dissipated when junctions in RSFQ cells switch. Although there are known techniques that can significantly reduce static power consumption in RSFQ cells,17 the issue of static power reduction has not yet been explored in larger complex circuits.

The principal reason for this was the immaturity of the prior fabrication processes capable of fabricating of only relatively low-complexity designs, where the unoptimized static power consumption was still very low. For instance, the most complex 25 GHz CORElγ processor chip had 22K JJs dissipating 6.3 mW (total power) at cryo-temperatures. Taking into account the cooling costs (from 300 to 5-6 K), the total (wall plug-in) power at room temperature is approximately three orders of magnitude higher than the power required for operation at cryogenic temperatures.

In order to continue to have a competitive edge in energy efficiency over future CMOS designs, RSFQ designers need to switch from the use of canonical RSFQ cell libraries to new libraries designed for low static power consumption. A radical step to completely eliminate bias resistors and static power consumption was proposed recently in a new type of logic called reciprocal quantum logic (RQL).18

Cryogenic random access memory (RAM) remains a perennial challenge for superconductor design community. The cryo-memory (more precisely, the lack of it) had the biggest impact on the architecture of superconductor processors and systems shown in Table 1.

In 2007, for the purpose of the new 10 kA/cm2 fabrication process evaluation, the SRL foundry in Japan fabricated two types of SFQ RAM chips: 4 Kbit and 16 Kbit with 23,488 and 80,768 JJs, respectively. Unfortunately, the yield of 16 Kbit RAM (the largest SFQ RAM implemented by now) was only 63.3% due to the large ac bias current required for memory operation.19 Besides off-chip RAM, superconductor processors need low-latency, high-throughput pipelined RAM to implement on-chip storage structures capable of working at the same rate as processor functional units.

Finally, the design of a new generation of high-complexity energy-efficient wide-datapath RSFQ processors with clock frequencies reaching 50 GHz creates additional requirements for CAD tools. For the cell-level design, this requires the development of new VHDL/Verilog cell libraries capable of taking into account gate delay fluctuations, and measuring both static and dynamic power consumption. For instance, the reduction of static current through the use of the LR-load biasing technique17 will make the cell propagation delays slightly dependent on the operation of neighboring cells sharing the same bias lines. This may require the context-dependent calculation of propagation delays for each cell.

4.   Asynchronous hybrid wave pipelining for 50 GHz 32/64-bit RSFQ processors

Efficient synchronization and design techniques are among the biggest challenges for the 50 GHz wide datapath RSFQ processor design.

In conventional synchronous pipelining, intermediate pipeline latches are used in addition to input and output registers. Intermediate latches ensure that data get propagated from one stage to another in a synchronous manner. There is only one set of data between pipeline stages. The clock cycle time in conventional pipelining is determined by the worst-case operation of the stage with the largest delay plus pipeline synchronization overhead, which can be defined as a sum of the latch propagation delay, latch set-up time, and clock skew between adjacent stages. In conventional synchronous RSFQ pipelines, this synchronization overhead can easily exceed latency from the logic blocks due to the small amount of work per stage, and significant clock skew. Making the situation even more difficult is the unavoidable (temperature-induced) timing uncertainty in RSFQ gate delays.

Wave pipelining is an approach aimed to achieve high performance in pipelined systems by removing intermediate latches.20,21 This technique increases the clock frequency by allowing multiple data waves to exist in any stage. By removing intermediate registers, the area, power and load associated with the clock are reduced. The clock cycle time in wave pipelined systems is determined by the difference between the maximum and minimum delays through the combinational logic (plus register set-up and hold times plus double clock skew). The challenges of designing wave pipelined systems are twofold:

•   preventing collision of unrelated data waves; and

•   balancing (equalizing) delay paths in order to reduce differences between the longest and shortest delays through the combinational logic.

The differences can accumulate as the waves propagate through a pipeline, creating the potential for data overrun of unrelated data waves. Ultra-high-rate long RSFQ pipelines are especially prone to this problem.

These problems can be avoided by using the hybrid wave-pipelining approach, where signals are held so that the next stage does not start operating until all the signals from the previous stage are available. In the hybrid wave-pipelined datapath, the stage with the largest delay difference determines the clock cycle time of the datapath.

Many RSFQ designs (e.g. the FLUX-1 processor pipeline) used some form of hybrid wave pipelining known as co-flow synchronization, with the clock traveling with data across the pipeline, thus eliminating the need for complex central clock distribution. The challenge for this co-flow synchronization is the necessity to insert carefully calculated delays into the clock propagation path to honor set-up and hold time requirements for clocked RSFQ gates. The problem becomes more and more severe as the width, length, and complexity of the datapath grow, and timing uncertainty increases.

A new asynchronous hybrid wave-pipelined (AHWP) methodology has been developed at Stony Brook University9 to remove the necessity of centralized clocking as in classic synchronous pipelining and the need to propagate clock signals along the data as in traditional wave pipelining. In the AHWP pipelines, the moment when pipeline stage logic starts processing will be determined by the availability of data, not clock signals. Thus, AHWP pipelines are truly data-driven.

This approach has an evident advantage of eliminating the problem of clock distribution within processor pipeline. The problem of how to detect the event when all data are available becomes the key design issue for control logic at each stage. This problem can be solved by appropriate coding of data, e.g. dual-rail, and the use of reset signals that “clean” the logic of each pipeline stage immediately after the propagation of data waves through the stage.

Another typical design problem that needs to be addressed is how and where the data that arrive earlier can wait for other signals to catch up. In current clocked RSFQ gates, input data signals (pulses) can “wait” inside the gates until the clock signal arrives and produces an output pulse, if any. Unfortunately, these gates need a clock signal to produce an output, so they alone are not suited for data-driven (push-through by data) asynchronous computing. It should be mentioned that there are other asynchronous techniques and gates for RSFQ such as delay-insensitive RSFQ logic, which is based on dual-rail signal coding.22 A drawback of delay-insensitive logic is that it requires a lot of junctions for its implementation and is basically not resettable (at least in its current form.)

The AHWP design methodology includes the use of a tunable generic VHDL gate library and statistical modeling of the gate delay variations. The technology-tunable parameters of the VHDL gate models will include mean delays, RMS jitter, power consumption, as well as the complexity of each gate in terms of JTLs. The VHDL gate models are designed to simulate normally-distributed thermally-induced gate delay variations for a given value of delay variance.

The preliminary study of the efficiency of hybrid wave-pipelining for implementation of several arithmetic and data routing blocks have been carried out.9,10 The results show that hybrid wave-pipelining can tolerate the four-fold increase in RSFQ datapath width from 16 to 64 bits with insignificant (~4–5%) decrease in processing rate. This technique can be successfully used in regular circuits where balancing of delay paths for data can be done relatively easily.

It should be noted, however, that a full-fledged 50 GHz RSFQ processor would use a combination of several asynchronous and synchronous techniques. Among the key reasons for this are the irregularity of control (e.g. instruction issue) logic and the need to control the injection rate at which data and operations can be safely “pumped” into wave-pipelined processor units.

All of this means that the 32/64-bit RSFQ processor microarchitecture needs to be developed in a way that allows both efficient and reliable coexistence of synchronous and asynchronous components in the processor datapath.

5.   Conclusions

During the last several years, designers were able to identify and demonstrate solutions to several design and fabrication problems related to bias current shielding, 100 GHz passive transmission line design, 25 GHz operational frequencies with relatively wide operational margins for designs with more than 22K JJs, etc. A new 1.0 μm 10 kA/cm2 8 Nb metal layer technology developed in Japan allowed the fabrication of chips with the complexity of up to 105 JJs. Recent design and architectural studies have demonstrated that new synchronization techniques, such as asynchronous wave-pipelining, can be very efficient in dealing with jitter and gate delay fluctuations in future wide datapath RSFQ processors.

These recent developments in fabrication, design, and microarchitecture created an opportunity for an important step towards full-fledged superconductor RSFQ processors. RSFQ designers need to exploit this opportunity and demonstrate sufficient functionality, complexity, reliability, speed, and energy efficiency in practical designs in order for superconductor processors to be seriously considered real competitors to CMOS.

Acknowledgments

This work was supported by the US Department of Defense under an ARO contract.

References

1.   “High performance computing for the national security community,” NSA report (July, 2002), see http://www.nitrd.gov/pubs/nsa/sta.pdf

2.   Joint U.S. Defense Science Board and UK Defense Scientific Advisory Council Task Force on Defense Critical Technologies report (2006), see http://www.acq.osd.mil/dsb/reports/2006-03-Defense_Critical_Technologies.pdf

3.   “Exascale computing study: Technology challenges in achieving exascale systems,” DARPA report (2008), see http://users.ece.gatech.edu/~mrichard/ExascaleComputingStudyReports/exascale_final_report_100208.pdf

4.   “Superconducting technology assessment,” NSA NITRD report (August, 2005); “Superconducting technology assessment (update),” NSA NITRD report (December, 2007).

5.   M. Dorojevets, P. Bunyk, D. Zinoviev, and K. Likharev, “COOL-0: design of an RSFQ subsystem for petaflops computing,” IEEE Trans. Appl. Supercond. 9, 3606(1999).

6.   M. Dorojevets, “COOL multithreading in HTMT SPELL-1 processors,” chapter in: Y. S. Park, S. Luryi, M. S. Shur, J. M. Xu, and A. Zaslavsky, eds., Frontiers in Electronics: From Materials to Systems, Singapore: World Scientific Publishing, 2000, pp. 247–253.

7.   M. Dorojevets, “An 8-bit FLUX-1 RSFQ microprocessor built in 1.75-μm technology,” Physica C. 378-381, 1446 (2002).

8.   M. Dorojevets, D. Strukov, A. Silver, et al., “On the road towards superconductor computers: Twenty years later,” chapter in S. Luryi, J. M. Xu, and A. Zaslavsky, eds., Future Trends in Microelectronics: The Nano, the Giga, and the Ultra, New York: Wiley, 2004, pp. 214–225.

9.   US ARO RDECON ACQ CTR contract W911NF-0-C-0036 awarded to Hypres, 2009.

10.   A. Fujimaki, M. Tanaka, T. Yamada, Y. Yamanashi, H. Park, and N. Yoshikawa. “Bit-serial single flux quantum microprocessor CORE,” IEICE Trans. Electronics E91-C, 342 (2008).

11.   N. Takagi, K. Murakami, A. Fujimaki, N. Yoshikawa, and K Inoue, “Proposal of a desk-side supercomputer with reconfigurable data-paths using rapid single-flux-quantum circuits,” IEICE Trans. Electronics E91-C, 350 (2008).

12.   M. Dorojevets, C. Ayala, and A. Kasperek, “Development and evaluation of design techniques for high-performance wave-pipelined wide datapath RSFQ processors,” Proc. Intern. Supercond. Electronics Conf. (2009), SP-P46.

13.   M. Dorojevets and C. Ayala, “Logical design and analysis of a 32/64-bit wave-pipelined RSFQ adder,” Proc. Supercond. SFQ VLSI Workshop (2009), pp. 15–16.

14.   K. Likharev and V. Semenov, “RSFQ logic/memory family: A new Josephson junction technology for sub-terahertz clock frequency digital systems,” IEEE Trans. Appl. Supercond. 1, 3 (1991).

15..   P. Bunyk, K. Likharev, and D. Zinoviev, “RSFQ technology: Physics and devices,” Intern. J. High Speed Electronic Syst. 11, 257 (2001).

16.   S. Yorozu, Y. Kameda, H. Terai, A. Fujimaki, T. Yamada, and S. Tahara, “A single flux quantum standard logic cell library,” Physica C 378-381, 1471 (2002).

17.   T. Nishigai, S. Yamada, and N. Yoshikawa, “Design and implementation of low-power SFQ circuits using LR-load biasing technique,” Physica C 445-448, 1029(2006).

18.   Q. P. Herr and A. Y. Herr, “Reciprocal quantum logic,” Proc. ASC (2008), Chicago, U.S.A.

19.   M. Hidaka, S. Nagasawa, K. Hinode, and T. Satoh, “Improvements in fabrication process for Nb-based single flux quantum circuits in Japan,” IEICE Trans. Electronics E91-C, 318 (2008).

20.   L. W. Cotton, “Maximum-rate pipeline systems,” in Proc. Spring Joint Computer Conf. (1969), pp. 581–586.

21.   W. P. Burleson, M. Ciesielski, F. Klass, and W. Liu, “Wave pipelining: A tutorial and research survey,” IEEE Trans. VLSI Syst. 6, 464 (1998).

22.   P. Patra, S. Polonsky, and D. Fussel, “Delay insensitive logic for RSFQ superconductor computing,” Proc. 3rd Intern. Symp. Advanced Res. Async. C. ire. Syst. (1997), pp. 42–51.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.30.211