15.5. Systolic Processing

Some of the earliest work in highly parallel systems was done by HT Kung and CE Leiserson at Carnegie Mellon University. As early as 1979, Kung had become an early advocate for a parallel-processing style he dubbed “systolic processing.” Kung and Leiserson were attempting to develop a general methodology for mapping high-level algorithmic computations onto processor arrays. They envisioned the processor array as resembling an automotive assembly line, with processors in the array working on the same data at different times and the entire array working on many data blocks simultaneously. A key element of the systolic-processing concept was the idea that data—in the form of partial results—could flow among the processors in a systolic array at multiple speeds and in multiple directions.

The systolic-processing array was modular, reconfigurable, and scalable. Kung and Leiserson suggested several types of systolic-processing arrays based on two fundamental types of processors, the rectangular and hexagonal inner-product-step processors. Figure 15.8 shows the inputs and outputs of the inner-product-step processors.

Figure 15.8. Inner-product-step processors for systolic processing.


Kung and Leiserson named these processors “inner-product-step processors” because they envisioned the processors executing portions of algorithms at the complexity level of an inner-product computation for matrix arithmetic. That was the complexity level achievable with VLSI semiconductor lithography in 1979. Nanometer silicon allows much larger, more capable, more general-purpose processors to be used today.

With their expanded I/O abilities, the rectangular and hexagonal inner-product processors could be connected in different types of arrays including linear, orthogonal, and hexagonal arrays (shown in Figure 15.9). Note that the I/O configurations on these processors do not resemble the bus structures of conventional microprocessor cores. However, they closely resemble the configurations made possible by multiple TIE (Tensilica Instruction Extension) queue interfaces available on some Xtensa and Diamond processor cores.

Figure 15.9. Linear, orthogonal, and hexagonal systolic-processing arrays.


Kung developed a systolic-processing architecture called WARP at CMU. The CMU Warp processor used a linear array of floating-point computational cells controlled by a host computer through an interface unit, as shown in Figure 15.10.

Figure 15.10. The CMU Warp processor developed by HT Kung operated a linear array of floating-point processing cells controlled by a host computer through an interface unit.


A 2-cell Warp prototype became operational in 1985 and contractor General Electric delivered a 10-cell pre-production unit to CMU in 1986. One of Kung’s graduate students, Monica Lam, developed a parallelizing compiler for the Warp architecture. (Lam eventually became a professor in the Computer Science department at Stanford University. Starting in 1998, she took a sabbatical from Stanford to help found Tensilica. She worked on and supervised work on the HLL compiler and other software tools for the earliest versions of the configurable Xtensa processor.)

The CMU Warp machine was successfully used in several applications including:

  • Stereo vision

  • Magnetic resonance imaging (MRI)

  • Synthetic aperture radar (SAR)

  • Radar signal processing

  • Air pollution monitoring

  • Ground motion monitoring.

Intel engaged with CMU to develop a more advanced, integrated version of the Warp architecture dubbed iWarp. The project was cosponsored by DARPA (the US Defense Advanced Research Projects Administration) and the resulting chip, first manufactured in late 1989, implemented a 20-MHz, 32-bit-RISC/96-bit-VLIW (very long instruction word) iWarp cell processor using 650,000 transistors in 1-μm CMOS process technology. Each iWarp cell included eight 40-Mbyte/sec buses for inter-processor communications and a 64-bit, 160-Mbyte/sec interface to the processor’s local memory. These iWarp cell processors were packaged into a system using a 2D torus-mesh architecture, as shown in Figure 15.11. The first 64-cell iWarp systems (initially priced at $494,950.00) became operational in 1990 and the architecture was designed to support systems containing as many as 1024 iWarp cells. Intel marketed iWarp systems through its Supercomputing System Division, which was formed in 1992, but the company terminated marketing for the product by 1995. Intel eventually stopped building iWarp systems entirely.

Figure 15.11. The Intel/CMU iWarp architecture combined iWarp RISC/VLIW processors into a 2D torus mesh.


Interest in systolic processing seems to have curtailed with the advent of cluster computing. However, nanometer silicon can easily support large arrays of general-purpose and application-tailored processor cores, so it may well be time to resurrect the idea and see where a new SOC-based renaissance in systolic computation can go.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.62.239