4.9. TIE Queue Interfaces

The highest-bandwidth mechanism for task-to-task communication is hardware implementation of data queues. One data queue can sustain data rates as high as one transfer every cycle or more than 10 Gbytes per second for wide operands (tens of bytes per operand at a clock rate of hundreds of MHz) because queue widths need not be tied to a processor’s bus width or general-register width. The handshake between producer and consumer is implicit in the queue interfaces.

Figure 4.5. Processors linked by a FIFO queue.


When the data producer has created data, it pushes that data into the tail of the queue, assuming the queue is not full. If the queue is full, the producer stalls. When the data consumer is ready for new data, it pops it from the head of the queue, assuming the queue is not empty. If the queue is empty, the consumer stalls. This self-regulating data-transfer mechanism ensures maximum I/O bandwidth between data producer and consumer.

TIE allows direct implementation of queue interfaces as source and destination operands for instructions. An instruction can specify a queue as one of the destinations for result values or use an incoming queue value as one source. This form of queue interface, shown in Figure 4.5, allows a new data value to be created or used each cycle on each queue interface. Hardware-handshake lines (FIFO push, FIFO full, FIFO pop, and FIFO empty) ensure that data flows at the maximum allowable rate with no software intervention. If the producer or consumer is not ready, the chain stalls. However, the overall process of data production, data transfer, and data consumption always proceeds at the chain’s maximum rate.

TIE queues can be configured to provide non-blocking push and pop operations, where the producer can explicitly check for a full queue before attempting a push and the consumer can explicit check for an empty queue before attempting a pop. This mechanism allows the producer or consumer task to move on to other work in lieu of stalling.

A TIE processor extension can perform multiple queue operations per cycle, perhaps combining inputs from two input queues with local data and sending result values to one or more output queues. The high aggregate bandwidth and low control overhead of queues allows application-specific processors to be used for applications with very high data rates where processors with conventional bus or memory interfaces are not appropriate because they lack the ability to handle the required high data rates.

Queues decouple the performance of one task from another. If the rate of data production and data consumption are quite uniform, the queue can be shallow. If either production or consumption rates are highly variable or bursty, a deep queue can mask the data-rate mismatch and ensure throughput at the average rate of producer and consumer, rather than at the minimum rate of the producer or the minimum rate of the consumer. Sizing the queues is an important design optimization driven by good system-level simulation. If the queue is too shallow, the processor at one end of the communication channel may stall when the other processor slows for some reason. If the queue is too deep, the silicon cost will be excessive.

Queue interfaces are added to an Xtensa processor through the following TIE syntax:

queue <queue-name> <width> in|out

The name of the queue, its width, and direction are defined with the above TIE syntax. An Xtensa processor can have more than 300 queue interfaces and each queue interface can be as wide as 1024 bits. These limits are set well beyond the routing limits of current silicon technology so that the processor core’s architecture is not the limiting factor in the design of a system. The designer can set the real limit based on system requirements and EDA flow. Using queues, designers can trade off fast and narrow processor interfaces with slower and wider interfaces to achieve bandwidth and performance goals.

Figure 4.6 shows how TIE queue interfaces are easily connected to simple Designware FIFOs. TIE queue push and pop requests are gated by the FIFO empty and full status signals to comply with the Designware FIFO specification.

Figure 4.6. Designware synchronous FIFO used with TIE queues.


TIE queue interfaces serve directly as input and output operands of TIE instructions, just like a register operand, a register-file entry, or a memory interface. The Xtensa processor includes 2-entry buffering for every TIE queue interface defined. The area consumed by a queue’s 2-entry buffer is substantially smaller than a load/store unit, which can have large combinational blocks for alignment, rotation, and sign extension of data, as well as cache-line buffers, write buffers, and complicated state machines. The processor area consumed by TIE queue interface ports is relatively small and is under the designer’s direct control.

The FIFO buffering incorporated into the processor for TIE queues serves three distinct purposes. First, the buffering provides a registered and synchronous interface to the external FIFO. Secondly, for output queues, the buffer provides two FIFO entries that buffer the processor from stalls that occur when the attached external FIFO is full. Thirdly, the buffering is necessary to hide the processor’s speculative execution from the external FIFO.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.190.153.63