4.6. Improving Application Performance Using TIE

As demonstrated above, TIE extensions can improve the execution speed of an application running on an Xtensa processor by enabling the creation of instructions that each perform the work of multiple general-purpose instructions. Several different techniques can be used to combine multiple general-purpose operations into one instruction. Three common techniques available through TIE are:

  1. Fusion

  2. SIMD/vector transformation

  3. FLIX

To illustrate these three techniques, consider a simple example where profiling indicates that most of an application’s execution time is spent computing the average of two arrays in the following loop:

unsigned short *a, *b, *c;
...
for (i=0; i<n; i++)
        c[i] = (a[i] + b[i]) >> 1;

The piece of C code above adds two short data items together and shifts the sum right by one bit in each loop iteration. Two base Xtensa instructions are required for the computation, not counting the instructions required for loading and storing the data. These two operations can be fused into a single TIE instruction:

operation AVERAGE{out AR res, in AR input0, in AR input1} {} {
     wire [16:0] tmp = input0[15:0] + input1[15:0];
     assign res = tmp[16:1];
}

This fused TIE instruction, named AVERAGE, takes two input values (input0 and input1) from entries in the general-purpose AR register file, computes the output value (res), and then saves the result in another AR register-file entry. The semantics of the instruction, an add feeding a shift, are described using the above TIE code. A C or C++ program uses the new AVERAGE instruction as follows:

#include <xtensa/tie/average.h>
unsigned short *a, *b, *c;
...
for (i=0; i<n; i++)
         c[i] = AVERAGE(a[i], b[i]);

Assembly code can also directly use the AVERAGE instruction. In fact, the entire software tool chain recognizes AVERAGE as a valid instruction for processors built using this TIE extension.

Fused instructions are not necessarily as expensive in hardware as the sum of their constituent parts. Often they are significantly cheaper because they operate on restricted data sets. In the example above, the addition operation performed inside the AVERAGE instruction is a 16-bit addition and requires only a 16-bit adder, while the base Xtensa ISA implements 32-bit additions and uses a 32-bit adder. Because the sum in the AVERAGE instruction is always right-shifted by 1 bit, the shift operation doesn’t require a general-purpose shifter—it is essentially free of hardware cost because a fixed 1-bit shift operation can be performed by simply selecting the appropriate bits from the result of the addition operation. Consequently, the AVERAGE instruction described above requires very few gates and easily executes in one cycle.

However, TIE instructions need not execute in a single cycle. The TIE schedule construct allows the creation of TIE instructions with computations that span multiple clock cycles. Such instructions can be fully pipelined (multiple instances of a multi-cycle instruction can be issued back-to-back) or they can be iterative. Fully pipelined instructions achieve higher performance because they can be issued back-to-back, but they may also require extra implementation hardware to store intermediate results in the instruction pipeline.

An iterative instruction spanning multiple clock cycles uses the same set of hardware, over two or more clock cycles. This design approach saves hardware but attempts by the running software to issue back-to-back iterative multi-cycle instructions during adjacent instruction cycles will stall the processor until the first dispatch of the multi-cycle instruction completes its computation.

Instruction fusion is not the only way to form new instructions for Xtensa processors. Two other techniques are SIMD (single instruction, multiple data) and FLIX (flexible-length instruction extensions), the Xtensa version of VLIW (very-long instruction word). SIMD instructions gang multiple, parallel execution units that perform the same operation on multiple operands simultaneously. This sort of instruction is particularly useful for stream processing in applications such as audio and video. Although multiple operations occur simultaneously, they are dependent operations. VLIW instructions bundle multiple independent operations into one machine instruction.

In the fused-instruction example shown above, one TIE instruction combines an add and a shift operation, which cuts the number of instruction cycles for the overall operation in half. Other types of instruction combinations can also improve performance. The C program in the above example performs the same computation on a new data instance during each loop iteration. SIMD instructions (also called vector instructions) perform multiple loop iterations simultaneously by performing parallel computations on different data sets during the execution of one instruction.

TIE instructions can combine fusion and SIMD techniques. Consider, for example, a case where a TIE instruction computes four AVERAGE operations in one instruction:

regfile VEC 64 8 v

operation VAVERAGE{out VEC res, in VEC input0, in VEC input1} {} {
     wire [67:0] tmp = {input0[63:48] + input1[63:48],
                        input0[47:32] + input1[47:32],
                        input0[31:16] + input1[31:16],
                        input0[15:0] + input1{15:0]};
     assign res = {tmp[67:52], tmp[50:35], tmp[33:18], tmp[16:1]};
}

Computing four 16-bit averages simultaneously requires that each data vector be 64 bits wide (containing four 16-bit scalar quantities). However, the general-purpose AR register file in the Xtensa processor is only 32 bits wide. Therefore, the first line in the SIMD TIE example above creates a new register file, called VEC, with eight 64-bit register-file entries that hold 64-bit data vectors for the new SIMD instruction. This new instruction, VAVERAGE, takes two 64-bit operands (each containing four 16-bit scalar quantities) from the VEC register file, computes four simultaneous averages, and saves the 64-bit vector result in a VEC register-file entry. To use the instruction in C/C++, simply modify the original example as follows:

#include <xtensa/tie/average.h>
VEC *a, *b, *c;
...
for (i=0; i<n; i+=4) {
        c[i] = VAVERAGE(a[i], b[i]);

The C/C++ compiler generated for a processor built with this TIE description automatically recognizes a new 64-bit C/C++ data type called VEC, which corresponds to the 64-bit entries in the new register file. In addition to the VAVERAGE instruction, the Xtensa Processor Generator automatically creates new load and store instructions to move 64-bit vectors between the VEC register file and memory. The XCC compiler uses these instructions to load and store the 64-bit vectors of type VEC.

Compared to the fused instruction AVERAGE, the SIMD vector-fused instruction VAVERAGE requires significantly more hardware (in the form of four 16-bit adders) because it performs four 16-bit additions in parallel. The four 1-bit shifts do not require any additional gates. The performance improvement gained by combining vectorization and fusion is significantly larger than the performance improvement from fusion alone.

The addition of SIMD instructions to an Xtensa processor nicely dovetails with Tensilica’s XCC C/C++ compiler, which has the ability to unroll and vectorize the inner loops of application programs. The loop acceleration achieved through vectorization is usually on the order of the number of SIMD units within the enhanced instruction. Thus a 2-operation SIMD instruction approximately doubles loop performance and an 8-operation SIMD instruction speeds up loop execution by about 8x.

FLIX instructions are multi-operation instructions like fused and SIMD instructions. They allow a processor to perform multiple, simultaneous, independent operations by encoding the multiple operations into a wide instruction word, in contrast with the dependent multiple operations of fused and SIMD instructions. Each operation in a FLIX instruction is independent of the others and the XCC C/C++ compiler for Xtensa processors can bundle these independent operations into a FLIX-format instruction as needed to accelerate code. While TIE-defined fused and SIMD instructions are 24 bits wide, FLIX instructions are either 32 or 64 bits wide, to provide enough instruction-word bits to fully describe the multiple independent operations.

Xtensa instructions of different sizes (base instructions, single-operation TIE instructions, and multi-operation FLIX instructions) can be freely intermixed. Xtensa processors have no mode settings for instruction size. The instruction size is encoded into the instruction word itself. Xtensa processors will identify, decode, and execute any mix of 16-, 24-, and 32- or 64-bit instructions in the incoming instruction stream. The 32-bit or 64-bit FLIX instruction are divided into slots, with independent operations placed in either, all, or some of the slots. FLIX slots need not be equally sized, which is why this feature is called FLIX. Any combination of the operations allowed in each slot can occupy a single FLIX instruction word.

Consider again the AVERAGE example. Using base Xtensa instructions, the inner loop contains the ADD and SRAI instructions to perform the actual computation. Two L16I load instructions and one S16I store instruction move the data as needed and three ADDI instructions update the address pointers used by the loads and stores. A 64-bit FLIX instruction format with one slot for the load and store instructions, one slot for the computation instructions, and one slot for address-update instructions can greatly accelerate this code as follows:

format flix3 64 {slot0, slot1, slot2}


slot_opcodes slot0 {L16I, S16I}
slot_opcodes slot1 {ADDI}
slot_opcodes slot2 {ADD, SRAI}

The first declaration creates a 64-bit instruction and defines an instruction format with three opcode slots. The last three lines of code list base ISA instructions that are to be available in each opcode slot defined for this FLIX configuration. Note that all the instructions specified are predefined, core processor instructions, so their definition need not be provided in the TIE code. The Xtensa Processor Generator already knows about all base Xtensa instructions.

For this example, the C/C++ program need not be changed at all. The generated C/C++ compiler for a processor built using these FLIX extensions will compile the original source code while exploiting the FLIX extensions automatically. The generated assembly code for this processor implementation would look like this:

loop:
{addi a9,a9,4;     add a12,a10,a8;    l16i a8,a9,0     }
{addi a11,a11,4;   srai  a12,a12,1;   l16i a10,a11,0   }
{addi a13,a13,4;   nop;               s16i a12,a13,0   }

A computation that requires eight cycles per iteration on a base Xtensa processor now requires just three cycles per iteration, which is nearly a 3× performance increase. It took only five lines of relatively simple TIE code to specify a complex FLIX configuration format with three instruction slots using existing Xtensa instructions.

Instruction fusion, SIMD, and FLIX techniques can be combined to further reduce cycle count. Tensilica’s XPRES compiler uses all three techniques to optimize processors after analyzing C and C++ code for optimization opportunities.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.107.181