Boost Throughput with Multiple Operations per Cycle

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

14.6. Boost Throughput with Multiple Operations per Cycle

The Xtensa LX processor is not limited to performing one operation per cycle. TIE provides two ways to perform two or more operations at a time. SOC designers can create single-cycle TIE instructions that draw operands from several sources, perform multiple operations on these operands concurrently, and then output the results to several destinations. Alternatively, SOC designers can use the Xtensa LX processor’s FLIX (flexible length instruction extension) technology to create wide-word instructions that perform multiple independent operations during the same instruction cycle. The Diamond 570T CPU, Diamond 330HiFi DSP, and Diamond 545CK DSP all incorporate FLIX-format instructions, which were previously described in the appropriate chapters. In some applications, both methods (compound-operation instructions, and FLIX instructions) will work equally well. In other applications, the flexibility of FLIX instructions will make function-block development much easier.

The example code below shows a short but complete example of a very simple long-instruction word processor described in TIE with FLIX technology. It relies entirely on built-in definitions of 32-bit integer operations, and defines no new operations. It creates a processor with a high degree of potential parallelism even for applications written purely in terms of standard C integer operations and data-types. The first of three slots supports all the commonly used integer operations, including ALU operations, loads, stores, jumps, and branches. The second slot offers loads and stores, plus the most common ALU operations. The third slot offers a full complement of ALU operations, but no loads and stores.

1: length ml64 64 {InstBuf[3:0] = = 15}
2: format format1 ml64 {base_slot, ldst_slot, alu_slot}
3: slot_opcodes base_slot {ADD.N, ADDX2, ADDX4, SUB, SUBX2, SUBX4,
    ADDI.N, AND, OR, XOR, BEQZ.N, BNEZ.N, BGEZ, BEQI, BNEI, BGEI, BNEI,
    BLTI, BEQ, BNE, BGE, BLT, BGEU, BLTU, L32I.N, L32R, L16UI, L16SI,
    L8UI, S32I.N, S16I, S8I, SLLI, SRLI, SRAI, J, JX, MOVI.N}
4: slot_opcodes ldst_slot {ADD.N, SUB, ADDI.N, L32I.N, L32R, L16UI,
    L16SI, L8UI, S32I.N, S16I, S8I, MOVI.N}
5: slot_opcodes alu_slot {ADD.N, ADDX2, ADDX4, SUB, SUBX2, SUBX4, ADDI.N,
    AND, OR, XOR, SLLI, SRLI, SRAI, MOVI.N}

The first line of this example code declares a new instruction length (64 bits) and specifies the encoding of the first 4 bits of the instruction that determine the length. The second line declares a format for that instruction length, format1, containing three slots: base_slot, ldst_slot, and alu_slot and names the three slots within the new format. The fourth line lists all the TIE instructions that can be packed into the first of those slots: base_slot. In this case, all the instructions happen to be pre-defined Xtensa LX instructions but new instructions could also be included in this slot. The processor generator also creates a NOP (no operation) for each slot, so the software tools can always create complete instruction bundles, even when no other operations for that slot are available for packing into a long instruction. Lines 4 and 5 designate the subset of instructions that can go into the other two slots. In fact, the code in Figure 14.4 goes a long way towards creating a Diamond 570T CPU.

Figure 14.4. Application tailoring boosts performance and energy efficiency.

The following block of example code defines a long-instruction-word architecture with a mix of built-in 32-bit operations and new 128-bit operations. It defines one 64-bit instruction format with three sub-instruction slots (base_slot, ldst_slot, and alu_slot). The description takes advantage of the Xtensa processor’s pre-defined RISC instructions, but also defines a large new register file and three new ALU operations on the new register file:

1: length ml64 64 {InstBuf[3:0] = = 15}
2: format format1 ml64 {base_slot, ldst_slot, alu_slot}
					3: slot_opcodes base_slot {ADD.N, ADDX2, ADDX4, SUB, SUBX2, SUBX4,
     ADDI.N, AND, OR, XOR, BEQZ.N, BNEZ.N, BGEZ, BEQI, BNEI, BGEI, BNEI,
     BLTI, BEQ, BNE, BGE, BLT, BGEU, BLTU, L32I.N, L32R, L16UI, L16SI,
     L8UI, S32I.N, S16I, S8I, SLLI, SRLI, SRAI, J, JX, MOVI.N}
4: regfile × 128 × 32
5: slot_opcodes ldst_slot {loadx, storex} /* slot does 128b load/store*/
6: immediate_range sim8 128 127 1 /*8 bit signed offset field */
7: operation loadx {in × *a, in sim8 off, out × d} {out VAddr, in
      MemDataIn128}{
8: assign VAddr = a + off; assign d = MemDataIn128;}
9: operation storex {in × *a, in sim8 off, in × s} {out VAddr,out
      MemDataOut128}{
10:assign VAddr = a + off; assign MemDataOut128 = s;}
11:slot_opcodes alu_slot {addx, andx, orx} /* two new ALU operations on
        × regs */
12:operation addx {in × a, in × b, out × c} {} {assign c = a + b;}
13:operation andx {in × a, in × b, out × c} {} {assign c = a & b;}
14:operation orx {in × a, in × b, out × c} {} {assign c = a | b;}

The first three lines of this code block are identical to those of the previous example. The fourth line declares a new register file 128 bits wide and 32 entries deep. The fifth line lists the two load and store instructions for the new wide register file, which can be found in the second slot of the long instruction word. The sixth line defines a new immediate range, an 8-bit signed value, to be used as the offset range for the new 128-bit load and store instructions.

Lines 7–10 fully define the new load and store instructions, in terms of basic interface signals Vaddr (the address used to access local data memory), MemDataIn128 (the data being returned from local data memory), and MemDataOut128 (the data to be sent to the local data memory). The use of 128-bit memory data signals also guarantees that the local data memory will be at least 128 bits wide. Line 11 lists the three new ALU operations that can be put in the third slot of the long instruction word. Lines 12–14 fully define those operations on the 128-bit wide register file: add, bit-wise AND, and bit-wise OR.

With this example, any combination of the 39 instructions (including NOP) in the first slot, three instructions in the second slot (loadx, storex, and NOP), and four instructions in the third slot can be combined to form legal instructions—a total of 468 combinations. This example shows the potential to independently specify operations to enable instruction-level parallelism. Moreover, all of the techniques for improving the performance of individual instructions—especially fusion and SIMD—are readily applied to the operations encoded in each sub-instruction.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Boost Throughput with Multiple Operations per Cycle

Create new playlist

Sign In

Sign Up

14.6. Boost Throughput with Multiple Operations per Cycle

Figure 14.4. Application tailoring boosts performance and energy efficiency.

Table of Contents for
Boost Throughput with Multiple Operations per Cycle