Chapter 15. SPU Assembly Language

This topic is not for the faint of heart. Thus far, the book has explained Cell programming in terms of C functions with friendly names like spu_add and mfc_get. The applications operate on variables with straightforward datatypes such as vector signed short, and if you don’t know the result of an operation, you can find out by calling printf.

None of these luxuries are available when you code with assembly. The code in a *.s file consists of instructions like ai, il, hbrz, and stqx. They don’t operate on typed variables, but instead manipulate bits, bytes, halfwords, words, doublewords, and quadwords in the SPU’s general-purpose registers. This means printf won’t suffice; if you want to see what an instruction accomplished, you have to examine the SPU’s internal state.

Coding in assembly rather than a high-level language is like driving a car with a manual transmission rather than an automatic transmission. It’s not simple or convenient. You have to manage the low-level details on your own, which means performing memory access without the benefit of pointers. You also have to monitor register usage and stack operation without assistance.

But what assembly coding lacks in convenience, it makes up for in performance. If an SPU executable has been coded (competently) in assembly, it will run faster and occupy less space than an executable coded in C. What’s more, the SPU’s strengths and weaknesses become evident as you learn its basic instructions, and if you restructure your algorithm accordingly, your application will reach the loftiest of programming goals: optimal performance.

This chapter presents assembly coding in three parts. The first part explains SPU assembly language at a high level and shows how to write, compile, and debug an assembly coded application. The second and largest part presents the SPU assembly instructions, divided into categories according to their function. The final part focuses on two important topics in SPU assembly coding: interfacing assembly with C/C++ functions and interacting with the SPU’s dual-pipelined architecture.

Why Learn SPU Assembly?

Let me say this first and foremost: I don’t recommend building applications completely in assembly. There are too many possible pitfalls. For example

  • If you don’t position your loads and stores properly, the SPU Control Unit (SCN) won’t be able to access instructions and your application will stall.

  • If you don’t provide the right branch hints, the SPU won’t be prepared when a branch occurs and you’ll lose cycles while the SPU waits for new instructions.

  • If your application needs all 128 registers, it takes significant effort to keep track of where all your variables are stored.

For these reasons, it’s a good idea to take advantage of the compiler and code your applications in a high-level language.

Still, I strongly recommend that you learn SPU assembly. This is for one reason: to improve upon the compiler. spu-gcc does a capable job, but it’s far from perfect. Despite Sony’s GPL-licensed toolchain, IBM is still marketing its proprietary XLC suite of Cell compilation tools. Why? Because IBM knows the weaknesses of spu-gcc well enough to convince professionals that developing quality applications requires proprietary tools. If you’ve mastered assembly, however, you can fine-tune your applications to the point where they run faster and more efficiently than those generated by Sony’s spu-gcc or IBM’s spuxlc.

Another reason to learn assembly is timing. You can never be certain of how much time a C/C++ function takes to execute, but each assembly instruction takes a precise number of cycles. Appendix D, “SPU Instruction Set Reference,” lists all the SPU’s instructions and their cycle counts. If you know which instructions take the most/fewest cycles, you’ll better understand why your application takes as much time as it does. You’ll also be better able to interface the Cell with external devices.

Specific Intrinsics and Assembly-Coded Applications

Assembly coding can make even the most experienced programmers nervous, so you may prefer using specific intrinsics as a bridge between C and assembly. These C functions map directly to SPU assembly instructions, and their names are similar to their low-level counterparts.

The second part of this section explains how to create applications coded in SPU assembly. It shows how assembly code is placed in an object file, and then walks through the process of building and debugging an assembly-coded executable.

Specific Intrinsics

Chapters 11 through 13 (“SIMD Programming on the SPU,” “SPU Communication, Part 1: Direct Memory Access (DMA),” “SPU Communication, Part 2: Events, Signals, and Mailboxes,” respectively) described the majority of the functions declared in spu_intrinsics.h. Most of them perform vector operations, but there’s another set of functions called specific intrinsics. Each specific intrinsic calls a single assembly instruction. The intrinsic takes its name from its corresponding instruction, but adds a prefix of si_.

For example, the assembly instruction ilh places a number in each halfword of an SPU register. That is,

ilh $20,4

inserts the value 4 into each halfword of Register 20. The corresponding specific intrinsic is si_ilh, and the C function si_ilh(4) returns a vector whose halfwords are all set to 4.

Nearly all the SPU’s assembly instructions have specific intrinsics, but there are three important categories that do not:

  • Branch instructions, such as br, brz, and biz

  • Branch hint instructions, such as hbr, hbra, and hbrz

  • Interrupt return instructions, such as iret, iretd, and irete

If you want to insert these instructions into C/C++ code, you’ll have to rely on the asm command. For example, the interrupt code in Chapter 13 called asm(iret) to return from the interrupt service routine.

Specific intrinsics provide the same low-level operation as assembly instructions, but can be safely used in C/C++ code. As you progress through this chapter, it’s a good idea to test any instructions that make you nervous by calling their intrinsics in C/C++ code.

Introducing the SPU Assembly Language

The first step in learning assembly is to understand the overall structure of an assembly file. This treatment presents a basic example of assembly coding, and then shows how applications can be assembled and debugged.

Creating Sections in Assembly

As explained in Appendix A, “Understanding ELF Files,” SPU object files are composed of sections with names such as .text, .rodata, and .symtab. You don’t have to worry about sections when you write C/C++ because the linker handles the placement of code in object files. When you program with assembly, however, this responsibility is yours alone.

Thankfully, the process is simple. Enter a section name in an assembly file and the information that follows will be placed in that section. Most assembly code only requires three types of sections:

  • .text: Executable code follows

  • .data: Initialized data (constants) follows

  • .bss: Uninitialized data (variables) follows

Structurally, an assembly file consists of section names (called directives) and the code/data to be placed in the corresponding sections. The following code shows how these directives provide the overall structure for an assembly file:

.data
<Variables to be placed in .data>

.text
<Executable code to be placed in .text>

.data
<More variables to be placed in .data>

.text
<More code to be placed in .text>

The byte alignment for a section’s content is set with .align x. When this is used, the code/data in the section is aligned on a 2x-byte boundary. For example, the following code makes sure that the .text section is aligned on a 16-byte boundary:

.text
.align 4

This 16-byte alignment is used throughout this chapter.

A Simple Assembly File

The code in Listing 15.1 adds the content of Register 20 to the content of Register 21. The sum is placed in Register 21.

Example 15.1. Register Addition in Assembly: spu_addreg.s

# Create the .data section, aligned on 16-byte boundary
   .data
   .align 4

# Create two constant vectors, addend1 and addend2
addend1:
   .int 0,1,2,3
addend2:
   .int 4,5,6,7

# Create the .text section, aligned on 16-byte boundary
   .text
   .align 4

# Create a global symbol, _start, for linker
   .global _start

# List instructions in the _start procedure
_start:
   lqa   $20,addend1    # Place addend1 in Reg 20
   lqa   $21,addend2    # Place addend2 in Reg 21
   a     $21,$20,$21    # Add Regs 20, 21 => Reg 21
   stop  0

This code consists of two section directives: .data and .text. The .data section declares two vectors, addend1 and addend2, and the vector names are written as labels. That is, they start in the leftmost column and are followed by a colon. In assembly, labels serve the same roles that function names and variable names serve in C/C++.

The .text section begins with the .global directive, which makes the _start label accessible to external processes. This is important because the linker uses _start to identify the first executable instruction of the code, called the entry point. You don’t have to call the entry point _start, but if you use another symbol, you need to add -e symbol_name to the linker arguments.

After the _start label, the first instruction, lqa, loads the data identified by addend1 into Register 20. Like the directives, the assembly instructions in Listing 15.1 are indented. The assembly comments are preceded by #, but C-style comments can be used.

Whitespace separates the instructions from their arguments. In Listing 15.1, the arguments are either register names or labels created in the .data section. The $ denotes a register and $20 identifies Register 20 in the SPU’s file of 128 registers. The symbols addend1 and addend2 refer to the values declared in the .data section. In each instruction, the first argument is the target register, which receives the result of the operation. This is common among all the SPU assembly instructions.

Building an Assembly-Coded Application

If you look through the makefiles in the Chapter15 projects, you’ll see that building assembly coded applications is accomplished with two steps:

  1. spu-as assembles the code into an object file (*.o) and inserts debugging information.

  2. spu-gcc links the object file into an executable.

The —gdwarf-2 flag tells the assembler to insert debugging information. DWARF 2 (Debug With Arbitrary Record Format), is a format for recording data about an application’s execution. Other formats are available, and you can see them by entering spu-as --help.

The linking step creates the executable and places the code at 0x0. This is specified with -Wl,-Ttext=0x0. The -nostartfiles option prevents the linker from including standard startup files such as crt0.o.

Debugging an Assembly-Coded Application

Running the spu_addreg executable won’t produce any meaningful output. To be sure that the code is working properly, you need a means of viewing the SPU’s state, such as the spu-gdb debugger. You can run the debugger from Eclipse as described in Chapter 4, “Debugging and Simulating Applications,” or you can debug from the command line.

A simple command-line debugging session for spu_addreg might look like the following:

% spu-gdb spu_addreg -q
(gdb) break 21

Breakpoint 1 at 0x4: file spu_addreg.s, line 21
(gdb) run

Starting program: spu_addreg

Breakpoint 1, _start () at spu_addreg.s:21
21     lqa     $21,addend1      # Place addend2 in Reg 21
Current language:  auto; currently asm
(gdb) step 2

_start () at spu_addreg.s:23
23     stop  0
(gdb) info reg 20 21

r20 {uint128 = 0x00000000000000010000000200000003,
v2_int64 = {0x1, 0x200000003},
v4_int32 = {0x0, 0x1, 0x2, 0x3},
v8_int16 = {0x0, 0x0, 0x0, 0x1, 0x0, 0x2, 0x0, 0x3}
v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1,
            0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0x0, 0x3},
v2_double = {0x0, 0x0}, v2_float = {0x0, 0x0, 0x0, 0x0}}

r21 {uint128 = 0x0000000400000006000000080000000a,
v2_int64 = {0x400000006, 0x80000000a},
v4_int32 = {0x4, 0x6, 0x8, 0xa},
v8_int16 = {0x0, 0x4, 0x0, 0x6, 0x0, 0x8, 0x0, 0xa}
v16_int8 = {0x0, 0x0, 0x0, 0x4, 0x0, 0x0, 0x0, 0x6,
            0x0, 0x0, 0x0, 0x8, 0x0, 0x0, 0x0, 0xa},
v2_double = {0x0, 0x0}, v2_float = {0x0, 0x0, 0x0, 0x0}}
(gdb) step

Program exited normally.
(gdb) quit

The first command, break 21, creates a breakpoint at line 21. The run command executes the application until the breakpoint is reached. At that point, you can step through the rest of the application and use (i)nfo reg to display the contents of SPU registers. The register contents are presented in every vector type available.

Note

step, info, and quit have been used for illustrative purposes. In the real world, you enter s for step, i for info, and q for quit. To look at the content of Registers 20 and 21, enter i r 20 21.

SPU Load and Store Instructions

The best place to start the discussion of assembly instructions is with the load/store instructions. The SPU operates only on register data, so if you can’t transfer data to and from the registers, you can’t do anything at all. Another reason to start with these instructions is that they’re easy to understand. But before I discuss the instructions, I need to explain the three addressing modes used by the SPU.

SPU Addressing Modes

C/C++ code accesses memory with pointers and the linker handles the actual addressing. In assembly code, you need to specify exactly which memory locations you want to access. There are three ways to do this:

  1. Absolute addressing (a-form): The instruction specifies the actual local store (LS) address.

  2. Indexed addressing (d-form): The instruction specifies a value that is added to a register value to generate the address.

  3. Indexed register indirect addressing (x-form): The values of two registers are added together to generate the address.

Most of the load/store instructions have a variant for each mode. These instructions are common in assembly, so you may want to memorize the characteristics of a-form, d-form, and x-form addressing.

Load/Store Instructions

Table 15.1 lists the different SPU assembly instructions that load data from the LS into a register and store data from a register into the LS. The arguments starting with r represent registers. The lsa and index arguments are numeric values.

Table 15.1. SPU Load/Store Instructions

Opcode

Arguments

Purpose

lqa

rt,lsa

Load quadword from lsa to register rt

lqd

rt,index(ra)

Load quadword from ra+index to register rt

lqx

rt,ra,rb

Load quadword from ra+rb to register rt

stqa

rt,lsa

Store quadword from register rt to lsa

stqd

rt,index(ra)

Store quadword from register rt to ra+index

stqx

rt,ra,rb

Store quadword from register rt to ra+rb

lqr

rt,lsa

Load quadword from lsa+PC to register rt

stqr

rt,lsa

Store quadword from register rt to lsa+PC

The first six entries are the -a, -d, and -x forms of the load (lq) and store (stq) instructions. They transfer data between the SPU’s register file and LS using the addressing amodes described previously. The index value can be only 10 bits wide.

An example will clarify how these instructions work. Suppose Figure 15.1 represents the initial state of Registers 20–22 and the LS contents from 0x200–0x220. Remember that these instructions load and store 16-byte quadwords, and that the registers and LS lines have the same size: 16 bytes.

Initial state of the registers and LS

Figure 15.1. Initial state of the registers and LS

Let’s look at the following instructions:

lqa  $20, 0x200
lqd  $21, 16($20)
lqx  $22, $20, $21
stqa $20, 0x220
stqd $21, -80($22)
stqx $22, $21, $20

The first instruction loads the quadword at address 0x200 into Register 20. This value is 0x200.

The second instruction forms an address by adding the value in Register 20 (0x200) to the index (16). 0x200 + 0x10 = 0x210, and the value at address 0x210 is placed into Register 21. This value is 0x20.

The third instruction forms an address by adding the value in Register 20, 0x200, and the value in Register 21, 0x20. 0x200 + 0x20 = 0x220, and the value at address 0x220 is loaded into Register 22. This value is 0x250.

The fourth instruction stores the content of Register 20 to address 0x220, and the fifth instruction finds the value in Register 22 (0x250) subtracts 80, and stores the value of Register 21 to the address 0x200. The last instruction adds the values of Registers 21 and 20 (0x200 + 0x20), and stores the value of Register 22 to address 0x220.

lqr and stqr are different from the other instructions. They have only one argument following the target register, lsa, but this isn’t an address. Instead, this value is added to the current value of the program counter (PC), which stores the address of the instructions being executed. By loading new data into the instruction list, you can dynamically alter the operation of your application.

Load Immediate Instructions

Instead of loading data from the LS, the instructions in Table 15.2 load immediate values (constants) into registers. The position of the constant within the register depends on the instruction being used.

Table 15.2. SPU Load Immediate Instructions

Opcode

Arguments

Name

Purpose

ilh

rt,imm

Immediate Load Halfword

Load each halfword in rt with the imm value

il

rt,imm

Immediate Load Word

Load each word in rt with the imm value

llhu

rt,imm

Immediate Load Halfword Upper

Load the high halfword of each word in rt with imm

iohl

rt,imm

Immediate Or Halfword Lower

OR the low halfword of each word in rt with imm

ila

rt,imm

Immediate Load Address

Load imm (18 bit) into the LSBs of rt

Each of the SPU’s 128-bit general-purpose registers can hold eight 16-bit halfwords or four 32-bit words. ilh loads the 16-bit imm value into all the halfword elements, and il loads the value into all the word elements. il sign-extends the leftmost bit of imm across the rest of the bits in the word.

The next two instructions load values into some, but not all, of the halfword elements. ilhu loads imm into the four high halfword positions in each word. The remaining bits in each word are set to 0. iohl performs an OR operation between imm and the low halfwords in each word. These instructions are commonly used together to load a 32-bit value into each word of a register.

The last instruction, ila, loads the 18-bit imm value into each word of the register. This is similar to il, but ila can accept a value of up to 18 bits and does not sign-extend the value.

Figure 15.2 shows how these five instructions position values in SPU registers.

Operation of the load immediate instructions

Figure 15.2. Operation of the load immediate instructions

Registers can be loaded with immediate values, but there are no instructions that store immediate values to memory. The only way to modify the LS is to load a value into a register and store the register’s content to memory.

SPU Shuffle and Select Instructions

The instructions in the preceding section loaded values into words and halfwords, but none of the operations were byte oriented. Byte manipulation is one of the many uses of the SPU’s shufb instruction. It can also select bytes from two registers and place them in a single register.

The selb instruction is similar to shufb, but selects individual bits rather than bytes. This discussion explains both instructions and the instructions that generate masks for them.

Byte Shuffling and Shuffle Mask Creation

Table 15.3 lists the byte shuffling instruction, shufb, and its mask generation instructions. The rightmost column lists the SPU intrinsic functions that correspond to the instructions. These functions are declared in spu_intrinsics.h, but they are not specific intrinsics.

Table 15.3. Instructions for SPU Shuffling and Mask Creation

Opcode

Arguments

Purpose

Intrinsic Function

shufb

rt,ra,rb,rc

Form rt from the bytes of ra and rb according to rc

spu_shuffle

cbd

rt,index(ra)

Create mask for byte insertion

spu_insert

cbx

rt,ra,rb

Create mask for byte insertion

spu_insert

chd

rt,index(ra)

Create mask for halfword insertion

spu_insert

chx

rt,ra,rb

Create mask for halfword insertion

spu_insert

cwd

rt,index(ra)

Create mask for word insertion

spu_insert

cwx

rt,ra,rb

Create mask for word insertion

spu_insert

cdd

rt,index(ra)

Create mask for doubleword insertion

spu_insert

cdx

rt,ra,rb

Create mask for doubleword insertion

spu_insert

shufb serves the same purpose as spu_shuffle from Chapter 11. Each byte in rc (the mask or index vector) determines the value of the corresponding byte in the target register, rt. If a byte in rc is between 0 and 31, the byte in rt will be one of the 32 bytes in the concatenation of ra and rb. The results of other rc values are given as follows:

  • If the rc byte is between 128 and 191 (100xxxxx), the corresponding byte in rt will be 0x00.

  • If the rc byte is between 192 and 223 (110xxxxx), the corresponding byte in rt will be 0xFF.

  • If the rc byte is greater than or equal to 224 (111xxxxx), the corresponding byte in rt will be 0x80.

For example, if rc equals

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16

the bytes in rt will be equal to the bytes of rb, but reversed. This is because shufb treats the bytes in ra and rb as a concatenated whole. The first byte of rc, 31, sets the first byte of rt equal to the last byte of rb.

The next two instructions create a mask that, when used with shufb, inserts a byte in a specific location. cbd uses index + ra to generate the 4-bit specific location and cbx uses ra + rb. To see how this works, suppose that index + ra equals 0x7. The cbd command will produce a target register containing the following values:

16 17 18 19 20 21 22 03 24 25 26 27 28 29 30 31

When used as rc in shufb rt,ra,rb,rc, this mask sets the result equal to rb except for 1 byte. The eighth byte (0x7) in the result is set to Byte 3 of ra. Byte 3 is chosen because it’s the preferred slot for byte-sized scalars. That is, when a C/C++ variable is set to a byte value, that value is placed in Byte 3 of the corresponding SPU register. (See Chapter 10, “Introducing the Synergistic Processor Unit (SPU),” for more information about preferred slots.) The code in Listing 15.2 shows how cbd and shufb are used together.

Example 15.2. SPU Mask Creation and Shuffling: spu_mask.s

   .data
   .align 4

# Create shufb arguments, reg_a and reg_b
reg_a:
   .byte 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
reg_b:
   .byte 32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62

   .text
   .align 4
   .global _start

_start:
   il    $20,0x07          # Load 7 into each word of R20
   cbd   $21,0($20)        # Create mask (rc) for shufb

   lqa   $22,reg_a         # Create ra for shufb
   lqa   $23,reg_b         # Create rb for shufb

   shufb $24,$22,$23,$21   # Use R21 to select values
   stop  0

Once spu_mask completes, Register 24 contains all the bytes in Register 23 except for the eighth byte, which is taken from Byte 3 in Register 22. This operation is presented in Figure 15.3.

Operation of cbd and shufb

Figure 15.3. Operation of cbd and shufb

The next two functions, chd and chx, create a mask that, when used by shufb, inserts a halfword into the result. The preferred slot of a halfword consists of Bytes 2 and 3 of a register. Because there are only eight places a halfword can be inserted, the last bit of the 4-bit second input (index + ra or ra + rb) has no effect. As an example, if Register 20 contains 0x5 and Register 21 contains 0x7, the instruction

chx $22, $20, $21

creates a mask that inserts a halfword into the sixth halfword position (0x5 + 0x7 = 0xC >> 1 = 0x6). The resulting mask is as follows:

16 17  18 19  20 21  22 23  24 25  26 27  02 03  30 31

The last four functions create similar masks, but the number of available positions decreases as the size of the inserted value increases. The word inserted by cwd and cwx can only be placed in one of four places, so the last 2 bits of the specified location aren’t used. Similarly, the doubleword inserted by cdd and cdx can be placed in only one of two positions, so just the most significant bit matters.

Bit Selection and Selection Mask Creation

Table 15.4 lists the SPU bit selection instruction, selb, and the instructions that generate masks for selb.

Table 15.4. Instructions for SPU Bit Selection and Mask Generation

Opcode

Arguments

Purpose

Intrinsic Function

selb

rt,ra,rb,rc

Select bits from ra and rb according to rc

spu_select

fsm

rt,ra

Form select mask for words

spu_maskw

fsmh

rt,ra

Form select mask for halfwords

spu_maskh

fsmb

rt,ra

Form select mask for bytes

spu_maskb

fsmbi

rt,imm

Form select mask for bytes with imm

The selb instruction is exactly similar to the spu_select intrinsic described in Chapter 11. The bits in the target register, rt, are determined by the corresponding bits in the select mask, rc. If Bit X in rc equals 1, Bit X in rt will equal Bit X in ra. If Bit X in rc equals 0, Bit X in rt will be set equal to Bit X in rb. In this manner, the target register will resemble ra if rc contains mostly 0s, and will resemble rb if rc contains mostly 1s. Figure 11.5 in Chapter 11 displays the operation of both selb and spu_select.

The next three instructions form masks for selb. In each case, bits from ra are repeated throughout the resulting register. The difference between fsm, fsmh, and fsmb is how many bits are taken from ra and how many times they’re repeated.

fsm forms a select mask by taking 4 bits from ra and repeating each of them 32 times in rt. That is, the first bit is repeated 32 times, followed by 32 repetitions of the second bit, and so forth. For example, if Register 20 contains 0xA (1010), the instruction

fsm $21,$20

produces the following result in the Register 21:

FFFFFFFF 00000000 FFFFFFFF 00000000

Similarly, fsmh repeats 8 bits from ra 16 times and fsmb repeats 16 bits from ra 8 times. These three instructions function exactly like the select mask generation intrinisics described in Chapter 11. The last instruction in Table 15.4, fsmbi, is different. This performs the same operation as fsmb, but instead of requiring ra as an argument, it accepts a 16-bit constant. For example

fsmbi $21,0xABAB

will place the following value in Register 21:

FF00FF00 FF00FFFF FF00FF00 FF00FFFF

By generating masks with these instructions, you can control how selb performs its bit selection. If used properly, this selection can replace traditional comparison-branch constructs, thereby removing the possibility of branch misses and improving the performance of your application.

SPU Arithmetic and Logic Instructions

The SPU instructions for math and logic are among the most commonly used and the simplest to understand. If the corresponding intrinsics in Chapter 11 made sense, you won’t find any surprises here. The only difference is that the integer instructions are divided into byte, word, and halfword instructions, and are further divided into those that use immediate values and those that don’t.

This section organizes arithmetic/logic instructions into three categories: addition/subtraction instructions, multiplication instructions, and logical instructions. The SPU has no assembly instructions for division. The MASSV and SIMD Math libraries provide higher-level routines for this purpose.

SPU Addition and Subtraction Instructions

There’s one simple reason why you need to have a solid grasp of the SPU’s addition/subtraction instructions: pointer arithmetic. This isn’t a concern in C, but in assembly coding, you not only need to know how to add values to addresses, you also need to increment and decrement the stack pointer. Table 15.5 lists the SPU instructions that perform these operations.

Table 15.5. Instructions for SPU Addition/Subtraction

Opcode

Arguments

Purpose

Intrinsic Function

ah

rt,ra,rb

Add halfwords in ra and rb

spu_add

ahi

rt,ra,imm

Add halfwords in ra and imm value

spu_add

a

rt,ra,rb

Add words in ra and rb

spu_add

ai

rt,ra,imm

Add words in ra to imm value

spu_add

addx

rt,ra,rb

Add words in ra and rb to LSB of rt

spu_addx

cg

rt,ra,rb

Generate carry vector from ra and rb

spu_genc

cgx

rt,ra,rb

Generate carry vector from ra, rb, and the LSB of rt

spu_gencx

sfh

rt,ra,rb

Subtract halfwords in ra from rb

spu_sub

sfhi

rt,ra,imm

Subtract halfwords in ra from imm value

spu_sub

sf

rt,ra,rb

Subtract words in ra from rb

spu_sub

sfi

rt,ra,imm

Subtract words in ra from imm value

spu_sub

sfx

rt,ra,rb

Subtract words in ra from rb and LSB of rt

spu_subx

bg

rt,ra,rb

Generate borrow from ra and rb

spu_genb

bgx

rt,ra,rb

Generate borrow from ra, rb, and the LSB of rt

spu_genbx

absdb

rt,ra,rb

Subtract bytes in ra from rb, returns absolute value

spu_absd

avgb

rt,ra,rb

Average of bytes in ra and rb

spu_avg

sumb

rt,ra,rb

Add bytes in ra and rb, return halfword results

spu_sumb

fa/dfa

rt,ra,rb

Add floating point values in ra and rb

spu_add

fs/dfs

rt,ra,rb

Subtract floating-point values in ra from rb

spu_sub

Most of these instructions operate on fixed-point values, and the basic addition and subtraction instructions exactly resemble their corresponding intrinsics. Even the extended operations, involving adding a carry vector or borrow vector, are provided as basic SPU instructions. However, you need to remember that, for addx and sfx, the carry/borrow bits are taken from the target register, rt, which also holds the result.

Let’s look at an example of how sfx and bg work together to subtract one large number from another. The goal is to perform the following subtraction:

  0x70000000 000000000 00000000 00000000
- 0x60000000 000000000 00000000 00000001
  0x0FFFFFFF FFFFFFFFF FFFFFFFF FFFFFFFF

The first value is stored in Register 20, and the second value is stored in Register 21. Register 22 holds the results and the borrow vectors. Listing 15.3 shows how the subtraction is coded.

Example 15.3. SPU Large Number Subtraction: spu_subtract.s

   .data
   .align 4

# Create operands for subtraction
reg_20:
   .word 0x70000000, 0x00000000, 0x00000000, 0x00000000
reg_21:
   .word 0x60000000, 0x00000000, 0x00000000, 0x00000001

   .text
   .align 4
   .global _start

_start:
   lqa     $20,reg_20
   lqa     $21,reg_21

   bg      $22,$20,$21    # Create the borrow vector
   shlqbyi $22,$22,4      # Shift the borrow vector left
   nor     $22,$22,$22    # Invert the borrow bits
   sfx     $22,$21,$20    # Subtract Reg 21 from Reg 20
                          # using borrow bits in Reg 22
   stop    0

There are a number of important differences between Listing 15.3 and Listing 12.3, which computes the sum of two large numbers. First, the generate borrow instruction (bg) needs to be executed only once, whereas the generate carry instruction (cg) must be executed for each word in the input. This is because bg doesn’t really perform subtraction; it compares the words of the inputs and returns a 1 whenever the word in ra is less than or equal to the corresponding word in rb.

The borrow vector in Register 22 is shifted left and inverted. This inversion is necessary because the subtraction instruction only subtracts borrow bits when they equal 0. That is, bg returns 1 when a borrow is needed, but sfx only factors in a borrow bit when it equals 0.

The last three fixed-point instructions in Table 15.5 are exactly like their intrinsic function counterparts. The first, absdb, subtracts the bytes of the first input from those of the second and returns the absolute value of each result. The second, avgb, returns the average of the bytes of the inputs. The third, sumb, doesn’t add the bytes of the first input to those of the second, but instead adds the bytes of the first input to each other and returns the sum. It also adds the bytes of the second input to each other and returns the sum.

The last two instructions operate on floating-point values. The instructions are simple, but keep in mind that some instructions operate on floats (fa/fs) and others operate on doubles (dfa/dfs). This notation holds for all floating-point instructions: Instructions that operate on single-precision floating-point values start with an f, and those that operate on double-precision values start with df.

SPU Multiplication Instructions

Table 15.6 presents the SPU instructions that multiply fixed-point and floating-point values. When using them, be sure to keep track of which input values are multiplied and how the products are stored in the target register.

Table 15.6. Instructions for SPU Multiplication Operations

Opcode

Arguments

Purpose

Intrinsic Function

mpy

rt,ra,rb

Multiply low halfwords in ra and rb

spu_mulo

mpyu

rt,ra,rb

Multiply unsigned low halfwords in ra and rb

spu_mulo

mpyi

rt,ra,imm

Multiply low halfwords in ra by imm value

spu_mulo

mpyui

rt,ra,imm

Multiply unsigned low halfwords in ra by imm value

spu_mulo

mpys

rt,ra,rb

Multiply low halfwords in ra and rb and shift right

spu_mulsr

mpyh

rt,ra,rb

Multiply high halfwords of ra and low halfwords of rb

spu_mulh

mpyhh

rt,ra,rb

Multiply high halfwords of ra and rb

spu_mule

mpyhhu

rt,ra,rb

Multiply unsigned high halfwords of ra and rb

spu_mule

mpya

rt,ra,rb,rc

Multiply signed words of ra and rb, add to rc

spu_madd

mpyhha

rt,ra,rb

Multiply high halfwords of ra and rb, and add rt

spu_mhhadd

mpyhhau

rt,ra,rb

Multiply unsigned high halfwords of ra and rb, and add rt

spu_mhhadd

fm/dfm

rt,ra,rb

Multiply floating-point values in ra and rb

spu_mul

fma

rt,ra,rb,rc

Multiply floating-point values in ra and rb, add to values in rc

spu_madd

dfma

rt,ra,rb

Multiply FP values in ra and rb, add to rt

spu_madd

fms

rt,ra,rb,rc

Multiply floating-point values in ra and rb, subtract values in rc

spu_msub

dfms

rt,ra,rb

Multiply floating-point values in ra and rb, subtract values in rt

spu_msub

fnms

rt,ra,rb,rc

Multiply floating-point values in ra and rb, subtract values in rc, negate result

spu_nmsub

dfnma

rt,ra,rb

Multiply floating-point values in ra and rb, add values in rc, negate result

spu_nmadd

dfnms

rt,ra,rb

Multiply floating-point values in ra and rb, subtract values in rt, negate result

spu_nmsub

frest

rt,ra

Floating-point reciprocal estimate

spu_re

frsqest

rt,ra

Floating-point reciprocal absolute square root estimate

spu_rsqrte

 

The first-generation Cell only performs multiplication 16 bits at a time—this is why so many of the instructions operate on halfwords. Most of the fixed-point multiplication instructions are distinguished by which halfwords they multiply. Making sense of the odd, even, and high multiplication can be confusing, but Figure 11.4 in Chapter 11 shows how they work.

The floating-point multiplication instructions present only one significant concern. fm and dfm have the same register arguments, but other instructions differ between the single-precision and double-precision versions. For example, fms and dfms both perform a multiplication and a subtraction, but fms includes a third input register whereas dfms subtracts the value in the target register. This also holds true for fma/dfma and fnms/dfnms.

The last two instructions return floating-point reciprocals and reciprocals of floating-point square roots. The algorithm used for the calculation and the accuracy of the result are described in the documentation for the SPU instruction set architecture.

SPU Logic Instructions

Table 15.7 presents the SPU instructions that perform logical operations: AND, OR, XOR, NAND, and NOR. Like previous instructions, they’re distinguished by the size of the data they operate on: bytes, halfwords, or words.

Table 15.7. Instructions for SPU Logic Operations

Opcode

Arguments

Purpose

Intrinsic Function

eqv

rt,ra,rb

Return 1 if ra and rb are the same, 0 otherwise

spu_eqv

and

rt,ra,rb

AND the values of ra and rb

spu_and

andbi

rt,ra,imm

AND the bytes of ra with the imm value

spu_and

andhi

rt,ra,imm

AND the halfwords of ra with the imm value

spu_and

andi

rt,ra,imm

AND the words of ra with the imm value

spu_and

andc

rt,ra,rb

AND the values of ra and the complement of rb

spu_andc

or

rt,ra,rb

OR the values of ra and rb

spu_or

orbi

rt,ra,imm

OR the bytes of ra with the imm value

spu_or

orhi

rt,ra,imm

OR the halfwords of ra with the imm value

spu_or

ori

rt,ra,imm

OR the words of ra with the imm value

spu_or

orc

rt,ra,rb

OR the values of ra and the complement of rb

spu_orc

xor

rt,ra,rb

XOR the values of ra and rb

spu_xor

xorbi

rt,ra,imm

XOR the bytes of ra with the imm value

spu_xor

xorhi

rt,ra,imm

XOR the halfwxords of ra with the imm value

spu_xor

xori

rt,ra,imm

XOR the wxords of ra with the imm value

spu_xor

nand

rt,ra,rb

NAND the values of ra and rb

spu_nand

nor

rt,ra,rb

NOR the values of ra and rb

spu_nor

orx

rt,ra

OR the words of ra

spu_orx

There is no clear instruction for logical inversion (that is, no NOT instruction). You can NAND a value with 1 or NOR a value with 0, but neither of these instructions accept immediate values. The fastest way to negate the content of a register is to NOR it with itself.

For example, if Register 20 holds 0xAAAAAAAA, the result of

nor $20,$20,$20

places 0x55555555 in Register 20, the logical inverse of 0xAAAAAAAA.

The orx instruction is unique because it operates on words within a register. That is, it looks at the four words inside a register and ORs all four of them together. For example, if Register 20 contains 0x2222444466668888, the result of

orx $20,$20

places the value, 0xEEEE in the preferred slot of Register 20. This is because the values 2, 4, 6, and 8 ORed together produce 0xE.

SPU Compare, Branch, and Halt Instructions

This section describes four different types of instructions: compare, branch, hint-for-branch, and halt. It might seem strange to combine them together, but there’s a good reason. Conditional branches generally use the results of a comparison to determine which instruction path to follow.

The hint-for-branch instructions identify the target address of an upcoming branch and thereby give the SPU time to load new instructions. Used properly, they can significantly improve an application’s performance.

SPU Compare Instructions

The instructions in Table 15.8 compare register values. There are many available, but each either checks for equality or a greater-than relationship. As shown in the last column, all the fixed-point instructions correspond to either spu_cmpeq or spu_cmpgt.

Table 15.8. Instructions for SPU Vector Comparison

Opcode

Arguments

Purpose

Intrinsic Function

ceqb

rt,ra,rb

Compare equality of bytes in ra and rb

spu_cmpeq

ceqbi

rt,ra,imm

Compare equality of bytes in ra to imm value

spu_cmpeq

ceqh

rt,ra,rb

Compare equality of halfwords in ra and rb

spu_cmpeq

ceqhi

rt,ra,imm

Compare equality of halfwords in ra to imm value

spu_cmpeq

ceq

rt,ra,rb

Compare equality of words in ra and rb

spu_cmpeq

ceqi

rt,ra,imm

Compare equality of words in ra to imm value

spu_cmpeq

cgtb

rt,ra,rb

Return if bytes in ra are greater than bytes in rb

spu_cmpgt

cgtbi

rt,ra,imm

Return if bytes in ra are greater than imm

spu_cmpgt

cgth

rt,ra,rb

Return if halfwords in ra are greater than rb

spu_cmpgt

cgthi

rt,ra,imm

Return if halfwords in ra are greater than imm

spu_cmpgt

cgt

rt,ra,rb

Return if words in ra are greater than words in rb

spu_cmpgt

cgti

rt,ra,imm

Return if words in ra are greater than imm

spu_cmpgt

clgtb

rt,ra,rb

Return if bytes in ra are logically greater than rb

spu_cmpgt

clgtbi

rt,ra,imm

Return if bytes in ra are logically greater than imm

spu_cmpgt

clgth

rt,ra,rb

Return if halfwords in ra are logically greater than rb

spu_cmpgt

clgthi

rt,ra,imm

Return if halfwords in ra are logically greater than imm

spu_cmpgt

clgt

rt,ra,rb

Return if words in ra are logically greater than rb

spu_cmpgt

clgti

rt,ra,imm

Return if words in ra are logically greater than imm

spu_cmpgt

fceq

rt,ra,rb

Compare floating-point equality of ra and rb

spu_cmpeq

fcmeq

rt,ra,rb

Compare floating-point equality of ra and rb magnitudes

spu_cmpabseq

fcgt

rt,ra,rb

Return if floating-point ra is greater than floating-point rb

spu_cmpgt

fcmgt

rt,ra,rb

Return if floating-point magnitude of ra is greater than that of rb

spu_cmpabsgt

As you code with these instructions, remember the size of the data elements being compared. If the condition turns out to be true for one of the bytes, halfwords, or words in the input register, the corresponding bytes, halfwords, and words will be set to all 1s. Otherwise, they will be set to all 0s. This result is important for the branch routines.

When it comes to checking for a greater-than relationship, many of the instructions take sign bits into account. The logical comparison instructions, such as clgt and clgthi, do not. The floating point comparisons come in two types: fceq and fcgt compare floating-point values and fcmeq and fcmgt compare their magnitudes.

SPU Branch Instructions

Normally, the SPU processes instructions sequentially. As each instruction is executed, the Program Counter (PC) increments its stored address and loads the next instruction. Branch instructions disrupt this orderly operation. Each branch instruction generates a target address, and if the branch is taken, the PC processes instructions at this address rather than the next instruction in the sequence. All if statements, case statements, and execution loops depend on branches.

Section 10.7, “The SPU Instruction Pipeline,” discusses the SPU’s instruction pipeline, but there are three points that bear repeating:

  • The SPU always predicts that the lesser of two addresses will be taken.

  • The SPU fetches instructions at the predicted address before they’re needed.

  • If the SPU makes a wrong prediction, the penalty is 18 wasted cycles.

There are at least three ways to reduce the possibility of branch misses. The most reliable method is to reduce the number of branches in your code. If this isn’t possible, you can arrange the comparison arguments so that the SPU makes accurate predictions. If that’s not possible, the least you can do is insert the right branch instruction into your code. Table 15.9 lists each SPU branch statement and its purpose.

Table 15.9. Instructions for SPU Branching and Halting

Opcode

Arguments

Purpose

Intrinsic

bra

imm

Branch to the imm address

brasl

rt,imm

Branch to imm and set link

br

imm

Branch to sum of imm and PC

brsl

rt,imm

Branch to sum of imm and PC, set link

bi

ra

Branch to ra

bisl

rt,ra

Branch to ra and set link

brhz

rt,imm

Branch to sum of imm and PC if rt halfword equals 0

brhnz

rt,imm

Branch to sum of imm and PC if rt halfword doesn’t equal 0

brz

rt,imm

Branch to sum of imm and PC if rt word equals 0

brnz

rt,imm

Branch to sum of imm and PC if rt word doesn’t equal 0

bihz

rt,ra

Branch to ra if rt halfword equals 0

bihnz

rt,ra

Branch to ra if rt halfword doesn’t equal 0

biz

rt,ra

Branch to ra if rt word equals 0

binz

rt,ra

Branch to ra if rt word doesn’t equal 0

bisled

rt,ra

Branch to ra and set link if an event occurs

spu_bisled

The first six instructions are unconditional branches, which means that the branch is always taken. br is the most common of these because it can accept a label as its argument. This is shown in the following code:

   br loop_start
   ai $20,$21,$22

loop_start:
   orx $20,$20

In this simple example, br causes the SPU to skip the ai instruction and go to the code represented by loop_start. After br, the next instruction is orx, not ai.

Many branch instructions in Table 15.9, conditional and unconditional, set a link during operation. “Setting a link” means storing the address of the current instruction in rt before making the branch. Links become crucial during function calls because they hold the address that the Program Counter should return to after the function completes. Because of the importance, the SPU Application Binary Interface (ABI) designates Register 0 to serve as the link register. This is discussed more fully in Section 15.10, “Assembly Language and Function Calls.”

Conditional branch instructions compare the preferred halfword or word in rt to 0. If the condition is true, the SPU starts processing at the branch target. If not, the SPU continues processing normally. brz and brnz are common because they accept labels.

Before a branch, it’s common for rt to hold the result of a previous compare instruction. As an example, let’s see how a high level if-else construct is compiled into SPU assembly. The following C code initializes x to 5, checks to see whether it’s greater than 3, and processes x differently based on the result:

int x = 5;
if (x > 3)
   x++;
else
   x--;

After compiling this with spu-gcc -S, the resulting code looks similar to the following:

   il    $4,5        # Set all words in Reg 4 to 5
   cgti  $5,$4,3     # Compare words in Reg 4 to 3
   brz   $5,.L2      # Branch if word in Reg 5 is 0
   ai    $4,$4,1     # Increment Reg 4

.L2:
   ai    $4,$4,-1    # Decrement Reg 4

As shown, x is initialized by setting all the words in Register 4 to the value of 5. Then cgti compares each of these words to 3 and places the result in Register 5. All the words are greater than 3, so cgti sets every bit in Register 5 to 1. The preferred word of Register 5 is non-0, so the brz branch won’t be taken and Register 4 (x) will be incremented.

The bisled instruction performs a branch when an SPU event occurs. This is exactly similar to the spu_bisled intrinsic discussed in Chapter 13.

SPU Hint-for-Branch Instructions

When the SPU is notified about an upcoming branch instruction, it predicts the branch target and loads the instructions at the target’s address. If the prediction turns out to be accurate, the SPU can continue processing without losing a single cycle.

The instructions in Table 15.10 provide these SPU alerts. Each instruction tells the SPU the address of the branch and the expected branch target.

Table 15.10. Instructions for SPU Branch Hinting

Opcode

Arguments

Purpose

Intrinsic

hbr

imm,ra

Hint that the branch at imm will target PC + ra

hbra

imm1,imm2

Hint that the branch at imm1 will target PC + imm2

hbrr

ra,rb

Hint that the branch at PC + imm1 will target PC + imm2

These instructions generate addresses differently, but the first argument identifies the address of the upcoming branch, and the second identifies the branch’s predicted target. To see how this is used, it helps to look at an example. The following assembly code is taken verbatim from /opt/cell/sdk/src/samples/julia/spu_ray_aos/ray.s:

   hbra .L111,.L17
   ilhu $21,16384
   cwd  $19,0($sp)
   ila  $18,66051
   shlqbyi    $17,$78,0
   ilhu $20,16672
   nop  $127
   nop  $127
   nop  $127
   nop  $127

.L111:
   br        .L17

The first instruction, hbra, tells the SPU that there’s a branch at the address represented by .L111, and that the branch will probably go to the address represented by .L17. After processing this instruction, the SPU will start fetching instructions at .L17.

SPU Halt Instructions

Instead of just changing the execution path, SPU instructions can also halt the SPU’s execution. Table 15.11 lists these instructions and their corresponding intrinsics.

Table 15.11. Instructions for SPU Branching and Halting

Opcode

Arguments

Purpose

Intrinsic

stop

Halt the SPU and send stop signal to PPU

spu_stop

stopd

Halt the SPU and send signal (can be used as breakpoint)

spu_stop

heq

ra,rb

Halt if ra equals rb

spu_hcmpeq

heqi

ra,imm

Halt if ra equals imm

spu_hcmpeq

hgt

ra,rb

Halt if ra is greater than rb

spu_hcmpgt

hgti

ra,imm

Halt if ra is greater than imm

spu_hcmpgt

hlgt

ra,rb

Halt if ra is logically greater than rb

spu_hcmpgt

hlgti

ra,imm

Halt if ra is logically greater than imm

spu_hcmpgt

Like the branch instructions, the halt instructions are either conditional or unconditional. The first two instructions, stop and stopd, are unconditional, and halt the SPU immediately. Both instructions send the PPU a signal that the PPU can handle as an event. Further, stopd can be used as a breakpoint. As stated in the SPU ISA documentation: “Instructions with dependencies can be replaced with stopd to create a breakpoint without affecting the instruction timings.”

The instructions following stop and stopd end the SPU’s execution depending on a condition. Unlike conditional branches, these conditions check for equality or a greater-than relationship between the inputs, not just equality with 0. Halt instructions are useful when you need to stop processing in the event of an error.

SPU Channel and Control Instructions

The instructions in Table 15.12 perform a number of tasks. Some access SPU channels and synchronize communication. Others provide access to the SPU’s special-purpose registers. The last two simply take up space in the instruction pipeline.

Table 15.12. SPU Control and Channel Operations

Opcode

Arguments

Purpose

Intrinsic

rdch

rt,imm

Read data from channel imm into rt

spu_readch

wrch

imm,rt

Write data from rt into channel imm

spu_writech

rchcnt

rt,imm

Read capacity of channel imm into rt

spu_readchcnt

sync

Force SPU to complete all store operations before continuing

spu_sync

syncc

Forces SPU to complete store and channel operations

spu_sync_c

dsync

Ensures LS data is current before external accessing

spu_dsync

mfspr

rt,imm

Move special-purpose register imm to rt

spu_mfspr

mtspr

imm,rt

Move rt to special-purpose register imm

spu_mtspr

fscrrd

rt

Move floating-point status and control register to rt

spu_mffpscr

fscrwr

ra

Move rt to floating-point status and control register

spu_mtfpscr

nop

No operation (Pipeline 0)

lnop

No operation (Pipeline 1)

Chapter 13 explained how channels work and the three channel functions that make SPU interprocessor communication possible: spu_readch, spu_writech, and spu_writechcnt. These correspond to the first three instructions in Table 15.12, which read from a channel, write to a channel, and return the channel’s capacity. In each case, the immediate value represents the number of the channel being accessed.

For example, to write the value in Register 20 to the SPU’s decrementer (Channel 7), you’d use the following instruction:

wrch   $20,7

To read a value back from the decrementer (Channel 6), you’d use this instruction:

rdch   6,$20

The next three instructions, sync, syncc, and dsync, delay SPU processing until communication tasks have completed. sync forces the SPU to wait until all store operations have been completed. This ensures that all future memory reads will return the most current data available. sync also flushes the instruction buffers and the instruction pipeline, causing a significant delay in the instruction pipeline. syncc performs the same operations as sync, but also forces channel operations to finish before executing further instructions.

dsync ensures that all data in the LS is current before external processes can read it. It forces prior loads, stores, and channel accesses to complete before future loads, stores, and channel accesses can be performed. dsync does not affect the instruction buffers or pipeline.

The next four instructions in the table provide access to special registers. The first two read and write to any of the SPU’s special-purpose registers. However, many of these registers require privileged access and are unavailable for common code. The floating-point status and control register (FPSCR) is always available, and its value can be read and written to using fscrrd and fscrwr. The FPSCR is fully discussed in Chapter 11.

The last two instructions in Table 15.12, nop and lnop, may seem trivial because they don’t perform any operation. If used properly, however, they enable more efficient usage of the SPU’s two pipelines.

SPU Shift and Rotate Instructions

The instructions in Table 15.13 are among the simplest. Each of them shifts a register’s contents to the left. They differ according to which bits are shifted and how many positions can be shifted at a time.

Table 15.13. Instructions for SPU Register Shifting

Opcode

Arguments

Purpose

Intrinsic

shlh

rt,ra,rb

Shift bits in halfwords of ra left according to rb

spu_sl

shlhi

rt,ra,imm

Shift bits in halfwords of ra left according to imm

spu_sl

shl

rt,ra,rb

Shift bits in words of ra left according to rb

spu_sl

shli

rt,ra,imm

Shift bits in words of ra left according to imm

spu_sl

shlqbi

rt,ra,rb

Shift entire ra left by bits according to rb

spu_slqw

shlqbii

rt,ra,imm

Shift entire ra left by bits according to imm

spu_slqw

shlqby

rt,ra,rb

Shift entire ra left by bytes according to rb

spu_slqwbyte

shlqbyi

rt,ra,imm

Shift entire ra left by bytes according to imm

spu_slqwbyte

shlqbybi

rt,ra,rb

Shift entire ra left by bytes according to rb count

spu_slqwbytebc

SPU Shift Instructions

The first four instructions shift bits within words and halfwords. For example, if each word in Register 20 holds the value 0x6000 0000, the operation

shli $21,$20,2

places 0x8000 0000 in each word of Register 21. This is because the first high bit in 0x6000 0000 is shifted out of each word and discarded.

The instructions starting with shlq treat the register content as a single 128-bit quadword rather than a group of words and halfwords. Returning to the previous example, if each word in Register 21 holds 0x6000 0000, the operation

shlqbii $21,$20,2

will place the value 0x8000 0001 8000 0001 8000 0001 8000 0001 in Register 21. The difference between this and shli is that the shifting is performed across the entire register without regard for word boundaries.

The last three instructions in the table shift bytes rather than bits. shlqby uses Bits 27 through 31 of rb to determine how many bytes to shift and shlqbybi uses Bits 24 through 28. These instructions can be replaced by shufb and a suitable mask.

SPU Rotate Instructions

Table 15.14 lists the instructions that rotate bits and bytes inside a register. Like the shift instructions, the shift can be limited to halfwords and words or can be performed across the entire register. One important difference is that many of these instructions (the ones that start with rotm- or rotqm-) assume that the second input will be negative. Instead of rotating values to the left, these functions shift values to the right.

Table 15.14. Instructions for SPU Register Rotation

Opcode

Arguments

Purpose

Intrinsic

roth

rt,ra,rb

Rotate bits in halfwords of ra left according to rb

spu_rl

rothi

rt,ra,imm

Rotate bits in halfwords of ra left according to imm

spu_rl

rot

rt,ra,rb

Rotate bits in words of ra left according to rb

spu_rl

roti

rt,ra,imm

Rotate bits in words of ra left according to imm

spu_rl

rotqbi

rt,ra,rb

Rotate entire ra left by bits according to rb

spu_rlqw

rotqbii

rt,ra,imm

Rotate entire ra left by bits according to imm

spu_rlqw

rotqby

rt,ra,rb

Rotate entire ra left by bytes according to rb

spu_rlqwbyte

rotqbyi

rt,ra,imm

Rotate entire ra left by bytes according to imm

spu_rlqwbyte

rotqbybi

rt,ra,rb

Rotate entire ra left by bytes according to rb count

spu_rlqwbytebc

rothm

rt,ra,rb

Rotate bits in halfwords of ra right according to -rb

spu_rlmask

rothmi

rt,ra,imm

Rotate bits in halfwords of ra right according to -imm

spu_rlmask

rotm

rt,ra,rb

Shift bits in words of ra right according to -rb

spu_rlmask

rotmi

rt,ra,imm

Shift bits in words of ra right according to -imm

spu_rlmask

rotqmbi

rt,ra,rb

Shift entire ra right by bits according to -rb

spu_rlmaskqw

rotqmbii

rt,ra,imm

Shift entire ra right by bits according to -imm

spu_rlmaskqw

rotqmby

rt,ra,rb

Shift entire ra right by bytes according to -rb

spu_rlmaskqw byte

rotqmbyi

rt,ra,imm

Shift entire ra right by bytes according to -imm

spu_rlmaskqw byte

rotqmbybi

rt,ra,rb

Shift entire ra right by bytes according to -rb count (algebraic)

spu_rlmaskqw bytebc

rotmah

rt,ra,rb

Shift bits in halfwords of ra right according to -rb (algebraic)

spu_rlmaska

rotmahi

rt,ra,imm

Shift bits in halfwords of ra right according to -imm (algebraic)

spu_rlmaska

rotma

rt,ra,rb

Shift bits in words of ra right according to -rb (algebraic)

spu_rlmaska

rotmai

rt,ra,imm

Shift bits in words of ra right according to -imm (algebraic)

spu_rlmaska

The first nine instructions operate like the left-shift instructions in Table 15.12. But instead of discarding bits/bytes that are shifted out, the bits/bytes are replaced at the least significant positions in the halfword, word, or quadword.

The instructions that start with rotm- or rotqm- are more interesting. The m stands for mask, but they don’t use bit masks like those used for shufb or selb. These instructions have four important characteristics:

  1. The second input, imm or rb, must contain a negative value.

  2. The result will be shifted right as many bytes/bits as the second argument specifies.

  3. Bits/bytes shifted out will be discarded.

  4. The sign bit will be extended for algebraic instructions (rotma, rotmah, and so on).

Listing 15.4 shows how three of these functions, rotmi, rotqmbii, and rotmai, operate on a register whose words are set to the value 0xAAAA AAAA.

Example 15.4. SPU Large Number Subtraction: spu_shright.s

   .data
   .align 4

# Immediate value to be shifted
reg_20:
   .word 0xAAAAAAAA, 0xAAAAAAAA, 0xAAAAAAAA, 0xAAAAAAAA

# Declare the text section
   .text
   .align 4
   .global _start

_start:
   lqa      $20,reg_20     # Load immediate value in $20

   rotmi    $21,$20,-7     # Shift bits in words right
   rotqmbii $22,$20,-7     # Shift entire quadword right
   rotmai   $23,$20,-7     # Shift bits in words right
                           # and repeat sign bit
   stop  0

Using spu-gdb to debug the application, the results are as follows:

rotmi:      0x0155 5555 0155 5555 0155 5555 0155 5555
rotqmbii:   0x0155 5555 5555 5555 5555 5555 5555 5555
rotmai:     0xFF55 5555 FF55 5555 FF55 5555 FF55 5555

0xAAAA >> 7 = 0x0155, so these results make sense. If you change the last argument of rotmi, rotqmbii, and rotmai from -7 to 7, the assembler will give you the following warning:

Constant expression out of range [-63, 0]

SPU Counting and Conversion Instructions

The instructions in this last category are used to count bits in a register, extend signs, and convert register contents. Table 15.15 lists each instruction, its purpose, and the SPU intrinsic that performs a similar operation.

Table 15.15. Instructions for SPU Register Rotation

Opcode

Arguments

Purpose

Intrinsic

clz

rt,ra

Count 0s preceding the first 1 in ra

spu_cntlz

cntb

rt,ra

Count number of 1s in each byte of ra

spu_cntb

gbb

rt,ra

Concatenate LSBs of each byte in ra

spu_gather

gbh

rt,ra

Concatenate LSBs of each halfword in ra

spu_gather

gb

rt,ra

Concatenate LSBs of each word in ra

spu_gather

xsbh

rt,ra

Sign extend bytes in ra to halfwords

spu_extend

xshw

rt,ra

Sign extend halfwords in ra to words

spu_extend

xswd

rt,ra

Sign extend words in ra to doublewords

spu_extend

fesd

rt,ra

Convert float in ra to double in rt

spu_extend

frds

rt,ra

Round double in ra to float in rt

spu_roundtf

fi

rt,ra,rb

Floating-point interpolate between ra and rb

cflts

rt,ra,imm

Convert float in ra to signed integer in rt, scaled by imm

spu_convts

cfltu

rt,ra,imm

Convert float in ra to unsigned integer in rt, scaled by imm

spu_convtu

csflt

rt,ra,imm

Convert signed integer in ra to float in rt, scaled by imm

spu_convtf

cuflt

rt,ra,imm

Convert unsigned integer in ra to float in rt, scaled by imm

spu_convtf

The first two instructions, clz and cntb, count bits in a register. clz counts the number of 0s in each input word and returns these four values in the words of the target register. cntb counts the number of 1s in each byte of the input register. These instructions are particularly helpful when operating on the results of comparison instructions.

The instructions gbb, gbh, and gb create a scalar element by concatenating the least significant bits of the corresponding elements of a register. For example, gbh looks at the LSBs of each input halfword and returns a halfword containing the LSBs in order. If Register 20 contains 0x8888 9999 AAAA BBBB CCCC DDDD EEEE FFFF, the instruction

gbh $21,$20

places the value 0x0055 in the preferred slot for halfwords.

The next three instructions accept a register containing elements of one size and return a register whose elements are twice as large and half as numerous. For example, xswd accepts a register containing four word values and returns a register of two doublewords. The words on the right retain their position and their signs are extended across the words on the left. For example, if you place {−5, −4, −3, −2} in Register 20, the result of

xswd $21,$20

places {−4, −2} in Register 21 as doublewords.

The rest of the instructions in the table convert between fixed-point and floating-point values. They’re easy to understand and closely resemble their intrinsic functions. Only fi (Floating Interpolate) is different. This instruction doesn’t have a corresponding intrinsic, but is used by frest (Floating Reciprocal Estimate) and frsqest (Floating Reciprocal Absolute Square Root Estimate).

Assembly Language and Function Calls

You’ve seen all the SPU assembly instructions and four simple assembly coded applications. But what if you want to write assembly code that can be called as a function? How do you read input parameters? Where do you store the return value? How do you manage the stack?

The answers to these questions are presented in the Linux Application Binary Interface (ABI), provided in the SDK doc, SPU_ABI-Specification.pdf. The ABI specifies what types of data should be stored in the SPU’s registers. This includes data such as the stack pointer, input arguments, and any environment information provided by the calling function. This is briefly explained in Chapter 10, and Table 15.16 restates how the SPU registers are used.

Table 15.16. SPU Registers and the Application Binary Interface

Register

ABI Usage

R0

Link register (LR). Stores address where the function should return.

R1

Stack pointer (SP). Points to the top of the stack in the LS.

R2

Environment pointer. Receives environment data.

R3-R74

Stores function arguments. Register 3 holds the first parameter and return value.

R75-R79

General usage.

R80-R127

Local variable storage.

This section describes how the ABI is used in two situations: writing a assembly code that can be called as a C/C++ function and calling a C/C++ function in assembly.

Writing an Assembly-Coded Function

Let’s say you have assembly code that you want to execute with a C/C++ function call. The coding process is similar to that for ordinary applications, but two additional tasks are needed:

  • Declaring the function in assembly

  • Managing the stack

This section describes both tasks and concludes with a full example of an assembly-coded callable function.

Declaring the Function in Assembly

The first step in creating the assembly coded function is to make sure that the function name is visible to any code that needs to access it. In SPU assembly, this is accomplished with the .global directive. This directive can also be used for any global variables or objects required by the application.

To show that the global symbol refers to a function, you need to use the .type directive and the @function specifier. For example, to declare a function called func, use lines similar to the following:

.global func
.type func, @function

This declaration is required for execution, so it must be placed in the .text section of the assembly file. When the symbol is created, it can be used as a global label in the same way that _start has been used in earlier applications.

Managing the Stack

Register 1 points to the top of the SPU stack, which is accessed in assembly with the stack pointer, $sp. As explained in Chapter 11, the stack is a section of memory that stores local variables during a function’s execution. When the function completes, the stack returns to its earlier size, effectively deallocating the function’s local variables.

The stack grows downward. To allocate memory for the stack, decrement the stack pointer by subtracting from $sp. Then use the memory between the current stack pointer and the old stack pointer to store local variables.

Suppose your assembly function has three local variables stored in Registers 20, 21, and 22. Each variable gets its own 16-byte line in the LS, so you might use code similar to the following:

# Store stack pointer at new address
stqd   $sp, -64($sp)

# Make space for three local variables
# by decrementing the stack pointer
ai     $sp, $sp, -64

# Store a local variable on the stack
stqd   $20, 48($sp)
stqd   $21, 32($sp)
stqd   $22, 16($sp)

# After the function completes,
# return the stack pointer to its earlier position
ai     $sp, $sp, 64

This code doesn’t accomplish anything worthwhile—it makes space on the stack for three LS lines by decrementing the stack pointer by four LS lines (4 × 16 bytes = 64). Then it stores the three variables on the stack at their appropriate lines (1 × 16, 2 × 16, 3 × 16). When the function returns, the stack pointer is incremented by the same value that it had been originally decremented (64). By returning the stack pointer to its original position, the code frees the space allocated for the local variables for subsequent functions to use.

This process can be generalized into a four-step process for managing the SPU stack:

  1. Store the value of the old stack pointer ($sp) at the address of the new stack pointer ($sp - stack_size).

  2. Decrement the stack pointer by 16 × (n + 1), where n is the number of local variables you need to store.

  3. Load and store local variables at line addresses (multiples of 16 + $sp) between the old stack pointer and the new stack pointer.

  4. Increment the stack pointer so that it returns to its original position.

Now that you understand how to declare functions and manage a stack in assembly, it’s time to look at an example.

Assembly-Coded Function Example

There are two code files in the func project: spu_caller.c and spu_func.s. Listing 15.5 shows the code that makes up spu_caller.c.

Example 15.5. Calling an Assembly-Coded Function: spu_caller.c

#include <stdio.h>

extern int func(int, int);

int main(int argc, char **argv) {

   int res = func(5, 2);
   printf("Result = %d
",res);
   return 0;
}

The caller declares func with an extern statement and then calls func(5, 2). The goal of this function is to compute the sum and difference of the two arguments and multiply them together. In this case, ((5 + 2) × (5 − 2)) = 21.

The code in Listing 15.6 shows how func can be implemented in assembly. Notice that the input values are received in Registers 3 and 4. Also notice how the stack is extended to store two values.

Example 15.6. Assembly-Coded Function: spu_func.s

.text
   .align   4
   .global  func
   .type func, @function

func:
   # Decrement the stack to store
   # two local variables
   stqd  $sp,-48($sp)
   ai      $sp,$sp,-48

   # Add the input values together
   # and store the sum on the stack
   a     $2,$3,$4
   stqd  $2,32($sp)

   # Subtract the input values and
   # store the difference on the stack
   sf    $2,$4,$3
   stqd  $2,16($sp)

   # Retrieve the sum and difference
   # from the stack and multiply
   lqd   $3,32($sp)
   lqd   $2,16($sp)
   mpy   $3,$2,$3

   # Increment the stack and return
   # to the link address
        ai         $sp,$sp,48
        bi         $lr

The input parameters are received in Registers 3 and 4, and the result is placed in Register 3. Register 1 ($sp) is left alone during computation because it holds the stack pointer.

Similarly, Register 0 ($lr) can’t be used because it stores the address of the instruction to be executed once the function completes. Instead of stop, the last instruction of an assembly coded function should be a branch to $lr.

Calling C/C++ Functions from Assembly Code

The link register ($lr or Register 0) becomes important when you call C/C++ functions from assembly. Before a function can execute, the current value of the PC must be placed in $lr. This way, the PC knows where to return to once the function finishes.

This placement is accomplished using branch instructions that “set a link” in addition to branching. brsl sets a link and then branches to a label, which can also be a function name. The instruction

brsl   $lr,func_name

updates the link register and tells the SPU to continue processing at func_name.

The caller project is exactly the reverse of the func project. Instead of using C code to call an assembly function, an assembly coded application (spu_caller.s) calls a function coded in a C file (spu_func.c). The code in Listing 15.7 shows how this can be done.

Example 15.7. Calling a C Function from Assembly: spu_caller.s

   .text
   .align 4
   .global _start

_start:
   # Load values into Registers 3 & 4
   il    $3,5
   il    $4,2

   # Branch to func
   brsl  $lr,func

   stop  0

When the function completes, the return value can be found in Register 3. The return value is 0x15, or 21.

Assembly and the SPU Dual-Pipeline Architecture

Because of its two parallel pipelines, an SPU can issue two instructions at once. This is a great improvement over single-issue processors, but this capability isn’t guaranteed. To see why dual-issue is possible for one instruction sequence but not for another, you need to understand how the two pipelines process instructions. Table 15.17 shows which instruction types can be processed by Pipeline 0 and Pipeline 1.

Table 15.17. The Two SPU Pipelines and Their Instructions

Pipeline 0

Pipeline 1

Load (immediate)

Load (not immediate), store

Fixed-point add, subtract, multiply, generate borrow and carry

Branches, branch hints

Single-precision/double-precision operations except estimates

Floating-point estimates

Halfword/word shifts and rotates

Quadword shifts and rotates

Select bits

Shuffle bytes, create masks for select and shuffle

Logical instructions, count 1s and 0s

Gather bits

NOP

LNOP

Conversion and sign extension

 

Absolute difference, average, sum

 

Dual-issue is only possible if the first instruction can be executed by Pipeline 0 and the second can be executed by Pipeline 1. For example, if the first instruction is fa (Floating-Point Add) and the second is stqd (Store Quadword), the two instructions can issue in parallel. But if the upcoming instructions are selb (Select Bits) and li (Load Immediate), they must be executed serially because they both require Pipeline 0. Similarly, if a Pipeline 1 instruction precedes a Pipeline 0 instruction, the two issue separately.

The SDK compiler does a competent job in pairing instructions for dual issue, but you might find there’s room for improvement. Let’s suppose the next eight instructions are given as

P0 P1 P0 P0 P1 P1 P1 P0

where P0 designates a Pipeline 0 instruction and P1 designates a Pipeline 1 instruction. If each instruction takes a single clock cycle to execute and there are no dependency concerns, the instructions issue as shown in Figure 15.4.

Inefficient instruction ordering

Figure 15.4. Inefficient instruction ordering

This instruction-pairing takes six cycles to execute eight instructions. The first pair of instructions issues in parallel, but the next four can’t because P1 instructions precede a P0 instructions. The last two issue in parallel.

Oddly enough, the performance of this example can be improved by inserting instructions: nop and lnop. These instructions don’t do anything, but nop is P0 and lnop is P1. By inserting nops, you can make sure that a P0 instruction precedes every P1 instruction, as shown in Figure 15.5.

Efficient instruction ordering

Figure 15.5. Efficient instruction ordering

With nop and lnop, the SPU can perform dual issue with each cycle. The eight nontrivial instructions can now execute in five cycles rather than six.

Conclusion

There’s a lot to learn when it comes to SPU assembly: sections, directives, addressing modes, branch hints, dual-issue conditions, and hundreds of hard-to-remember opcodes. I don’t recommend building complete applications in assembly, but if you want to improve on the compiler’s output, these are the hurdles you need to leap over.

The SPU assembly instructions are similar in purpose to the intrinsics discussed in Chapter 11. If you understood the multiply high-high and rotate mask functions as intrinsics, you’ll have no problem understanding the corresponding instructions. The main difficulty is keeping track of the different types of load/store commands, branches, and immediate versus register parameters.

The Linux Application Binary Interface (ABI) specifies how information is stored in SPU registers and how the stack pointer and link register operate. The stack holds local variables, and the link register makes it possible for functions to return to their calling functions. By using the stack pointer and link register properly, you can call assembly code as a C/C++ function and call a C/C++ function from assembly code.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.212.170