All the SPU’s vital capabilities have been discussed: Single Instruction, Multiple Data (SIMD) processing, direct memory access (DMA), and channel communication. This chapter presents three topics that aren’t as commonly used, but may prove important as you code more advanced SPU applications: overlays, software caching, and SPU isolation.
Overlays allow SPU applications to dynamically incorporate code stored outside the local store (LS). They are frequently used in embedded systems, where application sizes may exceed the amount of available local memory. Applications with overlays are easy to code, but the overlay information has to be placed in the linker script.
Each SPU has a 256KB LS and a 2KB register file, but no cache. To make sure recently used data remains close at hand, the SDK provides a software cache capability with functions declared in cache-api.h. After the cache’s structure has been configured, the LS can be accessed with data-transfer functions that offer many advantages over regular DMA commands.
The Cell provides extraordinary security for SPU applications through hardware isolation. That is, if the PPU configures a context in isolation mode, the corresponding SPU’s LS cannot be accessed by external processing units. Further, SPU executables can be signed and encrypted to ensure that intruders can’t corrupt the application.
Ideally, all of your SPU applications will fit inside the 256KB LS with enough extra space left over for a stack and heap. But if an application exceeds this limit, you’ll need to transfer instructions in and out of the LS at runtime. Overlays provide this capability and function like pages, which are swapped in and out of virtual memory. An overlay is an object in an SPU object file that can be loaded into an overlay region when called upon. For example, if object file bar.o contains an overlay function called baz()
, the object containing baz()
will be loaded into the LS only when baz()
is called.
Loading overlays into memory takes a significant number of cycles, but it’s the simplest and most effective way to deal with SPU applications that require more than 256KB. For example, Figure 14.1 shows how overlays can be used if an SPU application depends on object files obj1.o, obj2.o, obj3.o, and obj4.o, but only obj1.o and obj2.o and one of obj3.o or obj4.o will fit in memory.
The process of using SPU overlays consists of four steps:
Compile SPU code, separating overlay object files from the build.
Create a linker script that identifies the overlay object files and the objects.
Configure the build process to use the new linker script.
Invoke functions from the overlay objects during execution.
This section starts with a description of how a linker script defines overlays and then presents an example application that uses them.
The SPU linker spu-ld
has a default script that tells it how to create SPU executables. To display the script commands, enter the following at the command line:
spu-ld -verbose
On my system, the default script is elf32_spu.x, located in the /usr/spu/lib/ldscripts directory.
The elf32_spu.x script defines the target architecture (spu), the entry point (_start), and the directories to be searched for libraries (/usr/spu/lib). But most of the file describes how and where sections of input files should be incorporated into the executable. Appendix A, “Understanding ELF Files,” explains what ELF sections are and how they work.
If an application uses overlays, they must be defined in the linker script. The Chapter14/overlay project directory identifies its overlays in ld.script. This custom linker script contains most of the content of elf32_spu.x, but adds a few linker statements needed for the project. The makefile in the overlay project includes this script into the build with the linker flag:
LD_FLAGS = -Wl,-T,ld.script
Two statements must be added to the linker script to enable overlay processing:
OVERLAY
:Describes sections that may be loaded into the same memory region
EXCLUDE_FILE
:Tells the linker not to include the overlay sections into the executable
The OVERLAY
statement provides two pieces of information: the sections to be included in the executable and the object files that contain the sections. For example, suppose you want the linker to copy the .text
section of overlay1.o and overlay2.o into the respective .out_section1
and .out_section2
sections of the executable. You’d add the following lines to the linker script:
OVERLAY : { .out_section1 {overlay1.o(.text)} .out_section2 {overlay2.o(.text)} }
The OVERLAY
statement accepts further descriptors that identify the starting and loading address of the overlay sections, but this is the basic format. Conveniently, this is exactly the OVERLAY
statement used in the overlay project’s linker script, ld.script.
Normally, the build process links the .text
section of every input object file into the executable. This is specified in the linker script with
*(.text)
But overlay1.o and overlay2.o should not be loaded into the executable until they’re needed. ld.script excludes them with the following line:
*( EXCLUDE_FILE(overlay1.o overlay2.o) .text .stub .text.* .gnu.linkonce.t.*)
If you need to create an overlay application of your own, all you have to do is modify these two lines in ld.script and set the appropriate linker flag in the build process. Then the object files will be kept out of the SPU executable until they’re needed by the application.
In code, overlays are accessed as external functions. That is, if overlay1.o contains objects foo
and bar
, the SPU application can access them by declaring the following:
extern foo(); extern bar();
These functions don’t occupy any memory in the LS until they are invoked. Then the overlay manager loads the corresponding external objects into the LS and the functions execute normally. The SPU code doesn’t need to know anything about the object files themselves—only the names of the overlay objects are important.
In the overlay project, the main application (spu_overlay.c) accesses functions declared in overlay1.c and overlay2.c. The first function, reverse_vector
, calls spu_shuffle
to rearrange the elements of a vector unsigned int
. The second, reverse_again
, performs the same operation and returns the vector elements to their original order.
Listing 14.1 presents spu_overlay.c, which declares and calls the overlay functions.
Example 14.1. SPU Overlay Calling Function: spu_overlay.c
#include <stdio.h> #include <spu_intrinsics.h> /* Declare functions in overlay1.o and overlay2.o */ extern vector unsigned int reverse_vector(vector unsigned int); extern vector unsigned int reverse_again(vector unsigned int); int main(unsigned long long speid, unsigned long long argp, unsigned long long envp) { vector unsigned int test_vec = (vector unsigned int){0, 1, 2, 3}; printf("Calling the overlay functions: "); /* Call the external functions */ test_vec = reverse_vector(test_vec); test_vec = reverse_again(test_vec); return 0; }
Listing 14.2 presents overlay1.c, which reverses the order of the input vector’s elements. The code in overlay2.c is almost exactly the same.
Example 14.2. SPU Overlay Function: overlay1.c
#include <stdio.h> #include <spu_intrinsics.h> vector unsigned int reverse_vector(vector unsigned int test) { int i; vector unsigned char indexVec = {12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3}; /* Rearrange the vector */ test = spu_shuffle(test, test, indexVec); /* Display the results */ printf("reverse_vector: "); for(i=0; i<4; i++) printf("%u ", spu_extract(test, i)); printf(" "); return test; }
The printed result is as follows:
Calling the overlay functions: reverse_vector: 3 2 1 0 reverse_again: 0 1 2 3
The overlay functions, reverse_vector
and reverse_again
, look and behave like regular SPU functions. Although overlay1.c and overlay2.c contain only a single function each, an overlay file can contain multiple overlay functions.
An SPU’s LS is not a cache, but it can act like one with the right software. The SPU software managed cache capability provides a set of read/write functions that interface the LS as though it was a cache. Using this gives you two important advantages.
The first advantage is that data transfer becomes much simpler. When you read data into the LS using cache functions, you don’t have to worry about where the data should be stored. Data placement is determined solely by the cache structure. And because the cache is formed of equally sized cache lines, you don’t have to specify how much data should be transferred. Each cache operation transfers one cache line at a time.
The second and more important advantage becomes apparent when an application requires more data than the LS can hold. When this happens, the programmer usually has to decide how memory should be replaced in the LS and the exact locations where existing data should be overwritten. But the software cache handles data replacement automatically: Its replacement algorithm determines which cache lines should be overwritten and where new data should be stored.
Accessing the software managed cache in SPU code requires three steps:
Insert #define
statements with specific attributes to configure the cache.
Include the cache-api.h header file after the #define
statements.
Transfer data to and from the LS using cache read and write functions.
After you’ve configured a software cache, you should only access the LS with cache functions. You shouldn’t use mfc_get
or mfc_put
to transfer data to or from the cache. These unregulated transfers can damage its structure.
A software cache isn’t a data structure that can be allocated or deallocated. Instead, a cache’s parameters are configured through attributes that tell the cache read/write functions how to operate. These attributes are listed in Table 14.1.
Table 14.1. Software Cache Configuration Attributes
Attribute | Purpose | Range | Default |
---|---|---|---|
| Unique identifier of the cache | Any string | None |
| Cache type (0 - read-only, 1 - read/write) | 0, 1 | 1 (read/write) |
| Identifies the datatypes being cached | Any datatype | None |
| Cache associativity | 0, 2 | 2 (4-way) |
| Log2 of the number of sets in cache | 0–12 | 6 (64 sets) |
| Log2 of a cache line size | 4–12 | 7 (128 bytes) |
| Tag identifier of cache DMA transfers |
|
|
| Enables cache access through | — | — |
| Enables monitoring of cache activity | — | — |
The CACHE_NAME
attribute identifies the cache currently in use, and multiple caches can be created at once as long as they have different names. The next two attributes, CACHE_TYPE
and CACHED_TYPE
, look alike but have very different purposes. CACHE_TYPE
determines whether the cache is read-only (0) or read/write (1). CACHED_TYPE
identifies the datatype to be stored in the cache.
For example, the following lines configure a read-only cache called EXAMPLE_CACHE
that stores c_type
values:
#define CACHE_NAME EXAMPLE_CACHE #define CACHE_TYPE 0 #define CACHED_TYPE c_type
To enable read/write access to the cache, set CACHE_TYPE
equal to one. The SDK documentation recommends setting CACHED_TYPE
equal to a previously typedef
ed datatype to avoid errors involving operator precedence.
One of the most crucial concerns in configuring the cache is determining how the cache data should be mapped to main memory. This associativity is controlled by the CACHE_LOG2NWAY
attribute, which can take one of two values:
0 - Direct-Mapped—. Main memory is partitioned into the same number of blocks as there are cache lines, and each cache line stores data for a specific block. Each cache line is its own set.
2 - 4-Way Set Associative—. The N-line cache is divided into N/4 sets of four lines each, and each main memory block can be mapped to one of the four lines in a set.
Figure 14.2 depicts both methods. The 4-way associative cache mapping on the right is chosen by default.
The 4-way set associative mapping reduces the chances of collisions, because a memory location can be cached in one of four different lines. But when the cache is read, all four lines in the set must be searched. Direct mapping makes searching unnecessary since each memory location can be cached in only one location. If multiple addresses mapped to a cache line need to be cached, however, only one will be cached. The others, if accessed, will cause cache misses.
After the cache’s associativity has been configured, the number of sets can be identified with the CACHE_LOG2NSETS
attribute. The default value is 6, but this value can be set as high as 12 for a total of 212 sets. The size of each cache line is set with CACHELINE_LOG2SIZE
, whose maximum is 12 and default is 7, for a cache line size of 27 = 128 bytes. For example, to create a 4-way set associative cache with 128 sets and cache lines of 256 bytes, you’d use the following configuration:
#define CACHE_LOG2NWAY 2 #define CACHE_LOG2NSETS 7 #define CACHELINE_LOG2SIZE 8
The first line can be left out because 2 is the default value. The total size of this cache = 256 bytes/line * 4 lines/set * 128 sets = 128KB, which takes up half the LS. The default attributes produce a cache of 128 bytes/line * 4 lines/set * 64 sets = 32KB.
The last three attributes in Table 14.1 specify less-crucial information. The first, CACHE_SET_TAGID(set)
, identifies the tag value to be used for all the cache’s DMA transfers between LS and main memory. CACHE_READ_X4
makes it possible to read values as vector unsigned int
s, and can be used only when CACHED_TYPE
is a 32-bit integral type. The last attribute, CACHE_STATS
, makes it possible to access cache statistics such as hits and misses for load and store operations. This attribute is discussed shortly.
After you’ve configured the attributes in code, be sure to include the cache-api.h header. The code in this header depends on the cache attributes to operate.
The cache-api.h header declares two types of functions that interface the software cache: safe and unsafe. Safe functions allow you to read and write to the cache, but don’t allow pointer access to the LS. Unsafe functions return pointers to data in the LS, which make it possible to disturb the cache structure.
Safe functions are somewhat slower than the unsafe functions, but they reduce the possibility of error when accessing the LS. Table 14.2 lists all four of them.
Table 14.2. Safe Software Cache Access Functions
Constant | Return Value | Purpose |
---|---|---|
|
| Read from the effective address |
|
| Write |
|
| Write modified cache lines to main memory |
|
| Read a vector of values from cache |
Each function accepts a name
parameter that identifies the cache being accessed (configured by CACHE_NAME
). The first two functions, cache_rd
and cache_wr
, operate as their names imply: The first returns a CACHED_TYPE
from the effective address ea
, and the second writes a CACHED_TYPE
value to ea
. If the data at ea
hasn’t been cached, these functions block until the data is transferred from main memory into the LS.
The software cache is writeback, which means that cache writes aren’t written to main memory until they’re necessary. The third function in Table 14.2, cache_flush
, makes this update necessary. That is, it forces all modified cache lines to be written to their respective addresses in main memory.
The last function, cache_rd_x4
, is available only if the CACHE_READ_X4
attribute is defined and the CACHED_TYPE
is a 32-bit integral type. If the four values at ea
are already in cache, the function returns immediately. Otherwise, it blocks until the vector is fetched from main memory.
Safe functions are useful when you want to test how well an application works with a given cache structure. But once you’re certain the application works properly, you can improve performance and flexibility by using the functions in Table 14.3. These are unsafe because they provide pointers to CACHED_TYPE
structures rather than the CACHED_TYPE
structures themselves.
Table 14.3. Unsafe Software Cache Access Functions
Constant | Return Value | Purpose |
---|---|---|
|
| Provides ptr for modifying cache (blocks) |
|
| Provides ptr for modifying cache (no block) |
|
| Write modified cache lines to main memory |
|
| Locks the cache line associated with |
|
| Unlocks the cache line associated with |
cache_rw
returns a pointer to the cached data with effective address ea
. This is a pointer to a CACHED_TYPE
structure, and can be used for reading or writing to the LS. This function blocks until the data is available in the cache.
To see the advantage provided by cache_rw
, let’s say you want to increment a value stored in cache at address ea_addr
. With safe functions, you’d have to use the following:
c_type value = cache_rd(EXAMPLE_CACHE, ea_addr); value++; cache_wr(EXAMPLE_CACHE, ea_addr, value);
This requires two accesses to cache memory. But by calling cache_rw
, you can increment the value with one access:
c_type* value = cache_rw(EXAMPLE_CACHE, ea_addr); (*c_type)++;
cache_touch
is like cache_rw
but returns immediately. If the data at ea
hasn’t been cached yet, the pointer returned by cache_touch
won’t be immediately useable. This function is commonly called to cache a memory location before the actual data is needed. The next function, cache_wait
, forces the SPU to halt until the memory is cached. Afterward, the pointer can be used as if it had been returned by cache_rw
.
If the software cache needs to cache new data without available space in the LS, it uses the least recently used (LRU) replacement algorithm. That is, it discards the cache line that has been used least recently. To prevent a cached location from being discarded, call cache_lock(name, lsa)
. This assumes that you’ve acquired the LS pointer using cache_rw
or cache_touch
. When your code no longer needs the lock, cache_unlock
makes the cache location available for replacement.
The SPU software cache is particularly useful when dealing with large amounts of data partitioned into small sizes. For example, suppose the SPU needs to sort a list of 32-bit values that is larger than the LS can hold. The cache makes sure that the most recently used values are always available.
The example code in the SDK’s /opt/cell/sdk/src/samples/cache/sort directory shows how the cache can be used for sorting, and contains code for the quicksort (qsort.c) and heapsort (hsort.c). This discussion focuses on the heapsort code and its use of the software cache.
The heapsort is an in-place selection sort that manipulates elements of an array as if they were nodes in a binary tree. This tree is called the heap. For example, if the array to be sorted is given by arr
= [7, 3, 2, 8, 4, 6, 1, 5, 9], the corresponding heap is presented in Figure 14.3.
It’s important to see how the tree nodes relate to the array elements. The root node (containing 7) corresponds to arr[0]
, and the root node’s children are arr[1]
and arr[2]
. For all successive nodes, the children of arr[i]
are arr[2*i+1]
and arr[2*i+2]
.
The goal of the heapsort is to swap parents and children until every parent node is larger than or equal to each of its children. The swapping starts at the bottom and compares the lowest parent node to its children. The swapping continues for higher nodes in the heap. This is commonly called the heapup, buildheap, or heapify process. When the swapping is finished, the root node (arr[0]
) holds the largest value in the array.
Next, the root node and the last node in the tree are switched, and the last node is removed. In the preceding example, this means arr[0]
and arr[9]
are swapped, and arr
is treated like an eight-element array. The new root must be swapped with its descendants until it is properly ordered in the tree. This process is commonly called downheap or heapdown.
The steps used by the SDK’s heapsort example are as follows:
Heapup: Arrange the binary tree so that the parent values are larger than the child values.
Swap: Replace the root value with the last child value, remove last child value.
Heapdown: Reorder binary tree.
Repeat previous steps until all nodes are removed.
To perform the heapsort, the SDK code creates the software cache with the following attributes:
typedef vec_float4 item_t; #define CACHE_NAME hsort #define CACHED_TYPE item_t #define CACHELINE_LOG2SIZE 7 #define CACHE_LOG2NWAY 2 #define CACHE_LOG2NSETS 6 #define CACHE_TYPE 1
The heapdown
function in hsort.c shows how this cache is accessed in code:
static inline void heapdown (item_t *a, int start, int count) { int root, child; for (root=start; child=(root*2+1), child<count;) { item_t child_val = cache_rd(CACHE_NAME, &a[child]); item_t chpp_val = cache_rd(CACHE_NAME, &a[child+1]); if (child < (count - 1) && compare_leq(child_val, chpp_val)) { child += 1; child_val = chpp_val; } if (compare_and_swap(a, root, child, child_val, 0)) root = child; else return; } }
The first step loads the two child values from the cache. Next, the function determines the larger of the child values and compares it to the root value using compare_and_swap
. This function is contained in util.h, and is presented in the following code below (rewritten for clarity):
static inline int compare_and_swap (CACHED_TYPE *a, int i, int j, CACHED_TYPE cmp_val, int polarity) { CACHED_TYPE aii, ajj; CACHED_TYPE ai, aj; CACHED_TYPE *pi, *pj; // Load pointers &pi = cache_rw(CACHE_NAME, &a[i]); cache_lock(CACHE_NAME, &pi); &pj = cache_rw(CACHE_NAME, &a[j]); cache_unlock(CACHE_NAME, &pi); ai = *pi; aj = *pj; int ai_leq_pivot = (ai <= cmp_val); // Use aii, ajj as swap variables aii = (ai_leq_pivot) ? aj : ai; ajj = (ai_leq_pivot) ? ai : aj; // Complete swap *pi = aii; *pj = ajj; return (ai_leq_pivot) ? 1 : 0; }
This function calls the unsafe cache function, cache_rw
, to obtain pointers to the values to be swapped. The first value is locked in the cache to prevent its replacement when the second value is read. When both values are in cache, the function compares them, switches them if necessary, and returns whether the swap was made.
When the heapsort is finished, the application calls cache_pr_stats
to display data about the cache’s operation. This produces two results. First, it displays data about the cache: its name, size, type, and structure. Second, it creates a table with statistics concerning the cache’s efficiency.
For each of the heapsort’s 64 iterations, cache_pr_stats
lists the hit rate and miss rate for the application’s read and write operations. The hit percentage is listed in each case as well as the number of replacements, writebacks, and cache flushes. On my system, the hit percentage for the heapsort averaged around 90% for reads and 99% for writes. It’s clear that the SDK’s software cache provides an effective mechanism for caching data.
Most methods of data protection are effective only if the intruder remains outside the processing system. But when the system is infiltrated, security applications can be disassembled, decompiled, and rebuilt with deliberate weaknesses. These weaknesses enable trespassers to access system data or even take control.
The Cell protects software by isolating SPUs at the hardware level. Once an SPU is isolated, no amount of hacking will gain access to its LS. Further, the memory regions containing the SPU’s object code are strongly encrypted.
As a practical example, consider Sony’s protection of its proprietary operating system in the PlayStation 3. Sony has made it possible for Linux developers to access the Cell’s main memory and run applications. At the same time, however, there’s no way any hacker can disable the hypervisor and access Sony’s GameOS. This is because Sony keeps one SPU for itself and runs it in isolation mode.
Not only can’t hackers access the isolated SPU, we can’t even disassemble its executable. The object code has been encrypted and signed, and the PPU always decrypts and verifies its authenticity before launching it on the SPU. The decryption key can’t be accessed in software because it’s a hardware key, built in to the Cell.
We might not be able to break this isolation, encryption, and digital signing, but we can use it for our own applications.
IBM has released two versions of its SPU security tools. The full-featured version requires a signed confidentiality disclosure agreement (CDA) with IBM, and the second is publicly available as part of the Extras ISO in the Cell SDK. The full version provides the following features:
Full usage of SPU isolation
Application signing and verification with public key infrastructure (PKI)
Data/application encryption using secret-key cryptography
A special simulator to test secure applications
Additional libraries that assist with secure application development
The publicly available version doesn’t allow for real SPU isolation, but provides an emulation mode that allows applications to test the behavior of an isolated SPU. It doesn’t provide a secure simulator or additional libraries, but it enables full application signing and verification. The publicly available version also doesn’t encrypt executables, but XORs their bits with a static value.
This section focuses on the publicly available version. This is provided as part of the SDK’s Extras package and can be installed with the following command:
yum install cell-spu-isolation*
Before a secure SPU executable can be launched, it must be signed by its owner and verified by the SPU loader. This prevents intruders from substituting malicious code into the build process. To see how signing works, you need to understand keys and the Public Key Infrastructure (PKI).
The signing process consists of an algorithm (usually a hash function) that operates on the bits in the executable and returns a numeric result called the signature. When the signature is appended to the executable file, the file is signed.
The signing algorithm (explained further in Chapter 18, “Multiprecision Processing and Monte Carlo Methods,”) depends on a particular value that only the file’s owner can access. This is the private key. Other parties don’t have this key, so they can’t sign the executable as its owner. However, they can perform a second computation to make sure that the executable was properly signed. This is called verification and requires a second value: the owner’s public key.
The owner makes its public key available through certificates. The primary function of a certificate is to match the owner’s identity with the owner’s public key, and this binding is called the public key infrastructure (PKI). PKI is frequently used to secure Internet communication and has been incorporated into many secure e-mail and web transfer applications.
The OpenSSL package for Linux contains tools for creating new keys and certificates, and it can be installed with yum install openssl
. However, IBM has provided a sample private key (user_sign_key.pem) and a sample certificate (user_signed_crt.pem), both located at /opt/cell/sdk/prototype/src/examples/isolation/keystore. The sample code in this section relies on these two files.
To create a secure Cell application, you have to make a number of changes to the regular build process. First, you have to link to a new startup file, iso_crt0.o, rather than crt0.o. The link step must be performed with the secure linker, emulated-loader.bin, which must access the secure linker script, elf32_spu.xi. Then the SPU executable must be signed before it can be embedded into a CESOF file.
The code in the Chapter14/sign project is simple, but the five build commands in the makefile can be hard to follow. The first command compiles spu_sign.c
into spu_sign.o
:
spu-gcc -c -o spu_sign.o spu_sign.c
The second command creates the secure SPU executable and is the most complicated step of the build. It identifies the location of the secure linker, emulated-loader.bin, and uses the -nostartfiles
flag to tell the linker not to use the regular startup file, crt0.o
. The secure startup file, iso_crt0.o
, is linked instead.
At the time of this writing, the SDK places emulated-loader.bin in the /usr/lib/spe directory. For the 64-bit build process to work properly, this file must be copied from /usr/lib/spe into /usr/lib64/spe.
In addition to identifying the secure linker, the makefile also names the linker script (elf32_spu.xi) to be used for the linking. The full command for the second step is given by the following:
spu-gcc -o spu_sign_pre spu_sign_pre.o /opt/cell/sdk/prototype/usr/spu/lib/iso_crt0.o -Wl,-N -nostartfiles -L/usr/lib64/spe -Wl,-T,/usr/spu/lib/ldscripts/elf32_spu.xi
The third step in the build signs the SPU executable. Once the private key and certificate are available, this can be done with a single command: spu-isolated-app
. This handles signing and encryption, and accepts the following parameters:
infile
: Name of the SPU executable to be signed
outfile
: Name of the signed SPU executable
signKey
: File containing the private key
signCert
: File containing the digital certificate
encryptSec
: The file sections to be encrypted (optional)
The first four fields are straightforward, and the optional encryptSec
parameter is discussed shortly. In the sign project, the unsigned spu_sign_pre
is converted into the signed spu_sign
with the following command:
spu-isolated-app spu_sign_pre spu_sign $(KEYSTORE)/user_sign_key.pem $(KEYSTORE)/user_signed_crt.pem
where $(KEYSTORE)
equals /opt/cell/sdk/prototype/src/examples/isolation/keystore. The optional field is left out because the publicly available spu-isolated-app
tool doesn’t allow for encryption.
The fourth build command embeds the spu_sign
executable into a PPU object file as if it was a regular SPU executable. The fifth compiles and links the PPU application. The makefile commands for these last two steps are as follows:
ppu-embedspu -m64 spu_sign spu_sign spu_sign.o ppu-gcc -o ppu_sign ppu_sign.c spu_sign.o
Chapter 7, “The SPE Runtime Management Library (libspe
),” presented the SPE Management library (libspe
) and explained how the PPU accesses SPU object code through contexts. The PPU loads code into the contexts, creates threads, and then runs each context within a thread. To configure these contexts for secure operation, only one change needs to be made. The first parameter of spu_context_create
must be set to one of the following:
SPE_ISOLATE
: Prevents all external access to the SPU during its operation
SPE_ISOLATE_EMULATE
: Emulates isolation mode
SPE_ISOLATE
prevents external operations from accessing the SPU’s LS or affecting the SPU’s computation. With SPE_ISOLATE_EMULATE
, the SPU only appears isolated.
The publicly available version of the security tools enables applications to emulate SPU isolation, but doesn’t allow for actual isolation. Therefore, applications relying on these tools must use SPE_ISOLATE_EMULATE
when creating contexts.
To configure a context called ctx
to execute in the isolation-emulation mode, call the following function:
ctx = spe_context_create(SPE_ISOLATE_EMULATE, null);
If the context has been configured for security, the PPU will verify the executable’s signature during the load process. If the executable hasn’t been signed properly, spe_context_load()
throws an error.
Chapter 7 explained how to handle SPU events inside PPU code. spe_stop_info_read()
gathers information about why an SPU stopped:
spe_stop_info_read(events[i].spe, &stop_info);
The first argument identifies the SPU context and the second stores information about why the SPU stopped running. If the stop_info
’s stop_reason
field equals 7, the SPU halted because of an isolation error. The spe_exit_code
field can take values between 1 and 4:
The application is larger than the maximum permissible 200KB.
Mismatch between publicly available tools and CDA-provided tools.
The executable couldn’t be decrypted.
The executable couldn’t be verified.
Aside from these differences, the process of loading and running SPU contexts in a secure application is the same as for a regular application.
File signatures serve an important purpose, but if an intruder removes the signature of an SPU executable, it can be run as a regular executable. Real protection requires encryption. The CDA version of spu-isolated-app
lets you specify which sections of the SPU executable should be encrypted. The section name is identified in the command’s fifth parameter, encryptSec
, which can take values such as CODE
or DATA
or ALL
.
For example, to encrypt the .data
section of an SPU executable, a fifth parameter must be added to the earlier command:
spu-isolated-app spu_sign_pre spu_sign $(KEYSTORE)/user_sign_key.pem $(KEYSTORE)/user_signed_crt.pem DATA
Appendix A provides more information about sections in PPU ELF and SPU ELF files.
The Cell relies on two 2048-bit keys for encryption: the application decryption key and the application authentication key. The first key is specific to the secure loader and remains the same for all applications. The second key changes for each application, and ensures that if an intruder breaks the application decryption key, the executable will stay encrypted.
The publicly available tool doesn’t encrypt the SPU executable, but XORs its bits with a static value. The fifth parameter of spu-isolated-app
has no effect. Because encryption is unavailable, this description won’t discuss the key hierarchy or the encrypted file system API. Readers interested in these subjects should look through Chapters 4 and 5 of the CBE Secure SDK Guide.
If you add a printf
statement to spu_sign.c, the application will produce a segmentation fault at runtime. This is because isolated SPUs can’t call on the PPU to display output. Mailbox communication is available, but regular DMA commands such as mfc_get
and mfc_put
won’t work because most of the LS is blocked off from external access. To work around these restrictions, you need to call the functions in the SPU Isolation library, libisolation
.
The library libisolation.a is located in /opt/cell/sdk/prototype/usr/spu/lib and provides most of the functions you’d expect from libc
and libgloss
, including printf
. Table 14.4 lists the functions from these libraries that aren’t included in libisolation.a.
For the sake of security, PPE-assisted functions are allowed to access only a particular region of an SPU’s LS. This is called the auxiliary buffer. By default, this buffer is 192 bytes wide and occupies the LS memory region from 0x3E000 to 0x3E0C0. Figure 14.4 shows its position in the LS.
The auxiliary buffer must be large enough to store the parameters and results of all PPE-assisted functions invoked during an application’s execution. Specifically, it must have a base size of 16 bytes, 16 bytes for each function parameter, and 32 extra bytes for each call to printf
, vprintf
, fprintf
, and vfprintf
.
If this requires more memory than the buffer’s default size of 192 bytes, the buffer can be extended with change_ppuassist_buf_len(unsigned int)
, where the argument identifies the requested size of the auxiliary buffer. The maximum buffer size is 8016 bytes.
For example, if your SPU code calls printf
with parameters whose total size come to 256 bytes, the auxiliary buffer can be set to 16 + 32 + 256 = 304 bytes with the following instruction:
change_ppuassist_buf_len(304);
change_ppuassist_buf_len
is provided as part of libisolation
and is declared in the libisolation.h header.
An isolated SPU can transfer data to and from a small region of its LS. libisolation.a provides a number of functions for this purpose and many of them access encrypted data and files. Two routines can be used for simple, unencrypted data transfer: copyin
and copyout
. Both have the same list of parameters:
In Listing 14.3, the SPU receives two mailbox messages and uses them to form an effective address. It calls copyin
to transfer a vector from this address into its LS and calls printf
to display the elements of the vector.
Example 14.3. Secure Communication with an Isolated SPU: spu_secure_comm.c
#include <stdio.h> #include <spu_mfcio.h> #include <libisolation.h> int main() { /* data to be sent */ vector unsigned int msg; /* address of the data */ unsigned long long ea; unsigned int i, ea_low, ea_high; /* Create effective address from mailbox data */ ea_low = spu_read_in_mbox(); ea_high = spu_read_in_mbox(); ea = mfc_hl2ea(ea_high, ea_low); /* Access the data from main memory and copy it to the LS */ if (copyin(ea, &msg, sizeof(msg)) != 0) { return -1; } /* Display vector contents */ printf("Result: "); for(i=0; i<4; i++) printf("%u ",spu_extract(msg, i)); printf(" "); return 0; }
If you compare the makefile for the sign project to that of the secure_comm project, you’ll notice two changes. First, the secure_comm project links libisolation
into the build. Second, there are additional linker flags that allow the SPU to communicate with the PPU. Without these flags, the application will produce a segmentation fault.
This chapter presented three tools for building SPU applications with advanced features: overlays, software caching, and isolation. Overlays and software caching make it simpler for SPU applications to interface the LS and its limited memory. SPU isolation ensures that SPU executables can’t be read or corrupted by an intruder.
Overlays make it possible for an application to load object files into the LS as they’re needed. Adding overlays to a project doesn’t affect the code, but the build process changes significantly. The linker script must be modified to identify the overlay objects, their object files, and where they should be placed in memory.
The SDK’s software-managed cache contains functions that interface the LS as if it was a cache. The first step to using the cache is creating attributes that identify its name, type, structure, and set associativity. Then the cache can be accessed with safe functions, which return data values, and unsafe functions, which return pointers to data. When an application stops using the cache, the statistics of its operation can be analyzed with cache_pr_stats
.
The chapter ended with a discussion of SPU isolation. This is one of the most important and unique aspects of Cell applications, and few other processors can guarantee software security in hardware. In addition to SPU isolation, the security tool also protects executable code with strong encryption based on a hardware key.
18.191.110.17