Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 14. Advanced SPU Topics: Overlays, Software Caching, and SPU Isolation

All the SPU’s vital capabilities have been discussed: Single Instruction, Multiple Data (SIMD) processing, direct memory access (DMA), and channel communication. This chapter presents three topics that aren’t as commonly used, but may prove important as you code more advanced SPU applications: overlays, software caching, and SPU isolation.

Overlays allow SPU applications to dynamically incorporate code stored outside the local store (LS). They are frequently used in embedded systems, where application sizes may exceed the amount of available local memory. Applications with overlays are easy to code, but the overlay information has to be placed in the linker script.

Each SPU has a 256KB LS and a 2KB register file, but no cache. To make sure recently used data remains close at hand, the SDK provides a software cache capability with functions declared in cache-api.h. After the cache’s structure has been configured, the LS can be accessed with data-transfer functions that offer many advantages over regular DMA commands.

The Cell provides extraordinary security for SPU applications through hardware isolation. That is, if the PPU configures a context in isolation mode, the corresponding SPU’s LS cannot be accessed by external processing units. Further, SPU executables can be signed and encrypted to ensure that intruders can’t corrupt the application.

SPU Overlays

Ideally, all of your SPU applications will fit inside the 256KB LS with enough extra space left over for a stack and heap. But if an application exceeds this limit, you’ll need to transfer instructions in and out of the LS at runtime. Overlays provide this capability and function like pages, which are swapped in and out of virtual memory. An overlay is an object in an SPU object file that can be loaded into an overlay region when called upon. For example, if object file bar.o contains an overlay function called baz(), the object containing baz() will be loaded into the LS only when baz() is called.

Loading overlays into memory takes a significant number of cycles, but it’s the simplest and most effective way to deal with SPU applications that require more than 256KB. For example, Figure 14.1 shows how overlays can be used if an SPU application depends on object files obj1.o, obj2.o, obj3.o, and obj4.o, but only obj1.o and obj2.o and one of obj3.o or obj4.o will fit in memory.

Figure 14.1. Using overlays to execute large applications

The process of using SPU overlays consists of four steps:

Compile SPU code, separating overlay object files from the build.
Create a linker script that identifies the overlay object files and the objects.
Configure the build process to use the new linker script.
Invoke functions from the overlay objects during execution.

This section starts with a description of how a linker script defines overlays and then presents an example application that uses them.

Overlays and the GNU Linker Script

The SPU linker spu-ld has a default script that tells it how to create SPU executables. To display the script commands, enter the following at the command line:

spu-ld -verbose

On my system, the default script is elf32_spu.x, located in the /usr/spu/lib/ldscripts directory.

The elf32_spu.x script defines the target architecture (spu), the entry point (_start), and the directories to be searched for libraries (/usr/spu/lib). But most of the file describes how and where sections of input files should be incorporated into the executable. Appendix A, “Understanding ELF Files,” explains what ELF sections are and how they work.

If an application uses overlays, they must be defined in the linker script. The Chapter14/overlay project directory identifies its overlays in ld.script. This custom linker script contains most of the content of elf32_spu.x, but adds a few linker statements needed for the project. The makefile in the overlay project includes this script into the build with the linker flag:

LD_FLAGS = -Wl,-T,ld.script

Two statements must be added to the linker script to enable overlay processing:

OVERLAY:Describes sections that may be loaded into the same memory region
EXCLUDE_FILE:Tells the linker not to include the overlay sections into the executable

The OVERLAY statement provides two pieces of information: the sections to be included in the executable and the object files that contain the sections. For example, suppose you want the linker to copy the .text section of overlay1.o and overlay2.o into the respective .out_section1 and .out_section2 sections of the executable. You’d add the following lines to the linker script:

OVERLAY :
{
  .out_section1 {overlay1.o(.text)}
  .out_section2 {overlay2.o(.text)}
}

The OVERLAY statement accepts further descriptors that identify the starting and loading address of the overlay sections, but this is the basic format. Conveniently, this is exactly the OVERLAY statement used in the overlay project’s linker script, ld.script.

Normally, the build process links the .text section of every input object file into the executable. This is specified in the linker script with

*(.text)

But overlay1.o and overlay2.o should not be loaded into the executable until they’re needed. ld.script excludes them with the following line:

*( EXCLUDE_FILE(overlay1.o overlay2.o) .text .stub .text.* .gnu.linkonce.t.*)

If you need to create an overlay application of your own, all you have to do is modify these two lines in ld.script and set the appropriate linker flag in the build process. Then the object files will be kept out of the SPU executable until they’re needed by the application.

Overlay Code

In code, overlays are accessed as external functions. That is, if overlay1.o contains objects foo and bar, the SPU application can access them by declaring the following:

extern foo();
extern bar();

These functions don’t occupy any memory in the LS until they are invoked. Then the overlay manager loads the corresponding external objects into the LS and the functions execute normally. The SPU code doesn’t need to know anything about the object files themselves—only the names of the overlay objects are important.

In the overlay project, the main application (spu_overlay.c) accesses functions declared in overlay1.c and overlay2.c. The first function, reverse_vector, calls spu_shuffle to rearrange the elements of a vector unsigned int. The second, reverse_again, performs the same operation and returns the vector elements to their original order.

Listing 14.1 presents spu_overlay.c, which declares and calls the overlay functions.

Example 14.1. SPU Overlay Calling Function: spu_overlay.c

#include <stdio.h>
#include <spu_intrinsics.h>

/* Declare functions in overlay1.o and overlay2.o */
extern vector unsigned int
   reverse_vector(vector unsigned int);
extern vector unsigned int
   reverse_again(vector unsigned int);

int main(unsigned long long speid,
         unsigned long long argp,
         unsigned long long envp) {

   vector unsigned int test_vec =
      (vector unsigned int){0, 1, 2, 3};

   printf("Calling the overlay functions:
");

   /* Call the external functions */
   test_vec = reverse_vector(test_vec);
   test_vec = reverse_again(test_vec);

   return 0;
}

Listing 14.2 presents overlay1.c, which reverses the order of the input vector’s elements. The code in overlay2.c is almost exactly the same.

Example 14.2. SPU Overlay Function: overlay1.c

#include <stdio.h>
#include <spu_intrinsics.h>

vector unsigned int
   reverse_vector(vector unsigned int test)
{
   int i;
   vector unsigned char indexVec = {12, 13, 14, 15,
      8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3};

   /* Rearrange the vector */
   test = spu_shuffle(test, test, indexVec);

   /* Display the results */
   printf("reverse_vector: ");
   for(i=0; i<4; i++)
      printf("%u ", spu_extract(test, i));
   printf("
");
   return test;
}

The printed result is as follows:

Calling the overlay functions:
reverse_vector: 3 2 1 0
reverse_again: 0 1 2 3

The overlay functions, reverse_vector and reverse_again, look and behave like regular SPU functions. Although overlay1.c and overlay2.c contain only a single function each, an overlay file can contain multiple overlay functions.

SPU Software Cache

An SPU’s LS is not a cache, but it can act like one with the right software. The SPU software managed cache capability provides a set of read/write functions that interface the LS as though it was a cache. Using this gives you two important advantages.

The first advantage is that data transfer becomes much simpler. When you read data into the LS using cache functions, you don’t have to worry about where the data should be stored. Data placement is determined solely by the cache structure. And because the cache is formed of equally sized cache lines, you don’t have to specify how much data should be transferred. Each cache operation transfers one cache line at a time.

The second and more important advantage becomes apparent when an application requires more data than the LS can hold. When this happens, the programmer usually has to decide how memory should be replaced in the LS and the exact locations where existing data should be overwritten. But the software cache handles data replacement automatically: Its replacement algorithm determines which cache lines should be overwritten and where new data should be stored.

Accessing the software managed cache in SPU code requires three steps:

Insert #define statements with specific attributes to configure the cache.
Include the cache-api.h header file after the #define statements.
Transfer data to and from the LS using cache read and write functions.

After you’ve configured a software cache, you should only access the LS with cache functions. You shouldn’t use mfc_get or mfc_put to transfer data to or from the cache. These unregulated transfers can damage its structure.

Configuring the Cache

A software cache isn’t a data structure that can be allocated or deallocated. Instead, a cache’s parameters are configured through attributes that tell the cache read/write functions how to operate. These attributes are listed in Table 14.1.

Table 14.1. Software Cache Configuration Attributes

Attribute	Purpose	Range	Default
`CACHE_NAME`	Unique identifier of the cache	Any string	None
`CACHE_TYPE`	Cache type (0 - read-only, 1 - read/write)	0, 1	1 (read/write)
`CACHED_TYPE`	Identifies the datatypes being cached	Any datatype	None
`CACHE_LOG2NWAY`	Cache associativity	0, 2	2 (4-way)
`CACHE_LOG2NSETS`	Log2 of the number of sets in cache	0–12	6 (64 sets)
`CACHELINE_LOG2SIZE`	Log2 of a cache line size	4–12	7 (128 bytes)
`CACHE_SET_TAGID(set)`	Tag identifier of cache DMA transfers	`int < 31`	`set & 0x1F`
`CACHE_READ_X4`	Enables cache access through `vector unsigned int`	—	—
`CACHE_STATS`	Enables monitoring of cache activity	—	—

The CACHE_NAME attribute identifies the cache currently in use, and multiple caches can be created at once as long as they have different names. The next two attributes, CACHE_TYPE and CACHED_TYPE, look alike but have very different purposes. CACHE_TYPE determines whether the cache is read-only (0) or read/write (1). CACHED_TYPE identifies the datatype to be stored in the cache.

For example, the following lines configure a read-only cache called EXAMPLE_CACHE that stores c_type values:

#define CACHE_NAME    EXAMPLE_CACHE
#define CACHE_TYPE    0
#define CACHED_TYPE   c_type

To enable read/write access to the cache, set CACHE_TYPE equal to one. The SDK documentation recommends setting CACHED_TYPE equal to a previously typedefed datatype to avoid errors involving operator precedence.

One of the most crucial concerns in configuring the cache is determining how the cache data should be mapped to main memory. This associativity is controlled by the CACHE_LOG2NWAY attribute, which can take one of two values:

0 - Direct-Mapped—. Main memory is partitioned into the same number of blocks as there are cache lines, and each cache line stores data for a specific block. Each cache line is its own set.
2 - 4-Way Set Associative—. The N-line cache is divided into N/4 sets of four lines each, and each main memory block can be mapped to one of the four lines in a set.

Figure 14.2 depicts both methods. The 4-way associative cache mapping on the right is chosen by default.

Figure 14.2. Direct-mapped cache vs. 4-way associative cache

The 4-way set associative mapping reduces the chances of collisions, because a memory location can be cached in one of four different lines. But when the cache is read, all four lines in the set must be searched. Direct mapping makes searching unnecessary since each memory location can be cached in only one location. If multiple addresses mapped to a cache line need to be cached, however, only one will be cached. The others, if accessed, will cause cache misses.

After the cache’s associativity has been configured, the number of sets can be identified with the CACHE_LOG2NSETS attribute. The default value is 6, but this value can be set as high as 12 for a total of 2¹² sets. The size of each cache line is set with CACHELINE_LOG2SIZE, whose maximum is 12 and default is 7, for a cache line size of 2⁷ = 128 bytes. For example, to create a 4-way set associative cache with 128 sets and cache lines of 256 bytes, you’d use the following configuration:

#define CACHE_LOG2NWAY       2
#define CACHE_LOG2NSETS      7
#define CACHELINE_LOG2SIZE   8

The first line can be left out because 2 is the default value. The total size of this cache = 256 bytes/line * 4 lines/set * 128 sets = 128KB, which takes up half the LS. The default attributes produce a cache of 128 bytes/line * 4 lines/set * 64 sets = 32KB.

The last three attributes in Table 14.1 specify less-crucial information. The first, CACHE_SET_TAGID(set), identifies the tag value to be used for all the cache’s DMA transfers between LS and main memory. CACHE_READ_X4 makes it possible to read values as vector unsigned ints, and can be used only when CACHED_TYPE is a 32-bit integral type. The last attribute, CACHE_STATS, makes it possible to access cache statistics such as hits and misses for load and store operations. This attribute is discussed shortly.

After you’ve configured the attributes in code, be sure to include the cache-api.h header. The code in this header depends on the cache attributes to operate.

Accessing the Cache

The cache-api.h header declares two types of functions that interface the software cache: safe and unsafe. Safe functions allow you to read and write to the cache, but don’t allow pointer access to the LS. Unsafe functions return pointers to data in the LS, which make it possible to disturb the cache structure.

Safe Software Cache Functions

Safe functions are somewhat slower than the unsafe functions, but they reduce the possibility of error when accessing the LS. Table 14.2 lists all four of them.

Table 14.2. Safe Software Cache Access Functions

Constant	Return Value	Purpose
`cache_rd(name, ea)`	`CACHED_TYPE`	Read from the effective address `ea`
`cache_wr(name, ea, val)`	`void`	Write `val` to effective address `ea`
`cache_flush(name)`	`void`	Write modified cache lines to main memory
`cache_rd_x4(name, ea)`	`vector unsigned int`	Read a vector of values from cache

Each function accepts a name parameter that identifies the cache being accessed (configured by CACHE_NAME). The first two functions, cache_rd and cache_wr, operate as their names imply: The first returns a CACHED_TYPE from the effective address ea, and the second writes a CACHED_TYPE value to ea. If the data at ea hasn’t been cached, these functions block until the data is transferred from main memory into the LS.

The software cache is writeback, which means that cache writes aren’t written to main memory until they’re necessary. The third function in Table 14.2, cache_flush, makes this update necessary. That is, it forces all modified cache lines to be written to their respective addresses in main memory.

The last function, cache_rd_x4, is available only if the CACHE_READ_X4 attribute is defined and the CACHED_TYPE is a 32-bit integral type. If the four values at ea are already in cache, the function returns immediately. Otherwise, it blocks until the vector is fetched from main memory.

Unsafe Software Cache Functions

Safe functions are useful when you want to test how well an application works with a given cache structure. But once you’re certain the application works properly, you can improve performance and flexibility by using the functions in Table 14.3. These are unsafe because they provide pointers to CACHED_TYPE structures rather than the CACHED_TYPE structures themselves.

Table 14.3. Unsafe Software Cache Access Functions

Constant	Return Value	Purpose
`cache_rw(name, ea)`	`CACHED_TYPE*`	Provides ptr for modifying cache (blocks)
`cache_touch(name, ea)`	`CACHED_TYPE*`	Provides ptr for modifying cache (no block)
`cache_wait(name, lsa)`	`void`	Write modified cache lines to main memory
`cache_lock(name, lsa)`	`void`	Locks the cache line associated with `lsa`
`cache_unlock(name, lsa)`	`void`	Unlocks the cache line associated with `lsa`

cache_rw returns a pointer to the cached data with effective address ea. This is a pointer to a CACHED_TYPE structure, and can be used for reading or writing to the LS. This function blocks until the data is available in the cache.

To see the advantage provided by cache_rw, let’s say you want to increment a value stored in cache at address ea_addr. With safe functions, you’d have to use the following:

c_type value = cache_rd(EXAMPLE_CACHE, ea_addr);
value++;
cache_wr(EXAMPLE_CACHE, ea_addr, value);

This requires two accesses to cache memory. But by calling cache_rw, you can increment the value with one access:

c_type* value = cache_rw(EXAMPLE_CACHE, ea_addr);
(*c_type)++;

cache_touch is like cache_rw but returns immediately. If the data at ea hasn’t been cached yet, the pointer returned by cache_touch won’t be immediately useable. This function is commonly called to cache a memory location before the actual data is needed. The next function, cache_wait, forces the SPU to halt until the memory is cached. Afterward, the pointer can be used as if it had been returned by cache_rw.

If the software cache needs to cache new data without available space in the LS, it uses the least recently used (LRU) replacement algorithm. That is, it discards the cache line that has been used least recently. To prevent a cached location from being discarded, call cache_lock(name, lsa). This assumes that you’ve acquired the LS pointer using cache_rw or cache_touch. When your code no longer needs the lock, cache_unlock makes the cache location available for replacement.

Heapsort and the Software Cache

The SPU software cache is particularly useful when dealing with large amounts of data partitioned into small sizes. For example, suppose the SPU needs to sort a list of 32-bit values that is larger than the LS can hold. The cache makes sure that the most recently used values are always available.

The example code in the SDK’s /opt/cell/sdk/src/samples/cache/sort directory shows how the cache can be used for sorting, and contains code for the quicksort (qsort.c) and heapsort (hsort.c). This discussion focuses on the heapsort code and its use of the software cache.

The heapsort is an in-place selection sort that manipulates elements of an array as if they were nodes in a binary tree. This tree is called the heap. For example, if the array to be sorted is given by arr = [7, 3, 2, 8, 4, 6, 1, 5, 9], the corresponding heap is presented in Figure 14.3.

Figure 14.3. Binary tree (heap) used in the heapsort

It’s important to see how the tree nodes relate to the array elements. The root node (containing 7) corresponds to arr[0], and the root node’s children are arr[1] and arr[2]. For all successive nodes, the children of arr[i] are arr[2*i+1] and arr[2*i+2].

The goal of the heapsort is to swap parents and children until every parent node is larger than or equal to each of its children. The swapping starts at the bottom and compares the lowest parent node to its children. The swapping continues for higher nodes in the heap. This is commonly called the heapup, buildheap, or heapify process. When the swapping is finished, the root node (arr[0]) holds the largest value in the array.

Next, the root node and the last node in the tree are switched, and the last node is removed. In the preceding example, this means arr[0] and arr[9] are swapped, and arr is treated like an eight-element array. The new root must be swapped with its descendants until it is properly ordered in the tree. This process is commonly called downheap or heapdown.

The steps used by the SDK’s heapsort example are as follows:

Heapup: Arrange the binary tree so that the parent values are larger than the child values.
Swap: Replace the root value with the last child value, remove last child value.
Heapdown: Reorder binary tree.
Repeat previous steps until all nodes are removed.

To perform the heapsort, the SDK code creates the software cache with the following attributes:

typedef vec_float4 item_t;
#define CACHE_NAME           hsort
#define CACHED_TYPE          item_t
#define CACHELINE_LOG2SIZE   7
#define CACHE_LOG2NWAY       2
#define CACHE_LOG2NSETS      6
#define CACHE_TYPE           1

The heapdown function in hsort.c shows how this cache is accessed in code:

static inline void
heapdown (item_t *a, int start, int count)
{
    int root, child;

    for (root=start; child=(root*2+1), child<count;) {
       item_t child_val = cache_rd(CACHE_NAME, &a[child]);
       item_t chpp_val = cache_rd(CACHE_NAME, &a[child+1]);

       if (child < (count - 1) &&
               compare_leq(child_val, chpp_val))
       {
           child += 1;
           child_val = chpp_val;
        }

        if (compare_and_swap(a, root, child, child_val, 0))
            root = child;
        else
            return;
    }
}

The first step loads the two child values from the cache. Next, the function determines the larger of the child values and compares it to the root value using compare_and_swap. This function is contained in util.h, and is presented in the following code below (rewritten for clarity):

static inline int
compare_and_swap (CACHED_TYPE *a, int i, int j,
                  CACHED_TYPE cmp_val, int polarity)
{
    CACHED_TYPE aii, ajj;
    CACHED_TYPE ai, aj;
    CACHED_TYPE *pi, *pj;

    // Load pointers
    &pi = cache_rw(CACHE_NAME, &a[i]);
    cache_lock(CACHE_NAME, &pi);
    &pj = cache_rw(CACHE_NAME, &a[j]);
    cache_unlock(CACHE_NAME, &pi);

    ai = *pi;
    aj = *pj;

    int ai_leq_pivot = (ai <= cmp_val);

    // Use aii, ajj as swap variables
    aii = (ai_leq_pivot) ? aj : ai;
    ajj = (ai_leq_pivot) ? ai : aj;

    // Complete swap
    *pi = aii;
    *pj = ajj;
    return (ai_leq_pivot) ? 1 : 0;
}

This function calls the unsafe cache function, cache_rw, to obtain pointers to the values to be swapped. The first value is locked in the cache to prevent its replacement when the second value is read. When both values are in cache, the function compares them, switches them if necessary, and returns whether the swap was made.

Cache Statistics

When the heapsort is finished, the application calls cache_pr_stats to display data about the cache’s operation. This produces two results. First, it displays data about the cache: its name, size, type, and structure. Second, it creates a table with statistics concerning the cache’s efficiency.

For each of the heapsort’s 64 iterations, cache_pr_stats lists the hit rate and miss rate for the application’s read and write operations. The hit percentage is listed in each case as well as the number of replacements, writebacks, and cache flushes. On my system, the hit percentage for the heapsort averaged around 90% for reads and 99% for writes. It’s clear that the SDK’s software cache provides an effective mechanism for caching data.

SPU Security and Isolation

Most methods of data protection are effective only if the intruder remains outside the processing system. But when the system is infiltrated, security applications can be disassembled, decompiled, and rebuilt with deliberate weaknesses. These weaknesses enable trespassers to access system data or even take control.

The Cell protects software by isolating SPUs at the hardware level. Once an SPU is isolated, no amount of hacking will gain access to its LS. Further, the memory regions containing the SPU’s object code are strongly encrypted.

As a practical example, consider Sony’s protection of its proprietary operating system in the PlayStation 3. Sony has made it possible for Linux developers to access the Cell’s main memory and run applications. At the same time, however, there’s no way any hacker can disable the hypervisor and access Sony’s GameOS. This is because Sony keeps one SPU for itself and runs it in isolation mode.

Not only can’t hackers access the isolated SPU, we can’t even disassemble its executable. The object code has been encrypted and signed, and the PPU always decrypts and verifies its authenticity before launching it on the SPU. The decryption key can’t be accessed in software because it’s a hardware key, built in to the Cell.

We might not be able to break this isolation, encryption, and digital signing, but we can use it for our own applications.

SPU Security Tools

IBM has released two versions of its SPU security tools. The full-featured version requires a signed confidentiality disclosure agreement (CDA) with IBM, and the second is publicly available as part of the Extras ISO in the Cell SDK. The full version provides the following features:

Full usage of SPU isolation
Application signing and verification with public key infrastructure (PKI)
Data/application encryption using secret-key cryptography
A special simulator to test secure applications
Additional libraries that assist with secure application development

The publicly available version doesn’t allow for real SPU isolation, but provides an emulation mode that allows applications to test the behavior of an isolated SPU. It doesn’t provide a secure simulator or additional libraries, but it enables full application signing and verification. The publicly available version also doesn’t encrypt executables, but XORs their bits with a static value.

This section focuses on the publicly available version. This is provided as part of the SDK’s Extras package and can be installed with the following command:

yum install cell-spu-isolation*

Signing and Verifying SPU Executables

Before a secure SPU executable can be launched, it must be signed by its owner and verified by the SPU loader. This prevents intruders from substituting malicious code into the build process. To see how signing works, you need to understand keys and the Public Key Infrastructure (PKI).

Keys and Certificates

The signing process consists of an algorithm (usually a hash function) that operates on the bits in the executable and returns a numeric result called the signature. When the signature is appended to the executable file, the file is signed.

The signing algorithm (explained further in Chapter 18, “Multiprecision Processing and Monte Carlo Methods,”) depends on a particular value that only the file’s owner can access. This is the private key. Other parties don’t have this key, so they can’t sign the executable as its owner. However, they can perform a second computation to make sure that the executable was properly signed. This is called verification and requires a second value: the owner’s public key.

The owner makes its public key available through certificates. The primary function of a certificate is to match the owner’s identity with the owner’s public key, and this binding is called the public key infrastructure (PKI). PKI is frequently used to secure Internet communication and has been incorporated into many secure e-mail and web transfer applications.

The OpenSSL package for Linux contains tools for creating new keys and certificates, and it can be installed with yum install openssl. However, IBM has provided a sample private key (user_sign_key.pem) and a sample certificate (user_signed_crt.pem), both located at /opt/cell/sdk/prototype/src/examples/isolation/keystore. The sample code in this section relies on these two files.

Building Secure Applications

To create a secure Cell application, you have to make a number of changes to the regular build process. First, you have to link to a new startup file, iso_crt0.o, rather than crt0.o. The link step must be performed with the secure linker, emulated-loader.bin, which must access the secure linker script, elf32_spu.xi. Then the SPU executable must be signed before it can be embedded into a CESOF file.

The code in the Chapter14/sign project is simple, but the five build commands in the makefile can be hard to follow. The first command compiles spu_sign.c into spu_sign.o:

spu-gcc -c -o spu_sign.o spu_sign.c

The second command creates the secure SPU executable and is the most complicated step of the build. It identifies the location of the secure linker, emulated-loader.bin, and uses the -nostartfiles flag to tell the linker not to use the regular startup file, crt0.o. The secure startup file, iso_crt0.o, is linked instead.

Note

At the time of this writing, the SDK places emulated-loader.bin in the /usr/lib/spe directory. For the 64-bit build process to work properly, this file must be copied from /usr/lib/spe into /usr/lib64/spe.

In addition to identifying the secure linker, the makefile also names the linker script (elf32_spu.xi) to be used for the linking. The full command for the second step is given by the following:

spu-gcc -o spu_sign_pre spu_sign_pre.o
   /opt/cell/sdk/prototype/usr/spu/lib/iso_crt0.o
   -Wl,-N -nostartfiles -L/usr/lib64/spe
   -Wl,-T,/usr/spu/lib/ldscripts/elf32_spu.xi

The third step in the build signs the SPU executable. Once the private key and certificate are available, this can be done with a single command: spu-isolated-app. This handles signing and encryption, and accepts the following parameters:

infile: Name of the SPU executable to be signed
outfile: Name of the signed SPU executable
signKey: File containing the private key
signCert: File containing the digital certificate
encryptSec: The file sections to be encrypted (optional)

The first four fields are straightforward, and the optional encryptSec parameter is discussed shortly. In the sign project, the unsigned spu_sign_pre is converted into the signed spu_sign with the following command:

spu-isolated-app spu_sign_pre spu_sign
   $(KEYSTORE)/user_sign_key.pem $(KEYSTORE)/user_signed_crt.pem

where $(KEYSTORE) equals /opt/cell/sdk/prototype/src/examples/isolation/keystore. The optional field is left out because the publicly available spu-isolated-app tool doesn’t allow for encryption.

The fourth build command embeds the spu_sign executable into a PPU object file as if it was a regular SPU executable. The fifth compiles and links the PPU application. The makefile commands for these last two steps are as follows:

ppu-embedspu -m64 spu_sign spu_sign spu_sign.o
ppu-gcc -o ppu_sign ppu_sign.c spu_sign.o

`Libspe` and Secure Contexts

Chapter 7, “The SPE Runtime Management Library (libspe),” presented the SPE Management library (libspe) and explained how the PPU accesses SPU object code through contexts. The PPU loads code into the contexts, creates threads, and then runs each context within a thread. To configure these contexts for secure operation, only one change needs to be made. The first parameter of spu_context_create must be set to one of the following:

SPE_ISOLATE: Prevents all external access to the SPU during its operation
SPE_ISOLATE_EMULATE: Emulates isolation mode

SPE_ISOLATE prevents external operations from accessing the SPU’s LS or affecting the SPU’s computation. With SPE_ISOLATE_EMULATE, the SPU only appears isolated.

Note

The publicly available version of the security tools enables applications to emulate SPU isolation, but doesn’t allow for actual isolation. Therefore, applications relying on these tools must use SPE_ISOLATE_EMULATE when creating contexts.

To configure a context called ctx to execute in the isolation-emulation mode, call the following function:

ctx = spe_context_create(SPE_ISOLATE_EMULATE, null);

If the context has been configured for security, the PPU will verify the executable’s signature during the load process. If the executable hasn’t been signed properly, spe_context_load() throws an error.

Chapter 7 explained how to handle SPU events inside PPU code. spe_stop_info_read() gathers information about why an SPU stopped:

spe_stop_info_read(events[i].spe, &stop_info);

The first argument identifies the SPU context and the second stores information about why the SPU stopped running. If the stop_info’s stop_reason field equals 7, the SPU halted because of an isolation error. The spe_exit_code field can take values between 1 and 4:

The application is larger than the maximum permissible 200KB.
Mismatch between publicly available tools and CDA-provided tools.
The executable couldn’t be decrypted.
The executable couldn’t be verified.

Aside from these differences, the process of loading and running SPU contexts in a secure application is the same as for a regular application.

Application Encryption

File signatures serve an important purpose, but if an intruder removes the signature of an SPU executable, it can be run as a regular executable. Real protection requires encryption. The CDA version of spu-isolated-app lets you specify which sections of the SPU executable should be encrypted. The section name is identified in the command’s fifth parameter, encryptSec, which can take values such as CODE or DATA or ALL.

For example, to encrypt the .data section of an SPU executable, a fifth parameter must be added to the earlier command:

spu-isolated-app spu_sign_pre spu_sign
   $(KEYSTORE)/user_sign_key.pem $(KEYSTORE)/user_signed_crt.pem DATA

Appendix A provides more information about sections in PPU ELF and SPU ELF files.

The Cell relies on two 2048-bit keys for encryption: the application decryption key and the application authentication key. The first key is specific to the secure loader and remains the same for all applications. The second key changes for each application, and ensures that if an intruder breaks the application decryption key, the executable will stay encrypted.

The publicly available tool doesn’t encrypt the SPU executable, but XORs its bits with a static value. The fifth parameter of spu-isolated-app has no effect. Because encryption is unavailable, this description won’t discuss the key hierarchy or the encrypted file system API. Readers interested in these subjects should look through Chapters 4 and 5 of the CBE Secure SDK Guide.

The SPU Isolation Library

If you add a printf statement to spu_sign.c, the application will produce a segmentation fault at runtime. This is because isolated SPUs can’t call on the PPU to display output. Mailbox communication is available, but regular DMA commands such as mfc_get and mfc_put won’t work because most of the LS is blocked off from external access. To work around these restrictions, you need to call the functions in the SPU Isolation library, libisolation.

Functions in the SPU Isolation Library

The library libisolation.a is located in /opt/cell/sdk/prototype/usr/spu/lib and provides most of the functions you’d expect from libc and libgloss, including printf. Table 14.4 lists the functions from these libraries that aren’t included in libisolation.a.

Table 14.4. libc/libgloss Functions Not Provided by libisolation

fscanf

ftok_ea

gets

mktemp

mmap_ea

mremap_ea

msync_ea

munmap_ea

readv

scanf

setbuf

setvbuf

shmat_ea

shmctl_ea

shmdt_ea

shmget_ea

shm_open

shm_unlink

snprintf

sprintf

sscanf

system

tmpnam

utimes

vfscanf

vscanf

vsnprintf

vsprintf

vsscanf

writev

For the sake of security, PPE-assisted functions are allowed to access only a particular region of an SPU’s LS. This is called the auxiliary buffer. By default, this buffer is 192 bytes wide and occupies the LS memory region from 0x3E000 to 0x3E0C0. Figure 14.4 shows its position in the LS.

Figure 14.4. The arbitrary buffer in the LS

The auxiliary buffer must be large enough to store the parameters and results of all PPE-assisted functions invoked during an application’s execution. Specifically, it must have a base size of 16 bytes, 16 bytes for each function parameter, and 32 extra bytes for each call to printf, vprintf, fprintf, and vfprintf.

If this requires more memory than the buffer’s default size of 192 bytes, the buffer can be extended with change_ppuassist_buf_len(unsigned int), where the argument identifies the requested size of the auxiliary buffer. The maximum buffer size is 8016 bytes.

For example, if your SPU code calls printf with parameters whose total size come to 256 bytes, the auxiliary buffer can be set to 16 + 32 + 256 = 304 bytes with the following instruction:

change_ppuassist_buf_len(304);

change_ppuassist_buf_len is provided as part of libisolation and is declared in the libisolation.h header.

Communicating with an Isolated SPU

An isolated SPU can transfer data to and from a small region of its LS. libisolation.a provides a number of functions for this purpose and many of them access encrypted data and files. Two routines can be used for simple, unencrypted data transfer: copyin and copyout. Both have the same list of parameters:

unsigned long long ea:Effective address in main memory
void *ls: Pointer to structure in LS
unsigned int size: Size of the data being transferred

In Listing 14.3, the SPU receives two mailbox messages and uses them to form an effective address. It calls copyin to transfer a vector from this address into its LS and calls printf to display the elements of the vector.

Example 14.3. Secure Communication with an Isolated SPU: spu_secure_comm.c

#include <stdio.h>
#include <spu_mfcio.h>
#include <libisolation.h>

int main() {

   /* data to be sent */
   vector unsigned int msg;

   /* address of the data */
   unsigned long long ea;
   unsigned int i, ea_low, ea_high;

   /* Create effective address from mailbox data */
   ea_low = spu_read_in_mbox();
   ea_high = spu_read_in_mbox();
   ea = mfc_hl2ea(ea_high, ea_low);

   /* Access the data from main memory
      and copy it to the LS */
   if (copyin(ea, &msg, sizeof(msg)) != 0) {
      return -1;
   }

   /* Display vector contents */
   printf("Result: ");
   for(i=0; i<4; i++)
      printf("%u ",spu_extract(msg, i));
   printf("
");

   return 0;
}

If you compare the makefile for the sign project to that of the secure_comm project, you’ll notice two changes. First, the secure_comm project links libisolation into the build. Second, there are additional linker flags that allow the SPU to communicate with the PPU. Without these flags, the application will produce a segmentation fault.

Conclusion

This chapter presented three tools for building SPU applications with advanced features: overlays, software caching, and isolation. Overlays and software caching make it simpler for SPU applications to interface the LS and its limited memory. SPU isolation ensures that SPU executables can’t be read or corrupted by an intruder.

Overlays make it possible for an application to load object files into the LS as they’re needed. Adding overlays to a project doesn’t affect the code, but the build process changes significantly. The linker script must be modified to identify the overlay objects, their object files, and where they should be placed in memory.

The SDK’s software-managed cache contains functions that interface the LS as if it was a cache. The first step to using the cache is creating attributes that identify its name, type, structure, and set associativity. Then the cache can be accessed with safe functions, which return data values, and unsafe functions, which return pointers to data. When an application stops using the cache, the statistics of its operation can be analyzed with cache_pr_stats.

The chapter ended with a discussion of SPU isolation. This is one of the most important and unique aspects of Cell applications, and few other processors can guarantee software security in hardware. In addition to SPU isolation, the security tool also protects executable code with strong encryption based on a hardware key.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 14. Advanced SPU Topics: Overlays, Software Caching, and SPU Isolation

Create new playlist

Sign In

Sign Up

Chapter 14. Advanced SPU Topics: Overlays, Software Caching, and SPU Isolation

SPU Overlays

Overlays and the GNU Linker Script

Overlay Code

SPU Software Cache

Configuring the Cache

Accessing the Cache

Safe Software Cache Functions

Unsafe Software Cache Functions

Heapsort and the Software Cache

Cache Statistics

SPU Security and Isolation

SPU Security Tools

Signing and Verifying SPU Executables

Keys and Certificates

Building Secure Applications

Note

Libspe and Secure Contexts

Note

Application Encryption

The SPU Isolation Library

Functions in the SPU Isolation Library

Communicating with an Isolated SPU

Conclusion

Table of Contents for
14. Advanced SPU Topics: Overlays, Software Caching, and SPU Isolation

`Libspe` and Secure Contexts