Chapter 9. SIMD Programming on the PPU, Part 2: Methods and Algorithms

The previous chapter discussed the AltiVec, SIMD Math, and MASSV libraries in great detail, but there’s more to SIMD programming than just knowing the functions. This chapter delves deeper into vector coding and describes methods that will improve the speed and efficiency of your applications. In particular, it covers two points:

  • Accessing unaligned memory and unrolling loops

  • Structure of arrays (SOA) versus array of structures (AOS)

This chapter also provides many real-world examples of vector computation on the PPU. This will not only give you a better understanding of AltiVec coding, but you’ll also have a set of common, high-performance vector utilities that you can use in your own applications. These examples are taken from the libfreevec library, graciously released under the GNU Lesser General Public License (LGPL) by Konstantinos Margaritis and freevec.org.

From Scalars to Vectors

When it comes to converting code to run on a vector processor, there are no fixed guidelines. A good strategy is to profile your application, locate the bottlenecks, and determine where (and if) vector routines would offer an advantage. The goal of this section is to assist with this process and help you incorporate SIMD code into your applications.

Accessing Unaligned Memory and Unrolling Loops

Let’s say the bottleneck in your application is caused by the following two lines of code:

for (i=0; i<1028; i++)
   c[i] = a[i+1] + b[i+2];

This is an ideal target for vectorization, but there’s a problem. If the a, b, and c arrays are aligned on 16-byte boundaries, the operands, a[i+1] and b[i+2], are unaligned, as shown in Figure 9.1.

Unaligned arrays in memory

Figure 9.1. Unaligned arrays in memory

The following code loads values from int arrays a[] and b[] into vectors a_vec and b_vec. Each vector contains the first four elements of the array and the last line adds a_vec and b_vec together:

vector signed int a_vec = vec_ld(0, a);
vector signed int b_vec = vec_ld(0, b);
vector signed int c_vec = vec_ld(0, c);
c_vec = vec_add(a_vec, b_vec);

vec_add adds a[0] to b[0], a[1] to b[1], a[2] to b[2], and a[3] to b[3]. This is not what we want. We want c[0] to equal a[1] + b[2], c[1] to equal a[2] + b[3], and so on.

We need a way to reposition the elements in the vectors so that a[2], b[3], and c[0] are lined up together. Then, when vec_add is performed, the elements of a and b will be added as required by the processing loop. This desired result is shown in Figure 9.2.

Unaligned arrays shifted in vectors

Figure 9.2. Unaligned arrays shifted in vectors

To get from Figure 9.1 to Figure 9.2, the elements of a[] must be loaded into vectors a_vec[] and shifted left once. The elements of b[] need to be loaded into vectors b_vec[] and shifted left twice. When this is done, a_vec[] and b_vec[] can be added together, and the sum will be properly positioned in c_vec[].

Note

The array elements are not shifted in memory. They are loaded into the PPU’s vector registers and shifted inside the registers.

The vector shifting is performed by vec_perm, which rearranges the elements of two vectors according to a third index vector. The index vector is created by vec_lvsl, which accepts an offset and an address, adds them together, and returns an index vector that will shift the vectors’ elements such that the addressed element is placed into the most significant position.

vec_perm and vec_lvsl

Before attacking the full code for vectorized addition, let’s look at a brief example of how vec_lvsl and vec_perm work together. Suppose we want to load the first eight values of a[] into vectors a_vec1 and a_vec2, and shift a_vec1 left by 1. This shifting is performed by vec_perm; but to use this function, we need a suitable index vector. This is generated by vec_lvsl:

int numShift = 1;
indexVec = vec_lvsl(numShift * sizeof(int), a);

vec_lvsl adds the input arguments and checks the last 4 bits of the resulting address. Then it returns an index vector such that, when used in vec_perm, the vector elements will be shifted so that the resulting address (a[1] in this case) will be placed in the most significant position in the vector. In this example, vec_lvsl sets indexVec equal to

{4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19}

You can also create an index vector that shifts elements right with vec_lvsr.

The following code loads a[0:3] into vec_a and a[4:7] into vec_b. Then it generates an index vector, indexVec, that can be used to shift the integers left by one. vec_perm accesses indexVec to select bytes of vec_a and vec_b and place a[1:4] into vec_a:

vector signed int vec_a, vec_b;
vec_a = vec_ld(0, a);
vec_b = vec_ld(sizeof(vector signed int)), a);
indexVec = vec_lvsl(sizeof(int), a);
vec_a = vec_perm(vec_a, vec_b, indexVec);

If you look at the indexVec values and compare them to the index in Figure 8.4, you can better appreciate how vec_perm works. It shifts the elements of vec_a to the left, and then shifts in the most significant integer of vec_b.

Unaligned Vectorized Addition

If you’ve understood the discussion so far, you’ll have no problem comprehending the full application. The code in Listing 9.1 loads the elements of arrays a and b into vectors vec_a and vec_b, shifts the vectors, adds them together into vec_c, and stores vec_c into the c array.

Example 9.1. Adding Unaligned Arrays: ppu_shiftadd.c

#include <stdio.h>
#include <altivec.h>

#define N 64

/* Loads the values from a[] into vec_a[],
   loads values from b[] into vec_b[],
   and computes c[i] = a[i+1] + b[i+2]; */
int main(int argc, char **argv) {

   int i;
   unsigned int a[N] __attribute__ ((aligned(16)));
   unsigned int b[N] __attribute__ ((aligned(16)));
   unsigned int c[N] __attribute__ ((aligned(16)));
   vector unsigned int vec_a[4], vec_b[4], vec_c[4];
   vector unsigned char index_a, index_b;
   vector unsigned int tempA, tempB;

   /* Fill a[] with even values, b[] with odd values */
   for(i=0; i<N; i++) {
      a[i] = 2*i;
      b[i] = 2*i+1;
   }

   /* Create the index vectors needed for shifting */
   index_a = vec_lvsl(1*sizeof(int), a);
   index_b = vec_lvsl(2*sizeof(int), b);

   /* Load the first four elements of a and b into temp */
   tempA = vec_ld(0, a);
   tempB = vec_ld(0, b);

   /* Execute the vectorized loop */
   int vecSize = sizeof(vector unsigned int);
   for(i=0; i<N/4; i+=4) {

      /* Load five vectors from a */
      vec_a[0] = tempA;
      vec_a[1] = vec_ld((i+1)*vecSize, a);
      vec_a[2] = vec_ld((i+2)*vecSize, a);
      vec_a[3] = vec_ld((i+3)*vecSize, a);
      tempA = vec_ld((i+4)*vecSize, a);

      /* Shift vec_a vectors left */
      vec_a[0] = vec_perm(vec_a[0], vec_a[1], index_a);
      vec_a[1] = vec_perm(vec_a[1], vec_a[2], index_a);
      vec_a[2] = vec_perm(vec_a[2], vec_a[3], index_a);
      vec_a[3] = vec_perm(vec_a[3], tempA, index_a);

      /* Load five vectors from b */
      vec_b[0] = tempB;
      vec_b[1] = vec_ld((i+1)*vecSize, b);
      vec_b[2] = vec_ld((i+2)*vecSize, b);
      vec_b[3] = vec_ld((i+3)*vecSize, b);
      tempB = vec_ld((i+4)*vecSize, b);

      /* Shift vec_b vectors left */
      vec_b[0] = vec_perm(vec_b[0], vec_b[1], index_b);
      vec_b[1] = vec_perm(vec_b[1], vec_b[2], index_b);
      vec_b[2] = vec_perm(vec_b[2], vec_b[3], index_b);
      vec_b[3] = vec_perm(vec_b[3], tempB, index_b);

      /* Add vectors */
      vec_c[0] = vec_add(vec_a[0], vec_b[0]);
      vec_c[1] = vec_add(vec_a[1], vec_b[1]);
      vec_c[2] = vec_add(vec_a[2], vec_b[2]);
      vec_c[3] = vec_add(vec_a[3], vec_b[3]);

      /* Store sum vectors */
      vec_st(vec_c[0], (i+0)*vecSize, c);
      vec_st(vec_c[1], (i+1)*vecSize, c);
      vec_st(vec_c[2], (i+2)*vecSize, c);
      vec_st(vec_c[3], (i+3)*vecSize, c);
   }

   for(i=0; i<N; i++)
      printf("%u ", c[i]);
   printf("
");
   return 0;
}

 

For the sake of efficiency, the loads, shifts, and stores are performed in groups of four: Only one unnecessary load (tempA, tempB) is required for every four loads. These repetitive operations also reduce the number of iterations that the for loop has to perform. This is an important facet of improving any application’s performance, and it’s called loop unrolling.

for loops take a significant amount of time to execute. With each iteration, the processor has to perform four steps:

  1. Execute a branch statement to check for the end condition.

  2. Load the count variable into a register.

  3. Increment/decrement the variable.

  4. Store the updated variable to memory (cache).

Reducing the number of iterations will generally improve your application’s performance. One way is to process data as vectors rather than scalars. Another is to repeat lines of code, as shown in Listing 9.1.

The easiest way is to have the compiler handle it for you. ppu-gcc provides two flags for this purpose: -funroll-loops and -funroll-all-loops. The first causes ppu-gcc to examine each loop in the code and determine whether its number of iterations is known before the loop begins. If so, -funroll-loops replaces the loop with repeated statements. The second flag, -funroll-all-loops, produces repeated statements regardless of whether the loop’s number of iterations is known in advance.

Note

ppu-gcc provides the flag -ftree-vectorize, which performs loop vectorization on trees. I’ve used the restrict keyword and proper memory alignment, but ppu-gcc autovectorization has never worked for me.

SOA Versus AOS

It’s common to declare an array of C structs as follows:

typedef struct position {
  float x;
  float y;
  float z;
};
position particle_positions[N];

This is called an array of structures (AoS), and it’s both common and convenient. But when it comes to vectorizing these arrays, two problems arise. The obvious concern is that the three floats elements can’t fill the four elements of a vector, and this reduces the efficiency of vector operations.

The second problem with AoS processing concerns the difference between struct operations and vector operations. When the particle_positions array is stored in memory, the x, y, and z fields are interleaved, and each field is a fixed difference away from similar values. This is fine when they’re accessed as scalars, but because elements in a vector are always processed in the same way, it’s usually preferable that they contain the same type of information. In this case, that means separating the x, y, and z values into their own vectors.

The following declaration uses the SoA approach to create particle_positions:

typedef struct {
  float x[N];
  float y[N];
  float z[N];
} position;
position particle_positions;

The new particle_positions contains the same information as the old particle_positions[N], but now similar field values are stored adjacent to one another. This way, four x values can be loaded into a vector at once.

The SoA approach is better suited for SIMD processing, but AoS is so prevalent that you may not have any control over how the data is stored. In this case, you may need to convert the array of structures into a structure of arrays. This rearrangement is called swizzling, and the process is similar to finding the transpose of a matrix.

For example, suppose you want to swizzle the motion structs in Figure 9.3, Part A, into the vectors in Figure 9.3, Part B. You could use a series of vec_perm operations, but it’s simpler to work with vec_mergel and vec_mergeh. As described in the preceding chapter, these functions form a new vector by selecting alternating elements from the low or high halves of two vectors.

AOS and SOA

Figure 9.3. AOS and SOA

Listing 9.2 shows how swizzling is performed in code. After the structs are loaded into vectors, vec_mergeh merges the x and y members, and vec_mergel merges the z and t members.

Example 9.2. Converting from AoS to SoA: ppu_swizzle.c

#include <altivec.h>

typedef struct  {
   float x;
   float y;
   float z;
   float t;
} motion;

/* The array of structures */
motion p_motion[4] __attribute__ ((aligned (16)));

/* Place the content of the array of structures
   in vectors x_vec, y_vec, z_vec, and t_vec */
int main(int argc, char **argv) {

   vector float x_vec, y_vec, z_vec,
      t_vec, hold[4], tmp[4];

   /* Load structures into vectors */
   hold[0] = vec_ld(0, (float*)p_motion);
   hold[1] = vec_ld(0, (float*)&p_motion[1]);
   hold[2] = vec_ld(0, (float*)&p_motion[2]);
   hold[3] = vec_ld(0, (float*)&p_motion[3]);

   /* Perform first step of the swizzle */
   tmp[0] = vec_mergeh(hold[0], hold[2]);
   tmp[1] = vec_mergeh(hold[1], hold[3]);
   tmp[2] = vec_mergel(hold[0], hold[2]);
   tmp[3] = vec_mergel(hold[1], hold[3]);

   /* Perform second step of the swizzle */
   x_vec = vec_mergeh(tmp[0], tmp[1]);
   y_vec = vec_mergel(tmp[0], tmp[1]);
   z_vec = vec_mergeh(tmp[2], tmp[3]);
   t_vec = vec_mergel(tmp[2], tmp[3]);

   return 0;
}

The ppu_swizzle application takes advantage of the fact that the struct contains four floats that can easily be loaded into a vector. For heterogeneous structs or structs with an odd number of members, it’s necessary to use vec_perm.

When the AoS to SoA conversion is complete, you can operate on the data with normal vector functions. If subsequent functions require data in AoS format, you can de-swizzle the vectors using the same swizzle procedure.

Vectorizing Data Transfer and String Manipulation

Now that you’ve seen the basics of how scalar algorithms are vectorized, it’s time to look at how these principles are used in practice. This section examines a group of C functions that can be vectorized to improve performance. The source code is available for all of them, and for representative functions, the method of vectorization is explored in detail.

The goal of this section isn’t to add new functions to your repertoire, but to show how AltiVec routines are used for practical, nonmathematical purposes. As SIMD coding becomes more prevalent, these vectorized functions may eventually find their way into the PowerPC/Cell Linux kernel.

libmotovec and libfreevec

At the time of this writing, there are only two freely available AltiVec-based libraries of vector functions outside the SDK. The first and most optimized is libmotovec, provided by Freescale Semiconductor. This contains routines for memory access, string operations, and checksum calculation. Freescale has stated that their application’s efficiency doubled when using libmotovec routines to perform calculations associated with TCP/IP protocol stacks. The source is written in hand-coded assembly, which makes it efficient but difficult to understand.

This section focuses on libfreevec, which is less powerful than libmotovec but written clearly in C. This was created by Konstantinos Margaritis, who has kindly released the library under the GNU LGPL. Its functions manipulate data and strings, and Table 9.1 lists all of them. Each function has the same name and usage as its counterpart in the GNU C Library, but according to the graphs at www.freevec.org/functions, the libfreevec functions perform significantly better.

Table 9.1. Libfreevec Memory/String Manipulation Functions

Function Name

Description

Usage

bmove512

Transfer a multiple of 512 bytes to memory

void bmove512 (void*, const void*, uint32_t)

bzero

Copy zeroes to string

void bzero(void *s, size_t len)

memccpy

Copy until character

void *memccpy(void*, const void*, int, size_t)

memchr

Find character in memory

void *memchr(const void*, int, size_t)

memcmp

Compare memory regions

int memcmp(const void*, const void*, size_t)

memcpy

Copy bytes in memory

void *memcpy(void*, const void*, size_t)

memfrob

Encrypt a memory area

void *memfrob(void*, size_t)

memmove

Move block of memory

void *memmove(void*, const void*, size_t)

mempcpy

Copy bytes in memory

void *mempcpy(void*, const void*, size_t)

memrchr

Find character (reverse)

void *memrchr(const void*, int, size_t)

memset

Set bytes in memory

void *memset(void*, int, size_t)

strcmp

Compare two strings

int strcmp(const char*, const char*)

strcpy

Copy one string to another

char *strcpy(char*, const char*)

strlen

Length of string

size_t strlen(const char*)

strncmp

Compare part of two strings

int strncmp(const char*, const char*, size_t)

strnlen

Length of fixed-size string

size_t strnlen(const char*, size_t)

swab

Exchange adjacent bytes

void swab(const void *, void *, ssize_t)

To obtain the libfreevec library, go to www.freevec.org and click the download link on the left. To install libfreevec on your system, unzip the file into a directory, change to the directory, and modify permissions as needed. Then enter

./configure

and once the makefile is created, enter

make

This builds the libfreevec.a and libfreevec.so libraries and places them in the directory src/.libs.

The src/macros directory holds a series of header files, each containing macros that perform vector operations. This section discusses a few of these macros as they are used in two libfreevec functions: memcpy and strcmp.

memcpy

The scalar memcpy(void* dst, const void* src, size_t len) function is faster than copying values in a for loop, and the vectorized memcpy() is even faster. The libfreevec function works differently depending on whether the size of the transfer, len, is less than 4, less than 16, less than 32, less than 64, or larger than 64. This explanation discusses how memcpy transfers block sizes larger than 64.

The first concern is memory alignment. After prefetching the src and dst locations into memory, memcpy transfers bytes from src to dst and increments the dst pointer until it falls on a word-aligned boundary. This is performed by the libfreevec macro, MEMCOPY_FWD_UNTIL_DEST_IS_WORD_ALIGNED, whose code is shown here:

#define MEMCPY_FWD_UNTIL_DEST_IS_WORD_ALIGNED(d, s, len)
{
  int dstal = ((uint32_t)d) % sizeof(uint32_t);
  switch (dstal) {
  case 1:
    *d++ = *s++;
  case 2:
    *d++ = *s++;
  case 3:
    *d++ = *s++;
    len -= sizeof(uint32_t) -dstal;
  }
}

This macro doesn’t store unaligned data into vectors like the code in Listing 9.1. It copies individual bytes from the source to the destination until the destination address reaches a word boundary.

Next, memcpy checks on the source address: It stores the nearest word boundary below this address and the source’s offset from the boundary. Then it expands and executes the macro, MEMCOPY_FWD_UNTIL_DEST_IS_ALTIVEC_ALIGNED. This is similar to the previous macro, but transfers words from the source to the destination until the destination reaches a 16-byte boundary.

When the destination address is fully aligned, memcpy checks the size of the data to be copied and the alignment of the source address. If there are more than 64 bytes to copy and the source is aligned, memcpy expands the macro, MEMCOPY_FWD_LOOP_QUADWORD_ALTIVEC_ALIGNED. If the source address isn’t aligned, it expands MEMCOPY_FWD_LOOP_QUADWORD_ALTIVEC_UNALIGNED. The code for this macro is shown here:

#define MEMCPY_FWD_LOOP_QUADWORD_ALTIVEC_UNALIGNED(d, s, len)
{
  vector uint8_t mask, MSQ1, LSQ1, LSQ2, LSQ3, LSQ4;
  uint32_t blocks = len >> LOG_ALTIVECQUAD ;
  len -= blocks << LOG_ALTIVECQUAD;
  mask = vec_lvsl(0, s);
  while (blocks--) {
    MSQ1 = vec_ld(0, s);
    LSQ1 = vec_ld(15, s);
    LSQ2 = vec_ld(31, s);
    LSQ3 = vec_ld(47, s);
    LSQ4 = vec_ld(63, s);
    vec_st(vec_perm(MSQ1, LSQ1, mask), 0, (uint8_t *)d);
    vec_st(vec_perm(LSQ1, LSQ2, mask), 16, (uint8_t *)d);
    vec_st(vec_perm(LSQ2, LSQ3, mask), 32, (uint8_t *)d);
    vec_st(vec_perm(LSQ3, LSQ4, mask), 48, (uint8_t *)d);
    d += ALTIVECWORD_SIZE; s += QUAD_ALTIVECWORD;
  }
}

Much of this code should look familiar, as it obtains an index vector (mask), loads multiple values from memory, and uses vec_perm to align the elements inside the vectors. When the vectorized data transfer is complete, memcpy expands the MEMCPY_FWD_REST_WORDS_UNALIGNED macro to copy the remaining words from the source to the destination. Finally, MEMCPY_FWD_NIBBLE transfers any remaining bytes to the destination.

This example shows how important it is to have routines for managing unaligned data. And although it’s ideal to operate on vectors and words, you also have to keep track of any remaining unaligned bytes.

strcmp

Vectorized string functions are uncommon, but the libfreevec strcmp function gives an idea of how they work and how AltiVec can be used to compare large data sizes. This function compares two char pointers, s1 and s2, and returns a negative number if the string pointed to by s1 is less than the string pointed to by s2, a zero if they’re equal, and a number greater than zero if the s1 string is greater than s2.

The algorithm begins with the macro STRCMP_UNTIL_SRC1_WORD_ALIGNED, which compares the bytes of s1 and s2 until either the bytes don’t match or s1 is aligned on a word boundary. Next, STRCMP_UNTIL_SRC1_IS_ALTIVEC_ALIGNED compares words of s1 and s2 until s1 is aligned on a vector boundary. This is similar to how memcpy moved the destination address until it fell on a 16-byte boundary. After finishing the alignment, strcmp becomes interesting. If s2 is not aligned, the macro STRCMP_LOOP_ALTIVEC_WORD_UNALIGNED creates an infinite loop. This loop loads two vectors with data from s2, shifts the elements to align them, and then compares the result to s1. This comparison is performed by STRCMP_SINGLE_ALTIVEC_WORD_UNALIGNED, shown in the code here:

#define STRCMP_SINGLE_ALTIVEC_WORD_UNALIGNED(src1, src1l, src2, src2l, src2al)
{
   vector uint8_t vsrc1 =
      (vector uint8_t) vec_ld(0, (uint8_t *)src1l),
                              vsrc2, MSQ, LSQ, vmask;

   vmask = vec_lvsl(0, src2);
   MSQ = vec_ld(0, src2);
   LSQ = vec_ld(15, src2);
   vsrc2 = vec_perm(MSQ, LSQ, vmask);

   if (!vec_all_ne(vsrc1, v0) || !vec_all_eq(vsrc1, vsrc2)) {
      src2l = (uint32_t *)(src2 - src2al);
      STRCMP_QUADWORD_UNALIGNED(src1, src1l, src2, src2l, src2al);
   }
}

This macro also checks for zeroes in s1 by splatting 0 across a vector (v0) and using !vec_all_ne to find out whether any of the elements match. If an element of s1 is zero or differs from the corresponding element of s2, STRCMP_QUADWORD_UNALIGNED executes. This macro iterates through each of the four words of the vector and determines which word ended the main loop. When it determines the word, it returns an appropriate return value.

Vectorized Insertion Sort

Of the open source AltiVec-based libraries available at the time of this writing, none contain sorting routines. This is because the process of sorting applies different operations to individual scalars. This is inconvenient or impossible for many vector-based instruction sets.

But AltiVec provides two important advantages. The first is that vec_min and vec_max return column-wise minimum/maximum vectors without time-consuming branches. Second, vec_perm rearranges the elements of a vector in a single instruction.

This section presents a vector-based sort that relies on these three functions. The goal is not to sort data at high speed, but to show that a sort can be performed solely with vector operations. Simplicity is also a chief priority, and I’m going to assume that the input data array is aligned on a 16-byte boundary and that it contains only float or int values (four scalars per vector). The code can be modified to support other datatypes.

There are many ways to vectorize sort algorithms, but the insertion sort is particularly suitable. Before discussing the Single Instruction, Multiple Data (SIMD) implementation, it’s important to briefly explain how this algorithm works.

Insertion Sort

The insertion sort is one of the simplest sort algorithms, but also one of the least efficient, taking on the order of N2 comparisons to sort N items. The process begins by comparing the first two items and switching their order if the first is greater than the second. Then each successive element is compared with the element preceding it. If the successive element is less, the two are switched and the element is compared to the next preceding element. If the successive element is greater, it is left in its place and the next successive element is chosen.

To clarify this further, the following arrays show the insertion sort orders an array of four integers (12, 7, −5, 9). The elements under comparison are shown in bold:

12   7   -5   9
7   12   -5   9
7   -5   12   9
-5   7   12   9
-5   7   9   12

Depending on how the n elements are ordered, the insertion sort can take up to n! comparisons to completely order the array. This is shown in the following code, which performs the sort using two loops. The for loop cycles through each element of the array and the while loop compares each element with the elements preceding it, switching as necessary.

for (i=1; i<ARRAY_SIZE; i++) {
   start = num[i];
   j = i;
   while ((j > 0) && (num[j-1] > start)) {
      num[j] = num[j-1];
      j--;
   }
   num[j] = start;
}

Vectorizing the insertion sort is straightforward when you understand the two-part process:

  1. Intervector sort: Order the vectors containing the array data.

  2. Intravector sort: Order the elements inside the individual vectors.

The following subsections further explain these steps and their implementation in code.

Intervector Sort

The first stage in vectorizing the insertion sort is to find a reliable method of comparing vectors. One way is to use the vec_min and vec_max functions. As the names imply, they return vectors containing minimum and maximum elements of the two input vectors. To sort the vectors fully, however, you need to rotate the elements of one vector and perform vec_min and vec_max repeatedly until all the elements of the first vector are compared to all the elements of the second. When this is finished, all the elements of the maximum vector will be greater than or equal to all the elements of the minimum vector.

This process is shown in Figure 9.4 for two vector signed ints. After each vec_min and vec_max, the elements of the minimum vector are rotated to the left.

Ordering two vectors

Figure 9.4. Ordering two vectors

Each vector contains four elements, so vec_min and vec_max are performed four times. Also, the low (minimum) vector is rotated three times to ensure that all of its elements are compared to all the elements of the high (maximum) vector.

It’s important to note that there are no branch statements in this comparison—only vec_max, vec_min, and vec_perm to rotate the vector. This method produces two clearly ordered vectors, but it doesn’t order the elements inside the vectors. A second stage is needed.

Intravector Sort

Like the intervector sort, the intravector sort requires multiple compare-and-rearrange operations, but instead of using vec_min and vec_max, it relies on vec_cmpgt. This function compares the elements in two vectors and returns a vector whose elements are all ones when the corresponding elements have a greater-than relationship, and all zeroes when they are less than or equal.

The result of vec_cmpgt determines how the vector should be rearranged for the next comparison. Figure 9.5 shows the steps involved in sorting the elements of a vector signed int whose values are 12, 7, −5, and 9.

Sorting elements inside a vector

Figure 9.5. Sorting elements inside a vector

The vec_perm function can rearrange the vectors, but it needs a suitable index vector. Constructing the index vector from the comparison result takes two steps:

  1. AND the vector result of the comparison with a suitable mask vector. Elements in the output will equal elements in the mask vector where the comparison returned true, and will equal zero where the comparison returned false.

  2. Add the AND output to a basic index vector. The vec_perm function will swap input elements depending on the values in the updated index vector.

This might seem complicated at first, but as you look through the code in Listing 9.3, you can see how the index vectors are created and then used to sort the values inside vectors.

Example 9.3. Fully Vectorized Insertion Sort: vecsort.c

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <altivec.h>

#define N 16

/* Replace the first input by the max,
   the second input by the min */
#define MINMAX(max, min) {     
   temp = vec_min(max, min);   
   max = vec_max(max, min);    
   min = temp;                 
}

void sort_vectors(int*);
void sort_elements(int*);

int main(int argc, char **argv) {

   int sortVec[N] __attribute__ ((aligned (16)));
   int i;

   /* Generate array elements with the pid */
   srand(getpid());
   for (i=0; i<N; i++)
      sortVec[i] = rand() % N;

   /* Display the initial values */
   for (i=0; i<N; i++)
      printf("%d ", sortVec[i]);
   printf("
");

   /* Sort the values in the array */
   sort_vectors(sortVec);
   sort_elements(sortVec);

   /* Display the sorted values */
   for (i=0; i<N; i++)
      printf("%d ", sortVec[i]);
   printf("
");
   return 0;
}

/* Perform inter-vector sort */
void sort_vectors(int* array) {
   int i, j, numVec = N/4;    /* Number of vectors */
   vector signed int max, min, temp, temp_max;
   vector unsigned char indexVec =
      {4,5,6,7,8,9,10,11,12,13,14,15,0,1,2,3};

   for (i=1; i<numVec; i++) {

      /* Assume the first value is largest */
      max = vec_ld(i*sizeof(max), array);
      for (j=i; j>0; j--) {
         temp_max = max;
         min = vec_ld((j-1)*sizeof(min), array);

         /* Obtain the min/max elements */
         MINMAX(max, min);
         max = vec_perm(max, min, indexVec);
         MINMAX(max, min);
         max = vec_perm(max, min, indexVec);
         MINMAX(max, min);
         max = vec_perm(max, min, indexVec);
         MINMAX(max, min);
         max = vec_perm(max, min, indexVec);

         vec_st(max, j*sizeof(max), array);
         if (vec_any_ne(max, temp_max)) {
            if (j > 1)
               max = min;
            else
               vec_st(min, 0, array);
         }
         else
            break;
      }
   }
}

/* Perform intra-vector sort */
void sort_elements(int* array) {
   int i, numVec = N/4;             /* Number of vectors */

   /* Vectors used in the first compare-arrange */
   vector unsigned char perm_reverse =
      {12,13,14,15,8,9,10,11,4,5,6,7,0,1,2,3};
   vector unsigned char add_index1 =
      {12,12,12,12,4,4,4,4,4,4,4,4,12,12,12,12};
   vector unsigned char basis_index1 =
      {0,1,2,3,4,5,6,7,4,5,6,7,0,1,2,3};

   /* Vectors used in the second compare-arrange */
   vector unsigned char perm_in_out =
      {4,5,6,7,0,1,2,3,12,13,14,15,8,9,10,11};
   vector unsigned char add_index2 =
      {4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4};
   vector unsigned char basis_index2 =
      {0,1,2,3,0,1,2,3,8,9,10,11,8,9,10,11};

   /* Vectors used in the third compare-arrange */
   vector unsigned char perm_middle =
      {0,1,2,3,8,9,10,11,4,5,6,7,12,13,14,15};
   vector unsigned char add_index3 =
      {0,0,0,0,4,4,4,4,4,4,4,4,0,0,0,0};
   vector unsigned char basis_index3 =
      {0,1,2,3,4,5,6,7,4,5,6,7,12,13,14,15};

   vector bool int mask;
   vector unsigned char pattern;

   /* The intra-vector sort */
   for(i=0; i<numVec; i++) {
      vector signed int vec =
         vec_ld(i*sizeof(vec), array);

      /* First compare and rearrange */
      mask = vec_cmpgt(vec, vec_perm(vec, vec, perm_reverse));
      pattern = vec_and((vector unsigned char)mask, add_index1);
      pattern = vec_add(pattern, basis_index1);
      vec = vec_perm(vec, vec, pattern);

      /* Second compare and rearrange */
      mask = vec_cmpgt(vec, vec_perm(vec, vec, perm_in_out));
      pattern = vec_and((vector unsigned char)mask, add_index2);
      pattern = vec_add(pattern, basis_index2);
      vec = vec_perm(vec, vec, pattern);

      /* Third compare and rearrange */
      mask = vec_cmpgt(vec, vec_perm(vec, vec, perm_middle));
      pattern = vec_and((vector unsigned char)mask, add_index3);
      pattern = vec_add(pattern, basis_index3);
      vec = vec_perm(vec, vec, pattern);

      vec_st(vec, i*sizeof(vec), array);
   }
}

 

The insertion sort algorithm stops comparing a value to preceding values when the value is greater than one of the preceding values. In the sort_vectors function, this is done by comparing max to temp_max using vec_any_ne. If the maximum vector didn’t change during the MINMAX/rotation process, the vector is already in its proper position.

The vectorized insertion sort is much more involved than the scalar version. However, the lack of branch-based comparisons means that the application will never suffer from branch miss penalties. Further, vec_min and vec_max not only compare vector elements, but also separate larger values from smaller.

Conclusion

When it comes to languages, some people are fluent, whereas others merely know what the words mean. Programming with vectors is similar. The preceding chapter discussed the basic words of vectorization: datatypes, libraries, and functions. The goal of this chapter has been to bring you closer to fluency by showing how vectors are used in common situations.

Performance of scalar applications can usually be improved with vectorization, but the process requires thought and effort. If the data isn’t aligned in memory, a series of load/shift operations is needed to position scalars inside vectors. If the data values are stored as an array of structures (AoS), swizzling will rearrange them into the more SIMD-efficient structure of arrays (SoA).

libfreevec is an open-source library containing vectorized functions for manipulating strings and memory. The routines run on top of AltiVec and provide excellent examples of using vectors for applications beyond mathematics. Further, the header files contain a number of useful macros that can reduce the amount of repetitive code in your algorithms.

The last part of this chapter presented the theory and implementation of a vector-based insertion sort. This is not a high-performance algorithm, but the code demonstrates the breadth of vector processing on the PPU.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.10.107