The previous chapter discussed the AltiVec, SIMD Math, and MASSV libraries in great detail, but there’s more to SIMD programming than just knowing the functions. This chapter delves deeper into vector coding and describes methods that will improve the speed and efficiency of your applications. In particular, it covers two points:
Accessing unaligned memory and unrolling loops
Structure of arrays (SOA) versus array of structures (AOS)
This chapter also provides many real-world examples of vector computation on the PPU. This will not only give you a better understanding of AltiVec coding, but you’ll also have a set of common, high-performance vector utilities that you can use in your own applications. These examples are taken from the libfreevec
library, graciously released under the GNU Lesser General Public License (LGPL) by Konstantinos Margaritis and freevec.org.
When it comes to converting code to run on a vector processor, there are no fixed guidelines. A good strategy is to profile your application, locate the bottlenecks, and determine where (and if) vector routines would offer an advantage. The goal of this section is to assist with this process and help you incorporate SIMD code into your applications.
Let’s say the bottleneck in your application is caused by the following two lines of code:
for (i=0; i<1028; i++) c[i] = a[i+1] + b[i+2];
This is an ideal target for vectorization, but there’s a problem. If the a
, b
, and c
arrays are aligned on 16-byte boundaries, the operands, a[i+1]
and b[i+2]
, are unaligned, as shown in Figure 9.1.
The following code loads values from int
arrays a[]
and b[]
into vectors a_vec
and b_vec
. Each vector contains the first four elements of the array and the last line adds a_vec
and b_vec
together:
vector signed int a_vec = vec_ld(0, a); vector signed int b_vec = vec_ld(0, b); vector signed int c_vec = vec_ld(0, c); c_vec = vec_add(a_vec, b_vec);
vec_add
adds a[0]
to b[0]
, a[1]
to b[1]
, a[2]
to b[2]
, and a[3]
to b[3]
. This is not what we want. We want c[0]
to equal a[1]
+ b[2]
, c[1]
to equal a[2]
+ b[3]
, and so on.
We need a way to reposition the elements in the vectors so that a[2]
, b[3]
, and c[0]
are lined up together. Then, when vec_add
is performed, the elements of a
and b
will be added as required by the processing loop. This desired result is shown in Figure 9.2.
To get from Figure 9.1 to Figure 9.2, the elements of a[]
must be loaded into vectors a_vec[]
and shifted left once. The elements of b[]
need to be loaded into vectors b_vec[]
and shifted left twice. When this is done, a_vec[]
and b_vec[]
can be added together, and the sum will be properly positioned in c_vec[]
.
The array elements are not shifted in memory. They are loaded into the PPU’s vector registers and shifted inside the registers.
The vector shifting is performed by vec_perm
, which rearranges the elements of two vectors according to a third index vector. The index vector is created by vec_lvsl
, which accepts an offset and an address, adds them together, and returns an index vector that will shift the vectors’ elements such that the addressed element is placed into the most significant position.
Before attacking the full code for vectorized addition, let’s look at a brief example of how vec_lvsl
and vec_perm
work together. Suppose we want to load the first eight values of a[]
into vectors a_vec1
and a_vec2
, and shift a_vec1
left by 1. This shifting is performed by vec_perm
; but to use this function, we need a suitable index vector. This is generated by vec_lvsl
:
int numShift = 1; indexVec = vec_lvsl(numShift * sizeof(int), a);
vec_lvsl
adds the input arguments and checks the last 4 bits of the resulting address. Then it returns an index vector such that, when used in vec_perm
, the vector elements will be shifted so that the resulting address (a[1]
in this case) will be placed in the most significant position in the vector. In this example, vec_lvsl
sets indexVec
equal to
{4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19}
You can also create an index vector that shifts elements right with vec_lvsr
.
The following code loads a[0:3]
into vec_a
and a[4:7]
into vec_b
. Then it generates an index vector, indexVec
, that can be used to shift the integers left by one. vec_perm
accesses indexVec
to select bytes of vec_a
and vec_b
and place a[1:4]
into vec_a
:
vector signed int vec_a, vec_b; vec_a = vec_ld(0, a); vec_b = vec_ld(sizeof(vector signed int)), a); indexVec = vec_lvsl(sizeof(int), a); vec_a = vec_perm(vec_a, vec_b, indexVec);
If you look at the indexVec
values and compare them to the index in Figure 8.4, you can better appreciate how vec_perm
works. It shifts the elements of vec_a
to the left, and then shifts in the most significant integer of vec_b
.
If you’ve understood the discussion so far, you’ll have no problem comprehending the full application. The code in Listing 9.1 loads the elements of arrays a
and b
into vectors vec_a
and vec_b
, shifts the vectors, adds them together into vec_c
, and stores vec_c
into the c
array.
Example 9.1. Adding Unaligned Arrays: ppu_shiftadd.c
#include <stdio.h> #include <altivec.h> #define N 64 /* Loads the values from a[] into vec_a[], loads values from b[] into vec_b[], and computes c[i] = a[i+1] + b[i+2]; */ int main(int argc, char **argv) { int i; unsigned int a[N] __attribute__ ((aligned(16))); unsigned int b[N] __attribute__ ((aligned(16))); unsigned int c[N] __attribute__ ((aligned(16))); vector unsigned int vec_a[4], vec_b[4], vec_c[4]; vector unsigned char index_a, index_b; vector unsigned int tempA, tempB; /* Fill a[] with even values, b[] with odd values */ for(i=0; i<N; i++) { a[i] = 2*i; b[i] = 2*i+1; } /* Create the index vectors needed for shifting */ index_a = vec_lvsl(1*sizeof(int), a); index_b = vec_lvsl(2*sizeof(int), b); /* Load the first four elements of a and b into temp */ tempA = vec_ld(0, a); tempB = vec_ld(0, b); /* Execute the vectorized loop */ int vecSize = sizeof(vector unsigned int); for(i=0; i<N/4; i+=4) { /* Load five vectors from a */ vec_a[0] = tempA; vec_a[1] = vec_ld((i+1)*vecSize, a); vec_a[2] = vec_ld((i+2)*vecSize, a); vec_a[3] = vec_ld((i+3)*vecSize, a); tempA = vec_ld((i+4)*vecSize, a); /* Shift vec_a vectors left */ vec_a[0] = vec_perm(vec_a[0], vec_a[1], index_a); vec_a[1] = vec_perm(vec_a[1], vec_a[2], index_a); vec_a[2] = vec_perm(vec_a[2], vec_a[3], index_a); vec_a[3] = vec_perm(vec_a[3], tempA, index_a); /* Load five vectors from b */ vec_b[0] = tempB; vec_b[1] = vec_ld((i+1)*vecSize, b); vec_b[2] = vec_ld((i+2)*vecSize, b); vec_b[3] = vec_ld((i+3)*vecSize, b); tempB = vec_ld((i+4)*vecSize, b); /* Shift vec_b vectors left */ vec_b[0] = vec_perm(vec_b[0], vec_b[1], index_b); vec_b[1] = vec_perm(vec_b[1], vec_b[2], index_b); vec_b[2] = vec_perm(vec_b[2], vec_b[3], index_b); vec_b[3] = vec_perm(vec_b[3], tempB, index_b); /* Add vectors */ vec_c[0] = vec_add(vec_a[0], vec_b[0]); vec_c[1] = vec_add(vec_a[1], vec_b[1]); vec_c[2] = vec_add(vec_a[2], vec_b[2]); vec_c[3] = vec_add(vec_a[3], vec_b[3]); /* Store sum vectors */ vec_st(vec_c[0], (i+0)*vecSize, c); vec_st(vec_c[1], (i+1)*vecSize, c); vec_st(vec_c[2], (i+2)*vecSize, c); vec_st(vec_c[3], (i+3)*vecSize, c); } for(i=0; i<N; i++) printf("%u ", c[i]); printf(" "); return 0; }
For the sake of efficiency, the loads, shifts, and stores are performed in groups of four: Only one unnecessary load (tempA
, tempB
) is required for every four loads. These repetitive operations also reduce the number of iterations that the for
loop has to perform. This is an important facet of improving any application’s performance, and it’s called loop unrolling.
for
loops take a significant amount of time to execute. With each iteration, the processor has to perform four steps:
Execute a branch statement to check for the end condition.
Load the count variable into a register.
Increment/decrement the variable.
Store the updated variable to memory (cache).
Reducing the number of iterations will generally improve your application’s performance. One way is to process data as vectors rather than scalars. Another is to repeat lines of code, as shown in Listing 9.1.
The easiest way is to have the compiler handle it for you. ppu-gcc
provides two flags for this purpose: -funroll-loops
and -funroll-all-loops
. The first causes ppu-gcc
to examine each loop in the code and determine whether its number of iterations is known before the loop begins. If so, -funroll-loops
replaces the loop with repeated statements. The second flag, -funroll-all-loops
, produces repeated statements regardless of whether the loop’s number of iterations is known in advance.
It’s common to declare an array of C struct
s as follows:
typedef struct position { float x; float y; float z; }; position particle_positions[N];
This is called an array of structures (AoS), and it’s both common and convenient. But when it comes to vectorizing these arrays, two problems arise. The obvious concern is that the three float
s elements can’t fill the four elements of a vector, and this reduces the efficiency of vector operations.
The second problem with AoS processing concerns the difference between struct
operations and vector operations. When the particle_positions
array is stored in memory, the x
, y
, and z
fields are interleaved, and each field is a fixed difference away from similar values. This is fine when they’re accessed as scalars, but because elements in a vector are always processed in the same way, it’s usually preferable that they contain the same type of information. In this case, that means separating the x
, y
, and z
values into their own vectors.
The following declaration uses the SoA approach to create particle_positions
:
typedef struct { float x[N]; float y[N]; float z[N]; } position; position particle_positions;
The new particle_positions
contains the same information as the old particle_positions[N]
, but now similar field values are stored adjacent to one another. This way, four x
values can be loaded into a vector at once.
The SoA approach is better suited for SIMD processing, but AoS is so prevalent that you may not have any control over how the data is stored. In this case, you may need to convert the array of structures into a structure of arrays. This rearrangement is called swizzling, and the process is similar to finding the transpose of a matrix.
For example, suppose you want to swizzle the motion struct
s in Figure 9.3, Part A, into the vectors in Figure 9.3, Part B. You could use a series of vec_perm
operations, but it’s simpler to work with vec_mergel
and vec_mergeh
. As described in the preceding chapter, these functions form a new vector by selecting alternating elements from the low or high halves of two vectors.
Listing 9.2 shows how swizzling is performed in code. After the struct
s are loaded into vectors, vec_mergeh
merges the x
and y
members, and vec_mergel
merges the z
and t
members.
Example 9.2. Converting from AoS to SoA: ppu_swizzle.c
#include <altivec.h> typedef struct { float x; float y; float z; float t; } motion; /* The array of structures */ motion p_motion[4] __attribute__ ((aligned (16))); /* Place the content of the array of structures in vectors x_vec, y_vec, z_vec, and t_vec */ int main(int argc, char **argv) { vector float x_vec, y_vec, z_vec, t_vec, hold[4], tmp[4]; /* Load structures into vectors */ hold[0] = vec_ld(0, (float*)p_motion); hold[1] = vec_ld(0, (float*)&p_motion[1]); hold[2] = vec_ld(0, (float*)&p_motion[2]); hold[3] = vec_ld(0, (float*)&p_motion[3]); /* Perform first step of the swizzle */ tmp[0] = vec_mergeh(hold[0], hold[2]); tmp[1] = vec_mergeh(hold[1], hold[3]); tmp[2] = vec_mergel(hold[0], hold[2]); tmp[3] = vec_mergel(hold[1], hold[3]); /* Perform second step of the swizzle */ x_vec = vec_mergeh(tmp[0], tmp[1]); y_vec = vec_mergel(tmp[0], tmp[1]); z_vec = vec_mergeh(tmp[2], tmp[3]); t_vec = vec_mergel(tmp[2], tmp[3]); return 0; }
The ppu_swizzle
application takes advantage of the fact that the struct
contains four float
s that can easily be loaded into a vector. For heterogeneous struct
s or struct
s with an odd number of members, it’s necessary to use vec_perm
.
When the AoS to SoA conversion is complete, you can operate on the data with normal vector functions. If subsequent functions require data in AoS format, you can de-swizzle the vectors using the same swizzle procedure.
Now that you’ve seen the basics of how scalar algorithms are vectorized, it’s time to look at how these principles are used in practice. This section examines a group of C functions that can be vectorized to improve performance. The source code is available for all of them, and for representative functions, the method of vectorization is explored in detail.
The goal of this section isn’t to add new functions to your repertoire, but to show how AltiVec routines are used for practical, nonmathematical purposes. As SIMD coding becomes more prevalent, these vectorized functions may eventually find their way into the PowerPC/Cell Linux kernel.
At the time of this writing, there are only two freely available AltiVec-based libraries of vector functions outside the SDK. The first and most optimized is libmotovec
, provided by Freescale Semiconductor. This contains routines for memory access, string operations, and checksum calculation. Freescale has stated that their application’s efficiency doubled when using libmotovec
routines to perform calculations associated with TCP/IP protocol stacks. The source is written in hand-coded assembly, which makes it efficient but difficult to understand.
This section focuses on libfreevec
, which is less powerful than libmotovec
but written clearly in C. This was created by Konstantinos Margaritis, who has kindly released the library under the GNU LGPL. Its functions manipulate data and strings, and Table 9.1 lists all of them. Each function has the same name and usage as its counterpart in the GNU C Library, but according to the graphs at www.freevec.org/functions, the libfreevec
functions perform significantly better.
Table 9.1. Libfreevec Memory/String Manipulation Functions
Function Name | Description | Usage |
---|---|---|
| Transfer a multiple of 512 bytes to memory |
|
| Copy zeroes to string |
|
| Copy until character |
|
| Find character in memory |
|
| Compare memory regions |
|
| Copy bytes in memory |
|
| Encrypt a memory area |
|
| Move block of memory |
|
| Copy bytes in memory |
|
| Find character (reverse) |
|
| Set bytes in memory |
|
| Compare two strings |
|
| Copy one string to another |
|
| Length of string |
|
| Compare part of two strings |
|
| Length of fixed-size string |
|
| Exchange adjacent bytes |
|
To obtain the libfreevec
library, go to www.freevec.org and click the download link on the left. To install libfreevec
on your system, unzip the file into a directory, change to the directory, and modify permissions as needed. Then enter
./configure
and once the makefile is created, enter
make
This builds the libfreevec.a and libfreevec.so libraries and places them in the directory src/.libs.
The src/macros directory holds a series of header files, each containing macros that perform vector operations. This section discusses a few of these macros as they are used in two libfreevec
functions: memcpy
and strcmp
.
The scalar memcpy(void* dst, const void* src, size_t len)
function is faster than copying values in a for
loop, and the vectorized memcpy()
is even faster. The libfreevec
function works differently depending on whether the size of the transfer, len
, is less than 4, less than 16, less than 32, less than 64, or larger than 64. This explanation discusses how memcpy
transfers block sizes larger than 64.
The first concern is memory alignment. After prefetching the src
and dst
locations into memory, memcpy
transfers bytes from src
to dst
and increments the dst
pointer until it falls on a word-aligned boundary. This is performed by the libfreevec
macro, MEMCOPY_FWD_UNTIL_DEST_IS_WORD_ALIGNED
, whose code is shown here:
#define MEMCPY_FWD_UNTIL_DEST_IS_WORD_ALIGNED(d, s, len) { int dstal = ((uint32_t)d) % sizeof(uint32_t); switch (dstal) { case 1: *d++ = *s++; case 2: *d++ = *s++; case 3: *d++ = *s++; len -= sizeof(uint32_t) -dstal; } }
This macro doesn’t store unaligned data into vectors like the code in Listing 9.1. It copies individual bytes from the source to the destination until the destination address reaches a word boundary.
Next, memcpy
checks on the source address: It stores the nearest word boundary below this address and the source’s offset from the boundary. Then it expands and executes the macro, MEMCOPY_FWD_UNTIL_DEST_IS_ALTIVEC_ALIGNED
. This is similar to the previous macro, but transfers words from the source to the destination until the destination reaches a 16-byte boundary.
When the destination address is fully aligned, memcpy
checks the size of the data to be copied and the alignment of the source address. If there are more than 64 bytes to copy and the source is aligned, memcpy
expands the macro, MEMCOPY_FWD_LOOP_QUADWORD_ALTIVEC_ALIGNED
. If the source address isn’t aligned, it expands MEMCOPY_FWD_LOOP_QUADWORD_ALTIVEC_UNALIGNED
. The code for this macro is shown here:
#define MEMCPY_FWD_LOOP_QUADWORD_ALTIVEC_UNALIGNED(d, s, len) { vector uint8_t mask, MSQ1, LSQ1, LSQ2, LSQ3, LSQ4; uint32_t blocks = len >> LOG_ALTIVECQUAD ; len -= blocks << LOG_ALTIVECQUAD; mask = vec_lvsl(0, s); while (blocks--) { MSQ1 = vec_ld(0, s); LSQ1 = vec_ld(15, s); LSQ2 = vec_ld(31, s); LSQ3 = vec_ld(47, s); LSQ4 = vec_ld(63, s); vec_st(vec_perm(MSQ1, LSQ1, mask), 0, (uint8_t *)d); vec_st(vec_perm(LSQ1, LSQ2, mask), 16, (uint8_t *)d); vec_st(vec_perm(LSQ2, LSQ3, mask), 32, (uint8_t *)d); vec_st(vec_perm(LSQ3, LSQ4, mask), 48, (uint8_t *)d); d += ALTIVECWORD_SIZE; s += QUAD_ALTIVECWORD; } }
Much of this code should look familiar, as it obtains an index vector (mask
), loads multiple values from memory, and uses vec_perm
to align the elements inside the vectors. When the vectorized data transfer is complete, memcpy
expands the MEMCPY_FWD_REST_WORDS_UNALIGNED
macro to copy the remaining words from the source to the destination. Finally, MEMCPY_FWD_NIBBLE
transfers any remaining bytes to the destination.
This example shows how important it is to have routines for managing unaligned data. And although it’s ideal to operate on vectors and words, you also have to keep track of any remaining unaligned bytes.
Vectorized string functions are uncommon, but the libfreevec strcmp
function gives an idea of how they work and how AltiVec can be used to compare large data sizes. This function compares two char
pointers, s1
and s2
, and returns a negative number if the string pointed to by s1
is less than the string pointed to by s2
, a zero if they’re equal, and a number greater than zero if the s1
string is greater than s2
.
The algorithm begins with the macro STRCMP_UNTIL_SRC1_WORD_ALIGNED
, which compares the bytes of s1
and s2
until either the bytes don’t match or s1
is aligned on a word boundary. Next, STRCMP_UNTIL_SRC1_IS_ALTIVEC_ALIGNED
compares words of s1
and s2
until s1
is aligned on a vector boundary. This is similar to how memcpy
moved the destination address until it fell on a 16-byte boundary. After finishing the alignment, strcmp
becomes interesting. If s2
is not aligned, the macro STRCMP_LOOP_ALTIVEC_WORD_UNALIGNED
creates an infinite loop. This loop loads two vectors with data from s2
, shifts the elements to align them, and then compares the result to s1
. This comparison is performed by STRCMP_SINGLE_ALTIVEC_WORD_UNALIGNED
, shown in the code here:
#define STRCMP_SINGLE_ALTIVEC_WORD_UNALIGNED(src1, src1l, src2, src2l, src2al) { vector uint8_t vsrc1 = (vector uint8_t) vec_ld(0, (uint8_t *)src1l), vsrc2, MSQ, LSQ, vmask; vmask = vec_lvsl(0, src2); MSQ = vec_ld(0, src2); LSQ = vec_ld(15, src2); vsrc2 = vec_perm(MSQ, LSQ, vmask); if (!vec_all_ne(vsrc1, v0) || !vec_all_eq(vsrc1, vsrc2)) { src2l = (uint32_t *)(src2 - src2al); STRCMP_QUADWORD_UNALIGNED(src1, src1l, src2, src2l, src2al); } }
This macro also checks for zeroes in s1
by splatting 0
across a vector (v0
) and using !vec_all_ne
to find out whether any of the elements match. If an element of s1
is zero or differs from the corresponding element of s2
, STRCMP_QUADWORD_UNALIGNED
executes. This macro iterates through each of the four words of the vector and determines which word ended the main loop. When it determines the word, it returns an appropriate return value.
Of the open source AltiVec-based libraries available at the time of this writing, none contain sorting routines. This is because the process of sorting applies different operations to individual scalars. This is inconvenient or impossible for many vector-based instruction sets.
But AltiVec provides two important advantages. The first is that vec_min
and vec_max
return column-wise minimum/maximum vectors without time-consuming branches. Second, vec_perm
rearranges the elements of a vector in a single instruction.
This section presents a vector-based sort that relies on these three functions. The goal is not to sort data at high speed, but to show that a sort can be performed solely with vector operations. Simplicity is also a chief priority, and I’m going to assume that the input data array is aligned on a 16-byte boundary and that it contains only float
or int
values (four scalars per vector). The code can be modified to support other datatypes.
There are many ways to vectorize sort algorithms, but the insertion sort is particularly suitable. Before discussing the Single Instruction, Multiple Data (SIMD) implementation, it’s important to briefly explain how this algorithm works.
The insertion sort is one of the simplest sort algorithms, but also one of the least efficient, taking on the order of N2 comparisons to sort N items. The process begins by comparing the first two items and switching their order if the first is greater than the second. Then each successive element is compared with the element preceding it. If the successive element is less, the two are switched and the element is compared to the next preceding element. If the successive element is greater, it is left in its place and the next successive element is chosen.
To clarify this further, the following arrays show the insertion sort orders an array of four integers (12, 7, −5, 9). The elements under comparison are shown in bold:
12 7 -5 9 7 12 -5 9 7 -5 12 9 -5 7 12 9 -5 7 9 12
Depending on how the n elements are ordered, the insertion sort can take up to n! comparisons to completely order the array. This is shown in the following code, which performs the sort using two loops. The for
loop cycles through each element of the array and the while
loop compares each element with the elements preceding it, switching as necessary.
for (i=1; i<ARRAY_SIZE; i++) { start = num[i]; j = i; while ((j > 0) && (num[j-1] > start)) { num[j] = num[j-1]; j--; } num[j] = start; }
Vectorizing the insertion sort is straightforward when you understand the two-part process:
Intervector sort: Order the vectors containing the array data.
Intravector sort: Order the elements inside the individual vectors.
The following subsections further explain these steps and their implementation in code.
The first stage in vectorizing the insertion sort is to find a reliable method of comparing vectors. One way is to use the vec_min
and vec_max
functions. As the names imply, they return vectors containing minimum and maximum elements of the two input vectors. To sort the vectors fully, however, you need to rotate the elements of one vector and perform vec_min
and vec_max
repeatedly until all the elements of the first vector are compared to all the elements of the second. When this is finished, all the elements of the maximum vector will be greater than or equal to all the elements of the minimum vector.
This process is shown in Figure 9.4 for two vector signed int
s. After each vec_min
and vec_max
, the elements of the minimum vector are rotated to the left.
Each vector contains four elements, so vec_min
and vec_max
are performed four times. Also, the low (minimum) vector is rotated three times to ensure that all of its elements are compared to all the elements of the high (maximum) vector.
It’s important to note that there are no branch statements in this comparison—only vec_max
, vec_min
, and vec_perm
to rotate the vector. This method produces two clearly ordered vectors, but it doesn’t order the elements inside the vectors. A second stage is needed.
Like the intervector sort, the intravector sort requires multiple compare-and-rearrange operations, but instead of using vec_min
and vec_max
, it relies on vec_cmpgt
. This function compares the elements in two vectors and returns a vector whose elements are all ones when the corresponding elements have a greater-than relationship, and all zeroes when they are less than or equal.
The result of vec_cmpgt
determines how the vector should be rearranged for the next comparison. Figure 9.5 shows the steps involved in sorting the elements of a vector signed int
whose values are 12, 7, −5, and 9.
The vec_perm
function can rearrange the vectors, but it needs a suitable index vector. Constructing the index vector from the comparison result takes two steps:
AND the vector result of the comparison with a suitable mask vector. Elements in the output will equal elements in the mask vector where the comparison returned true, and will equal zero where the comparison returned false.
Add the AND output to a basic index vector. The vec_perm
function will swap input elements depending on the values in the updated index vector.
This might seem complicated at first, but as you look through the code in Listing 9.3, you can see how the index vectors are created and then used to sort the values inside vectors.
Example 9.3. Fully Vectorized Insertion Sort: vecsort.c
#include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <altivec.h> #define N 16 /* Replace the first input by the max, the second input by the min */ #define MINMAX(max, min) { temp = vec_min(max, min); max = vec_max(max, min); min = temp; } void sort_vectors(int*); void sort_elements(int*); int main(int argc, char **argv) { int sortVec[N] __attribute__ ((aligned (16))); int i; /* Generate array elements with the pid */ srand(getpid()); for (i=0; i<N; i++) sortVec[i] = rand() % N; /* Display the initial values */ for (i=0; i<N; i++) printf("%d ", sortVec[i]); printf(" "); /* Sort the values in the array */ sort_vectors(sortVec); sort_elements(sortVec); /* Display the sorted values */ for (i=0; i<N; i++) printf("%d ", sortVec[i]); printf(" "); return 0; } /* Perform inter-vector sort */ void sort_vectors(int* array) { int i, j, numVec = N/4; /* Number of vectors */ vector signed int max, min, temp, temp_max; vector unsigned char indexVec = {4,5,6,7,8,9,10,11,12,13,14,15,0,1,2,3}; for (i=1; i<numVec; i++) { /* Assume the first value is largest */ max = vec_ld(i*sizeof(max), array); for (j=i; j>0; j--) { temp_max = max; min = vec_ld((j-1)*sizeof(min), array); /* Obtain the min/max elements */ MINMAX(max, min); max = vec_perm(max, min, indexVec); MINMAX(max, min); max = vec_perm(max, min, indexVec); MINMAX(max, min); max = vec_perm(max, min, indexVec); MINMAX(max, min); max = vec_perm(max, min, indexVec); vec_st(max, j*sizeof(max), array); if (vec_any_ne(max, temp_max)) { if (j > 1) max = min; else vec_st(min, 0, array); } else break; } } } /* Perform intra-vector sort */ void sort_elements(int* array) { int i, numVec = N/4; /* Number of vectors */ /* Vectors used in the first compare-arrange */ vector unsigned char perm_reverse = {12,13,14,15,8,9,10,11,4,5,6,7,0,1,2,3}; vector unsigned char add_index1 = {12,12,12,12,4,4,4,4,4,4,4,4,12,12,12,12}; vector unsigned char basis_index1 = {0,1,2,3,4,5,6,7,4,5,6,7,0,1,2,3}; /* Vectors used in the second compare-arrange */ vector unsigned char perm_in_out = {4,5,6,7,0,1,2,3,12,13,14,15,8,9,10,11}; vector unsigned char add_index2 = {4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4}; vector unsigned char basis_index2 = {0,1,2,3,0,1,2,3,8,9,10,11,8,9,10,11}; /* Vectors used in the third compare-arrange */ vector unsigned char perm_middle = {0,1,2,3,8,9,10,11,4,5,6,7,12,13,14,15}; vector unsigned char add_index3 = {0,0,0,0,4,4,4,4,4,4,4,4,0,0,0,0}; vector unsigned char basis_index3 = {0,1,2,3,4,5,6,7,4,5,6,7,12,13,14,15}; vector bool int mask; vector unsigned char pattern; /* The intra-vector sort */ for(i=0; i<numVec; i++) { vector signed int vec = vec_ld(i*sizeof(vec), array); /* First compare and rearrange */ mask = vec_cmpgt(vec, vec_perm(vec, vec, perm_reverse)); pattern = vec_and((vector unsigned char)mask, add_index1); pattern = vec_add(pattern, basis_index1); vec = vec_perm(vec, vec, pattern); /* Second compare and rearrange */ mask = vec_cmpgt(vec, vec_perm(vec, vec, perm_in_out)); pattern = vec_and((vector unsigned char)mask, add_index2); pattern = vec_add(pattern, basis_index2); vec = vec_perm(vec, vec, pattern); /* Third compare and rearrange */ mask = vec_cmpgt(vec, vec_perm(vec, vec, perm_middle)); pattern = vec_and((vector unsigned char)mask, add_index3); pattern = vec_add(pattern, basis_index3); vec = vec_perm(vec, vec, pattern); vec_st(vec, i*sizeof(vec), array); } }
The insertion sort algorithm stops comparing a value to preceding values when the value is greater than one of the preceding values. In the sort_vectors
function, this is done by comparing max
to temp_max
using vec_any_ne
. If the maximum vector didn’t change during the MINMAX
/rotation process, the vector is already in its proper position.
The vectorized insertion sort is much more involved than the scalar version. However, the lack of branch-based comparisons means that the application will never suffer from branch miss penalties. Further, vec_min
and vec_max
not only compare vector elements, but also separate larger values from smaller.
When it comes to languages, some people are fluent, whereas others merely know what the words mean. Programming with vectors is similar. The preceding chapter discussed the basic words of vectorization: datatypes, libraries, and functions. The goal of this chapter has been to bring you closer to fluency by showing how vectors are used in common situations.
Performance of scalar applications can usually be improved with vectorization, but the process requires thought and effort. If the data isn’t aligned in memory, a series of load/shift operations is needed to position scalars inside vectors. If the data values are stored as an array of structures (AoS), swizzling will rearrange them into the more SIMD-efficient structure of arrays (SoA).
libfreevec
is an open-source library containing vectorized functions for manipulating strings and memory. The routines run on top of AltiVec and provide excellent examples of using vectors for applications beyond mathematics. Further, the header files contain a number of useful macros that can reduce the amount of repetitive code in your algorithms.
The last part of this chapter presented the theory and implementation of a vector-based insertion sort. This is not a high-performance algorithm, but the code demonstrates the breadth of vector processing on the PPU.
18.116.10.107