Implementing compact

The compact operation is a sequence of predicate, scan, addressing, and gather. In this implementation, we will build an array of positive numbers from an array of randomly generated numbers. The initial version can only afford a single thread block operation since we will only use a single block-sized prefix-sum operation. However, we can learn how prefix-sum is useful for other applications and extend this operation to larger arrays with the extended prefix-sum operation.

To implement a compact operation, we will write several kernel functions that can do the required operation for each step and call those last:

Let's write a kernel function that can make a predicate array by checking whether each element's value is greater than zero or not:

__global__ void
predicate_kernel(float *d_predicates, float *d_input, int length)
{
    int idx = blockDim.x * blockIdx.x + threadIdx.x;

    if (idx >= length) return;

    d_predicates[idx] = d_input[idx] > FLT_ZERO;
}

Then, we have to perform a prefix-sum operation for that predicate array. We will reuse the previous implementation here. After that, we can write a kernel function that can detect the address of the scanned array and gather the target elements as output:

__global__ void
pack_kernel(float *d_output, float *d_input, float *d_predicates, float *d_scanned, int length)
{
    int idx = blockDim.x * blockIdx.x + threadIdx.x;

    if (idx >= length) return;

    if (d_predicates[idx] != 0.f)
    {
        // addressing
        int address = d_scanned[idx] - 1;

        // gather
        d_output[address] = d_input[idx];
    }
}

Now, let's call them all together to make a compact operation:

// predicates
predicate_kernel<<< GRID_DIM, BLOCK_DIM >>>(d_predicates, d_input, length);
// scan
scan_v2(d_scanned, d_predicates, length);
// addressing & gather (pack)
pack_kernel<<< GRID_DIM, BLOCK_DIM >>>(d_output, d_input, d_predicates, d_scanned, length);

Now, we have an array of positive numbers that were gathered from a randomly generated array:

$ nvcc -run -m64 -std=c++11 -I/usr/local/cuda/samples/common/inc -gencode arch=compute_70,code=sm_70 -L/usr/local/cuda/lib -o pack_n_split ./pack_n_split.cu
input    :: -0.4508 -0.0210 -0.4774  0.2750 .... 0.0398  0.4869
pack[cpu]::  0.2750  0.3169  0.1248  0.4241 .... 0.3957  0.2958
pack[gpu]::  0.2750  0.3169  0.1248  0.4241 .... 0.3957  0.2958
SUCCESS!!

By using the parallel prefix-sum operation, we can implement the compact operation in parallel easily. Our implementation compacts the positive values from the given array, but we can switch this to the other condition and apply the compact operation without difficulty. Now, let's cover how to distribute these compact elements to the original array.

Table of Contents for Implementing compact

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementing compact