cuRAND with mixed precision cuBLAS GEMM

Previously, we have used the C++ random number generator to initialize matrices for a GEMM operation. This function is handy when we want to generate random numbers in general. However, you may find that this function took a long time to generate large random numbers in the last section. In this section, we will cover how cuRAND API can work with the cuBLAS GEMM operations. The fully implemented version is the gemm_with_curand_host.cpp file. Let's see how this was implemented:

Currently, we don't have a low-precision random number generator in the cuRAND library. Also, we need to convert the half-precision numbers to float in order to evaluate the output. For these reasons, we need to create type conversion functions on GPU as follows:

namespace fp16{
__global__ void float2half_kernel(half *out, float *in)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    out[idx] = __float2half(in[idx]);
}
  
void float2half(half *out, float *in, size_t length)
{
    float2half_kernel<<< (length + BLOCK_DIM - 1) / BLOCK_DIM, BLOCK_DIM >>>(out, in);
}

Now, we will write a random number generation function that uses the cuRAND host API. As we discussed before, we should convert the generated random numbers from float to half, when we need to use half-precision data. This function can be implemented as follows:

template <typename T>
typename std::enable_if<(std::is_same<T, float>::value), float>::type
*curand(curandGenerator_t generator, size_t length)
{
    T *buffer = nullptr;
    cudaMalloc((void **)&buffer, length * sizeof(float));
    curandGenerateUniform(generator, buffer, length);
    return buffer;
}
template <typename T>
typename std::enable_if<std::is_same<T, half>::value, half>::type
*curand(curandGenerator_t generator, size_t length)
{
    T *buffer = nullptr;
    float *buffer_fp32;
  
    cudaMalloc((void **)&buffer_fp32, length * sizeof(float));
    curandGenerateUniform(generator, buffer_fp32, length);
  
    cudaMalloc((void **)&buffer, length * sizeof(T));
    fp16::float2half(buffer, buffer_fp32, length);
    cudaFree(buffer_fp32);
  
    return buffer;
}

Define some local variables that control GEMM operations in the main() function:

void *d_A, *d_B, *d_C;
cudaDataType AType, BType, CType, computeType;
int M = 8192, N = 8192, K = 8192;
float alpha = 1.f, beta = 1.f;
std::string precision = "fp32";
bool tensor_core = true;

In this code, we determine the GEMM operation size, data type, and operation type.

Now, let's create input buffer arrays, and set parameters, along with the operation precision:

if (precision == "fp32") {
    auto *a = curand<float>(curand_gen, M * K);
    auto *b = curand<float>(curand_gen, K * N);
    auto *c = curand<float>(curand_gen, M * N);
    AType = BType = CType = CUDA_R_32F;
    computeType = CUDA_R_32F;
    d_A = a, d_B = b, d_C = c;
}
else if (precision == "fp16") {
    auto *a = curand<half>(curand_gen, M * K);
    auto *b = curand<half>(curand_gen, K * N);
    auto *c = curand<float>(curand_gen, M * N);
    AType = BType = CUDA_R_16F, CType = CUDA_R_32F;
    computeType = CUDA_R_32F;
    d_A = a, d_B = b, d_C = c;
}
else {
    exit(EXIT_FAILURE);
}

Create cuRAND and cuBLAS handles as follows:

cublasCreate(&cublas_handle);
curandCreateGenerator(&curand_gen, CURAND_RNG_PSEUDO_DEFAULT);
curandSetPseudoRandomGeneratorSeed(curand_gen, 2019UL);

Then, we should determine the operation type in order to use TensorCores:

cublasGemmAlgo_t gemm_algo = (tensor_core) ? 
                             CUBLAS_GEMM_DEFAULT_TENSOR_OP : CUBLAS_GEMM_DEFAULT;

Then, we can call the cublasGemmEx() function that affords FP32 and FP16 operations as follows:

cublasGemmEx(cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N,
             M, N, K,
             &alpha, d_A, AType, M, d_B, BType, K,
             &beta,  d_C, CType, M,
             computeType, gemm_algo);

The GEMM operation should show a similar performance when compared to the previous version. But, you may find that the whole application speed is enhanced, since the parallel random number generation on the GPU is much faster than the generation from the host.

The cuRAND developer guide will help you to find other random number generators, options, and distributions. This document is located at https://docs.nvidia.com/pdf/CURAND_Library.pdf.

Table of Contents for cuRAND with mixed precision cuBLAS GEMM

Create new playlist

Sign In

Sign Up

Table of Contents for
cuRAND with mixed precision cuBLAS GEMM