Inline PTX assembly

We will now scratch the surface of writing PTX (Parallel Thread eXecution) Assembly language, which is a kind of a pseudo-assembly language that works across all Nvidia GPUs, which is, in turn, compiled by a Just-In-Time (JIT) compiler to the specific GPU's actual machine code. While this obviously isn't intended for day-to-day usage, it will let us work at an even a lower level than C if necessary. One particular use case is that you can easily disassemble a CUDA binary file (a host-side executable/library or a CUDA .cubin binary) and inspect its PTX code if no source code is otherwise available. This can be done with the cuobjdump.exe -ptx cuda_binary command in both Windows and Linux.

As stated previously, we will only cover some of the basic usages of PTX from within CUDA-C, which has a particular syntax and usage which is similar to that of using the inline host-side assembly language in GCC. Let's get going with our code—we will do the imports and start writing our GPU code:

from __future__ import division
import numpy as np
from pycuda.compiler import SourceModule
import pycuda.autoinit
from pycuda import gpuarray

PtxCode='''

We will do several mini-experiments here by writing the code into separate device functions. Let's start with a simple function that sets an input variable to zero. (We can use the C++ pass-by-reference operator & in CUDA, which we will use in the device function.):

__device__ void set_to_zero(int &x)
{
 asm("mov.s32 %0, 0;" : "=r"(x));
}

Let's break this down before we move on. asm, of course, will indicate to the nvcc compiler that we are going to be using assembly, so we will have to put that code into quotes so that it can be handled properly. The mov instruction just copies a constant or other value, and inputs this into a register. (A register is the most fundamental type of on-chip storage unit that a GPU or CPU uses to store or manipulate values; this is how most local variables are used in CUDA.) The .s32 part of mov.s32 indicates that we are working with a signed, 32-bit integer variable—PTX Assembly doesn't have types for data in the sense of C, so we have to be careful to use the correct particular operations. %0 tells nvcc to use the register corresponding to the 0th argument of the string here, and we separate this from the next input to mov with a comma, which is the constant 0. We then end the line of assembly with a semicolon, like we would in C, and close off this string of assembly code with a quote. We'll have to then use a colon (not a comma!) to indicate the variables we want to use in our code. The "=r" means two things: the = will indicate to nvcc that the register will be written to as an output, while the r indicates that this should be handled as a 32-bit integer datatype. We then put the variable we want to be handled by the assembler in parentheses, and then close off the asm, just like we would with any C function.

All of that exposition to set the value of a single variable to 0! Now, let's make a small device function that will add two floating-point numbers for us:

__device__ void add_floats(float &out, float in1, float in2)
{
 asm("add.f32 %0, %1, %2 ;" : "=f"(out) : "f"(in1) , "f"(in2));
}

Let's stop and notice a few things. First, of course, we are using add.f32 to indicate that we want to add two 32-bit floating point values together. We also use "=f" to indicate that we will be writing to a register, and f to indicate that we will be only reading from it. Also, notice how we use a colon to separate the write registers from the only read registers for nvcc.

Let's look at one more simple example before we continue, that is, a function akin to the ++ operator in C that increments an integer by 1:

__device__ void plusplus(int &x)
{
 asm("add.s32 %0, %0, 1;" : "+r"(x));
}

First, notice that we use the "0th" parameter as both the output and the first input. Next, notice that we are using +r rather than =r—the + tells nvcc that this register will be read from and written to in this instruction.

Now, we won't be getting any fancier than this, as even writing a simple if statement in assembly language is fairly involved. However, let's look at some more examples that will come in useful when using CUDA Warps. Let's start with a small function that will give us the lane ID of the current thread; this is particularly useful, and actually far more straightforward than doing this with CUDA-C, since the lane ID is actually stored in a special register called %laneid that we can't access in pure C. (Notice how we use two % symbols in the code, which will indicate to nvcc to directly use the % in the assembly code for the %laneid reference rather than interpret this as an argument to the asm command.):

__device__ int laneid()
{
 int id; 
 asm("mov.u32 %0, %%laneid; " : "=r"(id)); 
 return id;
}

Now let's write two more functions that will be useful for dealing with CUDA Warps. Remember, you can only pass a 32-bit variable across a Warp using a shuffle command. This means that to pass a 64-bit variable over a warp, we have to split this into two 32-bit variables, shuffle both of those to another thread individually, and then re-combine these 32-bit values back into the original 64-bit variable. We can use the mov.b64 command for the case of splitting a 64-bit double into two 32-bit integers—notice how we have to use d to indicate a 64-bit floating-point double:

Notice our use of volatile in the following code, which will ensure that these commands are executed exactly as written after they are compiled. We do this because sometimes a compiler will make its own optimizations to or around inline assembly code, but for particularly delicate operations such as this, we want this done exactly as written.

__device__ void split64(double val, int & lo, int & hi)
{
 asm volatile("mov.b64 {%0, %1}, %2; ":"=r"(lo),"=r"(hi):"d"(val));
}

__device__ void combine64(double &val, int lo, int hi)
{
 asm volatile("mov.b64 %0, {%1, %2}; ":"=d"(val):"r"(lo),"r"(hi));
}

Now let's write a simple kernel that will test all of the PTX assembly device functions we wrote. We will then launch it over one single thread so that we can check everything:

__global__ void ptx_test_ker() { 

 int x=123;
 
 printf("x is %d \n", x);
 
 set_to_zero(x);
 
 printf("x is now %d \n", x);
 
 plusplus(x);
 
 printf("x is now %d \n", x);
 
 float f;
 
 add_floats(f, 1.11, 2.22 );
 
 printf("f is now %f \n", f);
 
 printf("lane ID: %d \n", laneid() );
 
 double orig = 3.1415;

 int t1, t2;
 
 split64(orig, t1, t2);
 
 double recon;
 
 combine64(recon, t1, t2);
 
 printf("Do split64 / combine64 work? : %s \n", (orig == recon) ? "true" : "false"); 
 
}'''

ptx_mod = SourceModule(PtxCode)
ptx_test_ker = ptx_mod.get_function('ptx_test_ker')
ptx_test_ker(grid=(1,1,1), block=(1,1,1))

We will now run the preceding code:

This example is also available as the ptx_assembly.py file under the Chapter11 directory in this book's GitHub repository.

Table of Contents for Inline PTX assembly

Create new playlist

Sign In

Sign Up

Table of Contents for
Inline PTX assembly