Implementing a dense layer of artificial neurons

Now, let's implement the most important building block of an NN, the dense layer. Let's start by declaring a CUDA kernel, like so:

__global__ void dense_eval(int num_outputs, int num_inputs, int relu, int sigmoid, float * w, float * b, float * x, float *y, int batch_size, int w_t, int b_t, float delta)

Let's go over the inputs, one by one. num_outputs, of course, indicates the total number of outputs this layer has; this is exactly the number of neurons in the layer. num_inputs tells us the size of the input data. Setting a positive value for relu and sigmoid will indicate that we should use the corresponding activation function on the output of this layer, which we will define later. w and b are arrays containing the weights and biases of this layer, while x and y will act as our inputs and outputs. Oftentimes, we wish to classify more than one piece of data at a time. We can indicate this by setting batch_size to be the number of points we wish to predict. Finally, w_t, b_t, and delta will be used in the training process to determine the appropriate weights and biases for this layer by means of gradient descent. (We will see more on gradient descent in a later section.)

Now, let's start writing our kernel. We will parallelize the computations over each output, so we will set an integer i to be the global thread ID to this end, and have any unnecessary extra threads which happen to be running this kernel to just not do anything with the appropriate if statement:

{
 int i = blockDim.x*blockIdx.x + threadIdx.x;

 if (i < num_outputs)
 {

Now, let's iterate over each data point in the batch with the appropriate for loop:

for(int k=0; k < batch_size; k++)
 {

We will multiply and accumulate the 32-bit floats from the weights and inputs into a 64-bit double temp and then add the appropriate bias point. We will then typecast this back to a 32-bit float and put the value in the output array, and then close off the loop over k:

double temp = 0.0f;
 for (int j = 0; j < num_inputs; j++)
 {
   temp += ((double) w[(num_inputs)*i + j ] ) * ( (double) x[k*num_inputs + j]);
 }
 temp += (double) b[i];
 y[k * num_outputs + i] = (float) temp;  
}

Multiply and accumulate types of operations are generally subject to a great loss of numerical precision. This can be mitigated by using a temporary variable of higher precision to store values in the course of the operation, and then typecasting this variable back to the original precision after the operation is completed.

To train an NN, we will ultimately have to calculate the derivative (from calculus) of our NN with respect to each weight and bias within each individual layer, which is with respect to a particular batch of inputs. Remember that the derivative of a mathematical function f at the value x can be estimated as f(x + δ) - f(x) / δ, where delta (δ) is some sufficiently small positive value. We will use the input values w_t and b_t to indicate to the kernel whether we want to calculate the derivative with respect to a particular weight or bias; otherwise, we will set these input values to a negative value to evaluate only for this layer. We will also set delta to be an appropriately small value for the calculation of the derivative, and use this to increment the value of the appropriate bias or weight:

if( w_t >= 0 && i == (w_t / num_inputs))
 {
 int j = w_t % num_inputs;
 for(int k=0; k < batch_size; k++)
  y[k*num_outputs + i] += delta*x[k*num_inputs+j];
}
if( b_t >= 0 && i == b_t )
 {
  for(int k=0; k < batch_size; k++)
  y[k*num_outputs + i] += delta;
 }

Now, we will add some code for what is known as the rectified linear unit (or ReLU) and sigmoid activation functions. These are used for processing the immediate output of a dense neural layer. ReLU just sets all negative values to 0, while acting as an identity for positive inputs, while sigmoid just computes the value of the sigmoid function on each value ( 1 / (1 + e^-x) ). ReLU (or any other activation function) is used between hidden layers in an NN as a means to make the entire NN act as a nonlinear function; otherwise, the entire NN would constitute a trivial (and inefficiently computed) matrix operation. (While there are many other nonlinear activation functions that can be used between layers, ReLU has been found to be a particularly effective function for training.) Sigmoid is used as a final layer in an NN intended for labeling, that is, one that may assign multiple labels for a given input, as opposed to assigning an input to a single class.

Let's go up a little bit in the file, before we even begin to define this CUDA kernel, and define these operations as C macros. We will also remember to put in the CUDA-C code we've just written while we are at it:

DenseEvalCode = '''
#define _RELU(x) ( ((x) > 0.0f) ? (x) : 0.0f )
#define _SIGMOID(x) ( 1.0f / (1.0f + expf(-(x)) ))

Now, we will use the kernel inputs relu and sigmoid to indicate whether we should use these additional layers; we will take a positive input from these to indicate that they should be used, respectively. We can add this, close off our kernel, and compile it into a usable Python function:

if(relu > 0 || sigmoid > 0)
for(int k=0; k < batch_size; k++)
 { 
   float temp = y[k * num_outputs + i];
   if (relu > 0)
    temp = _RELU(temp);
   if (sigmoid > 0)
    temp = _SIGMOID(temp);
   y[k * num_outputs + i] = temp; 
  }
 }
 return;
}
'''
eval_mod = SourceModule(DenseEvalCode)
eval_ker = eval_mod.get_function('dense_eval')

Now, let's go to the beginning of the file and set up the appropriate import statements. Notice that we will include the csv module, which will be used for processing data inputs for testing and training:

from __future__ import division
import pycuda.autoinit
import pycuda.driver as drv
from pycuda import gpuarray
from pycuda.compiler import SourceModule
from pycuda.elementwise import ElementwiseKernel
import numpy as np
from Queue import Queue
import csv
import time

Now, let's continue setting up our dense layer; we will want to wrap this within a Python class for ease of use, which will make our lives much easier when we start connecting these dense layers together into a full-blown NN. We'll call class DenseLayer and start by writing a constructor. Most of the inputs and setup here should be self-explanatory: we should definitely add an option to load weights and biases from a pre-trained network, and we'll also include the option to specify a default delta value as well as a default stream. (If no weights or biases are given, weights are initialized to random values, while all biases are set to 0.) We will also specify whether to use ReLU or sigmoid layers here, as well. Toward the end, notice how we set up the block and grid sizes:

class DenseLayer:
    def __init__(self, num_inputs=None, num_outputs=None, weights=None, b=None, stream=None, relu=False, sigmoid=False, delta=None):
        self.stream = stream
 
        if delta is None:
            self.delta = np.float32(0.001)
        else:
            self.delta = np.float32(delta)

        if weights is None:
            weights = np.random.rand(num_outputs, num_inputs) - .5
            self.num_inputs = np.int32(num_inputs)
        self.num_outputs = np.int32(num_outputs) 
 
        if type(weights) != pycuda.gpuarray.GPUArray:
            self.weights = gpuarray.to_gpu_async(np.array(weights, 
            dtype=np.float32) , stream = self.stream)
        else:
            self.weights = weights
 
        if num_inputs is None or num_outputs is None:
            self.num_inputs = np.int32(self.weights.shape[1])
            self.num_outputs = np.int32(self.weights.shape[0])
 
        else:
            self.num_inputs = np.int32(num_inputs)
            self.num_outputs = np.int32(num_outputs)

        if b is None:
            b = gpuarray.zeros((self.num_outputs,),dtype=np.float32)
 
        if type(b) != pycuda.gpuarray.GPUArray:
            self.b = gpuarray.to_gpu_async(np.array(b, 
            dtype=np.float32) , stream = self.stream)
        else:
            self.b = b 
 
        self.relu = np.int32(relu)
        self.sigmoid = np.int32(sigmoid)
 
        self.block = (32,1,1)
        self.grid = (int(np.ceil(self.num_outputs / 32)), 1,1)

Now, we will set up a function in this class to evaluate inputs from this layer; we will meticulously check the input (x) to determine if it is already on the GPU (transferring it over to a gpuarray if not), and we will let the user specify a preallocated gpuarray for output (y), manually allocating an output array if one is not specified. We will also check the delta and w_t/b_t values for the case of training, as well as batch_size. We will then run the kernel on the x input with outputs going into y, and finally return y as the output value:

def eval_(self, x, y=None, batch_size=None, stream=None, delta=None, w_t = None, b_t = None):

if stream is None:
    stream = self.stream

if type(x) != pycuda.gpuarray.GPUArray:
    x = gpuarray.to_gpu_async(np.array(x,dtype=np.float32), stream=self.stream)
 
if batch_size is None:
    if len(x.shape) == 2:
        batch_size = np.int32(x.shape[0])
    else:
        batch_size = np.int32(1)
 
if delta is None:
    delta = self.delta

delta = np.float32(delta)
 
if w_t is None:
    w_t = np.int32(-1)
 
if b_t is None:
    b_t = np.int32(-1)
 
if y is None:
    if batch_size == 1:
        y = gpuarray.empty((self.num_outputs,), dtype=np.float32)
    else:
        y = gpuarray.empty((batch_size, self.num_outputs), dtype=np.float32)

    eval_ker(self.num_outputs, self.num_inputs, self.relu, self.sigmoid, self.weights, self.b, x, y, np.int32(batch_size), w_t, b_t, delta , block=self.block, grid=self.grid , stream=stream)
 
 return y

There we go. We have fully implemented a dense layer!

Table of Contents for Implementing a dense layer of artificial neurons

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementing a dense layer of artificial neurons