CUDA in Gorgonia

Gorgonia has implemented support for NVIDIA's CUDA as part of its cu package. It abstracts out almost all the complexity, so all we have to do is simply specify the --tags=cuda flag at build time and ensure the operations we are calling are in fact present in the Gorgonia API.

Not every possible operation is implemented, of course. The emphasis is on operations that benefit from parallel execution, amenable to GPU acceleration. As we will cover in Chapter 5, Next Word Prediction with Recurrent Neural Networks, many of the operations involved in Convolutional Neural Networks (CNNs) meet this criterion.

So, what's available? The following list outlines the options:

1D or 2D convolutions (used in CNNs)
2D max pooling (also used in CNNs!)
Dropout (kill some neurons!)
ReLU (recall activation functions in Chapter 2, What is a Neural Network and How Do I Train One?)
Batch normalization

We will now look at the implementation of each, in turn.

Looking at gorgonia/ops/nn/api_cuda.go, we see the function for a 2D convolution as follows:

func Conv2d(im, filter *G.Node, kernelShape tensor.Shape, pad, stride, dilation []int) (retVal *G.Node, err error) {
    var op *convolution
    if op, err = makeConvolutionOp(im, filter, kernelShape, pad, stride, dilation); err != nil {
        return nil, err
    }
    return G.ApplyOp(op, im, filter)
}

The following 1D convolution function returns an instance of Conv2d(), which is a neat way of providing us with both options:

func Conv1d(in, filter *G.Node, kernel, pad, stride, dilation int) (*G.Node, error) {
    return Conv2d(in, filter, tensor.Shape{1, kernel}, []int{0, pad}, []int{1, stride}, []int{1, dilation})
}

Next is the MaxPool2D() function. In a CNN, the max pooling layer is part of the process of feature extraction. The dimensionality of the input is reduced, before being passed on to the subsequent convolutional layer.

Here, we create an instance of MaxPool that carries our XY parameters, and we return the result of running ApplyOp() across our input node, as shown in the following code:

func MaxPool2D(x *G.Node, kernel tensor.Shape, pad, stride []int) (retVal *G.Node, err error) {
    var op *maxpool
    if op, err = newMaxPoolOp(x, kernel, pad, stride); err != nil {
        return nil, err
    }
    return G.ApplyOp(op, x)
}

Dropout() is a regularization technique that is used to prevent our networks from overfitting. We want to learn the most general representation of our input data possible, and dropout helps us do that.

The structure of Dropout() should be familiar by now. It is another operation that can be parallelized within a layer, as follows:

func Dropout(x *G.Node, prob float64) (retVal *G.Node, err error) {
    var op *dropout
    if op, err = newDropout(x, prob); err != nil {
        return nil, err
    }

    // states := &scratchOp{x.Shape().Clone(), x.Dtype(), ""}
    // m := G.NewUniqueNode(G.WithType(x.Type()), G.WithOp(states), G.In(x.Graph()), G.WithShape(states.shape...))

    retVal, err = G.ApplyOp(op, x)
    return
}

The standard ReLU function we covered in Chapter 2, What is a Neural Network and How Do I Train One?, is also available, as shown here:

func Rectify(x *G.Node) (retVal *G.Node, err error) {
 var op *activation
 if op, err = newRelu(); err != nil {
 return nil, err
 }
 retVal, err = G.ApplyOp(op, x)
 return
}

BatchNorm() is slightly more complicated. Looking back at the original paper that described batch normalization, by Szegedy and Ioffe (2015), we see how, for a given batch, we normalize the output of the previous layer by subtracting the mean of the batch and dividing by the standard deviation. We can also observe the addition of two parameters that we will train with SGD.

And now, we can see the CUDA-fied Gorgonia implementation as follows. First, let's perform the function definition and a data type check:

func BatchNorm(x, scale, bias *G.Node, momentum, epsilon float64) (retVal, γ, β *G.Node, op *BatchNormOp, err error) {
    dt, err := dtypeOf(x.Type())
    if err != nil {
        return nil, nil, nil, nil, err
    }

Then, it needs to create some scratch variables to allow the VM to allocate spare memory:

channels := x.Shape()[1]
H, W := x.Shape()[2], x.Shape()[3]
scratchShape := tensor.Shape{1, channels, H, W}

meanScratch := &gpuScratchOp{scratchOp{x.Shape().Clone(), dt, "mean"}}
varianceScratch := &gpuScratchOp{scratchOp{x.Shape().Clone(), dt, "variance"}}
cacheMeanScratch := &gpuScratchOp{scratchOp{scratchShape, dt, "cacheMean"}}
cacheVarianceScratch := &gpuScratchOp{scratchOp{scratchShape, dt, "cacheVariance"}}

We then create the equivalent variables in our computation graph:

g := x.Graph()

dims := len(x.Shape())

mean := G.NewTensor(g, dt, dims, G.WithShape(scratchShape.Clone()...), G.WithName(x.Name()+"_mean"), G.WithOp(meanScratch))

variance := G.NewTensor(g, dt, dims, G.WithShape(scratchShape.Clone()...), G.WithName(x.Name()+"_variance"), G.WithOp(varianceScratch))

cacheMean := G.NewTensor(g, dt, dims, G.WithShape(scratchShape.Clone()...),      G.WithOp(cacheMeanScratch))

cacheVariance := G.NewTensor(g, dt, dims, G.WithShape(scratchShape.Clone()...), G.WithOp(cacheVarianceScratch))

We then create our scale and bias variables in the graph, before applying our function and returning the results:

if scale == nil {
    scale = G.NewTensor(g, dt, dims, G.WithShape(scratchShape.Clone()...), G.WithName(x.Name()+"_γ"), G.WithInit(G.GlorotN(1.0)))
}

if bias == nil {
    bias = G.NewTensor(g, dt, dims, G.WithShape(scratchShape.Clone()...), G.WithName(x.Name()+"_β"), G.WithInit(G.GlorotN(1.0)))
}

op = newBatchNormOp(momentum, epsilon)

retVal, err = G.ApplyOp(op, x, scale, bias, mean, variance, cacheMean, cacheVariance)

return retVal, scale, bias, op, err

Next, let's take a look at how to build a model in Gorgonia that leverages CUDA.

Table of Contents for CUDA in Gorgonia

Create new playlist

Sign In

Sign Up

Table of Contents for
CUDA in Gorgonia