Describing a CNN

Having said all that, the neural network is very easy to build. First, we define a neural network as such:

type convnet struct {
    g                  *gorgonia.ExprGraph
    w0, w1, w2, w3, w4 *gorgonia.Node // weights. the number at the back indicates which layer it's used for
    d0, d1, d2, d3     float64        // dropout probabilities

    out    *gorgonia.Node
    outVal gorgonia.Value
}

Here, we defined a neural network with four layers. A convnet layer is similar to a linear layer in many ways. It can, for example, be written as an equation:

Note that in this specific example, I consider dropout and max-pool to be part of the same layer. In many literatures, they are considered to be separate layers.

I personally do not see the necessity to consider them as separate layers. After all, everything is just a mathematical equation; composing functions comes naturally.

A mathematical equation on its own without structure is quite meaningless. Unfortunately, we do not have technology usable enough to simply define the structure of a data type (the hotness is in dependently-typed languages, such as Idris, but they are not yet at the level of usability or performance that is necessary for deep learning). Instead, we have to constrain our data structure by providing a function to define a convnet:

func newConvNet(g *gorgonia.ExprGraph) *convnet {
  w0 := gorgonia.NewTensor(g, dt, 4, gorgonia.WithShape(32, 1, 3, 3), 
                 gorgonia.WithName("w0"),    
                 gorgonia.WithInit(gorgonia.GlorotN(1.0)))
  w1 := gorgonia.NewTensor(g, dt, 4, gorgonia.WithShape(64, 32, 3, 3), 
                 gorgonia.WithName("w1"),  
                 gorgonia.WithInit(gorgonia.GlorotN(1.0)))
  w2 := gorgonia.NewTensor(g, dt, 4, gorgonia.WithShape(128, 64, 3, 3), 
                 gorgonia.WithName("w2"), 
                 gorgonia.WithInit(gorgonia.GlorotN(1.0)))
  w3 := gorgonia.NewMatrix(g, dt, gorgonia.WithShape(128*3*3, 625), 
                 gorgonia.WithName("w3"), 
                 gorgonia.WithInit(gorgonia.GlorotN(1.0)))
  w4 := gorgonia.NewMatrix(g, dt, gorgonia.WithShape(625, 10), 
                 gorgonia.WithName("w4"), 
                 gorgonia.WithInit(gorgonia.GlorotN(1.0)))
  return &convnet{
    g: g,
    w0: w0,
    w1: w1,
    w2: w2,
    w3: w3,
    w4: w4,

    d0: 0.2,
    d1: 0.2,
    d2: 0.2,
    d3: 0.55,
  }
}

We'll start with dt. This is essentially a global variable denoting what data type we would like to work in. For the purposes of this project, we can use var dt = tensor.Float64, to indicate that we would like to work with float64 throughout the entire project. This allows us to immediately reuse the functions from the previous chapter without having to handle different data types. Note that if we do plan to use float32, the computation speed immediately doubles. In the repository to this chapter, you might note that the code uses float32.

We'll start with d0 all the way to d3. This is fairly simple. For the first three layers, we want 20% of the activations to be randomly zeroed. But for the last layer, we want 55% of the activations to be randomly zeroed. In really broad strokes, this causes an information bottleneck, which will cause the machine to learn only the really important features.

Take a look at how w0 is defined. Here, we're saying w0 is a variable called w0. It is a tensor with the shape of (32, 1, 3, 3). This is typically called the Number of Batches, Channels, Height, Width (NCHW/BCHW) format. In short, what we're saying is that there are 32 filters we wish to learn, each filter has a height and width of (3, 3), and it has one color channel. MNIST is, after all, black and white.

BCHW is not the only format! Some deep learning frameworks prefer to use BHWC formats. The reason for preferring one format over another is purely operational. Some convolution algorithms work better with NCHW; some work better with BHWC. The ones in Gorgonia works only in BCHW.

The choice of a 3 x 3 filter is purely unprincipled but not without precedence. You could choose a 5 x 5 filter, or a 2 x 1 filter, or really, a filter of any shape. However, it has to be said that a 3 x 3 filter is probably the most universal filter that can work on all sorts of images. Square filters of these sorts are common in image-processing algorithms, so it is in accordance to such traditions that we chose a 3 x 3.

The weights for the higher layers start to look a bit more interesting. For example, w1 has a shape of (64, 32, 3, 3). Why? In order to understand why, we need to explore the interplay between the activation functions and the shapes. Here's the entire forward function of the convnet:

// This function is particularly verbose for educational reasons. In reality, you'd wrap up the layers within a layer struct type and perform per-layer activations
func (m *convnet) fwd(x *gorgonia.Node) (err error) {
    var c0, c1, c2, fc *gorgonia.Node
    var a0, a1, a2, a3 *gorgonia.Node
    var p0, p1, p2 *gorgonia.Node
    var l0, l1, l2, l3 *gorgonia.Node

    // LAYER 0
    // here we convolve with stride = (1, 1) and padding = (1, 1),
    // which is your bog standard convolution for convnet
    if c0, err = gorgonia.Conv2d(x, m.w0, tensor.Shape{3, 3}, []int{1, 1}, []int{1, 1}, []int{1, 1}); err != nil {
        return errors.Wrap(err, "Layer 0 Convolution failed")
    }
    if a0, err = gorgonia.Rectify(c0); err != nil {
        return errors.Wrap(err, "Layer 0 activation failed")
    }
    if p0, err = gorgonia.MaxPool2D(a0, tensor.Shape{2, 2}, []int{0, 0}, []int{2, 2}); err != nil {
        return errors.Wrap(err, "Layer 0 Maxpooling failed")
    }
    if l0, err = gorgonia.Dropout(p0, m.d0); err != nil {
        return errors.Wrap(err, "Unable to apply a dropout")
    }

    // Layer 1
    if c1, err = gorgonia.Conv2d(l0, m.w1, tensor.Shape{3, 3}, []int{1, 1}, []int{1, 1}, []int{1, 1}); err != nil {
        return errors.Wrap(err, "Layer 1 Convolution failed")
    }
    if a1, err = gorgonia.Rectify(c1); err != nil {
        return errors.Wrap(err, "Layer 1 activation failed")
    }
    if p1, err = gorgonia.MaxPool2D(a1, tensor.Shape{2, 2}, []int{0, 0}, []int{2, 2}); err != nil {
        return errors.Wrap(err, "Layer 1 Maxpooling failed")
    }
    if l1, err = gorgonia.Dropout(p1, m.d1); err != nil {
        return errors.Wrap(err, "Unable to apply a dropout to layer 1")
    }

    // Layer 2
    if c2, err = gorgonia.Conv2d(l1, m.w2, tensor.Shape{3, 3}, []int{1, 1}, []int{1, 1}, []int{1, 1}); err != nil {
        return errors.Wrap(err, "Layer 2 Convolution failed")
    }
    if a2, err = gorgonia.Rectify(c2); err != nil {
        return errors.Wrap(err, "Layer 2 activation failed")
    }
    if p2, err = gorgonia.MaxPool2D(a2, tensor.Shape{2, 2}, []int{0, 0}, []int{2, 2}); err != nil {
        return errors.Wrap(err, "Layer 2 Maxpooling failed")
    }
    log.Printf("p2 shape %v", p2.Shape())

    var r2 *gorgonia.Node
    b, c, h, w := p2.Shape()[0], p2.Shape()[1], p2.Shape()[2], p2.Shape()[3]
    if r2, err = gorgonia.Reshape(p2, tensor.Shape{b, c * h * w}); err != nil {
        return errors.Wrap(err, "Unable to reshape layer 2")
    }
    log.Printf("r2 shape %v", r2.Shape())
    if l2, err = gorgonia.Dropout(r2, m.d2); err != nil {
        return errors.Wrap(err, "Unable to apply a dropout on layer 2")
    }

    // Layer 3
    if fc, err = gorgonia.Mul(l2, m.w3); err != nil {
        return errors.Wrapf(err, "Unable to multiply l2 and w3")
    }
    if a3, err = gorgonia.Rectify(fc); err != nil {
        return errors.Wrapf(err, "Unable to activate fc")
    }
    if l3, err = gorgonia.Dropout(a3, m.d3); err != nil {
        return errors.Wrapf(err, "Unable to apply a dropout on layer 3")
    }

    // output decode
    var out *gorgonia.Node
    if out, err = gorgonia.Mul(l3, m.w4); err != nil {
        return errors.Wrapf(err, "Unable to multiply l3 and w4")
    }
    m.out, err = gorgonia.SoftMax(out)
    gorgonia.Read(m.out, &m.outVal)
    return
}

It should be noted that convolution layers do change the shape of the inputs. Given an (N, 1, 28, 28) input, the Conv2d function will return a (N, 32, 28, 28) output, precisely because there are now 32 filters. The MaxPool2d will return an output with the shape of (N, 32, 14, 14); recall that the purpose of max-pooling is to reduce the amount of information in the neural network. It just happens that max-pooling with a shape of (2, 2) will nicely halve the length and width of the image (and reduce the amount of information by four times).

The output of layer 0 would have a shape of (N, 32, 14, 14). If we stick to our explanations of our shapes from earlier, where it was in the format of (N, C, H, W), we would be quite stumped. What does it mean to have 32 channels? To answer that, let's look at how we encode a color image in terms of BCHW:

Note that we encode it as three separate layers, stacked onto one another. This is a clue as to how to think about having 32 channels. Of course, each of the 32 channels as the result of applying each of the 32 filters; the extracted features, so to speak. The result can, of course, be stacked in the same way color channels be stacked.

For the most part, however, the mere act of symbol pushing is all that is required to build a deep learning system; no real intelligence is required. This, of course mirrors, the Chinese Room Puzzle thought experiment, and I have quite a bit to say on that, though it's not really the time nor the place.

The more interesting parts is in the construction of Layer 3. Layers 1 and 2 are constructed very similarly to Layer 0, but Layer 3 has a slightly different construction. The reason is because the output of Layer 2 is a rank-4 tensor, but in order to perform matrix multiplication, it needs to be reshaped into a rank-2 tensor.

Lastly, the final layer, which decodes the output, uses a softmax activation function to ensure that the result we get is probability.

And really, there you have it. A CNN, written in a very neat way that does not obfuscate the mathematical definitions.

Table of Contents for Describing a CNN

Create new playlist

Sign In

Sign Up

Table of Contents for
Describing a CNN