Recurrent Neural Networks for drawing classification

The model used in this chapter was trained on the dataset used in Google's AI experiment Quick, Draw!

Quick, Draw! is a game where players are challenged to draw a given object to see whether the computer can recognize it; an extract of the data is shown as follows:

The technique was inspired from the work done on handwritten recognition (Google Translate), where, rather than looking at the image as a whole, the team worked with data features describing how the characters were drawn. This is illustrated in the following image:

Source: https://experiments.withgoogle.com/ai/quick-draw

The hypothesis here is that there exists some consistent pattern of how people draw certain types of objects; but to discover those patterns, we would need a lot of data, which we do have. The dataset consists of over 50 million drawings across 345 categories obtained cleverly, from the players of the Quick, Draw! game. Each sample is described with timestamped vectors and associated metadata describing the country the player was based in and the category asked of the user. You can learn more about the dataset from the official website: https://github.com/googlecreativelab/quickdraw-dataset.

To make the dataset and training manageable, our model was only trained on 172 of the 345 categories, but the accompanying Notebook used to create and train the model is available for those wanting to delve into the details. To get a better understanding of the data, let's have a peek at a single sample, as shown here:

{
    "key_id":"5891796615823360",
    "word":"nose",
    "countrycode":"AE",
    "timestamp":"2017-03-01 20:41:36.70725 UTC",
    "recognized":true,
    "drawing":[[[129,128,129,129,130,130,131,132,132,133,133,133,133,...]]]
 }

The details of the sketch are broken down into an array of strokes, each described by a three-dimensional array containing the x, y positions and timestamp that make up the path of the stroke:

[
    [ // First stroke
    [x0, x1, x2, x3, ...],
    [y0, y1, y2, y3, ...],
    [t0, t1, t2, t3, ...]
 ],
    [ // Second stroke
    [x0, x1, x2, x3, ...],
    [y0, y1, y2, y3, ...],
    [t0, t1, t2, t3, ...]
 ],
    ... // Additional strokes
 ]

As mentioned previously, this being an example from the raw dataset, the team behind Quick, Draw! has released many variants of the data, from raw samples to preprocessed and compressed versions. We are mostly interested in exploring the raw and simplified versions: the former because it's the closest representation we have that will represent the data we obtain from the user, and the latter because it was used to train the model.

Spoiler: Most of this chapter deals with preprocessing the user input.

Both raw and simplified versions have stored each category in an individual file in the NDJSON file format.

The NDJSON file format, short for newline delimited JSON, is a convenient format for storing and streaming structured data that may be processed one record at a time. As the name suggests, it stores multiple JSON-formatted objects in single lines. In our case, this means each sample is stored as a separate object delimited by a new line; you can learn more about the format at http://ndjson.org.

You may be wondering what the difference is between the raw and simplified versions. We will go into the details when we build the preprocessing functionality required for this application, but as the name implies, the simplified version reduces the complexity of each stroke by removing any unnecessary points, along with applying some level of standardization—a typical requirement when dealing with any data to make the samples more comparable.

Now that we have a better understanding of the data we are dealing with, let's turn our attention to building up some level of intuition of how we can learn from these sequences, by briefly discussing the details of the model used in this chapter.

In previous chapters, we saw many examples of how CNNs can learn useful patterns from local 2D patches, which themselves can be built upon to further abstract from raw pixels into something with more descriptive power. This is fairly intuitive given our understanding of images is not made up of independent pixels but rather a collection of pixels related to their neighbors, which in turn describe parts of an object. In Chapter 1, Introduction to Machine Learning, we introduced a Recurrent Neural Network (RNN), a major component of building the Sequence to Sequence (Seq2Seq) model used for language translation, and we saw how its ability to remember made it well suited for data made up of sequences where order matters. As highlighted previously, our given samples are made up of sequences of strokes; the RNN is a likely candidate for learning to classify sketches.

As a quick recap, RNNs implement a type of selective memory using a feedback loop, which itself is adjusted during training; diagrammatically this is shown as follows:

On the left is the actual network, and on the right we have the same network unrolled across four time steps. As the points of the sketch's strokes are fed in, they are multiplied by the layer's weight along with the current state before being fed back in and/or outputted. During training, this feedback allows the network to learn patterns of an ordered sequence. We can stack these recurrent layers on top of each other to learn more complex and abstract patterns as we did with CNN.

But recurrent layers are not the only way to learn patterns from sequential data. If you generalize the concept of CNNs as something being able to learn local patterns across any dimension (as opposed to just two dimensions), then you can see how we could use 1D convolutional layers to achieve a similar effect as our recurrent layers. Therein, similar to 2D convolutional layers, we learn 1D kernels across sequences (treating time as a spatial dimension) to find local patterns to represent our data. Using a convolutional layer has the advantage of being considerably computationally cheaper than its counterpart, making it ideal for processor- and power-constrained devices, such as mobile phones. It is also advantageous for its ability to learn patterns independent of order, similar to how 2D kernels are invariant of position. In this figure, we illustrate how the 1D convolutional layer operates on input data:

In this context, strokes (local to the window size) will be learned, independent of where they are in the sequence, and a compact representation will be outputted, which we can then feed into an RNN to learn ordered sequences from these strokes (rather than from raw points). Intuitively you can think of our model as initially learning strokes such as vertical and horizontal strokes (independent of time), and then learning (in our subsequent layers made up of RNNs) higher-order patterns such as shapes from the ordered sequence of these strokes. The following figure illustrates this concept:

On the left, we have the raw points inputted into the model. The middle part shows how a 1D convolutional layer can learn local patterns from these points in a form of strokes. And finally, at the far right, we have the subsequent RNNs learning order-sensitive patterns from the sequence of these strokes.

One more concept to introduce before introducing the model, but, before doing so, I want you to quickly think of how you draw a square. Do you draw it in a clockwise direction or anti-clockwise direction?

The last concept I want to briefly introduce in this section is bidirectional layers; bidirectional layers attempt to make our network invariant to the previous question. We discussed earlier how RNNs are sensitive to order, which is precisely why they are useful here, but as I hope has been highlighted, our sketch may be drawn in the reverse order. To account for this, we can use a bidirectional layer, which, as the name implies, processes the input sequence in two directions (chronologically and anti-chronologically) and then merges their representations. By processing a sequence in both directions, our model can become somewhat invariant to the direction in which we draw.

We have now introduced all the building blocks used for this model; the following figure shows the model in its entirety:

As a reminder, this book is focused on the application of machine learning related to Core ML. Therefore we won't be going into the details of this (or any) model, but cover just enough to have an intuitive understanding of how the model works for you to use and explore further.

As shown previously, our model is comprised of a stack of 1D convolutional layers that feed into a stack of Long Short-Term Memory (LSTM), an implementation of an RNN, before being fed into a fully connected layer where our prediction is made. This model was trained on 172 categories, each using 10,000 training samples and 1,000 validation samples. After 16 epochs, the model achieved approximately 78% accuracy on both the training and validation data, as shown here:

We now have our model but have skimmed across what we are actually feeding into our model. In the next section, we will discuss what our model was trained with (and therefore expecting) and implement the required functionality to prepare it.

Table of Contents for Recurrent Neural Networks for drawing classification

Create new playlist

Sign In

Sign Up

Table of Contents for
Recurrent Neural Networks for drawing classification