Chapter 12. Magic Wand: Training a Model

In Chapter 11, we used a 20 KB pretrained model to interpret raw accelerometer data, using it to identify which of a set of gestures was performed. In this chapter, we show you how this model was trained, and then we talk about how it actually works.

Our wake-word and person detection models both required large amounts of data to train. This is mostly due to the complexity of the problems they were trying to solve. There are a huge number of different ways in which a person can say “yes” or “no”—think of all the variations of accent, intonation, and pitch that make someone’s voice unique. Similarly, a person can appear in an image in an infinite variety of ways; you might see their face, their whole body, or a single hand, and they could be standing in any possible pose.

So that it can accurately classify such a diversity of valid inputs, a model needs to be trained on an equally diverse set of training data. This is why our datasets for wake-word and person detection training were so large, and why training takes so long.

Our magic wand gesture recognition problem is a lot simpler. In this case, rather than trying to classify a huge range of natural voices or human appearances and poses, we’re attempting to understand the differences between three specific and deliberately selected gestures. Although there’ll be some variation in the way different people perform each gesture, we’re hoping that our users will strive to perform the gestures as correctly and uniformly as possible.

This means that there’ll be a lot less variation in our expected valid inputs, which makes it a lot easier to train an accurate model without needing vast amounts of data. In fact, the dataset we’ll be using to train the model contains only around 150 examples for each gesture and is only 1.5 MB in size. It’s exciting to think about how a useful model can be trained on such a small dataset, because obtaining sufficient data is often the most difficult part of a machine learning project.

In the first part of this chapter, you’ll learn how to train the original model used in the magic wand application. In the second part, we’ll talk about how this model actually works. And finally, you’ll see how you can capture your own data and train a new model that recognizes different gestures.

Training a Model

To train our model, we use training scripts located in the TensorFlow repository. You can find them in magic_wand/train.

The scripts perform the following tasks:

  • Prepare raw data for training.

  • Generate synthetic data.1

  • Split the data for training, validation, and testing.

  • Perform data augmentation.

  • Define the model architecture.

  • Run the training process.

  • Convert the model into the TensorFlow Lite format.

To make life easy, the scripts are accompanied by a Jupyter notebook which demonstrates how to use them. You can run the notebook in Colaboratory (Colab) on a GPU runtime. With our tiny dataset, training will take only a few minutes.

To begin, let’s walk through the training process in Colab.

Training in Colab

Open the Jupyter notebook at magic_wand/train/train_magic_wand_model.ipynb and click the “Run in Google Colab” button, as shown in Figure 8-1.

The 'Run in Google Colab' button
Figure 12-1. The “Run in Google Colab” button
Note

As of this writing, there’s a bug in GitHub that results in intermittent error messages when displaying Jupyter notebooks. If you see the message “Sorry, something went wrong. Reload?” when trying to access the notebook, follow the instructions in “Building Our Model”.

This notebook walks through the process of training the model. It includes the following steps:

  • Installing dependencies

  • Downloading and preparing the data

  • Loading TensorBoard to visualize the training process

  • Training the model

  • Generating a C source file

Enable GPU Training

Training this model should be very quick, but it will be even faster if we use a GPU runtime. To enable this option, go to Colab’s Runtime menu and choose “Change runtime type,” as illustrated in Figure 12-2.

This opens the “Notebook settings” dialog box shown in Figure 12-3.

From the “Hardware accelerator” drop-down list, select GPU, as depicted in Figure 12-4, and then click SAVE.

You’re now ready to run the notebook.

The 'Change runtime type' option in Colab
Figure 12-2. The “Change runtime type” option in Colab
The 'Notebook settings' box
Figure 12-3. The “Notebook settings” dialog box
The 'Hardware accelerator' dropdown
Figure 12-4. The “Hardware accelerator” drop-down list

Install dependencies

The first step is to install the required dependencies. In the “Install dependencies” section, run the cells to install the correct versions of TensorFlow and grab a copy of the training scripts.

Prepare the data

Next, in the “Prepare the data” section, run the cells to download the dataset and split it into training, validation, and test sets.

The first cell downloads and extracts the dataset into the training scripts’ directory. The dataset consists of four directories, one for each gesture (“wing,” “ring,” and “slope”) plus a “negative” directory for data that represents no distinct gesture. Each directory contains files that represent raw data resulting from the capture process for the gesture being performed:

data/
├── slope
│   ├── output_slope_dengyl.txt
│   ├── output_slope_hyw.txt
│   └── ...
├── ring
│   ├── output_ring_dengyl.txt
│   ├── output_ring_hyw.txt
│   └── ...
├── negative
│   ├── output_negative_1.txt
│   └── ...
└── wing
    ├── output_wing_dengyl.txt
    ├── output_wing_hyw.txt
    └── ...

There are 10 files for each gesture, which we’ll walk through later on. Each file contains a gesture being demonstrated by a named individual, with the last part of the filename corresponding to their user ID. For example, the file output_slope_dengyl.txt contains data for the “slope” gesture being demonstrated by a user whose ID is dengyl.

There are approximately 15 individual performances of a given gesture in each file, one accelerometer reading per row, with each performance being prefixed by the row -,-,-:

 -,-,-
-766.0,132.0,709.0
-751.0,249.0,659.0
-714.0,314.0,630.0
-709.0,244.0,623.0
-707.0,230.0,659.0

Each performance consists of a log of up to a few seconds’ worth of data, with 25 rows per second. The gesture itself occurs at some point within that window, with the device being held still for the remainder of the time.

Due to the way the measurements were captured, the files also contain some garbage characters. Our first training script, data_prepare.py, which is run in our second training cell, will clean up this dirty data:

# Prepare the data
!python data_prepare.py

This script is designed to read the raw data files from their folders, ignore any garbage characters, and write them in a sanitized form to another location within the training scripts’ directory (data/complete_data). Cleaning up messy data sources is a common task when training machine learning models given that it’s very common for errors, corruption, and other issues to creep into large datasets.

In addition to cleaning the data, the script generates some synthetic data. This is a term for data that is generated algorithmically, rather than being captured from the real world. In this case, the generate_negative_data() function in data_prepare.py creates synthetic data that is equivalent to movement of the accelerometer that doesn’t correspond to any particular gesture. This data is used to train our “unknown” category.

Because creating synthetic data is much faster than capturing real-world data, it’s useful to help augment our training process. However, real-world variation is unpredictable, so it’s not often possible to create an entire dataset from synthetic data. In our case, it’s helpful for making our “unknown” category more robust, but it wouldn’t be helpful for classifying the known gestures.

The next script to run in the second cell is data_split_person.py:

# Split the data by person
!python data_split_person.py

This script splits the data into training, validation, and test sets. Because our data is labeled with the person who created it, we’re able to use one set of people’s data for training, another set for validation, and a final set for test. The data is split as follows:

train_names = [
    "hyw", "shiyun", "tangsy", "dengyl", "jiangyh", "xunkai", "negative3",
    "negative4", "negative5", "negative6"
]
valid_names = ["lsj", "pengxl", "negative2", "negative7"]
test_names = ["liucx", "zhangxy", "negative1", "negative8"]

We use six people’s data for training, two for validation, and two for testing. In addition, we mix in our negative data, which isn’t associated with a particular user. Our total data is split between the three sets at a ratio of roughly 60%/20%/20%, which is pretty standard for machine learning.

In splitting by person, we’re trying to ensure that our model will be able to generalize to new data. Because the model will be validated and tested on data from individuals who were not included in the training dataset, the model will need to be robust against individual variations in how each person performs each gesture.

It’s also possible to split the data randomly, instead of by person. In this case, the training, validation, and testing datasets would each contain some samples of each gesture from every single individual. The resulting model will have been trained on data from every single person rather than just six, so it will have had more exposure to people’s varying gesturing styles.

However, because the validation and training sets also contain data from every individual, we’d have no way of testing whether the model is able to generalize to new gesturing styles that it has not seen before. A model developed in this way might report higher accuracy during validation and testing, but it would not be guaranteed to work as well with new data.

Make sure you’ve run both cells in the “Prepare the data” section before continuing.

Load TensorBoard

After the data has been prepared, we can run the next cell to load TensorBoard, which will help us monitor the training process:

# Load TensorBoard
%load_ext tensorboard
%tensorboard --logdir logs/scalars

Training logs will be written to the logs/scalars subdirectory of the training scripts’ directory, so we pass this in to TensorBoard.

Begin training

After TensorBoard has loaded, it’s time to begin training. Run the following cell:

!python train.py --model CNN --person true

The script train.py sets up the model architecture, loads the data using data_load.py, and begins the training process.

As the data is loaded, load_data.py also performs data augmentation using code defined in data_augmentation.py. The function augment_data() takes data representing a gesture and creates a number of new versions of it, each modified slightly from the original. The modifications include shifting and warping the datapoints in time, adding random noise, and increasing the amount of acceleration. This augmented data is used alongside the original data to train the model, helping make the most of our small dataset.

As training ramps up, you’ll see some output appearing below the cell you just ran. There’s a lot there, so let’s pick out the most noteworthy parts. First, Keras generates a nice table that shows the architecture of our model:

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv2d (Conv2D)              (None, 128, 3, 8)         104
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 42, 1, 8)          0
_________________________________________________________________
dropout (Dropout)            (None, 42, 1, 8)          0
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 42, 1, 16)         528
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 14, 1, 16)         0
_________________________________________________________________
dropout_1 (Dropout)          (None, 14, 1, 16)         0
_________________________________________________________________
flatten (Flatten)            (None, 224)               0
_________________________________________________________________
dense (Dense)                (None, 16)                3600
_________________________________________________________________
dropout_2 (Dropout)          (None, 16)                0
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 68
=================================================================

It tells us all the layers that are used, along with their shapes and their numbers of parameters—which is another term for weights and biases. You can see that our model uses Conv2D layers, as it’s a convolutional model. Not shown in this table is the fact that our model’s input shape is (None, 128, 3). We’ll look more closely at the model’s architecture later.

The output will also show us an estimate of the model’s size:

Model size: 16.796875 KB

This represents the amount of memory that will be taken up by the model’s trainable parameters. It doesn’t include the extra space required to store the model’s execution graph, so our actual model file will be slightly larger, but it gives us an idea of the correct order of magnitude. This will definitely qualify as a tiny model!

You’ll eventually see the training process itself begin:

1000/1000 [==============================] - 12s 12ms/step - loss: 7.6510 - accuracy: 0.5207 - val_loss: 4.5836 - val_accuracy: 0.7206

At this point, you can take a look at TensorBoard to see the training process moving along.

Evaluate the results

When training is complete, we can look at the cell’s output for some useful information. First, we can see that the validation accuracy in our final epoch looks very promising at 0.9743, and the loss is nice and low, too:

Epoch 50/50
1000/1000 [==============================] - 7s 7ms/step - loss: 0.0568 -

accuracy: 0.9835 - val_loss: 0.1185 - val_accuracy: 0.9743

This is great, especially as we’re using a per-person data split, meaning our validation data is from a completely different set of individuals. However, we can’t just rely on our validation accuracy to evaluate our model. Because the model’s hyperparameters and architecture were hand-tuned on the validation dataset, we might have overfit it.

To get a better understanding of our model’s final performance, we can evaluate it against our test dataset by calling Keras’s model.evaluate() function. The next line of output shows the results of this:

6/6 [==============================] - 0s 6ms/step - loss: 0.2888 - accuracy: 0.9323

Although not as amazing as the validation numbers, the model shows a good-enough accuracy of 0.9323, with a loss that is still low. The model will predict the correct class 93% of the time, which should be fine for our purposes.

The next few lines show the confusion matrix for the results, calculated by the tf.math.confusion_matrix() function:

tf.Tensor(
[[ 75   3   0   4]
 [  0  69   0  15]
 [  0   0  85   3]
 [  0   0   1 129]], shape=(4, 4), dtype=int32)

A confusion matrix is a helpful tool for evaluating the performance of classification models. It shows how well the predicted class of each input in the test dataset agrees with its actual value.

Each column of the confusion matrix corresponds to a predicted label, in order (“wing,” “ring,” “slope,” then “unknown”). Each row, from the top down, corresponds to the actual label. From our confusion matrix, we can see that the vast majority of predictions agree with the actual labels. We can also see the specific places where confusion is occurring: most significantly, a fair number of inputs were misclassified as “unknown,” especially those belonging to the “ring” category.

The confusion matrix gives us an idea of where our model’s weak points are. In this case, it informs us that it might be beneficial to obtain more training data for the “ring” gesture in order to help the model better learn the differences between “ring” and “unknown.”

The final thing that train.py does is convert the model to TensorFlow Lite format, in both floating-point and quantized variations. The following output reveals the sizes of each variant:

Basic model is 19544 bytes
Quantized model is 8824 bytes
Difference is 10720 bytes

Our 20 KB model shrinks down to 8.8 KB after quantization. This is a very tiny model, and a great result.

Create a C array

The next cell, in the “Create a C source file” section, transforms this into a C source file. Run this cell to see the output:

# Install xxd if it is not available
!apt-get -qq install xxd
# Save the file as a C source file
!xxd -i model_quantized.tflite > /content/model_quantized.cc
# Print the source file
!cat /content/model_quantized.cc

We can copy and paste the contents of this file into our project so that we can use the newly trained model in our application. Later, you’ll learn how to collect new data and teach the application to understand new gestures. For now, let’s keep moving.

Other Ways to Run the Scripts

If you’d prefer not to use Colab, or you’re making changes to the model training scripts and would like to test them out locally, you can easily run the scripts from your own development machine. You can find the instructions in README.md.

Next up, we walk through how the model itself works.

How the Model Works

So far, we’ve established that our model is a convolutional neural network (CNN) and that it transforms a sequence of 128 three-axis accelerometer readings, representing around five seconds of time, into an array of four probabilities: one for each gesture, and one for “unknown.”

CNNs are used when the relationships between adjacent values contain important information. In the first part of our explanation, we’ll take a look at our data and learn why a CNN is well suited to making sense of it.

Visualizing the Input

In our time-series accelerometer data, adjacent accelerometer readings give us clues about the device’s motion. For example, if acceleration on one axis changes rapidly from zero to positive, then back to zero, the device might have begun motion in that direction. Figure 12-5 shows a hypothetical example of this.

A graph of accelerometer values for a single axis of a device being moved
Figure 12-5. Accelerometer values for a single axis of a device being moved

Any given gesture is composed of a series of motions, one after the other. For example, consider our “wing” gesture, shown in Figure 12-6.

Diagram showing the 'wing' gesture
Figure 12-6. The “wing” gesture

The device is first moved down and to the right, then up and to the right, then down and to the right, then up and to the right again. Figure 12-7 shows a sample of real data captured during the “wing” gesture, measured in milli-Gs.

A graph of accelerometer values during the 'wing' gesture
Figure 12-7. Accelerometer values during the “wing” gesture

By looking at this graph and breaking it down into its component parts, we can understand which gesture is being made. From the z-axis acceleration, it’s very clear that the device is being moved up and down in the way we would expect given the “wing” gesture’s shape. More subtly, we can see how the acceleration on the x-axis correlates with the z-axis changes in a way that indicates the device’s motion across the width of the gesture. Meanwhile, we can observe that the y-axis remains mostly stable.

Similarly, a CNN with multiple layers is able to learn how to discern each gesture through its telltale component parts. For example, a network might learn to distinguish an up-and-down motion, and that two of them, when combined with the appropriate z- and y-axis movements, indicates a “wing” gesture.

To do this, a CNN learns a series of filters, arranged in layers. Each filter learns to spot a particular type of feature in the data. When it notices this feature, it passes this high-level information to the next layer of the network. For example, one filter in the first layer of the network might learn to spot something simple, like a period of upward acceleration. When it identifies such a structure, it passes this information to the next layer of the network.

Subsequent layers of filters learn how the outputs of earlier, simpler filters are composed together to form larger structures. For example, a series of four alternating upward and downward accelerations might fit together to represent the “W” shape in our “wing” gesture.

In this process, the noisy input data is progressively transformed into a high-level, symbolic representation. Subsequent layers of our network can analyze this symbolic representation to guess which gesture was performed.

In the next section, we walk through the actual model architecture and see how it maps onto this process.

Understanding the Model Architecture

The architecture of our model is defined in train.py, in the build_cnn() function. This function uses the Keras API to define a model, layer by layer:

model = tf.keras.Sequential([
    tf.keras.layers.Conv2D( # input_shape=(batch, 128, 3)
        8, (4, 3),
        padding="same",
        activation="relu",
        input_shape=(seq_length, 3, 1)),  # output_shape=(batch, 128, 3, 8)
    tf.keras.layers.MaxPool2D((3, 3)),  # (batch, 42, 1, 8)
    tf.keras.layers.Dropout(0.1),  # (batch, 42, 1, 8)
    tf.keras.layers.Conv2D(16, (4, 1), padding="same",
                            activation="relu"),  # (batch, 42, 1, 16)
    tf.keras.layers.MaxPool2D((3, 1), padding="same"),  # (batch, 14, 1, 16)
    tf.keras.layers.Dropout(0.1),  # (batch, 14, 1, 16)
    tf.keras.layers.Flatten(),  # (batch, 224)
    tf.keras.layers.Dense(16, activation="relu"),  # (batch, 16)
    tf.keras.layers.Dropout(0.1),  # (batch, 16)
    tf.keras.layers.Dense(4, activation="softmax")  # (batch, 4)
])

This is a sequential model, meaning the output of each layer is passed directly into the next one. Let’s walk through the layers one by one and explore what’s going on. The first layer is a Conv2D:

tf.keras.layers.Conv2D(
    8, (4, 3),
    padding="same",
    activation="relu",
    input_shape=(seq_length, 3, 1)),  # output_shape=(batch, 128, 3, 8)

This is a convolutional layer; it directly receives our network’s input, which is a sequence of raw accelerometer data. The input’s shape is provided in the input_shape argument. It’s set to (seq_length, 3, 1), where seq_length is the total number of accelerometer measurements that are passed in (128 by default). Each measurement is composed of three values, representing the x-, y-, and z-axes. The input is visualized in Figure 12-8.

A diagram of the model's input
Figure 12-8. The model’s input

The job of our convolutional layer is to take this raw data and extract some basic features that can be interpreted by subsequent layers. The arguments to the Conv2D() function determine how many features will be extracted. The arguments are described in the tf.keras.layers.Conv2D() documentation.

The first argument determines how many filters the layer will have. During training, each filter learns to identify a particular feature in the raw data—for example, one filter might learn to identify the telltale signs of an upward motion. For each filter, the layer outputs a feature map that shows where the feature it has learned occurs within the input.

The layer defined in our code has eight filters, meaning that it will learn to recognize and output eight different types of high-level features from the input data. You can see this reflected in the output shape, (batch_size, 128, 3, 8), which has eight feature channels in its final dimension, one for each feature. The value in each channel indicates the degree to which a feature was present in that location of the input.

As we learned in Chapter 8, convolutional layers slide a window across the data and decide whether a given feature is present in that window. The second argument to Conv2D() is where we provide the dimensions of this window. In our case, it’s (4, 3). This means that the features for which our filters are hunting span four consecutive accelerometer measurements and all three axes. Because the window spans four measurements, each filter analyzes a small snapshot of time, meaning it can generate features that represent a change in acceleration over time. You can see how this works in Figure 12-9.

A diagram of a convolution window overlaid on the data
Figure 12-9. A convolution window overlaid on the data

The padding argument determines how the window will be moved across the data. When padding is set to "same", the layer’s output will have the same length (128) and width (3) as the input. Because every movement of the filter window results in a single output value, the "same" argument means the window must be moved three times across the data, and 128 times down it.

Because the window has a width of 3, this means it must start by overhanging the lefthand side of the data. The empty spaces, where the filter window doesn’t cover an actual value, are padded with zeros. To move a total of 128 times down the length of the data, the filter must also overhang the top of the data. You can see how this works in Figures 12-10 and 12-11.

As soon as the convolution window has moved across all the data, using each filter to create eight different feature maps, the output will be passed to our next layer, MaxPool2D:

tf.keras.layers.MaxPool2D((3, 3)),  # (batch, 42, 1, 8)
A diagram of a convolution window moving across the data
Figure 12-10. The convolution window in its first position, necessitating padding on the top and left sides
A diagram of a convolution window moving across the data
Figure 12-11. The same convolution window having moved to its second position, requiring padding only on the top

This MaxPool2D layer takes the output of the previous layer, a (128, 3, 8) tensor, and shrinks it down to a (42, 1, 8) tensor—a third of its original size. It does this by looking at a window of input data and then selecting the largest value in the window and propagating only that value to the output. The process is then repeated with the next window of data. The argument provided to the MaxPool2D() function, (3, 3), specifies that a 3 × 3 window should be used. By default, the window is always moved so that it contains entirely new data. Figure 12-12 shows how this process works.

A diagram of max pooling at work
Figure 12-12. Max pooling at work

Note that although the diagram shows a single value for each element, our data actually has eight feature channels per element.

But why do we need to shrink our input like this? When used for classification, the goal of a CNN is to transform a big, complex input tensor into a small, simple output. The MaxPool2D layer helps make this happen. It boils down the output of our first convolutional layer into a concentrated, high-level representation of the relevant information that it contains.

By concentrating the information, we begin to strip out things that aren’t relevant to the task of identifying which gesture was contained within the input. Only the most significant features, which were maximally represented in the first convolutional layer’s output, are preserved. It’s interesting to note that even though our original input had three accelerometer axes for each measurement, a combination of Conv2D and MaxPool2D has now merged these together into a single value.

After we’ve shrunk our data down, it goes through a Dropout layer:

tf.keras.layers.Dropout(0.1),  # (batch, 42, 1, 8)

The Dropout layer randomly sets some of a tensor’s values to zero during training. In this case, by calling Dropout(0.1), we set 10% of the values to zero, entirely obliterating that data. This might seem like a strange thing to do, so let’s explain.

Dropout is a regularization technique. As mentioned earlier in the book, regularization is the process of improving machine learning models so that they are less likely to overfit their training data. Dropout is a simple but effective way to limit overfitting. By randomly removing some data between one layer and the next, we force the neural network to learn how to cope with unexpected noise and variation. Adding dropout between layers is a common and effective practice.

The dropout layer is only active during training. During inference, it has no effect; all of the data is allowed through.

After the Dropout layer, we again feed the data through a MaxPool2D layer and a Dropout layer:

tf.keras.layers.Conv2D(16, (4, 1), padding="same",
                        activation="relu"),  # (batch, 42, 1, 16)

This layer has 16 filters and a window size of (4, 1). These numbers are part of the model’s hyperparameters, and they were chosen in an iterative process while the model was being developed. Designing an effective architecture is a process of trial and error, and these magic numbers are what was arrived at after a lot of experimentation. It’s unlikely that you’ll ever select the exact right values the first time around.

Like the first convolutional layer, this one also learns to spot patterns in adjacent values that contain meaningful information. Its output is an even higher-level representation of the content of a given input. The features it recognizes are compositions of the features identified by our first convolutional layer.

After this convolutional layer, we do another MaxPool2D and Dropout:

tf.keras.layers.MaxPool2D((3, 1), padding="same"),  # (batch, 14, 1, 16)
tf.keras.layers.Dropout(0.1),  # (batch, 14, 1, 16)

This continues the process of distilling the original input down to a smaller, more manageable representation. The output, with a shape of (14, 1, 16), is a multidimensional tensor that symbolically represents only the most significant structures contained within the input data.

If we wanted to, we could continue with the process of convolution and pooling. The number of layers in a CNN is just another hyperparameter that we can tune during model development. However, during the development of this model, we found that two convolutional layers was sufficient.

Up until this point, we’ve been running our data through convolutional layers, which care only about the relationships between adjacent values—we haven’t really been considering the bigger picture. However, because we now have high-level representations of the major features contained within our input, we can “zoom out” and study them in aggregate. To do so, we flatten our data and feed it into a Dense layer (also known as a fully connected layer):

tf.keras.layers.Flatten(),  # (batch, 224)
tf.keras.layers.Dense(16, activation="relu"),  # (batch, 16)

The Flatten layer is used to transform a multidimensional tensor into one with a single dimension. In this case, our (14, 1, 16) tensor is squished down into a single dimension with shape (224).

It’s then fed into a Dense layer with 16 neurons. This is one of the most basic tools in the deep learning toolbox: a layer where every input is connected to every neuron. By considering all of our data, all at once, this layer can learn the meanings of various combinations of inputs. The output of this Dense layer will be a set of 16 values representing the content of the original input in a highly compressed form.

Our final task is to shrink these 16 values down into 4 classes. To do this, we first add some more dropout and then a final Dense layer:

tf.keras.layers.Dropout(0.1),  # (batch, 16)
tf.keras.layers.Dense(4, activation="softmax")  # (batch, 4)

This layer has four neurons; one representing each class of gesture. Each of them is connected to all 16 of the outputs from the previous layer. During training, each neuron will learn the combination of previous-layer activations that correspond to the gesture it represents.

The layer is configured with a "softmax" activation function, which results in the layer’s output being a set of probabilities that sum to 1. This output is what we see in the model’s output tensor.

This type of model architecture—a combination of convolutional and fully connected layers—is very useful in classifying time-series sensor data like the measurements we obtain from our accelerometer. The model learns to identify the high-level features that represent the “fingerprint” of a particular class of input. It’s small, runs fast, and doesn’t take long to train. This architecture will be a valuable tool in your belt as an embedded machine learning engineer.

Training with Your Own Data

In this section, we’ll show you how to train your own, custom model that recognizes new gestures. We’ll walk through how to capture accelerometer data, modify the training scripts to incorporate it, train a new model, and integrate it into the embedded application.

Capturing Data

To obtain training data, we can use a simple program to log accelerometer data to the serial port while gestures are being performed.

SparkFun Edge

The fastest way to get started is by modifying one of the examples in the SparkFun Edge Board Support Package (BSP). First, follow SparkFun’s “Using SparkFun Edge Board with Ambiq Apollo3 SDK” guide to set up the Ambiq SDK and SparkFun Edge BSP.

After you’ve downloaded the SDK and BSP, you’ll need to tweak the example code so it does what we want.

First, open the file AmbiqSuite-Rel2.2.0/boards/SparkFun_Edge_BSP/examples/example1_edge_test/src/tf_adc/tf_adc.c in your text editor of choice. Find the call to am_hal_adc_samples_read(), on line 61 of the file:

if (AM_HAL_STATUS_SUCCESS != am_hal_adc_samples_read(g_ADCHandle,
                                                     NULL,
                                                     &ui32NumSamples,
                                                     &Sample))

Change its second parameter to true so that the entire function call looks like this:

if (AM_HAL_STATUS_SUCCESS != am_hal_adc_samples_read(g_ADCHandle,
                                                     true,
                                                     &ui32NumSamples,
                                                     &Sample))

Next, you’ll need to modify the file AmbiqSuite-Rel2.2.0/boards/SparkFun_Edge_BSP/examples/example1_edge_test/src/main.c. Find the while loop on line 51:

/*
* Read samples in polling mode (no int)
*/
while(1)
{
    // Use Button 14 to break the loop and shut down
    uint32_t pin14Val = 1;
    am_hal_gpio_state_read( AM_BSP_GPIO_14, AM_HAL_GPIO_INPUT_READ, &pin14Val);

Change the code to add the following extra line:

/*
* Read samples in polling mode (no int)
*/
while(1)
{
    am_util_stdio_printf("-,-,-
");
    // Use Button 14 to break the loop and shut down
    uint32_t pin14Val = 1;
    am_hal_gpio_state_read( AM_BSP_GPIO_14, AM_HAL_GPIO_INPUT_READ, &pin14Val);

Now find this line a little further along in the while loop:

am_util_stdio_printf("Acc [mg] %04.2f x, %04.2f y, %04.2f z,
                     Temp [deg C] %04.2f, MIC0 [counts / 2^14] %drn",
        acceleration_mg[0], acceleration_mg[1], acceleration_mg[2],
        temperature_degC, (audioSample) );

Delete the original line and replace it with the following:

am_util_stdio_printf("%04.2f,%04.2f,%04.2f
", acceleration_mg[0],
                     acceleration_mg[1], acceleration_mg[2]);

The program will now output data in the format expected by the training scripts.

Next, follow the instructions in SparkFun’s guide to build the example1_edge_test example application and flash it to the device.

Logging data

After you’ve built and flashed the example code, follow these instructions to capture some data.

First, open a new terminal window. Then run the following command to begin logging all of the terminal’s output to a file named output.txt:

script output.txt

Next, in the same window, use screen to connect to the device:

screen ${DEVICENAME} 115200

Measurements from the accelerometer will be shown on the screen and saved to output.txt in the same comma-delimited format expected by the training scripts.

You should aim to capture multiple performances of the same gesture in a single file. To start capturing a single performance of a gesture, press the button marked RST. The characters -,-,- will be written to the serial port; this output is used by the training scripts to identify the start of a gesture performance. After you’ve performed the gesture, press the button marked 14 to stop logging data.

When you’ve logged the same gesture a number of times, exit screen by pressing Ctrl- A, immediately followed by the K key, and then the Y key. After you’ve exited screen, enter the following command to stop logging data to output.txt:

exit

You now have a file, output.txt, which contains data for one person performing a single gesture. To train an entirely new model, you should aim to collect a similar amount of data as in the original dataset, which contains around 15 performances of each gesture by 10 people.

If you don’t care about your model working for people other than yourself, you can probably get away with capturing only your own performances. That said, the more variation in performances you can collect, the better.

For compatibility with the training scripts, you should rename your captured data files in the following format:

output_<gesture_name>_<person_name>.txt

For example, data for a hypothetical “triangle” gesture made by “Daniel” would have the following name:

output_triangle_Daniel.txt

The training scripts will expect the data to be organized in directories for each gesture name; for example:

data/
├── triangle
│   ├── output_triangle_Daniel.txt
│   └── ...
├── square
│   ├── output_square_Daniel.txt
│   └── ...
└── star
    ├── output_star_Daniel.txt
    └── ...

You’ll also need to provide data for the “unknown” category, in a directory named negative. In this case, you can just reuse the data files from the original dataset.

Note that because the model architecture is designed to output probabilities for four classes (three gestures plus “unknown”), you should provide three gestures of your own. If you want to train on more or fewer gestures, you’ll need to change the training scripts and adjust the model architecture.

Modifying the Training Scripts

To train a model with your new gestures, you need to make some changes to the training scripts.

First, replace all of the gesture names within the following files:

Next, replace all of the person names within the following files:

Note that if you have a different number of person names (the original dataset has 10) and you want to split the data by person during training, you’ll need to decide on a new split. If you have data from only a few people, it won’t be possible to split by person during training, so don’t worry about data_split_person.py.

Training

To train a new model, copy your data files directories into the training scripts’ directory and follow the process we walked through earlier in this chapter.

If you have data from only a few people, you should split the data randomly rather than per person. To do this, run data_split.py instead of data_split_person.py when preparing for training.

Because you’re training on new gestures, it’s worth playing with the model’s hyperparameters to obtain the best accuracy. For example, you can see whether you get better results by training for more or fewer epochs, or with a different arrangement of layers or number of neurons, or with different convolutional hyperparameters. You can use TensorBoard to monitor your progress.

Once you have a model with acceptable accuracy, you’ll need to make a few changes to the project to make sure it works.

Using the New Model

First, you’ll need to copy the new model’s data, as formatted by xxd -i, into magic_wand_model_data.cc. Make sure you also update the value of g_magic_wand_model_data_len to match the number output by xxd.

Next, in the array should_continuous_count, you’ll need to update the values in accelerometer_handler.cc that specify the number of continuous predictions required for each gesture. The value corresponds to how long the gesture takes to perform. Given that the original “wing” gesture requires a continuous count of 15, estimate how long your new gestures will take relative to that, and update the values in the array. You can tune these values iteratively until you get the most reliable performance.

Finally, update the code in output_handler.cc to print the correct names for your new gestures. When this is done, you can build your code and flash your device.

Wrapping Up

In this chapter, we’ve taken our deepest dive yet into the architecture of a typical embedded machine learning model. This type of convolutional model is a powerful tool for classifying time-series data, and you’ll come across it often.

By now, you hopefully have an understanding of what embedded machine learning applications look like, and how their application code works together with models to understand the world around them. As you build your own projects, you’ll begin to put together a toolbox of familiar models that you can use to solve different problems.

Learning Machine Learning

This book is intended to provide a gentle introduction to the possibilities of embedded machine learning, but it’s not a complete reference on machine learning itself. If you’d like to dig deeper into building your own models, there are some amazing and highly accessible resources that are suitable for students of all backgrounds and will give you a running start.

Here are some of our favorites, which will build on what you’ve learned here:

What’s Next

The remaining chapters of this book take a deeper dive into the tools and workflows of embedded machine learning. You’ll learn how to think about designing your own TinyML applications, how to optimize models and application code to run well on low-powered devices, how to port existing machine learning models to embedded devices, and how to debug embedded machine learning code. We’ll also address some high-level concerns, like deployment, privacy, and security.

But first, let’s learn a bit more about TensorFlow Lite, the framework that powers all of the examples in this book.

1 This is a new term, which we’ll talk about later.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.131.38.14