Chapter 10. Person Detection: Training a Model

In Chapter 9, we showed how you can deploy a pretrained model for recognizing people in images, but we didn’t explain where that model came from. If your product has different requirements, you’ll want to be able to train your own version, and this chapter explains how to do that.

Picking a Machine

Training this image model takes a lot more compute power than our previous examples, so if you want your training to complete in a reasonable amount of time, you’ll need to use a machine with a high-end graphics processing unit (GPU). Unless you expect to be running a lot of training jobs, we recommend starting off by renting a cloud instance rather than buying a special machine. Unfortunately the free Colaboratory service from Google that we’ve used in previous chapters for smaller models won’t work, and you will need to pay for access to a machine. There are many great providers available, but our instructions will assume you’re using Google Cloud Platform because that’s the service we’re most familiar with. If you are already using Amazon Web Services (AWS) or Microsoft Azure, they also have TensorFlow support and the training instructions should be the same, but you’ll need to follow their tutorials for setting up a machine.

Setting Up a Google Cloud Platform Instance

You can rent a virtual machine with TensorFlow and NVIDIA drivers preinstalled from Google Cloud Platform, and with support for a Jupyter Notebook web interface, which can be very convenient. The route to setting this up can be a bit involved, though. As of September 2019, here are the steps you need to take to create a machine:

  1. Sign in to console.cloud.google.com. You’ll need to create a Google account if you don’t already have one, and you’ll have to set up billing to pay for the instance you create. If you don’t already have a project, you’ll need to create one.

  2. In the upper-left corner of the screen, open the hamburger menu (the main menu with three horizontal lines as an icon, as illustrated in Figure 10-1) and scroll down until you find the Artificial Intelligence section.

  3. In this section, select AI Platform→Notebooks, as shown in Figure 10-1.

    The AI Platform menu
    Figure 10-1. The AI Platform menu
  4. You might see a prompt asking you to enable the Compute Engine API to proceed, as depicted in Figure 10-2; go ahead and approve it. This can take several minutes to go through.

    The Compute Engine API
    Figure 10-2. The Compute Engine API screen
  5. A “Notebook instances” screen will open. In the menu bar at the top, select NEW INSTANCE. On the submenu that opens, choose “Customize instance,” as shown in Figure 10-3.

    The instance creation menu
    Figure 10-3. The instance creation menu
  6. On the “New notebook instance” page, in the “instance name” box, give your machine a name, as illustrated in Figure 10-4, and then scroll down to set up the environment.

    The instance naming interface
    Figure 10-4. The naming interface
  7. As of September 2019, the correct TensorFlow version to choose is TensorFlow 1.14. The recommended version will likely have increased to 2.0 or beyond by the time you’re reading this, but there might be some incompatibilities, so if it’s still possible start by selecting 1.14 or another version in the 1.x branch.

  8. In the “Machine configuration” section, choose at least 4 CPUs and 15 GB of RAM, as shown in Figure 10-5.

    The CPU and version interface
    Figure 10-5. The CPU and version interface
  9. Picking the right GPU will make the biggest difference in your training speed. It can be tricky because not all zones offer the same kind of hardware. In our case, we’re using “us-west1 (Oregon)” as the region and “us-west-1b” as the zone because we know that they currently offer high-end GPUs. You can get the detailed pricing information using Google Cloud Platform’s pricing calculator, but for this example we’re choosing one NVIDIA Tesla V100 GPU, as illustrated in Figure 10-6. This costs $1,300 a month to run but allows us to train the person-detector model in around a day, so the model training cost works out to about $45.

    The GPU selection interface
    Figure 10-6. The GPU selection interface
    Tip

    These high-end machines are expensive to run, so make sure you stop your instance when you’re not actively using it for training. Otherwise, you’ll be paying for an idle machine.

  10. It makes life easier to have the GPU drivers installed automatically, so make sure you select that option, as demonstrated in Figure 10-7.

    The GPU driver interface
    Figure 10-7. The GPU driver interface
  11. Because you’ll be downloading a dataset to this machine, we recommend making the boot disk a bit larger than the default 100 GB; maybe as big as 500 GB, as shown in Figure 10-8.

    The boot disk size
    Figure 10-8. Increasing the boot disk size
  12. When you’ve set all those options, at the bottom of the page, click the CREATE button, which should return you to the “Notebook instances” screen. There should be a new instance in the list with the name you gave to your machine. There will be spinners next to it for a few minutes while the instance is being set up. When that’s complete click the OPEN JUPYTERLAB link, as depicted in Figure 10-9.

    The instances screen
    Figure 10-9. The instances screen
  13. In the screen that opens, choose to create a Python 3 notebook (see Figure 10-10).

    The notebook selection screen
    Figure 10-10. The notebook selection screen

    This gives you a Jupyter notebook connected to your instance. If you’re not familiar with Jupyter, it gives you a nice web interface to a Python interpreter running on a machine, and stores the commands and results in a notebook you can share. To start using it, in the panel on the right, type print("Hello World!") and then press Shift+Return. You should see “Hello World!"” printed just below, as shown in Figure 10-11. If so, you’ve successfully set up your machine instance. We use this notebook as the place in which we enter commands for the rest of this tutorial.

The hello world example
Figure 10-11. The “hello world” example

Many of the commands that follow assume that you’re running from a Jupyter notebook, so they begin with a !, which indicates they should be run as shell commands rather than Python statements. If you’re running directly from a terminal (for example, after opening a Secure Shell connection to commmunicate with an instance) you can remove the initial !.

Training Framework Choice

Keras is the recommended interface for building models in TensorFlow, but when the person detection model was being created it didn’t yet support all the features we needed. For that reason, we show you how to train a model using tf.slim, an older interface. It is still widely used but deprecated, so future versions of TensorFlow might not support this approach. We hope to publish Keras instructions online in the future; check tinymlbook.com/persondetector for updates.

The model definitions for Slim are part of the TensorFlow models repository, so to get started, you’ll need to download it from GitHub:

! cd ~
! git clone https://github.com/tensorflow/models.git
Note

The following guide assumes that you’ve done this from your home directory, so the model repository code is at ~/models, and that all commands are run from the home directory unless otherwise noted. You can place the repository somewhere else, but you’ll need to update all references to it.

To use Slim, you need to make sure that Python can find its modules and install one dependency. Here’s how to do this in an iPython notebook:

! pip install contextlib2
import os
new_python_path = (os.environ.get("PYTHONPATH") or '') + ":models/research/slim"
%env PYTHONPATH=$new_python_path

Updating PYTHONPATH through an EXPORT statement like this works only for the current Jupyter session, so if you’re using bash directly you should add it to a persistent startup script, running something like this:

echo 'export PYTHONPATH=$PYTHONPATH:models/research/slim' >> ~/.bashrc
source ~/.bashrc

If you see import errors running the Slim scripts, make sure the PYTHONPATH is set up correctly and that contextlib2 has been installed. You can find more general information on tf.slim in the repository’s README.

Building the Dataset

To train our person detection model, we need a large collection of images that are labeled depending on whether they have people in them. The ImageNet 1,000-class dataset that’s widely used for training image classifiers doesn’t include labels for people, but luckily the COCO dataset does.

The dataset is designed to be used for training models for localization, so the images aren’t labeled with the “person,” “not person” categories for which we want to train. Instead, each image comes with a list of bounding boxes for all of the objects it contains. “Person” is one of these object categories, so to get to the classification labels we want, we need to look for images with bounding boxes for people. To make sure that they aren’t too tiny to be recognizable we also need to exclude very small bounding boxes. Slim contains a convenient script to both download the data and convert bounding boxes into labels:

! python download_and_convert_data.py 
  --dataset_name=visualwakewords 
  --dataset_dir=data/visualwakewords

This is a large download, about 40 GB, so it will take a while and you’ll need to make sure you have at least 100 GB free on your drive to allow space for unpacking and further processing. Don’t be surprised if the process takes around 20 minutes to complete. When it’s done, you’ll have a set of TFRecords in data/visualwakewords holding the labeled image information. This dataset was created by Aakanksha Chowdhery and is known as the Visual Wake Words dataset. It’s designed to be useful for benchmarking and testing embedded computer vision because it represents a very common task that we need to accomplish with tight resource constraints. We’re hoping to see it drive even better models for this and similar tasks.

Training the Model

One of the nice things about using tf.slim to handle the training is that the parameters we commonly need to modify are available as command-line arguments, so we can just call the standard train_image_classifier.py script to train our model. You can use this command to build the model we use in the example:

! python models/research/slim/train_image_classifier.py 
    --train_dir=vww_96_grayscale 
    --dataset_name=visualwakewords 
    --dataset_split_name=train 
    --dataset_dir=data/visualwakewords 
    --model_name=mobilenet_v1_025 
    --preprocessing_name=mobilenet_v1 
    --train_image_size=96 
    --use_grayscale=True 
    --save_summaries_secs=300 
    --learning_rate=0.045 
    --label_smoothing=0.1 
    --learning_rate_decay_factor=0.98 
    --num_epochs_per_decay=2.5 
    --moving_average_decay=0.9999 
    --batch_size=96 
    --max_number_of_steps=1000000

It will take a couple of days on a single-GPU V100 instance to complete all one million steps, but you should be able to get a fairly accurate model after a few hours if you want to experiment early. Following are some additional considerations:

  • The checkpoints and summaries will be saved in the folder given in the --train_dir argument. This is where you’ll need to look for the results.

  • The --dataset_dir parameter should match the one where you saved the TFRecords from the Visual Wake Words build script.

  • The architecture we use is defined by the --model_name argument. The mobilenet_v1 prefix instructs the script to use the first version of MobileNet. We did experiment with later versions, but these used more RAM for their intermediate activation buffers, so for now we’re sticking with the original. The 025 is the depth multiplier to use, which mostly affects the number of weight parameters; this low setting ensures the model fits within 250 KB of flash memory.

  • --preprocessing_name controls how input images are modified before they’re fed into the model. The mobilenet_v1 version shrinks the width and height of the images to the size given in --train_image_size (in our case 96 pixels because we want to reduce the compute requirements). It also scales the pixel values from integers in the range 0 to 255 to floating-point numbers in the range −1.0 to +1.0 floating-point numbers (though we’ll be quantizing those after pass:[training).

  • The HM01B0 camera we’re using on the SparkFun Edge board is monochrome, so to get the best results, we need to train our model on black-and-white images. We pass in the --use_grayscale flag to enable that preprocessing.

  • The --learning_rate, --label_smoothing, --learning_rate_decay_factor, --num_epochs_per_decay, --moving_average_decay, and --batch_size parameters all control how weights are updated during the the training process. Training deep networks is still a bit of a dark art, so these exact values we found through experimentation for this particular model. You can try tweaking them to speed up training or gain a small boost in accuracy, but we can’t give much guidance for how to make those changes, and it’s easy to get combinations where the training accuracy never converges.

  • --max_number_of_steps defines how long the training should continue. There’s no good way to establish this threshold in advance; you need to experiment to determine when the accuracy of the model is no longer improving to know when to cut it off. In our case, we default to a million steps because with this particular model we know that’s a good point to stop.

After you start the script, you should see output that looks something like this:

INFO:tensorflow:global step 4670: loss = 0.7112 (0.251 sec/step)
  I0928 00:16:21.774756 140518023943616 learning.py:507] global step 4670: loss
  = 0.7112 (0.251 sec/step)
INFO:tensorflow:global step 4680: loss = 0.6596 (0.227 sec/step)
  I0928 00:16:24.365901 140518023943616 learning.py:507] global step 4680: loss
  = 0.6596 (0.227 sec/step)

Don’t worry about the line duplication: this is just a side effect of the way TensorFlow log printing interacts with Python. Each line has two key bits of information about the training process. The global step is a count of how far through the training we are. Because we’ve set the limit as a million steps, in this case we’re nearly 5% complete. Together with the steps-per-second estimate, this is useful because you can use it to estimate a rough duration for the entire training process. In this case, we’re completing about 4 steps per second, so a million steps will take about 70 hours, or 3 days. The other crucial piece of information is the loss. This is a measure of how close the partially trained model’s predictions are to the correct values, and lower values are better. This will show a lot of variation but should on average decrease during training if the model is learning. Because it’s so noisy the amounts will bounce around a lot over short time periods, but if things are working well you should see a noticeable drop if you wait an hour or so and check back. This kind of variation is a lot easier to see in a graph, which is one of the main reasons to try TensorBoard.

TensorBoard

TensorBoard is a web application that lets you view data visualizations from TensorFlow training sessions, and it’s included by default in most cloud instances. If you’re using Google Cloud AI Platform, you can start up a new TensorBoard session by opening the command palette from the left tabs in the notebook interface and then scrolling down to select “Create a new tensorboard.” You’re then prompted for the location of the summary logs. Enter the path you used for --train_dir in the training script—in the previous example, the folder name is vww_96_grayscale. One common error to watch out for is adding a slash to the end of the path, which will cause TensorBoard to fail to find the directory.

If you’re starting TensorBoard from the command line in a different environment you’ll need to pass in this path as the --logdir argument to the TensorBoard command-line tool, and point your browser to http://localhost:6006 (or the address of the machine you’re running it on).

After navigating to the TensorBoard address or opening the session through Google Cloud, you should see a page that looks something like Figure 10-12. It might take a little while for the graphs to have anything useful in them given that the script only saves summaries every five minutes. Figure 10-12 shows the results after training for more than a day. The most important graph is called “clone_loss”; it shows the progression of the same loss value that’s displayed in the logging output. As you can see in this example it fluctuates a lot, but the overall trend is downward over time. If you don’t see this sort of progression after a few hours of training, it’s a good sign that your model isn’t converging to a good solution, and you might need to debug what’s going wrong either with your dataset or the training parameters.

TensorBoard defaults to the SCALARS tab when it opens, but the other section that can be useful during training is IMAGES (Figure 10-13). This shows a random selection of the pictures the model is currently being trained on, including any distortions and other preprocessing. In the figure, you can see that the image has been flipped and that it’s been converted to grayscale before being fed to the model. This information isn’t as essential as the loss graphs, but it can be useful to ensure that the dataset is what you expect, and it is interesting to see the examples updating as training progresses.

Training graphs in Tensorboard
Figure 10-12. Graphs in TensorBoard
Training images in Tensorboard
Figure 10-13. Images in TensorBoard

Evaluating the Model

The loss function correlates with how well your model is training, but it isn’t a direct, understandable metric. What we really care about is how many people our model detects correctly, but to get it to calculate this we need to run a separate script. You don’t need to wait until the model is fully trained, you can check the accuracy of any checkpoints in the --train_dir folder. To do this, run the following command:

! python models/research/slim/eval_image_classifier.py 
    --alsologtostderr 
    --checkpoint_path=vww_96_grayscale/model.ckpt-698580 
    --dataset_dir=data/visualwakewords 
    --dataset_name=visualwakewords 
    --dataset_split_name=val 
    --model_name=mobilenet_v1_025 
    --preprocessing_name=mobilenet_v1 
    --use_grayscale=True 
    --train_image_size=96

You’ll need to make sure that --checkpoint_path is pointing to a valid set of checkpoint data. Checkpoints are stored in three separate files, so the value should be their common prefix. For example, if you have a checkpoint file called model.ckpt-5179.data-00000-of-00001, the prefix would be model.ckpt-5179. The script should produce output that looks something like this:

INFO:tensorflow:Evaluation [406/406]
I0929 22:52:59.936022 140225887045056 evaluation.py:167] Evaluation [406/406]
eval/Accuracy[0.717438412]eval/Recall_5[1]

The important number here is the accuracy. It shows the proportion of the images that were classified correctly, which is 72% in this case, after converting to a percentage. If you follow the example script, you should expect a fully trained model to achieve an accuracy of around 84% after one million steps and show a loss of around 0.4.

Exporting the Model to TensorFlow Lite

When the model has trained to an accuracy you’re happy with, you’ll need to convert the results from the TensorFlow training environment into a form you can run on an embedded device. As we’ve seen in previous chapters, this can be a complex process, and tf.slim adds a few of its own wrinkles, too.

Exporting to a GraphDef Protobuf File

Slim generates the architecture from the model_name every time one of its scripts is run, so for a model to be used outside of Slim, it needs to be saved in a common format. We’re going to use the GraphDef protobuf serialization format because that’s understood by both Slim and the rest of TensorFlow:

! python models/research/slim/export_inference_graph.py 
    --alsologtostderr 
    --dataset_name=visualwakewords 
    --model_name=mobilenet_v1_025 
    --image_size=96 
    --use_grayscale=True 
    --output_file=vww_96_grayscale_graph.pb

If this succeeds, you should have a new vww_96_grayscale_graph.pb file in your home directory. This contains the layout of the operations in the model, but it doesn’t yet have any of the weight data.

Freezing the Weights

The process of storing the trained weights together with the operation graph is known as freezing. This converts all of the variables in the graph to constants, after loading their values from a checkpoint file. The command that follows uses a checkpoint from the millionth training step, but you can supply any valid checkpoint path. The graph-freezing script is stored in the main TensorFlow repository, so you’ll need to download this from GitHub before running this command:

! git clone https://github.com/tensorflow/tensorflow
! python tensorflow/tensorflow/python/tools/freeze_graph.py 
    --input_graph=vww_96_grayscale_graph.pb 
    --input_checkpoint=vww_96_grayscale/model.ckpt-1000000 
    --input_binary=true --output_graph=vww_96_grayscale_frozen.pb 
    --output_node_names=MobilenetV1/Predictions/Reshape_1

After this, you should see a file called vww_96_grayscale_frozen.pb.

Quantizing and Converting to TensorFlow Lite

Quantization is a tricky and involved process, and it’s still very much an active area of research, so taking the float graph that we’ve trained so far and converting it down to an 8-bit entity takes quite a bit of code. You can find more of an explanation of what quantization is and how it works in Chapter 15, but here we’ll show you how to use it with the model we’ve trained. The majority of the code is preparing example images to feed into the trained network so that the ranges of the activation layers in typical use can be measured. We rely on the TFLiteConverter class to handle the quantization and conversion into the TensorFlow Lite FlatBuffer file that we need for the inference engine:

import tensorflow as tf
import io
import PIL
import numpy as np

def representative_dataset_gen():

  record_iterator = tf.python_io.tf_record_iterator
      (path='data/visualwakewords/val.record-00000-of-00010')

  count = 0
  for string_record in record_iterator:
    example = tf.train.Example()
    example.ParseFromString(string_record)
    image_stream = io.BytesIO
        (example.features.feature['image/encoded'].bytes_list.value[0])
    image = PIL.Image.open(image_stream)
    image = image.resize((96, 96))
    image = image.convert('L')
    array = np.array(image)
    array = np.expand_dims(array, axis=2)
    array = np.expand_dims(array, axis=0)
    array = ((array / 127.5) - 1.0).astype(np.float32)
    yield([array])
    count += 1
    if count > 300:
        break

converter = tf.lite.TFLiteConverter.from_frozen_graph 
    ('vww_96_grayscale_frozen.pb', ['input'],  ['MobilenetV1/Predictions/ 
    Reshape_1'])
converter.inference_input_type = tf.lite.constants.INT8
converter.inference_output_type = tf.lite.constants.INT8
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen

tflite_quant_model = converter.convert()
open("vww_96_grayscale_quantized.tflite", "wb").write(tflite_quant_model)

Converting to a C Source File

The converter writes out a file, but most embedded devices don’t have a filesystem. To access the serialized data from our program, we must compile it into the executable and store it in flash. The easiest way to do that is to convert the file to a C data array, as we’ve done in previous chapters:

# Install xxd if it is not available
! apt-get -qq install xxd
# Save the file as a C source file
! xxd -i vww_96_grayscale_quantized.tflite > person_detect_model_data.cc

You can now replace the existing person_detect_model_data.cc file with the version you’ve trained and will be able to run your own model on embedded devices.

Training for Other Categories

There are more than 60 different object types in the COCO dataset, so an easy way to customize your model would be to choose one of those instead of person when you build the training dataset. Here’s an example that looks for cars:

! python models/research/slim/datasets/build_visualwakewords_data.py 
   --logtostderr 
   --train_image_dir=coco/raw-data/train2014 
   --val_image_dir=coco/raw-data/val2014 
   --train_annotations_file=coco/raw-data/annotations/instances_train2014.json 
   --val_annotations_file=coco/raw-data/annotations/instances_val2014.json 
   --output_dir=coco/processed_cars 
   --small_object_area_threshold=0.005 
   --foreground_class_of_interest='car'

You should be able to follow the same steps as you did for the person detector, substituting in the new coco/processed_cars path wherever data/visualwakewords used to be.

If the kind of object you’re interested in isn’t present in COCO, you might be able to use transfer learning to help you train on a custom dataset you’ve gathered, even if it’s much smaller. Although we don’t have an example of this to share yet, you can check tinymlbook.com for updates on this approach.

Understanding the Architecture

MobileNets are a family of architectures designed to provide good accuracy for as few weight parameters and arithmetic operations as possible. There are now multiple versions, but in our case we’re using the original v1 because it requires the smallest amount of RAM at runtime. The core concept behind the architecture is depthwise separable convolution. This is a variant of classic 2D convolutions that works in a much more efficient way, without sacrificing very much accuracy. Regular convolution calculates an output value based on applying a filter of a particular size across all channels of the input. This means that the number of calculations involved in each output is the width of the filter multiplied by the height, multiplied by the number of input channels. Depthwise convolution breaks this large calculation into separate parts. First, each input channel is filtered by one or more rectangular filters to produce intermediate values. These values are then combined using pointwise convolutions. This dramatically reduces the number of calculations needed, and in practice produces similar results to regular convolution.

MobileNet v1 is a stack of 14 of these depthwise separable convolution layers with an average pool and then a fully connected layer followed by a softmax at the end. We have specified a width multiplier of 0.25, which has the effect of reducing the number of computations down to around 60 million per inference, by shrinking the number of channels in each activation layer by 75% compared to the standard model. In essence it’s very similar to a normal convolutional neural network in operation, with each layer learning patterns in the input. Earlier layers act more like edge recognition filters, spotting low-level structure in the image, and later layers synthesize that information into more abstract patterns that help with the final object classification.

Wrapping Up

Image recognition using machine learning requires large amounts of data and a lot of processing power. In this chapter you learned how to train a model from scratch, given nothing but a dataset, and how to convert that model into a form that is optimized for embedded devices.

This experience should give you a good foundation for tackling the machine vision problems that you need to solve for your product. There’s still something a bit magical about computers being able to see and understand the world around them, so we can’t wait to see what you come up with!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.44.182