Chapter 7. Wake-Word Detection: Building an Application

TinyML might be a new phenomenon, but its most widespread application is perhaps already at work in your home, in your car, or even in your pocket. Can you guess what it is?

The past few years have seen the rise of digital assistants. These products provide a voice user interface (UI) designed to give instant access to information without the need for a screen or keyboard. Between Google Assistant, Apple’s Siri, and Amazon Alexa, these digital assistants are nearly ubiquitous. Some variant is built into almost every mobile phone, from flagship models to voice-first devices designed for emerging markets. They’re also in smart speakers, computers, and vehicles.

In most cases, the heavy lifting of speech recognition, natural language processing, and generating responses to users’ queries is done in the cloud, on powerful servers running large ML models. When a user asks a question, it’s sent to the server as a stream of audio. The server figures out what it means, looks up any required information, and sends the appropriate response back.

But part of an assistants’ appeal is that they’re always on, ready to help you out. By saying “Hey Google,” or “Alexa,” you can wake up your assistant and tell it what you need without ever having to press a button. This means they must be listening for your voice 24/7, whether you’re sitting in your living room, driving down the freeway, or in the great outdoors with a phone in your hand.

Although it’s easy to do speech recognition on a server, it’s just not feasible to send a constant stream of audio from a device to a data center. From a privacy perspective, sending every second of audio captured to a remote server would be an absolute disaster. Even if that were somehow okay, it would require vast amounts of bandwidth and chew through mobile data plans in hours. In addition, network communication uses energy, and sending a constant stream of data would quickly drain the device’s battery. What’s more, with every request going to a server and back, the assistant would feel laggy and slow to respond.

The only audio the assistant really needs is what immediately follows the wake word (e.g., “Hey Google”). What if we could detect that word without sending data, but start streaming when we heard it? We’d protect user privacy, save battery life and bandwidth, and wake up the assistant without waiting for the network.

And this is where TinyML comes in. We can train a tiny model that listens for a wake word, and run it on a low-powered chip. If we embed this in a phone, it can listen for wake words all the time. When it hears the magic word, it informs the phone’s operating system (OS), which can begin to capture audio and send it to the server.

Wake-word detection is the perfect application for TinyML. It’s ideally suited to delivering privacy, efficiency, speed, and offline inference. This approach, in which a tiny, efficient model “wakes up” a larger, more resource-hungry model, is called cascading.

In this chapter, we examine how we can use a pretrained speech detection model to provide always-on wake-word detection using a tiny microcontroller. In Chapter 8, we’ll explore how the model is trained, and how to create our own.

What We’re Building

We’re going to build an embedded application that uses an 18 KB model, trained on a dataset of speech commands, to classify spoken audio. The model is trained to recognize the words “yes” and “no,” and is also capable of distinguishing between unknown words and silence or background noise.

Our application will listen to its surroundings with a microphone and indicate when it has detected a word by lighting an LED or displaying data on a screen, depending on the capabilities of the device. Understanding this code will give you the ability to control any electronics project with voice commands.

Note

Like with Chapter 5, the source code for this application is available in the TensorFlow GitHub repository.

We’ll follow a similar pattern to Chapter 5, walking through the tests, then the application code, followed by the logic that makes the sample work on various devices.

We provide instructions for deploying the application to the following devices:

Note

TensorFlow Lite regularly adds support for new devices, so if the device you’d like to use isn’t listed here, check the example’s README.md. You can also check there for updated deployment instructions if you run into trouble following these steps.

This is a significantly more complex application than the “hello world” example, so let’s begin by walking through its structure.

Application Architecture

Over the previous few chapters, you’ve learned that a machine learning application does the following sequence of things:

  1. Obtains an input

  2. Preprocesses the input to extract features suitable to feed into a model

  3. Runs inference on the processed input

  4. Postprocesses the model’s output to make sense of it

  5. Uses the resulting information to make things happen

The “hello world” example followed these steps in a very straightforward manner. It took a single floating-point number as input, generated by a simple counter. Its output was another floating-point number that we used directly to control visual output.

Our wake-word application will be more complicated for the following reasons:

  • It takes audio data as an input. As you’ll see, this requires heavy processing before it can be fed into a model.

  • Its model is a classifier, outputting class probabilities. We’ll need to parse and make sense of this output.

  • It’s designed to perform inference continually, on live data. We’ll need to write code to make sense of a stream of inferences.

  • The model is larger and more complex. We’ll be pushing our hardware to the limits of its capabilities.

Because much of this complexity results from the model we’ll be using, let’s learn a little more about it.

Introducing Our Model

As we mentioned earlier, the model we use in this chapter is trained to recognize the words “yes” and “no,” and is also capable of distinguishing between unknown words and silence or background noise.

The model was trained on a dataset called the Speech Commands dataset. This consists of 65,000 one-second-long utterances of 30 short words, crowdsourced online.

Although the dataset contains 30 different words, the model was trained to distinguish between only four categories: the words “yes” and “no,” “unknown” words (meaning the other 28 words in the dataset), and silence.

The model takes in one second’s worth of data at a time. It outputs four probability scores, one for each of these four classes, predicting how likely it is that the data represented one of them.

However, the model doesn’t take in raw audio sample data. Instead, it works with spectrograms, which are two-dimensional arrays that are made up of slices of frequency information, each taken from a different time window.

Figure 7-1 is a visual representation of a spectrogram generated from a one-second audio clip of someone saying “yes.” Figure 7-2 shows the same thing for the word “no.”

Spectrogram for 'yes'
Figure 7-1. Spectrogram for “yes”
Spectrogram for 'no'
Figure 7-2. Spectrogram for “no”

By isolating the frequency information during preprocessing, we make the model’s life easier. During training, it doesn’t need to learn how to interpret raw audio data; instead, it gets to work with a higher-layer abstraction that distills the most useful information.

We’ll look at how the spectrogram is generated later in this chapter. For now, we just need to know that the model takes a spectrogram as input. Because a spectrogram is a two-dimensional array, we feed it into the model as a 2D tensor.

There’s a type of neural network architecture that is specifically designed to work well with multidimensional tensors in which information is contained in the relationships between groups of adjacent values. It’s called a convolutional neural network (CNN).

The most common example of this type of data is images, for which a group of adjacent pixels might represent a shape, pattern, or texture. During training, a CNN is able to identify these features and learn what they represent.

It can learn how simple image features (like lines or edges) fit together into more complex features (like an eye or an ear), and in turn how those features might be combined to form an input image, such as a photo of a human face. This means that a CNN can learn to distinguish between different classes of input image, such as between a photo of a person and a photo of a dog.

Although they’re often applied to images, which are 2D grids of pixels, CNNs can be used with any multidimensional vector input. It turns out they’re very well suited to working with spectrogram data.

In Chapter 8, we’ll look at how this model was trained. Until then, let’s get back to discussing the architecture of our application.

All the Moving Parts

As mentioned earlier, our wake-word application is a more complicated than the “hello world” example. Figure 7-3 shows the components that comprise it.

Diagram of the components of our wake word application
Figure 7-3. The components of our wake-word application

Let’s investigate what each of these pieces do:

Main loop

Like the “hello world” example, our application runs in a continuous loop. All of the subsequent processes are contained within it, and they execute continually, as fast as the microcontroller can run them, which is multiple times per second.

Audio provider

The audio provider captures raw audio data from the microphone. Because the methods for capturing audio vary from device to device, this component can be overridden and customized.

Feature provider

The feature provider converts raw audio data into the spectrogram format that our model requires. It does so on a rolling basis as part of the main loop, providing the interpreter with a sequence of overlapping one-second windows.

TF Lite interpreter

The interpreter runs the TensorFlow Lite model, transforming the input spectrogram into a set of probabilities.

Model

The model is included as a data array and run by the interpreter. The array is located in tiny_conv_micro_features_model_data.cc.

Command recognizer

Because inference is run multiple times per second, the RecognizeCommands class aggregates the results and determines whether, on average, a known word was heard.

Command responder

If a command was heard, the command responder uses the device’s output capabilities to let the user know. Depending on the device, this could mean flashing an LED or showing data on an LCD display. It can be overridden for different device types.

The example’s files on GitHub contain tests for each of these components. We’ll walk through them next to learn how they work.

Walking Through the Tests

As in Chapter 5, we can use tests to learn how the application works. We’ve already covered a lot of C++ and TensorFlow Lite basics, so we won’t need to explain every single line. Instead, let’s focus on the most important parts of each test and explain what’s going on.

We’ll explore the following tests, which you can find in the GitHub repository:

micro_speech_test.cc

Shows how to run inference on spectrogram data and interpret the results

audio_provider_test.cc

Shows how to use the audio provider

feature_provider_mock_test.cc

Shows how to use the feature provider, using a mock (fake) implementation of the audio provider to pass in fake data

recognize_commands_test.cc

Shows how to interpret the model’s output to decide whether a command was found

command_responder_test.cc

Shows how to call the command responder to trigger an output

There are many more tests in the example, but exploring these few will give us an understanding of the key moving parts.

The Basic Flow

The test micro_speech_test.cc follows the same basic flow we’re familiar with from the “hello world” example: we load the model, set up the interpreter, and allocate tensors.

However, there’s a notable difference. In the “hello world” example, we used the AllOpsResolver to pull in all of the deep learning operations that might be necessary to run the model. This is a reliable approach, but it’s wasteful because a given model probably doesn’t use all of the dozens of available operations. When deployed to a device, these unnecessary operations will take up valuable memory, so it’s best if we include only those we need.

To do this, we first define the ops that our model will need, at the top of the test file:

namespace tflite {
namespace ops {
namespace micro {
TfLiteRegistration* Register_DEPTHWISE_CONV_2D();
TfLiteRegistration* Register_FULLY_CONNECTED();
TfLiteRegistration* Register_SOFTMAX();
}  // namespace micro
}  // namespace ops
}  // namespace tflite

Next, we set up logging and load our model, as normal:

// Set up logging.
tflite::MicroErrorReporter micro_error_reporter;
tflite::ErrorReporter* error_reporter = &micro_error_reporter;
// Map the model into a usable data structure. This doesn't involve any
// copying or parsing, it's a very lightweight operation.
const tflite::Model* model =
    ::tflite::GetModel(g_tiny_conv_micro_features_model_data);
if (model->version() != TFLITE_SCHEMA_VERSION) {
  error_reporter->Report(
      "Model provided is schema version %d not equal "
      "to supported version %d.
",
      model->version(), TFLITE_SCHEMA_VERSION);
}

After our model is loaded, we declare a MicroMutableOpResolver and use its method AddBuiltin() to add the ops we listed earlier:

tflite::MicroMutableOpResolver micro_mutable_op_resolver;
micro_mutable_op_resolver.AddBuiltin(
    tflite::BuiltinOperator_DEPTHWISE_CONV_2D,
    tflite::ops::micro::Register_DEPTHWISE_CONV_2D());
micro_mutable_op_resolver.AddBuiltin(
    tflite::BuiltinOperator_FULLY_CONNECTED,
    tflite::ops::micro::Register_FULLY_CONNECTED());
micro_mutable_op_resolver.AddBuiltin(tflite::BuiltinOperator_SOFTMAX,
                                      tflite::ops::micro::Register_SOFTMAX());

You’re probably wondering how we know which ops to include for a given model. One way is to try running the model using a MicroMutableOpResolver, but without calling AddBuiltin() at all. Inference will fail, and the accompanying error messages will inform us which ops are missing and need to be added.

Note

The MicroMutableOpResolver is defined in tensorflow/lite/micro/micro_mutable_op_resolver.h, which you’ll need to add to your include statements.

After the MicroMutableOpResolver is set up, we just carry on as usual, setting up our interpreter and its working memory:

// Create an area of memory to use for input, output, and intermediate arrays.
const int tensor_arena_size = 10 * 1024;
uint8_t tensor_arena[tensor_arena_size];
// Build an interpreter to run the model with.
tflite::MicroInterpreter interpreter(model, micro_mutable_op_resolver, tensor_arena,
                                     tensor_arena_size, error_reporter);
interpreter.AllocateTensors();

In our “hello world” application we allocated only 2 * 1,024 bytes for the tensor_arena, given that the model was so small. Our speech model is a lot bigger, and it deals with more complex input and output, so it needs more space (10 1,024). This was determined by trial and error.

Next, we check the input tensor size. However, it’s a little different this time around:

// Get information about the memory area to use for the model's input.
TfLiteTensor* input = interpreter.input(0);
// Make sure the input has the properties we expect.
TF_LITE_MICRO_EXPECT_NE(nullptr, input);
TF_LITE_MICRO_EXPECT_EQ(4, input->dims->size);
TF_LITE_MICRO_EXPECT_EQ(1, input->dims->data[0]);
TF_LITE_MICRO_EXPECT_EQ(49, input->dims->data[1]);
TF_LITE_MICRO_EXPECT_EQ(40, input->dims->data[2]);
TF_LITE_MICRO_EXPECT_EQ(1, input->dims->data[3]);
TF_LITE_MICRO_EXPECT_EQ(kTfLiteUInt8, input->type);

Because we’re dealing with a spectrogram as our input, the input tensor has more dimensions—four, in total. The first dimension is just a wrapper containing a single element. The second and third represent the “rows” and “columns” of our spectrogram, which happens to have 49 rows and 40 columns. The fourth, innermost dimension of the input tensor, which has size 1, holds each individual “pixel” of the spectrogram. We’ll look more at the spectrogram’s structure later on.

Next, we grab a sample spectrogram for a “yes,” stored in the constant g_yes_micro_f2e59fea_nohash_1_data. The constant is defined in the file micro_features/yes_micro_features_data.cc, which was included by this test. The spectrogram exists as a 1D array, and we just iterate through it to copy it into the input tensor:

// Copy a spectrogram created from a .wav audio file of someone saying "Yes"
// into the memory area used for the input.
const uint8_t* yes_features_data = g_yes_micro_f2e59fea_nohash_1_data;
for (int i = 0; i < input->bytes; ++i) {
  input->data.uint8[i] = yes_features_data[i];
}

After the input has been assigned, we run inference and inspect the output tensor’s size and shape:

// Run the model on this input and make sure it succeeds.
TfLiteStatus invoke_status = interpreter.Invoke();
if (invoke_status != kTfLiteOk) {
  error_reporter->Report("Invoke failed
");
}
TF_LITE_MICRO_EXPECT_EQ(kTfLiteOk, invoke_status);

// Get the output from the model, and make sure it's the expected size and
// type.
TfLiteTensor* output = interpreter.output(0);
TF_LITE_MICRO_EXPECT_EQ(2, output->dims->size);
TF_LITE_MICRO_EXPECT_EQ(1, output->dims->data[0]);
TF_LITE_MICRO_EXPECT_EQ(4, output->dims->data[1]);
TF_LITE_MICRO_EXPECT_EQ(kTfLiteUInt8, output->type);

Our output has two dimensions. The first is just a wrapper. The second has four elements. This is the structure that holds the probabilities that each of our four classes (silence, unknown, “yes,” and “no”) were matched.

The next chunk of code checks whether the probabilities were as expected. A given element of the output tensor always represents a certain class, so we know which index to check for each one. The order is defined during training:

// There are four possible classes in the output, each with a score.
const int kSilenceIndex = 0;
const int kUnknownIndex = 1;
const int kYesIndex = 2;
const int kNoIndex = 3;

// Make sure that the expected "Yes" score is higher than the other classes.
uint8_t silence_score = output->data.uint8[kSilenceIndex];
uint8_t unknown_score = output->data.uint8[kUnknownIndex];
uint8_t yes_score = output->data.uint8[kYesIndex];
uint8_t no_score = output->data.uint8[kNoIndex];
TF_LITE_MICRO_EXPECT_GT(yes_score, silence_score);
TF_LITE_MICRO_EXPECT_GT(yes_score, unknown_score);
TF_LITE_MICRO_EXPECT_GT(yes_score, no_score);

We passed in a “yes” spectrogram, so we expect that the variable yes_score contains a higher probability than silence_score, unknown_score, and no_score.

When we’re satisfied with “yes,” we do the same thing with a “no” spectrogram. First, we copy in an input and run inference:

// Now test with a different input, from a recording of "No".
const uint8_t* no_features_data = g_no_micro_f9643d42_nohash_4_data;
for (int i = 0; i < input->bytes; ++i) {
  input->data.uint8[i] = no_features_data[i];
}
// Run the model on this "No" input.
invoke_status = interpreter.Invoke();
if (invoke_status != kTfLiteOk) {
  error_reporter->Report("Invoke failed
");
}
TF_LITE_MICRO_EXPECT_EQ(kTfLiteOk, invoke_status);

After inference is done, we confirm that “no” achieved the highest score:

// Make sure that the expected "No" score is higher than the other classes.
silence_score = output->data.uint8[kSilenceIndex];
unknown_score = output->data.uint8[kUnknownIndex];
yes_score = output->data.uint8[kYesIndex];
no_score = output->data.uint8[kNoIndex];
TF_LITE_MICRO_EXPECT_GT(no_score, silence_score);
TF_LITE_MICRO_EXPECT_GT(no_score, unknown_score);
TF_LITE_MICRO_EXPECT_GT(no_score, yes_score);

And we’re done!

To run this test, issue the following command from the root of the TensorFlow repository:

make -f tensorflow/lite/micro/tools/make/Makefile 
  test_micro_speech_test

Next up, let’s look at the source of all our audio data: the audio provider.

The Audio Provider

The audio provider is what connects a device’s microphone hardware to our code. Every device has a different mechanism for capturing audio. As a result, audio_provider.h defines an interface for requesting audio data, and developers can write their own implementations for any platforms that they want to support.

Tip

The example includes audio provider implementations for Arduino, STM32F746G, SparkFun Edge, and macOS. If you’d like this example to support a new device, you can read the existing implementations to learn how to do it.

The core part of the audio provider is a function named GetAudioSamples(), defined in audio_provider.h. It looks like this:

TfLiteStatus GetAudioSamples(tflite::ErrorReporter* error_reporter,
                             int start_ms, int duration_ms,
                             int* audio_samples_size, int16_t** audio_samples);

As described in audio_provider.h, the function is expected to return an array of 16-bit pulse code modulated (PCM) audio data. This is a very common format for digital audio.

The function is called with an ErrorReporter instance, a start time (start_ms), a duration (duration_ms), and two pointers.

These pointers are a mechanism for GetAudioSamples() to provide data. The caller declares variables of the appropriate type and then passes pointers to them when it calls the function. Inside the function’s implementation, the pointers are dereferenced and the variables’ values are set.

The first pointer, audio_samples_size, will receive the total number of 16-bit samples in the audio data. The second pointer, audio_samples, will receive an array containing the audio data itself.

By looking at the tests, we can see this in action. There are two tests in audio_provider_test.cc, but we need to look only at the first to learn how to use the audio provider:

TF_LITE_MICRO_TEST(TestAudioProvider) {
  tflite::MicroErrorReporter micro_error_reporter;
  tflite::ErrorReporter* error_reporter = &micro_error_reporter;

  int audio_samples_size = 0;
  int16_t* audio_samples = nullptr;
  TfLiteStatus get_status =
      GetAudioSamples(error_reporter, 0, kFeatureSliceDurationMs,
                      &audio_samples_size, &audio_samples);
  TF_LITE_MICRO_EXPECT_EQ(kTfLiteOk, get_status);
  TF_LITE_MICRO_EXPECT_LE(audio_samples_size, kMaxAudioSampleSize);
  TF_LITE_MICRO_EXPECT_NE(audio_samples, nullptr);

  // Make sure we can read all of the returned memory locations.
  int total = 0;
  for (int i = 0; i < audio_samples_size; ++i) {
    total += audio_samples[i];
  }
}

The test shows how GetAudioSamples() is called with some values and some pointers. The test confirms that the pointers are assigned correctly after the function is called.

Note

You’ll notice the use of some constants, kFeatureSliceDurationMs and kMaxAudioSampleSize. These are values that were chosen when the model was trained, and you can find them in micro_features/micro_model_settings.h.

The default implementation of audio_provider.cc just returns an empty array. To prove that it’s the right size, the test simply loops through it for the expected number of samples.

In addition to GetAudioSamples(), the audio provider contains a function called LatestAudioTimestamp(). This is intended to return the time that audio data was last captured, in milliseconds. This information is needed by the feature provider to determine what audio data to fetch.

To run the audio provider tests, use the following command:

make -f tensorflow/lite/micro/tools/make/Makefile 
  test_audio_provider_test

The audio provider is used by the feature provider as a source of fresh audio samples, so let’s take a look at that next.

The Feature Provider

The feature provider converts raw audio, obtained from the audio provider, into spectrograms that can be fed into our model. It is called during the main loop.

Its interface is defined in feature_provider.h, and looks like this:

class FeatureProvider {
 public:
  // Create the provider, and bind it to an area of memory. This memory should
  // remain accessible for the lifetime of the provider object, since subsequent
  // calls will fill it with feature data. The provider does no memory
  // management of this data.
  FeatureProvider(int feature_size, uint8_t* feature_data);
  ~FeatureProvider();

  // Fills the feature data with information from audio inputs, and returns how
  // many feature slices were updated.
  TfLiteStatus PopulateFeatureData(tflite::ErrorReporter* error_reporter,
                                   int32_t last_time_in_ms, int32_t time_in_ms,
                                   int* how_many_new_slices);

 private:
  int feature_size_;
  uint8_t* feature_data_;
  // Make sure we don't try to use cached information if this is the first call
  // into the provider.
  bool is_first_run_;
};

To see how it’s used, we can take a look at the tests in feature_provider_mock_test.cc.

For there to be audio data for the feature provider to work with, these tests use a special fake version of the audio provider, known as a mock, that is set up to provide audio data. It is defined in audio_provider_mock.cc.

Note

The mock audio provider is substituted for the real thing in the build instructions for the test, which you can find in Makefile.inc under FEATURE_PROVIDER_MOCK_TEST_SRCS.

The file feature_provider_mock_test.cc contains two tests. Here’s the first one:

TF_LITE_MICRO_TEST(TestFeatureProviderMockYes) {
  tflite::MicroErrorReporter micro_error_reporter;
  tflite::ErrorReporter* error_reporter = &micro_error_reporter;

  uint8_t feature_data[kFeatureElementCount];
  FeatureProvider feature_provider(kFeatureElementCount, feature_data);

  int how_many_new_slices = 0;
  TfLiteStatus populate_status = feature_provider.PopulateFeatureData(
      error_reporter, /* last_time_in_ms= */ 0, /* time_in_ms= */ 970,
      &how_many_new_slices);
  TF_LITE_MICRO_EXPECT_EQ(kTfLiteOk, populate_status);
  TF_LITE_MICRO_EXPECT_EQ(kFeatureSliceCount, how_many_new_slices);

  for (int i = 0; i < kFeatureElementCount; ++i) {
    TF_LITE_MICRO_EXPECT_EQ(g_yes_micro_f2e59fea_nohash_1_data[i],
                            feature_data[i]);
  }
}

To create a FeatureProvider, we call its constructor, passing in feature_size and feature_data arguments:

FeatureProvider feature_provider(kFeatureElementCount, feature_data);

The first argument indicates how many total data elements should be in the spectrogram. The second argument is a pointer to an array that we want to be populated with the spectrogram data.

The number of elements in the spectrogram was decided when the model was trained and is defined as kFeatureElementCount in micro_features/micro_model_settings.h.

To obtain features for the past second of audio, feature_provider.PopulateFeatureData() is called:

TfLiteStatus populate_status = feature_provider.PopulateFeatureData(
      error_reporter, /* last_time_in_ms= */ 0, /* time_in_ms= */ 970,
      &how_many_new_slices);

We supply an ErrorReporter instance, an integer representing the last time this method was called (last_time_in_ms), the current time (time_in_ms), and a pointer to an integer that will be updated with how many new feature slices we receive (how_many_new_slices). A slice is just one row of the spectrogram, representing a chunk of time.

Because we always want the last second of audio, the feature provider will compare when it was last called (last_time_in_ms) with the current time (time_in_ms), create spectrogram data from the audio captured during that time, and then update the feature_data array to add any additional slices and drop any that are older than one second.

When PopulateFeatureData() runs, it will request audio from the mock audio provider. The mock will give it audio representing a “yes,” and the feature provider will process it and provide the result.

After calling PopulateFeatureData(), we check whether its result is what we expect. We compare the data it generated to a known spectrogram that is correct for the “yes” input given by the mock audio provider:

TF_LITE_MICRO_EXPECT_EQ(kTfLiteOk, populate_status);
TF_LITE_MICRO_EXPECT_EQ(kFeatureSliceCount, how_many_new_slices);
for (int i = 0; i < kFeatureElementCount; ++i) {
  TF_LITE_MICRO_EXPECT_EQ(g_yes_micro_f2e59fea_nohash_1_data[i],
                          feature_data[i]);
}

The mock audio provider can provide audio for a “yes” or a “no” depending on which start and end times are passed into it. The second test in feature_provider_mock_test.cc does exactly the same thing as the first, but for audio representing “no.”

To run the tests, use the following command:

make -f tensorflow/lite/micro/tools/make/Makefile 
  test_feature_provider_mock_test

How the feature provider converts audio to a spectrogram

The feature provider is implemented in feature_provider.cc. Let’s talk through how it works.

As we’ve discussed, its job is to populate an array that represents a spectrogram of one second of audio. It’s designed to be called in a loop, so to avoid unnecessary work, it will generate new features only for the time between now and when it was last called. If it were called less than a second ago, it would keep some of its previous output and generate only the missing parts.

In our code, each spectrogram is represented as a 2D array, with 40 columns and 49 rows, where each row represents a 30-millisecond (ms) sample of audio split into 43 frequency buckets.

To create each row, we run a 30-ms slice of audio input through a fast Fourier transform (FFT) algorithm. This technique analyzes the frequency distribution of audio in the sample and creates an array of 256 frequency buckets, each with a value from 0 to 255. These are averaged together into groups of six, leaving us with 43 buckets.

The code that does this is in the file micro_features/micro_features_generator.cc, and is called by the feature provider.

To build the entire 2D array, we combine the results of running the FFT on 49 consecutive 30-ms slices of audio, with each slice overlapping the last by 10 ms. Figure 7-4 shows how this happens.

You can see how the 30-ms sample window is moved forward by 20 ms each time until it has covered the full one-second sample. The resulting spectrogram is ready to pass into our model.

We can understand how this process happens in feature_provider.cc. First, it determines which slices it actually needs to generate based on the time PopulateFeatureData() was last called:

// Quantize the time into steps as long as each window stride, so we can
// figure out which audio data we need to fetch.
const int last_step = (last_time_in_ms / kFeatureSliceStrideMs);
const int current_step = (time_in_ms / kFeatureSliceStrideMs);

int slices_needed = current_step - last_step;
Diagram of audio samples being processed
Figure 7-4. Diagram of audio samples being processed

If it hasn’t run before, or it ran more than one second ago, it will generate the maximum number of slices:

if (is_first_run_) {
  TfLiteStatus init_status = InitializeMicroFeatures(error_reporter);
  if (init_status != kTfLiteOk) {
    return init_status;
  }
  is_first_run_ = false;
  slices_needed = kFeatureSliceCount;
}
if (slices_needed > kFeatureSliceCount) {
  slices_needed = kFeatureSliceCount;
}
*how_many_new_slices = slices_needed;

The resulting number is written to how_many_new_slices.

Next, it calculates how many of any existing slices it should keep, and shifts data in the array around to make room for any new ones:

const int slices_to_keep = kFeatureSliceCount - slices_needed;
const int slices_to_drop = kFeatureSliceCount - slices_to_keep;
// If we can avoid recalculating some slices, just move the existing data
// up in the spectrogram, to perform something like this:
// last time = 80ms          current time = 120ms
// +-----------+             +-----------+
// | data@20ms |         --> | data@60ms |
// +-----------+       --    +-----------+
// | data@40ms |     --  --> | data@80ms |
// +-----------+   --  --    +-----------+
// | data@60ms | --  --      |  <empty>  |
// +-----------+   --        +-----------+
// | data@80ms | --          |  <empty>  |
// +-----------+             +-----------+
if (slices_to_keep > 0) {
  for (int dest_slice = 0; dest_slice < slices_to_keep; ++dest_slice) {
    uint8_t* dest_slice_data =
        feature_data_ + (dest_slice * kFeatureSliceSize);
    const int src_slice = dest_slice + slices_to_drop;
    const uint8_t* src_slice_data =
        feature_data_ + (src_slice * kFeatureSliceSize);
    for (int i = 0; i < kFeatureSliceSize; ++i) {
      dest_slice_data[i] = src_slice_data[i];
    }
  }
}
Note

If you’re a seasoned C++ author, you might wonder why we don’t use standard libraries to do things like copying data around. The reason is that we’re trying to avoid unnecessary dependencies, in an effort to keep our binary size small. Because embedded platforms have very little memory, a smaller application binary means that we have space for a larger and more accurate deep learning model.

After moving data around, it begins a loop that iterates once for each new slice that it needs. In this loop, it first requests audio for that slice from the audio provider using GetAudioSamples():

for (int new_slice = slices_to_keep; new_slice < kFeatureSliceCount;
     ++new_slice) {
  const int new_step = (current_step - kFeatureSliceCount + 1) + new_slice;
  const int32_t slice_start_ms = (new_step * kFeatureSliceStrideMs);
  int16_t* audio_samples = nullptr;
  int audio_samples_size = 0;
  GetAudioSamples(error_reporter, slice_start_ms, kFeatureSliceDurationMs,
                  &audio_samples_size, &audio_samples);
  if (audio_samples_size < kMaxAudioSampleSize) {
    error_reporter->Report("Audio data size %d too small, want %d",
                           audio_samples_size, kMaxAudioSampleSize);
    return kTfLiteError;
  }

To complete the loop iteration, it passes that data into GenerateMicroFeatures(), defined in micro_features/micro_features_generator.h. This is the function that performs the FFT and returns the audio frequency information.

It also passes a pointer, new_slice_data, which points at the memory location where the new data should be written:

  uint8_t* new_slice_data = feature_data_ + (new_slice * kFeatureSliceSize);
  size_t num_samples_read;
  TfLiteStatus generate_status = GenerateMicroFeatures(
      error_reporter, audio_samples, audio_samples_size, kFeatureSliceSize,
      new_slice_data, &num_samples_read);
  if (generate_status != kTfLiteOk) {
    return generate_status;
  }
}

After this process has happened for each slice, we have an entire second’s worth of up-to-date spectrogram.

Tip

The function that generates the FFT is GenerateMicroFeatures(). If you’re interested, you can read its definition in micro_features/micro_features_generator.cc.

If you’re building your own application that uses spectrograms, you can reuse this code as is. You’ll need to use the same code to pre-process data into spectrograms when training your model.

Once we have a spectrogram, we can run inference on it using the model. After this happens, we need to interpret the results. That task belongs to the class we explore next, RecognizeCommands.

The Command Recognizer

After our model outputs a set of probabilities that a known word was spoken in the last second of audio, it’s the job of the RecognizeCommands class to determine whether this indicates a successful detection.

It seems like this would be simple: if the probability in a given category is more than a certain threshold, the word was spoken. However, in the real world, things become a bit more complicated.

As we established earlier, we’re running multiple inferences per second, each on a one-second window of data. This means that we’ll run inference on any given word multiple times, in multiple windows.

In Figure 7-5, you can see a waveform of the word “noted” being spoken, surrounded by a box representing a one-second window being captured.

The word 'noted' being captured in our window
Figure 7-5. The word “noted” being captured in our window

Our model is trained to detect the word “no,” and it understands that the word “noted” is not the same thing. If we run inference on this one-second window, it will (hopefully) output a low probability for the word “no.” However, what if the window came slightly earlier in the audio stream, as in Figure 7-6?

Part of the word 'noted' being captured in our window
Figure 7-6. Part of the word “noted” being captured in our window

In this case, the only part of the word “noted” that appears within the window is its first syllable. Because the first syllable of “noted” sounds like “no,” it’s likely that the model will interpret this as having a high probability of being a “no.”

This problem, along with others, means that we can’t rely on a single inference to tell us whether a word was spoken. This is where RecognizeCommands comes in!

The recognizer calculates the average score for each word over the past few inferences, and decides whether it’s high enough to count as a detection. To do this, we feed it each inference result as they roll in.

You can see its interface in recognize_commands.h, partially reproduced here:

class RecognizeCommands {
 public:
  explicit RecognizeCommands(tflite::ErrorReporter* error_reporter,
                             int32_t average_window_duration_ms = 1000,
                             uint8_t detection_threshold = 200,
                             int32_t suppression_ms = 1500,
                             int32_t minimum_count = 3);

  // Call this with the results of running a model on sample data.
  TfLiteStatus ProcessLatestResults(const TfLiteTensor* latest_results,
                                    const int32_t current_time_ms,
                                    const char** found_command, uint8_t* score,
                                    bool* is_new_command);

The class RecognizeCommands is defined, along with a constructor that defines default values for a few things:

  • The length of the averaging window (average_window_duration_ms)

  • The minimum average score that counts as a detection (detection_threshold)

  • The amount of time we’ll wait after hearing a command before recognizing a second one (suppression_ms)

  • The minimum number of inferences required in the window for a result to count (3)

The class has one method, ProcessLatestResults(). It accepts a pointer to a TfLiteTensor containing the model’s output (latest_results), and it must be called with the current time (current_time_ms).

In addition, it takes three pointers that it uses for output. First, it gives us the name of any word that was detected (found_command). It also provides the average score of the command (score) and whether the command is new or has been heard in previous inferences within a certain timespan (is_new_command).

Averaging the results of multiple inferences is a useful and common technique when dealing with time-series data. In the next few pages, we’ll walk through the code in recognize_commands.cc and learn a bit about how it works. You don’t need to understand every line, but it’s helpful to get some insight into what might be a helpful tool in your own projects.

First, we make sure the input tensor is the right shape and type:

TfLiteStatus RecognizeCommands::ProcessLatestResults(
    const TfLiteTensor* latest_results, const int32_t current_time_ms,
    const char** found_command, uint8_t* score, bool* is_new_command) {
  if ((latest_results->dims->size != 2) ||
      (latest_results->dims->data[0] != 1) ||
      (latest_results->dims->data[1] != kCategoryCount)) {
    error_reporter_->Report(
        "The results for recognition should contain %d elements, but there are "
        "%d in an %d-dimensional shape",
        kCategoryCount, latest_results->dims->data[1],
        latest_results->dims->size);
    return kTfLiteError;
  }

  if (latest_results->type != kTfLiteUInt8) {
    error_reporter_->Report(
        "The results for recognition should be uint8 elements, but are %d",
        latest_results->type);
    return kTfLiteError;
  }

Next, we check current_time_ms to verify that it is after the most recent result in our averaging window:

if ((!previous_results_.empty()) &&
    (current_time_ms < previous_results_.front().time_)) {
  error_reporter_->Report(
      "Results must be fed in increasing time order, but received a "
      "timestamp of %d that was earlier than the previous one of %d",
      current_time_ms, previous_results_.front().time_);
  return kTfLiteError;
}

After that, we add the latest result to a list of results we’ll be averaging:

// Add the latest results to the head of the queue.
previous_results_.push_back({current_time_ms, latest_results->data.uint8});
// Prune any earlier results that are too old for the averaging window.
const int64_t time_limit = current_time_ms - average_window_duration_ms_;
while ((!previous_results_.empty()) &&
       previous_results_.front().time_ < time_limit) {
  previous_results_.pop_front();

If there are fewer results in our averaging window than the minimum number (defined by minimum_count_, which is 3 by default), we can’t provide a valid average. In this case, we set the output pointers to indicate that found_command is the most recent top command, that the score is 0, and that the command is not a new one:

// If there are too few results, assume the result will be unreliable and
// bail.
const int64_t how_many_results = previous_results_.size();
const int64_t earliest_time = previous_results_.front().time_;
const int64_t samples_duration = current_time_ms - earliest_time;
if ((how_many_results < minimum_count_) ||
    (samples_duration < (average_window_duration_ms_ / 4))) {
  *found_command = previous_top_label_;
  *score = 0;
  *is_new_command = false;
  return kTfLiteOk;
}

Otherwise, we continue by averaging all of the scores in the window:

// Calculate the average score across all the results in the window.
int32_t average_scores[kCategoryCount];
for (int offset = 0; offset < previous_results_.size(); ++offset) {
  PreviousResultsQueue::Result previous_result =
      previous_results_.from_front(offset);
  const uint8_t* scores = previous_result.scores_;
  for (int i = 0; i < kCategoryCount; ++i) {
    if (offset == 0) {
      average_scores[i] = scores[i];
    } else {
      average_scores[i] += scores[i];
    }
  }
}
for (int i = 0; i < kCategoryCount; ++i) {
  average_scores[i] /= how_many_results;
}

We now have enough information to identify which category is our winner. Establishing this is a simple process:

// Find the current highest scoring category.
int current_top_index = 0;
int32_t current_top_score = 0;
for (int i = 0; i < kCategoryCount; ++i) {
  if (average_scores[i] > current_top_score) {
    current_top_score = average_scores[i];
    current_top_index = i;
  }
}
const char* current_top_label = kCategoryLabels[current_top_index];

The final piece of logic determines whether the result was a valid detection. To do this, it ensures that its score is above the detection threshold (200 by default), and that it didn’t happen too quickly after the last valid detection, which can be an indication of a faulty result:

// If we've recently had another label trigger, assume one that occurs too
// soon afterwards is a bad result.
int64_t time_since_last_top;
if ((previous_top_label_ == kCategoryLabels[0]) ||
    (previous_top_label_time_ == std::numeric_limits<int32_t>::min())) {
  time_since_last_top = std::numeric_limits<int32_t>::max();
} else {
  time_since_last_top = current_time_ms - previous_top_label_time_;
}
if ((current_top_score > detection_threshold_) &&
    ((current_top_label != previous_top_label_) ||
     (time_since_last_top > suppression_ms_))) {
  previous_top_label_ = current_top_label;
  previous_top_label_time_ = current_time_ms;
  *is_new_command = true;
} else {
  *is_new_command = false;
}
*found_command = current_top_label;
*score = current_top_score;

If the result was valid, is_new_command is set to true. This is what the caller can use to determine whether a word was genuinely detected.

The tests (in recognize_commands_test.cc) exercise various different combinations of inputs and results that are stored in the averaging window.

Let’s walk through one of the tests, RecognizeCommandsTestBasic, which demonstrates how RecognizeCommands is used. First, we just create an instance of the class:

TF_LITE_MICRO_TEST(RecognizeCommandsTestBasic) {
  tflite::MicroErrorReporter micro_error_reporter;
  tflite::ErrorReporter* error_reporter = &micro_error_reporter;

  RecognizeCommands recognize_commands(error_reporter);

Next, we create a tensor containing some fake inference results, which will be used by ProcessLatestResults() to decide whether a command was heard:

TfLiteTensor results = tflite::testing::CreateQuantizedTensor(
    {255, 0, 0, 0}, tflite::testing::IntArrayFromInitializer({2, 1, 4}),
    "input_tensor", 0.0f, 128.0f);

Then, we set up some variables that will be set with the output of ProcessLatestResults():

const char* found_command;
uint8_t score;
bool is_new_command;

Finally, we call ProcessLatestResults(), providing pointers to these variables along with the tensor containing the results. We assert that the function will return kTfLiteOk, indicating that the input was processed successfully:

TF_LITE_MICRO_EXPECT_EQ(
    kTfLiteOk, recognize_commands.ProcessLatestResults(
                   &results, 0, &found_command, &score, &is_new_command));

The other tests in the file perform some more exhaustive checks to make sure the function is performing correctly. You can read through them to learn more.

To run all of the tests, use the following command:

make -f tensorflow/lite/micro/tools/make/Makefile 
  test_recognize_commands_test

As soon as we’ve determined whether a command was detected, it’s time to share our results with the world (or at least our on-board LEDs). The command responder is what makes this happen.

The Command Responder

The final piece in our puzzle, the command responder, is what produces an output to let us know that a word was detected.

The command responder is designed to be overridden for each type of device. We explore the device-specific implementations later in this chapter.

For now, let’s look at its very simple reference implementation, which just logs detection results as text. You can find it in the file command_responder.cc:

void RespondToCommand(tflite::ErrorReporter* error_reporter,
                      int32_t current_time, const char* found_command,
                      uint8_t score, bool is_new_command) {
  if (is_new_command) {
    error_reporter->Report("Heard %s (%d) @%dms", found_command, score,
                           current_time);
  }
}

That’s it! The file implements just one function: RespondToCommand(). As parameters, it expects an error_reporter, the current time (current_time), the command that was last detected (found_command), the score it received (score), and whether the command was newly heard (is_new_command).

It’s important to note that in our program’s main loop, this function will be called every time inference is performed, even if a command was not detected. This means that we should check is_new_command to determine whether anything needs to be done.

The test for this function, in command_responder_test.cc, is equally simple. It just calls the function, given that there’s no way for it to test that it generates the correct output:

TF_LITE_MICRO_TEST(TestCallability) {
  tflite::MicroErrorReporter micro_error_reporter;
  tflite::ErrorReporter* error_reporter = &micro_error_reporter;

  // This will have external side-effects (like printing to the debug console
  // or lighting an LED) that are hard to observe, so the most we can do is
  // make sure the call doesn't crash.
  RespondToCommand(error_reporter, 0, "foo", 0, true);
}

To run this test, enter this in your terminal:

make -f tensorflow/lite/micro/tools/make/Makefile 
  test_command_responder_test

And that’s it! We’ve walked through all of the components of the application. Now, let’s see how they come together in the program itself.

Listening for Wake Words

You can find the following code in main_functions.cc, which defines the setup() and loop() functions that are the core of our program. Let’s read through it together!

Because you’re now a seasoned TensorFlow Lite expert, a lot of this code will look familiar to you. So let’s try to focus on the new bits.

First, we list the ops that we want to use:

namespace tflite {
namespace ops {
namespace micro {
TfLiteRegistration* Register_DEPTHWISE_CONV_2D();
TfLiteRegistration* Register_FULLY_CONNECTED();
TfLiteRegistration* Register_SOFTMAX();
}  // namespace micro
}  // namespace ops
}  // namespace tflite

Next, we set up our global variables:

namespace {
tflite::ErrorReporter* error_reporter = nullptr;
const tflite::Model* model = nullptr;
tflite::MicroInterpreter* interpreter = nullptr;
TfLiteTensor* model_input = nullptr;
FeatureProvider* feature_provider = nullptr;
RecognizeCommands* recognizer = nullptr;
int32_t previous_time = 0;

// Create an area of memory to use for input, output, and intermediate arrays.
// The size of this will depend on the model you're using, and may need to be
// determined by experimentation.
constexpr int kTensorArenaSize = 10 * 1024;
uint8_t tensor_arena[kTensorArenaSize];
}  // namespace

Notice how we declare a FeatureProvider and a RecognizeCommands in addition to the usual TensorFlow suspects. We also declare a variable named g_previous_time, which keeps track of the most recent time we received new audio samples.

Next up, in the setup() function, we load the model, set up our interpreter, add ops, and allocate tensors:

void setup() {
  // Set up logging.
  static tflite::MicroErrorReporter micro_error_reporter;
  error_reporter = &micro_error_reporter;

  // Map the model into a usable data structure. This doesn't involve any
  // copying or parsing, it's a very lightweight operation.
  model = tflite::GetModel(g_tiny_conv_micro_features_model_data);
  if (model->version() != TFLITE_SCHEMA_VERSION) {
    error_reporter->Report(
        "Model provided is schema version %d not equal "
        "to supported version %d.",
        model->version(), TFLITE_SCHEMA_VERSION);
    return;
  }

  // Pull in only the operation implementations we need.
  static tflite::MicroMutableOpResolver micro_mutable_op_resolver;
  micro_mutable_op_resolver.AddBuiltin(
      tflite::BuiltinOperator_DEPTHWISE_CONV_2D,
      tflite::ops::micro::Register_DEPTHWISE_CONV_2D());
  micro_mutable_op_resolver.AddBuiltin(
      tflite::BuiltinOperator_FULLY_CONNECTED,
      tflite::ops::micro::Register_FULLY_CONNECTED());
  micro_mutable_op_resolver.AddBuiltin(tflite::BuiltinOperator_SOFTMAX,
                                       tflite::ops::micro::Register_SOFTMAX());

  // Build an interpreter to run the model with.
  static tflite::MicroInterpreter static_interpreter(
      model, micro_mutable_op_resolver, tensor_arena, kTensorArenaSize,
      error_reporter);
  interpreter = &static_interpreter;

  // Allocate memory from the tensor_arena for the model's tensors.
  TfLiteStatus allocate_status = interpreter->AllocateTensors();
  if (allocate_status != kTfLiteOk) {
    error_reporter->Report("AllocateTensors() failed");
    return;
  }

After allocating tensors, we check that the input tensor is the correct shape and type:

  // Get information about the memory area to use for the model's input.
  model_input = interpreter->input(0);
  if ((model_input->dims->size != 4) || (model_input->dims->data[0] != 1) ||
      (model_input->dims->data[1] != kFeatureSliceCount) ||
      (model_input->dims->data[2] != kFeatureSliceSize) ||
      (model_input->type != kTfLiteUInt8)) {
    error_reporter->Report("Bad input tensor parameters in model");
    return;
  }

Next comes the interesting stuff. First, we instantiate a FeatureProvider, pointing it at our input tensor:

  // Prepare to access the audio spectrograms from a microphone or other source
  // that will provide the inputs to the neural network.
  static FeatureProvider static_feature_provider(kFeatureElementCount,
                                                 model_input->data.uint8);
  feature_provider = &static_feature_provider;

We then create a RecognizeCommands instance and initialize our previous_time variable:

  static RecognizeCommands static_recognizer(error_reporter);
  recognizer = &static_recognizer;

  previous_time = 0;
}

Up next, it’s time for our loop() function. Like in the previous example, this function will be called over and over again, indefinitely. In the loop, we first use the feature provider to create a spectrogram:

void loop() {
  // Fetch the spectrogram for the current time.
  const int32_t current_time = LatestAudioTimestamp();
  int how_many_new_slices = 0;
  TfLiteStatus feature_status = feature_provider->PopulateFeatureData(
      error_reporter, previous_time, current_time, &how_many_new_slices);
  if (feature_status != kTfLiteOk) {
    error_reporter->Report("Feature generation failed");
    return;
  }
  previous_time = current_time;
  // If no new audio samples have been received since last time, don't bother
  // running the network model.
  if (how_many_new_slices == 0) {
    return;
  }

If there’s no new data since the last iteration, we don’t bother running inference.

After we have our input, we just invoke the interpreter:

  // Run the model on the spectrogram input and make sure it succeeds.
  TfLiteStatus invoke_status = interpreter->Invoke();
  if (invoke_status != kTfLiteOk) {
    error_reporter->Report("Invoke failed");
    return;
  }

The model’s output tensor is now filled with the probabilities for each category. To interpret them, we use our RecognizeCommands instance. We obtain a pointer to the output tensor, then set up a few variables to receive the ProcessLatestResults() output:

  // Obtain a pointer to the output tensor
  TfLiteTensor* output = interpreter->output(0);
  // Determine whether a command was recognized based on the output of inference
  const char* found_command = nullptr;
  uint8_t score = 0;
  bool is_new_command = false;
  TfLiteStatus process_status = recognizer->ProcessLatestResults(
      output, current_time, &found_command, &score, &is_new_command);
  if (process_status != kTfLiteOk) {
    error_reporter->Report("RecognizeCommands::ProcessLatestResults() failed");
    return;
  }

Finally, we call the command responder’s RespondToCommand() method so that it can notify users if a word was detected:

  // Do something based on the recognized command. The default implementation
  // just prints to the error console, but you should replace this with your
  // own function for a real application.
  RespondToCommand(error_reporter, current_time, found_command, score,
                   is_new_command);
}

And that’s it! The call to RespondToCommand() is the final thing in our loop. Everything from feature generation onward will repeat endlessly, checking the audio for known words and producing some output if one is confirmed.

The setup() and loop() functions are called by our main() function, defined in main.cc, which begins the loop when the application starts:

int main(int argc, char* argv[]) {
  setup();
  while (true) {
    loop();
  }
}

Running Our Application

The example contains an audio provider compatible with macOS. If you have access to a Mac, you can run the example on your development machine. First, use the following command to build it:

make -f tensorflow/lite/micro/tools/make/Makefile micro_speech

After the build completes, you can run the example with the following command:

tensorflow/lite/micro/tools/make/gen/osx_x86_64/bin/micro_speech

You might see a pop-up asking for microphone access. If so, grant it, and the program will start.

Try saying “yes” and “no.” You should see output that looks like the following:

Heard yes (201) @4056ms
Heard no (205) @6448ms
Heard unknown (201) @13696ms
Heard yes (205) @15000ms
Heard yes (205) @16856ms
Heard unknown (204) @18704ms
Heard no (206) @21000ms

The number after each detected word is its score. By default, the command recognizer component considers matches as valid only if their score is more than 200, so all of the scores you see will be at least 200.

The number after the score is the number of milliseconds since the program was started.

If you don’t see any output, make sure your Mac’s internal microphone is selected in the Mac’s Sound menu and that its input volume is turned up high enough.

We’ve established that the program works on a Mac. Now, let’s get it running on some embedded hardware.

Deploying to Microcontrollers

In this section, we deploy the code to three different devices:

For each one, we’ll walk through the build and deployment process.

Because every device has its own mechanism for capturing audio, there’s a separate implementation of audio_provider.cc for each one. The same is true for output, so each has a variant of command_responder.cc, too.

The audio_provider.cc implementations are complex and device-specific, and not directly related to machine learning. Consequently, we won’t walk through them in this chapter. However, there’s a walkthrough of the Arduino variant in Appendix B. If you need to capture audio in your own project, you’re welcome to reuse these implementations in your own code.

Alongside deployment instructions, we’re also going to walk through the command_responder.cc implementation for each device. First up, it’s time for Arduino.

Arduino

As of this writing, the only Arduino board with a built-in microphone is the Arduino Nano 33 BLE Sense, so that’s what we’ll be using for this section. If you’re using a different Arduino board and attaching your own microphone, you’ll need to implement your own audio_provider.cc.

The Arduino Nano 33 BLE Sense also has a built-in LED, which is what we use to indicate that a word has been recognized.

Figure 7-7 shows a picture of the board with its LED highlighted.

Image of the Arduino Nano 33 BLE Sense board with the LED highlighted
Figure 7-7. The Arduino Nano 33 BLE Sense board with the LED highlighted

Now let’s look at how we use this LED to indicate that a word has been detected.

Responding to commands on Arduino

Every Arduino board has a built-in LED, and there’s a convenient constant called LED_BUILTIN that we can use to obtain its pin number, which varies across boards. To keep this code portable, we’ll constrain ourselves to using this single LED for output.

Here’s what we’re going to do. To show that inference is running, we’ll flash the LED by toggling it on or off with each inference. However, when we hear the word “yes,” we’ll switch on the LED for a few seconds.

What about the word “no”? Well, because this is just a demonstration, we won’t worry about it too much. We do, however, log all of the detected commands to the serial port, so we can connect to the device and see every match.

The replacement command responder for Arduino is located in arduino/command_responder.cc. Let’s walk through its source. First, we include the command responder header file and the Arduino platform’s library header file:

#include "tensorflow/lite/micro/examples/micro_speech/command_responder.h"
#include "Arduino.h"

Next, we begin our function implementation:

// Toggles the LED every inference, and keeps it on for 3 seconds if a "yes"
// was heard
void RespondToCommand(tflite::ErrorReporter* error_reporter,
                      int32_t current_time, const char* found_command,
                      uint8_t score, bool is_new_command) {

Our next step is to place the built-in LED’s pin into output mode so that we can switch it on and off. We do this inside an if statement that runs only once, thanks to a static bool called is_initialized. Remember, static variables preserve their state between function calls:

static bool is_initialized = false;
if (!is_initialized) {
  pinMode(LED_BUILTIN, OUTPUT);
  is_initialized = true;
}

Next, we set up another couple of static variables to keep track of the last time a “yes” was detected, and the number of inferences that have been performed:

static int32_t last_yes_time = 0;
static int count = 0;

Now comes the fun stuff. If the is_new_command argument is true, we know we’ve heard something, so we log it with the ErrorReporter instance. But if it’s a “yes” we heard—which we determine by checking the first character of the found_command character array—we store the current time and switch on the LED:

if (is_new_command) {
  error_reporter->Report("Heard %s (%d) @%dms", found_command, score,
                         current_time);
  // If we heard a "yes", switch on an LED and store the time.
  if (found_command[0] == 'y') {
    last_yes_time = current_time;
    digitalWrite(LED_BUILTIN, HIGH);
  }
}

Next, we implement the behavior that switches off the LED after a few seconds—three, to be precise:

// If last_yes_time is non-zero but was >3 seconds ago, zero it
// and switch off the LED.
if (last_yes_time != 0) {
  if (last_yes_time < (current_time - 3000)) {
    last_yes_time = 0;
    digitalWrite(LED_BUILTIN, LOW);
  }
  // If it is non-zero but <3 seconds ago, do nothing.
  return;
}

When the LED is switched off, we also set last_yet_time to 0, so we won’t enter this if statement until the next time a “yes” is heard. The return statement is important: it’s what prevents any further output code from running if we recently heard a “yes,” so the LED stays solidly lit.

So far, our implementation will switch on the LED for around three seconds when a “yes” is heard. The next part will toggle the LED on and off with each inference—except for while we’re in “yes” mode, when we’re prevented from reaching this point by the aforementioned return statement.

Here’s the final chunk of code:

// Otherwise, toggle the LED every time an inference is performed.
++count;
if (count & 1) {
  digitalWrite(LED_BUILTIN, HIGH);
} else {
  digitalWrite(LED_BUILTIN, LOW);
}

By incrementing the count variable for each inference, we keep track of the total number of inferences that we’ve performed. Inside the if conditional, we use the & operator to do a binary AND operation with the count variable and the number 1.

By performing an AND on count with 1, we filter out all of count’s bits except the smallest. If the smallest bit is a 0, meaning count is an odd number, the result will be a 0. In a C++ if statement, this evaluates to false.

Otherwise, the result will be a 1, indicating an even number. Because a 1 evaluates to true, our LED will switch on with even values and off with odd values. This is what makes it toggle.

And that’s it! We’ve now implemented our command responder for Arduino. Let’s get it running so that we can see it in action.

Running the example

To deploy this example, here’s what we’ll need:

  • An Arduino Nano 33 BLE Sense board

  • A micro-USB cable

  • The Arduino IDE

Tip

There’s always a chance that the build process might have changed since this book was written, so check README.md for the latest instructions.

The projects in this book are available as example code in the TensorFlow Lite Arduino library. If you haven’t already installed the library, open the Arduino IDE and select Manage Libraries from the Tools menu. In the window that appears, search for and install the library named Arduino_TensorFlowLite. You should be able to use the latest version, but if you run into issues, the version that was tested with this book is 1.14-ALPHA.

Note

You can also install the library from a .zip file, which you can either download from the TensorFlow Lite team or generate yourself using the TensorFlow Lite for Microcontrollers Makefile. If you’d prefer to do the latter, see Appendix A.

After you’ve installed the library, the micro_speech example will show up in the File menu under Examples→Arduino_TensorFlowLite, as shown in Figure 7-8.

Click “micro_speech” to load the example. It will appear as a new window, with a tab for each of the source files. The file in the first tab, micro_speech, is equivalent to the main_functions.cc we walked through earlier.

Screenshot of the 'Examples' menu
Figure 7-8. The Examples menu
Note

“Running the Example” already explained the structure of the Arduino example, so we won’t cover it again here.

To run the example, plug in your Arduino device via USB. Make sure the correct device type is selected from the Board drop-down list in the Tools menu, as shown in Figure 7-9.

Screenshot of the 'Board' dropdown
Figure 7-9. The Board drop-down list

If your device’s name doesn’t appear in the list, you’ll need to install its support package. To do this, click Boards Manager. In the window that appears, search for your device, and then install the latest version of the corresponding support package. Next, make sure the device’s port is selected in the Port drop-down list, also in the Tools menu, as demonstrated in Figure 7-10.

Screenshot of the 'Port' dropdown
Figure 7-10. The Port drop-down list

Finally, in the Arduino window, click the upload button (highlighted in white in Figure 7-11) to compile and upload the code to your Arduino device.

Screenshot of the upload button, which has an arrow icon
Figure 7-11. The upload button, a right-facing arrow

After the upload has successfully completed you should see the LED on your Arduino board begin to flash.

To test the program, try saying “yes.” When it detects a “yes,” the LED will remain lit solidly for around three seconds.

Tip

If you can’t get the program to recognize your “yes,” try saying it a few times in a row.

You can also see the results of inference via the Arduino Serial Monitor. To do this, open the Serial Monitor from the Tools menu. Now, try saying “yes,” “no,” and other words. You should see something like Figure 7-12.

Screenshot of the Arduino IDE's Serial Monitor
Figure 7-12. The Serial Monitor displaying some matches
Note

The model we’re using is small and imperfect, and you’ll probably notice that it’s better at detecting “yes” than “no.” This is an example of how optimizing for a tiny model size can result in issues with accuracy. We cover this topic in Chapter 8.

Making your own changes

Now that you’ve deployed the application, try playing around with the code! You can edit the source files in the Arduino IDE. When you save, you’ll be prompted to re-save the example in a new location. After you’ve made your changes, you can click the upload button in the Arduino IDE to build and deploy.

Here are a few ideas you could try:

  • Switch the example to light the LED when “no” is spoken, instead of “yes,”

  • Make the application respond to a specific sequence of “yes” and “no” commands, like a secret code phrase.

  • Use the “yes” and “no” commands to control other components, like additional LEDs or servos.

SparkFun Edge

The SparkFun Edge has both a microphone and a row of four colored LEDs—red, blue, green, and yellow—which will make displaying results easy. Figure 7-13 shows the SparkFun Edge with its LEDs highlighted.

Photo of the SparkFun Edge highlighting its four LEDs
Figure 7-13. The SparkFun Edge’s four LEDs

Responding to commands on SparkFun Edge

To make it clear that our program is running, let’s toggle the blue LED on and off with each inference. We’ll switch on the yellow LED when a “yes” is heard, the red LED when a “no” is heard, and the green LED when an unknown command is heard.

The command responder for SparkFun Edge is implemented in sparkfun_edge/command_responder.cc. The file begins with some includes:

#include "tensorflow/lite/micro/examples/micro_speech/command_responder.h"
#include "am_bsp.h"

The command_responder.h include is this file’s corresponding header. am_bsp.h is the Ambiq Apollo3 SDK, which you saw in the last chapter.

Inside the function definition, the first thing we do is set up the pins connected to the LEDs as outputs:

// This implementation will light up the LEDs on the board in response to
// different commands.
void RespondToCommand(tflite::ErrorReporter* error_reporter,
                      int32_t current_time, const char* found_command,
                      uint8_t score, bool is_new_command) {
  static bool is_initialized = false;
  if (!is_initialized) {
    am_hal_gpio_pinconfig(AM_BSP_GPIO_LED_RED, g_AM_HAL_GPIO_OUTPUT_12);
    am_hal_gpio_pinconfig(AM_BSP_GPIO_LED_BLUE, g_AM_HAL_GPIO_OUTPUT_12);
    am_hal_gpio_pinconfig(AM_BSP_GPIO_LED_GREEN, g_AM_HAL_GPIO_OUTPUT_12);
    am_hal_gpio_pinconfig(AM_BSP_GPIO_LED_YELLOW, g_AM_HAL_GPIO_OUTPUT_12);
    is_initialized = true;
  }

We call the am_hal_gpio_pinconfig() function from the Apollo3 SDK to set all four LED pins to output mode, represented by the constant g_AM_HAL_GPIO_OUTPUT_12. We use the is_initialized static variable to ensure that we do this only once!

Next comes the code that will toggle the blue LED on and off. We do this using a count variable, in the same way as in the Arduino implementation:

static int count = 0;
// Toggle the blue LED every time an inference is performed.
++count;
if (count & 1) {
  am_hal_gpio_output_set(AM_BSP_GPIO_LED_BLUE);
} else {
  am_hal_gpio_output_clear(AM_BSP_GPIO_LED_BLUE);
}

This code uses the am_hal_gpio_output_set() and am_hal_gpio_output_clear() functions to switch the blue LED’s pin either on or off.

By incrementing the count variable at each inference, we keep track of the total number of inferences we’ve performed. Inside the if conditional, we use the & operator to do a binary AND operation with the count variable and the number 1.

By performing an AND on count with 1, we filter out all of count’s bits except the smallest. If the smallest bit is a 0, meaning count is an odd number, the result will be a 0. In a C++ if statement, this evaluates to false.

Otherwise, the result will be a 1, indicating an even number. Because a 1 evaluates to true, our LED will switch on with even values and off with odd values. This is what makes it toggle.

Next, we light the appropriate LED depending on which word was just heard. By default, we clear all of the LEDs, so if a word was not recently heard the LEDs will all be unlit:

am_hal_gpio_output_clear(AM_BSP_GPIO_LED_RED);
am_hal_gpio_output_clear(AM_BSP_GPIO_LED_YELLOW);
am_hal_gpio_output_clear(AM_BSP_GPIO_LED_GREEN);

We then use some simple if statements to switch on the appropriate LED depending on which command was heard:

if (is_new_command) {
  error_reporter->Report("Heard %s (%d) @%dms", found_command, score,
                         current_time);
  if (found_command[0] == 'y') {
    am_hal_gpio_output_set(AM_BSP_GPIO_LED_YELLOW);
  }
  if (found_command[0] == 'n') {
    am_hal_gpio_output_set(AM_BSP_GPIO_LED_RED);
  }
  if (found_command[0] == 'u') {
    am_hal_gpio_output_set(AM_BSP_GPIO_LED_GREEN);
  }
}

As we saw earlier, is_new_command is true only if RespondToCommand() was called with a genuinely new command, so if a new command wasn’t heard the LEDs will remain off. Otherwise, we use the am_hal_gpio_output_set() function to switch on the appropriate LED.

Running the example

We’ve now walked through how our example code lights up LEDs on the SparkFun Edge. Next, let’s get the example up and running.

Tip

There’s always a chance that the build process might have changed since this book was written, so check README.md for the latest instructions.

To build and deploy our code, we’ll need the following:

  • A SparkFun Edge board

  • A USB programmer (we recommend the SparkFun Serial Basic Breakout, which is available in micro-B USB and USB-C variants)

  • A matching USB cable

  • Python 3 and some dependencies

Note

Chapter 6 shows how to confirm whether you have the correct version of Python installed. If you already did this, great. If not, it’s worth flipping back to “Running the Example” to take a look.

In your terminal, clone the TensorFlow repository and then change into its directory:

git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow

Next, we’re going to build the binary and run some commands that get it ready for downloading to the device. To avoid some typing, you can copy and paste these commands from README.md.

Build the binary

The following command downloads all of the required dependencies and then compiles a binary for the SparkFun Edge:

make -f tensorflow/lite/micro/tools/make/Makefile 
  TARGET=sparkfun_edge TAGS=cmsis-nn micro_speech_bin

The binary is created as a .bin file, in the following location:

tensorflow/lite/micro/tools/make/gen/ 
  sparkfun_edge_cortex-m4/bin/micro_speech.bin

To check whether the file exists, you can use the following command:

test -f tensorflow/lite/micro/tools/make/gen/ 
  sparkfun_edge_cortex-m4/bin/micro_speech.bin 
  &&  echo "Binary was successfully created" || echo "Binary is missing"

If you run that command, you should see Binary was successfully created printed to the console. If you see Binary is missing, there was a problem with the build process. If so, it’s likely that there are some clues to what went wrong in the output of the make command.

Sign the binary

The binary must be signed with cryptographic keys to be deployed to the device. Let’s now run some commands that will sign the binary so it can be flashed to the SparkFun Edge. The scripts used here come from the Ambiq SDK, which is downloaded when the Makefile is run.

Enter the following command to set up some dummy cryptographic keys that you can use for development:

cp tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.0.0/ 
  tools/apollo3_scripts/keys_info0.py 
  tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.0.0/ 
  tools/apollo3_scripts/keys_info.py

Next, run the following command to create a signed binary. Substitute python3 with python if necessary:

python3 tensorflow/lite/micro/tools/make/downloads/ 
  AmbiqSuite-Rel2.0.0/tools/apollo3_scripts/create_cust_image_blob.py 
  --bin tensorflow/lite/micro/tools/make/gen/ 
  sparkfun_edge_cortex-m4/bin/micro_speech.bin 
  --load-address 0xC000 
  --magic-num 0xCB -o main_nonsecure_ota 
  --version 0x0

This creates the file main_nonsecure_ota.bin. Now run this command to create a final version of the file that can be used to flash your device with the script you will use in the next step:

python3 tensorflow/lite/micro/tools/make/downloads/ 
  AmbiqSuite-Rel2.0.0/tools/apollo3_scripts/create_cust_wireupdate_blob.py 
  --load-address 0x20000 
  --bin main_nonsecure_ota.bin 
  -i 6 -o main_nonsecure_wire 
  --options 0x1

You should now have a file called main_nonsecure_wire.bin in the directory where you ran the commands. This is the file you’ll be flashing to the device.

Flash the binary

The SparkFun Edge stores the program it is currently running in its 1 megabyte of flash memory. If you want the board to run a new program, you need to send it to the board, which will store it in flash memory, overwriting any program that was previously saved.

Attach the programmer to the board

To download new programs to the board, you’ll use the SparkFun USB-C Serial Basic serial programmer. This device allows your computer to communicate with the microcontroller via USB.

To attach this device to your board, perform the following steps:

  1. On the side of the SparkFun Edge, locate the six-pin header.

  2. Plug the SparkFun USB-C Serial Basic into these pins, ensuring the pins labeled BLK and GRN on each device are lined up correctly, as illustrated in Figure 7-14.

A photo showing how the SparkFun Edge and USB-C Serial Basic should be connected
Figure 7-14. Connecting the SparkFun Edge and USB-C Serial Basic (courtesy of SparkFun)

Attach the programmer to your computer

You connect the board to your computer via USB. To program the board, you need to find out the name that your computer gives the device. The best way of doing this is to list all the computer’s devices before and after attaching it, and look to see which device is new.

Warning

Some people have reported issues with their operating system’s default drivers for the programmer, so we strongly recommend installing the driver before you continue.

Before attaching the device via USB, run the following command:

# macOS:
ls /dev/cu*

# Linux:
ls /dev/tty*

This should output a list of attached devices that looks something like the following:

/dev/cu.Bluetooth-Incoming-Port
/dev/cu.MALS
/dev/cu.SOC

Now, connect the programmer to your computer’s USB port and run the command again:

# macOS:
ls /dev/cu*

# Linux:
ls /dev/tty*

You should see an extra item in the output, as shown in the example that follows. Your new item might have a different name. This new item is the name of the device:

/dev/cu.Bluetooth-Incoming-Port
/dev/cu.MALS
/dev/cu.SOC
/dev/cu.wchusbserial-1450

This name will be used to refer to the device. However, it can change depending on which USB port the programmer is attached to, so if you disconnect the board from your computer and then reattach it, you might need to look up its name again.

Tip

Some users have reported two devices appearing in the list. If you see two devices, the correct one to use begins with the letters “wch”; for example, “/dev/wchusbserial-14410.”

After you’ve identified the device name, put it in a shell variable for later use:

export DEVICENAME=<your device name here>

This is a variable that you can use when running commands that require the device name, later in the process.

Run the script to flash your board

To flash the board, you must put it into a special “bootloader” state that prepares it to receive the new binary. You’ll then run a script to send the binary to the board.

First create an environment variable to specify the baud rate, which is the speed at which data will be sent to the device:

export BAUD_RATE=921600

Now paste the command that follows into your terminal—but do not press Enter yet! The ${DEVICENAME} and ${BAUD_RATE} in the command will be replaced with the values you set in the previous sections. Remember to substitute python3 with python if necessary:

python3 tensorflow/lite/micro/tools/make/downloads/ 
  AmbiqSuite-Rel2.0.0/tools/apollo3_scripts/uart_wired_update.py 
  -b ${BAUD_RATE} ${DEVICENAME} 
  -r 1 -f main_nonsecure_wire.bin 
  -i 6

Next, you’ll reset the board into its bootloader state and flash the board. On the board, locate the buttons marked RST and 14, as shown in Figure 7-15. Perform the following steps:

  1. Ensure that your board is connected to the programmer and the entire thing is connected to your computer via USB.

  2. On the board, press and hold the button marked 14. Continue holding it.

  3. While still holding the button marked 14, press the button marked RST to reset the board.

  4. Press Enter on your computer to run the script. Continue holding button 14.

You should now see something like the following appearing on your screen:

Connecting with Corvette over serial port /dev/cu.usbserial-1440...
Sending Hello.
Received response for Hello
Received Status
length =  0x58
version =  0x3
Max Storage =  0x4ffa0
Status =  0x2
State =  0x7
AMInfo =
0x1
0xff2da3ff
0x55fff
0x1
0x49f40003
0xffffffff
[...lots more 0xffffffff...]
Sending OTA Descriptor =  0xfe000
Sending Update Command.
number of updates needed =  1
Sending block of size  0x158b0  from  0x0  to  0x158b0
Sending Data Packet of length  8180
Sending Data Packet of length  8180
[...lots more Sending Data Packet of length  8180...]
A photo showing the SparkFun Edge's buttons
Figure 7-15. The SparkFun Edge’s buttons

Keep holding button 14 until you see Sending Data Packet of length 8180. You can release the button after seeing this (but it’s okay if you keep holding it). The program will continue to print lines on the terminal. Eventually, you’ll see something like the following:

[...lots more Sending Data Packet of length  8180...]
Sending Data Packet of length  8180
Sending Data Packet of length  6440
Sending Reset Command.
Done.

This indicates a successful flashing.

Tip

If the program output ends with an error, check whether Sending Reset Command. was printed. If so, flashing was likely successful despite the error. Otherwise, flashing might have failed. Try running through these steps again (you can skip over setting the environment variables).

Testing the program

To make sure the program is running, press the RST button. You should now see the blue LED flashing.

To test the program, try saying “yes.” When it detects a “yes,” the orange LED will flash. The model is also trained to recognize “no,” and when unknown words are spoken. The red LED should flash for “no,” and the green for unknown.

If you can’t get the program to recognize your “yes,” try saying it a few times in a row: “yes, yes, yes.”

The model we’re using is small and imperfect, and you’ll probably notice that it’s better at detecting “yes” than “no,” which it often recognizes as “unknown.” This is an example of how optimizing for a tiny model size can result in issues with accuracy. We cover this topic in Chapter 8.

Viewing debug data

The program will also log successful recognitions to the serial port. To view this data, we can monitor the board’s serial port output using a baud rate of 115200. On macOS and Linux, the following command should work:

screen ${DEVICENAME} 115200

You should initially see output that looks something like the following:

Apollo3 Burst Mode is Available

                               Apollo3 operating in Burst Mode (96MHz)

Try issuing some commands by saying “yes” or “no.” You should see the board printing debug information for each command:

Heard yes (202) @65536ms

To stop viewing the debug output with screen, press Ctrl-A immediately followed by the K key, and then press the Y key.

Making your own changes

Now that you’ve deployed the basic application, try playing around and making some changes. You can find the application’s code in the tensorflow/lite/micro/examples/micro_speech folder. Just edit and save and then repeat the preceding instructions to deploy your modified code to the device.

Here are a few things that you could try:

  • RespondToCommand()’s score argument shows the prediction score. Use the LEDs as a meter to show the strength of the match.

  • Make the application respond to a specific sequence of “yes” and “no” commands, like a secret code phrase.

  • Use the “yes” and “no” commands to control other components, like additional LEDs or servos.

ST Microelectronics STM32F746G Discovery Kit

Because the STM32F746G comes with a fancy LCD display, we can use this to show off whichever wake words are detected, as depicted in Figure 7-16.

STM32F746G displaying a 'no'
Figure 7-16. STM32F746G displaying a “no”

Responding to commands on STM32F746G

The STM32F746G’s LCD driver gives us methods that we can use to write text to the display. In this exercise, we’ll use these to show one of the following messages, depending on which command was heard:

  • “Heard yes!”

  • “Heard no :(”

  • “Heard unknown”

  • “Heard silence”

We’ll also set the background color differently depending on which command was heard.

To begin, we include some header files:

#include "tensorflow/lite/micro/examples/micro_speech/command_responder.h"
#include "LCD_DISCO_F746NG.h"

The first, command_responder.h, just declares the interface for this file. The second, LCD_DISCO_F74NG.h, gives us an interface to control the device’s LCD display. You can read more about it on the Mbed site.

Next, we instantiate an LCD_DISCO_F746NG object, which holds the methods we use to control the LCD:

LCD_DISCO_F746NG lcd;

In the next few lines, the RespondToCommand() function is declared, and we check whether it has been called with a new command:

// When a command is detected, write it to the display and log it to the
// serial port.
void RespondToCommand(tflite::ErrorReporter *error_reporter,
                      int32_t current_time, const char *found_command,
                      uint8_t score, bool is_new_command) {
  if (is_new_command) {
    error_reporter->Report("Heard %s (%d) @%dms", found_command, score,
                           current_time);

When we know this is a new command, we use the error_reporter to log it to the serial port.

Next, we use a big if statement to determine what happens when each command is found. First comes “yes”:

if (*found_command == 'y') {
  lcd.Clear(0xFF0F9D58);
  lcd.DisplayStringAt(0, LINE(5), (uint8_t *)"Heard yes!", CENTER_MODE);

We use lcd.Clear() to both clear any previous content from the screen and set a new background color, like a fresh coat of paint. The color 0xFF0F9D58 is a nice, rich green.

On our green background, we use lcd.DisplayStringAt() to draw some text. The first argument specifies an x coordinate, the second specifies a y. To position our text roughly in the middle of the display, we use a helper function, LINE(), to determine the y coordinate that would correspond to the fifth line of text on the screen.

The third argument is the string of text we’ll be displaying, and the fourth argument determines the alignment of the text; here, we use the constant CENTER_MODE to specify that the text is center-aligned.

We continue the if statement to cover the remaining three possibilities, “no,” “unknown,” and “silence” (which is captured by the else block):

} else if (*found_command == 'n') {
  lcd.Clear(0xFFDB4437);
  lcd.DisplayStringAt(0, LINE(5), (uint8_t *)"Heard no :(", CENTER_MODE);
} else if (*found_command == 'u') {
  lcd.Clear(0xFFF4B400);
  lcd.DisplayStringAt(0, LINE(5), (uint8_t *)"Heard unknown", CENTER_MODE);
} else {
  lcd.Clear(0xFF4285F4);
  lcd.DisplayStringAt(0, LINE(5), (uint8_t *)"Heard silence", CENTER_MODE);
}

And that’s it! Because the LCD library gives us such easy high-level control over the display, it doesn’t take much code to output our results. Let’s deploy the example to see this all in action.

Running the example

Now we can use the Mbed toolchain to deploy our application to the device.

Tip

There’s always a chance that the build process might have changed since this book was written, so check README.md for the latest instructions.

Before we begin, we’ll need the following:

  • An STM32F746G Discovery kit board

  • A mini-USB cable

  • The Arm Mbed CLI (follow the Mbed setup guide)

  • Python 3 and pip

Like the Arduino IDE, Mbed requires source files to be structured in a certain way. The TensorFlow Lite for Microcontrollers Makefile knows how to do this for us and can generate a directory suitable for Mbed.

To do so, run the following command:

make -f tensorflow/lite/micro/tools/make/Makefile 
  TARGET=mbed TAGS="cmsis-nn disco_f746ng" generate_micro_speech_mbed_project

This results in the creation of a new directory:

tensorflow/lite/micro/tools/make/gen/mbed_cortex-m4/prj/ 
  micro_speech/mbed

This directory contains all of the example’s dependencies structured in the correct way for Mbed to be able to build it.

First, change into the directory so that you can run some commands within it:

cd tensorflow/lite/micro/tools/make/gen/mbed_cortex-m4/prj/micro_speech/mbed

Next, you’ll use Mbed to download the dependencies and build the project.

To begin, use the following command to inform Mbed that the current directory is the root of an Mbed project:

mbed config root .

Next, instruct Mbed to download the dependencies and prepare to build:

mbed deploy

By default, Mbed builds the project using C++98. However, TensorFlow Lite requires C++11. Run the following Python snippet to modify the Mbed configuration files so that it uses C++11. You can just type or paste it into the command line:

python -c 'import fileinput, glob;
for filename in glob.glob("mbed-os/tools/profiles/*.json"):
  for line in fileinput.input(filename, inplace=True):
    print(line.replace(""-std=gnu++98"",""-std=c++11", "-fpermissive""))'

Finally, run the following command to compile:

mbed compile -m DISCO_F746NG -t GCC_ARM

This should result in a binary at the following path:

./BUILD/DISCO_F746NG/GCC_ARM/mbed.bin

One of the nice things about the STM32F746G board is that deployment is really easy. To deploy, just plug in your STM board and copy the file to it. On macOS, you can do this by using the following command:

cp ./BUILD/DISCO_F746NG/GCC_ARM/mbed.bin /Volumes/DIS_F746NG/

Alternately, just find the DIS_F746NG volume in your file browser and drag the file over.

Copying the file initiates the flashing process.

Testing the program

When this is complete, try saying “yes.” You should see the appropriate text appear on the display and the background color change.

If you can’t get the program to recognize your “yes,” try saying it a few times in a row, like “yes, yes, yes.”

The model we’re using is small and imperfect, and you’ll probably notice that it’s better at detecting “yes” than “no,” which it often recognizes as “unknown.” This is an example of how optimizing for a tiny model size can result in issues with accuracy. We cover this topic in Chapter 8.

Viewing debug data

The program also logs successful recognitions to the serial port. To view the output, establish a serial connection to the board using a baud rate of 9600.

On macOS and Linux, the device should be listed when you issue the following command:

ls /dev/tty*

It will look something like the following:

/dev/tty.usbmodem1454203

After you’ve identified the device, use the following command to connect to it, replacing </dev/tty.devicename> with the name of your device as it appears in /dev:

screen /dev/<tty.devicename 9600>

Try issuing some commands by saying “yes” or “no.” You should see the board printing debug information for each command:

Heard yes (202) @65536ms

To stop viewing the debug output with screen, press Ctrl-A, immediately followed by the K key, and then press the Y key.

Note

If you’re not sure how to make a serial connection on your platform, you could try CoolTerm, which works on Windows, macOS, and Linux. The board should show up in CoolTerm’s Port drop-down list. Make sure you set the baud rate to 9600.

Making your own changes

Now that you’ve deployed the application, it could be fun to play around and make some changes. You can find the application’s code in the tensorflow/lite/micro/tools/make/gen/mbed_cortex-m4/prj/micro_speech/mbed folder. Just edit and save and then repeat the preceding instructions to deploy your modified code to the device.

Here are a few things you could try:

  • RespondToCommand()’s score argument shows the prediction score. Create a visual indicator of the score on the LCD display.

  • Make the application respond to a specific sequence of “yes” and “no” commands, like a secret code phrase.

  • Use the “yes” and “no” commands to control other components, like additional LEDs or servos.

Wrapping Up

The application code we’ve walked through has been mostly concerned with capturing data from the hardware and then extracting features that are suitable for inference. The part that actually feeds data into the model and runs inference is relatively small, and it’s very similar to the example covered in Chapter 6.

This is fairly typical of machine learning projects. The model is already trained, thus our job is just to keep it fed with the appropriate sort of data. As an embedded developer working with TensorFlow Lite, you’ll be spending most of your programming time on capturing sensor data, processing it into features, and responding to the output of your model. The inference part itself is quick and easy.

But the embedded application is only part of the package—the really fun part is the model. In Chapter 8, you’ll learn how to train your own speech model to listen for different words. You’ll also learn more about how it works.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.88.110