TinyML might be a new phenomenon, but its most widespread application is perhaps already at work in your home, in your car, or even in your pocket. Can you guess what it is?
The past few years have seen the rise of digital assistants. These products provide a voice user interface (UI) designed to give instant access to information without the need for a screen or keyboard. Between Google Assistant, Apple’s Siri, and Amazon Alexa, these digital assistants are nearly ubiquitous. Some variant is built into almost every mobile phone, from flagship models to voice-first devices designed for emerging markets. They’re also in smart speakers, computers, and vehicles.
In most cases, the heavy lifting of speech recognition, natural language processing, and generating responses to users’ queries is done in the cloud, on powerful servers running large ML models. When a user asks a question, it’s sent to the server as a stream of audio. The server figures out what it means, looks up any required information, and sends the appropriate response back.
But part of an assistants’ appeal is that they’re always on, ready to help you out. By saying “Hey Google,” or “Alexa,” you can wake up your assistant and tell it what you need without ever having to press a button. This means they must be listening for your voice 24/7, whether you’re sitting in your living room, driving down the freeway, or in the great outdoors with a phone in your hand.
Although it’s easy to do speech recognition on a server, it’s just not feasible to send a constant stream of audio from a device to a data center. From a privacy perspective, sending every second of audio captured to a remote server would be an absolute disaster. Even if that were somehow okay, it would require vast amounts of bandwidth and chew through mobile data plans in hours. In addition, network communication uses energy, and sending a constant stream of data would quickly drain the device’s battery. What’s more, with every request going to a server and back, the assistant would feel laggy and slow to respond.
The only audio the assistant really needs is what immediately follows the wake word (e.g., “Hey Google”). What if we could detect that word without sending data, but start streaming when we heard it? We’d protect user privacy, save battery life and bandwidth, and wake up the assistant without waiting for the network.
And this is where TinyML comes in. We can train a tiny model that listens for a wake word, and run it on a low-powered chip. If we embed this in a phone, it can listen for wake words all the time. When it hears the magic word, it informs the phone’s operating system (OS), which can begin to capture audio and send it to the server.
Wake-word detection is the perfect application for TinyML. It’s ideally suited to delivering privacy, efficiency, speed, and offline inference. This approach, in which a tiny, efficient model “wakes up” a larger, more resource-hungry model, is called cascading.
In this chapter, we examine how we can use a pretrained speech detection model to provide always-on wake-word detection using a tiny microcontroller. In Chapter 8, we’ll explore how the model is trained, and how to create our own.
We’re going to build an embedded application that uses an 18 KB model, trained on a dataset of speech commands, to classify spoken audio. The model is trained to recognize the words “yes” and “no,” and is also capable of distinguishing between unknown words and silence or background noise.
Our application will listen to its surroundings with a microphone and indicate when it has detected a word by lighting an LED or displaying data on a screen, depending on the capabilities of the device. Understanding this code will give you the ability to control any electronics project with voice commands.
Like with Chapter 5, the source code for this application is available in the TensorFlow GitHub repository.
We’ll follow a similar pattern to Chapter 5, walking through the tests, then the application code, followed by the logic that makes the sample work on various devices.
We provide instructions for deploying the application to the following devices:
TensorFlow Lite regularly adds support for new devices, so if the device you’d like to use isn’t listed here, check the example’s README.md. You can also check there for updated deployment instructions if you run into trouble following these steps.
This is a significantly more complex application than the “hello world” example, so let’s begin by walking through its structure.
Over the previous few chapters, you’ve learned that a machine learning application does the following sequence of things:
Obtains an input
Preprocesses the input to extract features suitable to feed into a model
Runs inference on the processed input
Postprocesses the model’s output to make sense of it
Uses the resulting information to make things happen
The “hello world” example followed these steps in a very straightforward manner. It took a single floating-point number as input, generated by a simple counter. Its output was another floating-point number that we used directly to control visual output.
Our wake-word application will be more complicated for the following reasons:
It takes audio data as an input. As you’ll see, this requires heavy processing before it can be fed into a model.
Its model is a classifier, outputting class probabilities. We’ll need to parse and make sense of this output.
It’s designed to perform inference continually, on live data. We’ll need to write code to make sense of a stream of inferences.
The model is larger and more complex. We’ll be pushing our hardware to the limits of its capabilities.
Because much of this complexity results from the model we’ll be using, let’s learn a little more about it.
As we mentioned earlier, the model we use in this chapter is trained to recognize the words “yes” and “no,” and is also capable of distinguishing between unknown words and silence or background noise.
The model was trained on a dataset called the Speech Commands dataset. This consists of 65,000 one-second-long utterances of 30 short words, crowdsourced online.
Although the dataset contains 30 different words, the model was trained to distinguish between only four categories: the words “yes” and “no,” “unknown” words (meaning the other 28 words in the dataset), and silence.
The model takes in one second’s worth of data at a time. It outputs four probability scores, one for each of these four classes, predicting how likely it is that the data represented one of them.
However, the model doesn’t take in raw audio sample data. Instead, it works with spectrograms, which are two-dimensional arrays that are made up of slices of frequency information, each taken from a different time window.
Figure 7-1 is a visual representation of a spectrogram generated from a one-second audio clip of someone saying “yes.” Figure 7-2 shows the same thing for the word “no.”
By isolating the frequency information during preprocessing, we make the model’s life easier. During training, it doesn’t need to learn how to interpret raw audio data; instead, it gets to work with a higher-layer abstraction that distills the most useful information.
We’ll look at how the spectrogram is generated later in this chapter. For now, we just need to know that the model takes a spectrogram as input. Because a spectrogram is a two-dimensional array, we feed it into the model as a 2D tensor.
There’s a type of neural network architecture that is specifically designed to work well with multidimensional tensors in which information is contained in the relationships between groups of adjacent values. It’s called a convolutional neural network (CNN).
The most common example of this type of data is images, for which a group of adjacent pixels might represent a shape, pattern, or texture. During training, a CNN is able to identify these features and learn what they represent.
It can learn how simple image features (like lines or edges) fit together into more complex features (like an eye or an ear), and in turn how those features might be combined to form an input image, such as a photo of a human face. This means that a CNN can learn to distinguish between different classes of input image, such as between a photo of a person and a photo of a dog.
Although they’re often applied to images, which are 2D grids of pixels, CNNs can be used with any multidimensional vector input. It turns out they’re very well suited to working with spectrogram data.
In Chapter 8, we’ll look at how this model was trained. Until then, let’s get back to discussing the architecture of our application.
As mentioned earlier, our wake-word application is a more complicated than the “hello world” example. Figure 7-3 shows the components that comprise it.
Let’s investigate what each of these pieces do:
Like the “hello world” example, our application runs in a continuous loop. All of the subsequent processes are contained within it, and they execute continually, as fast as the microcontroller can run them, which is multiple times per second.
The audio provider captures raw audio data from the microphone. Because the methods for capturing audio vary from device to device, this component can be overridden and customized.
The feature provider converts raw audio data into the spectrogram format that our model requires. It does so on a rolling basis as part of the main loop, providing the interpreter with a sequence of overlapping one-second windows.
The interpreter runs the TensorFlow Lite model, transforming the input spectrogram into a set of probabilities.
The model is included as a data array and run by the interpreter. The array is located in tiny_conv_micro_features_model_data.cc.
Because inference is run multiple times per second, the RecognizeCommands
class aggregates the results and determines whether, on average, a known word was heard.
If a command was heard, the command responder uses the device’s output capabilities to let the user know. Depending on the device, this could mean flashing an LED or showing data on an LCD display. It can be overridden for different device types.
The example’s files on GitHub contain tests for each of these components. We’ll walk through them next to learn how they work.
As in Chapter 5, we can use tests to learn how the application works. We’ve already covered a lot of C++ and TensorFlow Lite basics, so we won’t need to explain every single line. Instead, let’s focus on the most important parts of each test and explain what’s going on.
We’ll explore the following tests, which you can find in the GitHub repository:
Shows how to run inference on spectrogram data and interpret the results
Shows how to use the audio provider
Shows how to use the feature provider, using a mock (fake) implementation of the audio provider to pass in fake data
Shows how to interpret the model’s output to decide whether a command was found
Shows how to call the command responder to trigger an output
There are many more tests in the example, but exploring these few will give us an understanding of the key moving parts.
The test micro_speech_test.cc follows the same basic flow we’re familiar with from the “hello world” example: we load the model, set up the interpreter, and allocate tensors.
However, there’s a notable difference. In the “hello world” example, we used the AllOpsResolver
to pull in all of the deep learning operations that might be necessary to run the model. This is a reliable approach, but it’s wasteful because a given model probably doesn’t use all of the dozens of available operations. When deployed to a device, these unnecessary operations will take up valuable memory, so it’s best if we include only those we need.
To do this, we first define the ops that our model will need, at the top of the test file:
namespace
tflite
{
namespace
ops
{
namespace
micro
{
TfLiteRegistration
*
Register_DEPTHWISE_CONV_2D
();
TfLiteRegistration
*
Register_FULLY_CONNECTED
();
TfLiteRegistration
*
Register_SOFTMAX
();
}
// namespace micro
}
// namespace ops
}
// namespace tflite
Next, we set up logging and load our model, as normal:
// Set up logging.
tflite
::
MicroErrorReporter
micro_error_reporter
;
tflite
::
ErrorReporter
*
error_reporter
=
&
micro_error_reporter
;
// Map the model into a usable data structure. This doesn't involve any
// copying or parsing, it's a very lightweight operation.
const
tflite
::
Model
*
model
=
::
tflite
::
GetModel
(
g_tiny_conv_micro_features_model_data
);
if
(
model
->
version
()
!=
TFLITE_SCHEMA_VERSION
)
{
error_reporter
->
Report
(
"Model provided is schema version %d not equal "
"to supported version %d.
"
,
model
->
version
(),
TFLITE_SCHEMA_VERSION
);
}
After our model is loaded, we declare a MicroMutableOpResolver
and use its method AddBuiltin()
to add the ops we listed earlier:
tflite
::
MicroMutableOpResolver
micro_mutable_op_resolver
;
micro_mutable_op_resolver
.
AddBuiltin
(
tflite
::
BuiltinOperator_DEPTHWISE_CONV_2D
,
tflite
::
ops
::
micro
::
Register_DEPTHWISE_CONV_2D
());
micro_mutable_op_resolver
.
AddBuiltin
(
tflite
::
BuiltinOperator_FULLY_CONNECTED
,
tflite
::
ops
::
micro
::
Register_FULLY_CONNECTED
());
micro_mutable_op_resolver
.
AddBuiltin
(
tflite
::
BuiltinOperator_SOFTMAX
,
tflite
::
ops
::
micro
::
Register_SOFTMAX
());
You’re probably wondering how we know which ops to include for a given model. One way is to try running the model using a MicroMutableOpResolver
, but without calling AddBuiltin()
at all. Inference will fail, and the accompanying error messages will inform us which ops are missing and need to be added.
The MicroMutableOpResolver
is defined in tensorflow/lite/micro/micro_mutable_op_resolver.h, which you’ll need to add to your include
statements.
After the MicroMutableOpResolver
is set up, we just carry on as usual, setting up our interpreter and its working memory:
// Create an area of memory to use for input, output, and intermediate arrays.
const
int
tensor_arena_size
=
10
*
1024
;
uint8_t
tensor_arena
[
tensor_arena_size
];
// Build an interpreter to run the model with.
tflite
::
MicroInterpreter
interpreter
(
model
,
micro_mutable_op_resolver
,
tensor_arena
,
tensor_arena_size
,
error_reporter
);
interpreter
.
AllocateTensors
();
In our “hello world” application we allocated only 2 * 1,024 bytes for the tensor_arena
, given that the model was so small. Our speech model is a lot bigger, and it deals with more complex input and output, so it needs more space (10 1,024). This was determined by trial and error.
Next, we check the input tensor size. However, it’s a little different this time around:
// Get information about the memory area to use for the model's input.
TfLiteTensor
*
input
=
interpreter
.
input
(
0
);
// Make sure the input has the properties we expect.
TF_LITE_MICRO_EXPECT_NE
(
nullptr
,
input
);
TF_LITE_MICRO_EXPECT_EQ
(
4
,
input
->
dims
->
size
);
TF_LITE_MICRO_EXPECT_EQ
(
1
,
input
->
dims
->
data
[
0
]);
TF_LITE_MICRO_EXPECT_EQ
(
49
,
input
->
dims
->
data
[
1
]);
TF_LITE_MICRO_EXPECT_EQ
(
40
,
input
->
dims
->
data
[
2
]);
TF_LITE_MICRO_EXPECT_EQ
(
1
,
input
->
dims
->
data
[
3
]);
TF_LITE_MICRO_EXPECT_EQ
(
kTfLiteUInt8
,
input
->
type
);
Because we’re dealing with a spectrogram as our input, the input tensor has more dimensions—four, in total. The first dimension is just a wrapper containing a single element. The second and third represent the “rows” and “columns” of our spectrogram, which happens to have 49 rows and 40 columns. The fourth, innermost dimension of the input tensor, which has size 1, holds each individual “pixel” of the spectrogram. We’ll look more at the spectrogram’s structure later on.
Next, we grab a sample spectrogram for a “yes,” stored in the constant g_yes_micro_f2e59fea_nohash_1_data
. The constant is defined in the file micro_features/yes_micro_features_data.cc, which was included by this test. The spectrogram exists as a 1D array, and we just iterate through it to copy it into the input tensor:
// Copy a spectrogram created from a .wav audio file of someone saying "Yes"
// into the memory area used for the input.
const
uint8_t
*
yes_features_data
=
g_yes_micro_f2e59fea_nohash_1_data
;
for
(
int
i
=
0
;
i
<
input
->
bytes
;
++
i
)
{
input
->
data
.
uint8
[
i
]
=
yes_features_data
[
i
];
}
After the input has been assigned, we run inference and inspect the output tensor’s size and shape:
// Run the model on this input and make sure it succeeds.
TfLiteStatus
invoke_status
=
interpreter
.
Invoke
();
if
(
invoke_status
!=
kTfLiteOk
)
{
error_reporter
->
Report
(
"Invoke failed
"
);
}
TF_LITE_MICRO_EXPECT_EQ
(
kTfLiteOk
,
invoke_status
);
// Get the output from the model, and make sure it's the expected size and
// type.
TfLiteTensor
*
output
=
interpreter
.
output
(
0
);
TF_LITE_MICRO_EXPECT_EQ
(
2
,
output
->
dims
->
size
);
TF_LITE_MICRO_EXPECT_EQ
(
1
,
output
->
dims
->
data
[
0
]);
TF_LITE_MICRO_EXPECT_EQ
(
4
,
output
->
dims
->
data
[
1
]);
TF_LITE_MICRO_EXPECT_EQ
(
kTfLiteUInt8
,
output
->
type
);
Our output has two dimensions. The first is just a wrapper. The second has four elements. This is the structure that holds the probabilities that each of our four classes (silence, unknown, “yes,” and “no”) were matched.
The next chunk of code checks whether the probabilities were as expected. A given element of the output tensor always represents a certain class, so we know which index to check for each one. The order is defined during training:
// There are four possible classes in the output, each with a score.
const
int
kSilenceIndex
=
0
;
const
int
kUnknownIndex
=
1
;
const
int
kYesIndex
=
2
;
const
int
kNoIndex
=
3
;
// Make sure that the expected "Yes" score is higher than the other classes.
uint8_t
silence_score
=
output
->
data
.
uint8
[
kSilenceIndex
];
uint8_t
unknown_score
=
output
->
data
.
uint8
[
kUnknownIndex
];
uint8_t
yes_score
=
output
->
data
.
uint8
[
kYesIndex
];
uint8_t
no_score
=
output
->
data
.
uint8
[
kNoIndex
];
TF_LITE_MICRO_EXPECT_GT
(
yes_score
,
silence_score
);
TF_LITE_MICRO_EXPECT_GT
(
yes_score
,
unknown_score
);
TF_LITE_MICRO_EXPECT_GT
(
yes_score
,
no_score
);
We passed in a “yes” spectrogram, so we expect that the variable yes_score
contains a higher probability than silence_score
, unknown_score
, and no_score
.
When we’re satisfied with “yes,” we do the same thing with a “no” spectrogram. First, we copy in an input and run inference:
// Now test with a different input, from a recording of "No".
const
uint8_t
*
no_features_data
=
g_no_micro_f9643d42_nohash_4_data
;
for
(
int
i
=
0
;
i
<
input
->
bytes
;
++
i
)
{
input
->
data
.
uint8
[
i
]
=
no_features_data
[
i
];
}
// Run the model on this "No" input.
invoke_status
=
interpreter
.
Invoke
();
if
(
invoke_status
!=
kTfLiteOk
)
{
error_reporter
->
Report
(
"Invoke failed
"
);
}
TF_LITE_MICRO_EXPECT_EQ
(
kTfLiteOk
,
invoke_status
);
After inference is done, we confirm that “no” achieved the highest score:
// Make sure that the expected "No" score is higher than the other classes.
silence_score
=
output
->
data
.
uint8
[
kSilenceIndex
];
unknown_score
=
output
->
data
.
uint8
[
kUnknownIndex
];
yes_score
=
output
->
data
.
uint8
[
kYesIndex
];
no_score
=
output
->
data
.
uint8
[
kNoIndex
];
TF_LITE_MICRO_EXPECT_GT
(
no_score
,
silence_score
);
TF_LITE_MICRO_EXPECT_GT
(
no_score
,
unknown_score
);
TF_LITE_MICRO_EXPECT_GT
(
no_score
,
yes_score
);
And we’re done!
To run this test, issue the following command from the root of the TensorFlow repository:
make -f tensorflow/lite/micro/tools/make/Makefile test_micro_speech_test
Next up, let’s look at the source of all our audio data: the audio provider.
The audio provider is what connects a device’s microphone hardware to our code. Every device has a different mechanism for capturing audio. As a result, audio_provider.h defines an interface for requesting audio data, and developers can write their own implementations for any platforms that they want to support.
The example includes audio provider implementations for Arduino, STM32F746G, SparkFun Edge, and macOS. If you’d like this example to support a new device, you can read the existing implementations to learn how to do it.
The core part of the audio provider is a function named GetAudioSamples()
, defined in audio_provider.h. It looks like this:
TfLiteStatus
GetAudioSamples
(
tflite
::
ErrorReporter
*
error_reporter
,
int
start_ms
,
int
duration_ms
,
int
*
audio_samples_size
,
int16_t
**
audio_samples
);
As described in audio_provider.h, the function is expected to return an array of 16-bit pulse code modulated (PCM) audio data. This is a very common format for digital audio.
The function is called with an ErrorReporter
instance, a start time (start_ms
), a duration (duration_ms
), and two pointers.
These pointers are a mechanism for GetAudioSamples()
to provide data. The caller declares variables of the appropriate type and then passes pointers to them when it calls the function. Inside the function’s implementation, the pointers are dereferenced and the variables’ values are set.
The first pointer, audio_samples_size
, will receive the total number of 16-bit samples in the audio data. The second pointer, audio_samples
, will receive an array containing the audio data itself.
By looking at the tests, we can see this in action. There are two tests in audio_provider_test.cc, but we need to look only at the first to learn how to use the audio provider:
TF_LITE_MICRO_TEST
(
TestAudioProvider
)
{
tflite
::
MicroErrorReporter
micro_error_reporter
;
tflite
::
ErrorReporter
*
error_reporter
=
&
micro_error_reporter
;
int
audio_samples_size
=
0
;
int16_t
*
audio_samples
=
nullptr
;
TfLiteStatus
get_status
=
GetAudioSamples
(
error_reporter
,
0
,
kFeatureSliceDurationMs
,
&
audio_samples_size
,
&
audio_samples
);
TF_LITE_MICRO_EXPECT_EQ
(
kTfLiteOk
,
get_status
);
TF_LITE_MICRO_EXPECT_LE
(
audio_samples_size
,
kMaxAudioSampleSize
);
TF_LITE_MICRO_EXPECT_NE
(
audio_samples
,
nullptr
);
// Make sure we can read all of the returned memory locations.
int
total
=
0
;
for
(
int
i
=
0
;
i
<
audio_samples_size
;
++
i
)
{
total
+=
audio_samples
[
i
];
}
}
The test shows how GetAudioSamples()
is called with some values and some pointers. The test confirms that the pointers are assigned correctly after the function is called.
You’ll notice the use of some constants, kFeatureSliceDurationMs
and kMaxAudioSampleSize
. These are values that were chosen when the model was trained, and you can find them in micro_features/micro_model_settings.h.
The default implementation of audio_provider.cc just returns an empty array. To prove that it’s the right size, the test simply loops through it for the expected number of samples.
In addition to GetAudioSamples()
, the audio provider contains a function called LatestAudioTimestamp()
. This is intended to return the time that audio data was last captured, in milliseconds. This information is needed by the feature provider to determine what audio data to fetch.
To run the audio provider tests, use the following command:
make -f tensorflow/lite/micro/tools/make/Makefile test_audio_provider_test
The audio provider is used by the feature provider as a source of fresh audio samples, so let’s take a look at that next.
The feature provider converts raw audio, obtained from the audio provider, into spectrograms that can be fed into our model. It is called during the main loop.
Its interface is defined in feature_provider.h, and looks like this:
class
FeatureProvider
{
public
:
// Create the provider, and bind it to an area of memory. This memory should
// remain accessible for the lifetime of the provider object, since subsequent
// calls will fill it with feature data. The provider does no memory
// management of this data.
FeatureProvider
(
int
feature_size
,
uint8_t
*
feature_data
);
~
FeatureProvider
();
// Fills the feature data with information from audio inputs, and returns how
// many feature slices were updated.
TfLiteStatus
PopulateFeatureData
(
tflite
::
ErrorReporter
*
error_reporter
,
int32_t
last_time_in_ms
,
int32_t
time_in_ms
,
int
*
how_many_new_slices
);
private
:
int
feature_size_
;
uint8_t
*
feature_data_
;
// Make sure we don't try to use cached information if this is the first call
// into the provider.
bool
is_first_run_
;
};
To see how it’s used, we can take a look at the tests in feature_provider_mock_test.cc.
For there to be audio data for the feature provider to work with, these tests use a special fake version of the audio provider, known as a mock, that is set up to provide audio data. It is defined in audio_provider_mock.cc.
The mock audio provider is substituted for the real thing in the build instructions for the test, which you can find in Makefile.inc under FEATURE_PROVIDER_MOCK_TEST_SRCS
.
The file feature_provider_mock_test.cc contains two tests. Here’s the first one:
TF_LITE_MICRO_TEST
(
TestFeatureProviderMockYes
)
{
tflite
::
MicroErrorReporter
micro_error_reporter
;
tflite
::
ErrorReporter
*
error_reporter
=
&
micro_error_reporter
;
uint8_t
feature_data
[
kFeatureElementCount
];
FeatureProvider
feature_provider
(
kFeatureElementCount
,
feature_data
);
int
how_many_new_slices
=
0
;
TfLiteStatus
populate_status
=
feature_provider
.
PopulateFeatureData
(
error_reporter
,
/* last_time_in_ms= */
0
,
/* time_in_ms= */
970
,
&
how_many_new_slices
);
TF_LITE_MICRO_EXPECT_EQ
(
kTfLiteOk
,
populate_status
);
TF_LITE_MICRO_EXPECT_EQ
(
kFeatureSliceCount
,
how_many_new_slices
);
for
(
int
i
=
0
;
i
<
kFeatureElementCount
;
++
i
)
{
TF_LITE_MICRO_EXPECT_EQ
(
g_yes_micro_f2e59fea_nohash_1_data
[
i
],
feature_data
[
i
]);
}
}
To create a FeatureProvider
, we call its constructor, passing in feature_size
and feature_data
arguments:
FeatureProvider
feature_provider
(
kFeatureElementCount
,
feature_data
);
The first argument indicates how many total data elements should be in the spectrogram. The second argument is a pointer to an array that we want to be populated with the spectrogram data.
The number of elements in the spectrogram was decided when the model was trained and is defined as kFeatureElementCount
in micro_features/micro_model_settings.h.
To obtain features for the past second of audio, feature_provider.PopulateFeatureData()
is called:
TfLiteStatus
populate_status
=
feature_provider
.
PopulateFeatureData
(
error_reporter
,
/* last_time_in_ms= */
0
,
/* time_in_ms= */
970
,
&
how_many_new_slices
);
We supply an ErrorReporter
instance, an integer representing the last time this method was called (last_time_in_ms
), the current time (time_in_ms
), and a pointer to an integer that will be updated with how many new feature slices we receive (how_many_new_slices
). A slice is just one row of the spectrogram, representing a chunk of time.
Because we always want the last second of audio, the feature provider will compare when it was last called (last_time_in_ms
) with the current time (time_in_ms
), create spectrogram data from the audio captured during that time, and then update the feature_data
array to add any additional slices and drop any that are older than one second.
When PopulateFeatureData()
runs, it will request audio from the mock audio provider. The mock will give it audio representing a “yes,” and the feature provider will process it and provide the result.
After calling PopulateFeatureData()
, we check whether its result is what we expect. We compare the data it generated to a known spectrogram that is correct for the “yes” input given by the mock audio provider:
TF_LITE_MICRO_EXPECT_EQ
(
kTfLiteOk
,
populate_status
);
TF_LITE_MICRO_EXPECT_EQ
(
kFeatureSliceCount
,
how_many_new_slices
);
for
(
int
i
=
0
;
i
<
kFeatureElementCount
;
++
i
)
{
TF_LITE_MICRO_EXPECT_EQ
(
g_yes_micro_f2e59fea_nohash_1_data
[
i
],
feature_data
[
i
]);
}
The mock audio provider can provide audio for a “yes” or a “no” depending on which start and end times are passed into it. The second test in feature_provider_mock_test.cc does exactly the same thing as the first, but for audio representing “no.”
To run the tests, use the following command:
make -f tensorflow/lite/micro/tools/make/Makefile test_feature_provider_mock_test
The feature provider is implemented in feature_provider.cc. Let’s talk through how it works.
As we’ve discussed, its job is to populate an array that represents a spectrogram of one second of audio. It’s designed to be called in a loop, so to avoid unnecessary work, it will generate new features only for the time between now and when it was last called. If it were called less than a second ago, it would keep some of its previous output and generate only the missing parts.
In our code, each spectrogram is represented as a 2D array, with 40 columns and 49 rows, where each row represents a 30-millisecond (ms) sample of audio split into 43 frequency buckets.
To create each row, we run a 30-ms slice of audio input through a fast Fourier transform (FFT) algorithm. This technique analyzes the frequency distribution of audio in the sample and creates an array of 256 frequency buckets, each with a value from 0 to 255. These are averaged together into groups of six, leaving us with 43 buckets.
The code that does this is in the file micro_features/micro_features_generator.cc, and is called by the feature provider.
To build the entire 2D array, we combine the results of running the FFT on 49 consecutive 30-ms slices of audio, with each slice overlapping the last by 10 ms. Figure 7-4 shows how this happens.
You can see how the 30-ms sample window is moved forward by 20 ms each time until it has covered the full one-second sample. The resulting spectrogram is ready to pass into our model.
We can understand how this process happens in feature_provider.cc. First, it determines which slices it actually needs to generate based on the time PopulateFeatureData()
was last called:
// Quantize the time into steps as long as each window stride, so we can
// figure out which audio data we need to fetch.
const
int
last_step
=
(
last_time_in_ms
/
kFeatureSliceStrideMs
);
const
int
current_step
=
(
time_in_ms
/
kFeatureSliceStrideMs
);
int
slices_needed
=
current_step
-
last_step
;
If it hasn’t run before, or it ran more than one second ago, it will generate the maximum number of slices:
if
(
is_first_run_
)
{
TfLiteStatus
init_status
=
InitializeMicroFeatures
(
error_reporter
);
if
(
init_status
!=
kTfLiteOk
)
{
return
init_status
;
}
is_first_run_
=
false
;
slices_needed
=
kFeatureSliceCount
;
}
if
(
slices_needed
>
kFeatureSliceCount
)
{
slices_needed
=
kFeatureSliceCount
;
}
*
how_many_new_slices
=
slices_needed
;
The resulting number is written to how_many_new_slices
.
Next, it calculates how many of any existing slices it should keep, and shifts data in the array around to make room for any new ones:
const
int
slices_to_keep
=
kFeatureSliceCount
-
slices_needed
;
const
int
slices_to_drop
=
kFeatureSliceCount
-
slices_to_keep
;
// If we can avoid recalculating some slices, just move the existing data
// up in the spectrogram, to perform something like this:
// last time = 80ms current time = 120ms
// +-----------+ +-----------+
// | data@20ms | --> | data@60ms |
// +-----------+ -- +-----------+
// | data@40ms | -- --> | data@80ms |
// +-----------+ -- -- +-----------+
// | data@60ms | -- -- | <empty> |
// +-----------+ -- +-----------+
// | data@80ms | -- | <empty> |
// +-----------+ +-----------+
if
(
slices_to_keep
>
0
)
{
for
(
int
dest_slice
=
0
;
dest_slice
<
slices_to_keep
;
++
dest_slice
)
{
uint8_t
*
dest_slice_data
=
feature_data_
+
(
dest_slice
*
kFeatureSliceSize
);
const
int
src_slice
=
dest_slice
+
slices_to_drop
;
const
uint8_t
*
src_slice_data
=
feature_data_
+
(
src_slice
*
kFeatureSliceSize
);
for
(
int
i
=
0
;
i
<
kFeatureSliceSize
;
++
i
)
{
dest_slice_data
[
i
]
=
src_slice_data
[
i
];
}
}
}
If you’re a seasoned C++ author, you might wonder why we don’t use standard libraries to do things like copying data around. The reason is that we’re trying to avoid unnecessary dependencies, in an effort to keep our binary size small. Because embedded platforms have very little memory, a smaller application binary means that we have space for a larger and more accurate deep learning model.
After moving data around, it begins a loop that iterates once for each new slice that it needs. In this loop, it first requests audio for that slice from the audio provider using GetAudioSamples()
:
for
(
int
new_slice
=
slices_to_keep
;
new_slice
<
kFeatureSliceCount
;
++
new_slice
)
{
const
int
new_step
=
(
current_step
-
kFeatureSliceCount
+
1
)
+
new_slice
;
const
int32_t
slice_start_ms
=
(
new_step
*
kFeatureSliceStrideMs
);
int16_t
*
audio_samples
=
nullptr
;
int
audio_samples_size
=
0
;
GetAudioSamples
(
error_reporter
,
slice_start_ms
,
kFeatureSliceDurationMs
,
&
audio_samples_size
,
&
audio_samples
);
if
(
audio_samples_size
<
kMaxAudioSampleSize
)
{
error_reporter
->
Report
(
"Audio data size %d too small, want %d"
,
audio_samples_size
,
kMaxAudioSampleSize
);
return
kTfLiteError
;
}
To complete the loop iteration, it passes that data into GenerateMicroFeatures()
, defined in micro_features/micro_features_generator.h. This is the function that performs the FFT and returns the audio frequency information.
It also passes a pointer, new_slice_data
, which points at the memory location where the new data should be written:
uint8_t
*
new_slice_data
=
feature_data_
+
(
new_slice
*
kFeatureSliceSize
);
size_t
num_samples_read
;
TfLiteStatus
generate_status
=
GenerateMicroFeatures
(
error_reporter
,
audio_samples
,
audio_samples_size
,
kFeatureSliceSize
,
new_slice_data
,
&
num_samples_read
);
if
(
generate_status
!=
kTfLiteOk
)
{
return
generate_status
;
}
}
After this process has happened for each slice, we have an entire second’s worth of up-to-date spectrogram.
The function that generates the FFT is GenerateMicroFeatures()
. If you’re interested, you can read its definition in micro_features/micro_features_generator.cc.
If you’re building your own application that uses spectrograms, you can reuse this code as is. You’ll need to use the same code to pre-process data into spectrograms when training your model.
Once we have a spectrogram, we can run inference on it using the model. After this happens, we need to interpret the results. That task belongs to the class we explore next, RecognizeCommands
.
After our model outputs a set of probabilities that a known word was spoken in the last second of audio, it’s the job of the RecognizeCommands
class to determine whether this indicates a successful detection.
It seems like this would be simple: if the probability in a given category is more than a certain threshold, the word was spoken. However, in the real world, things become a bit more complicated.
As we established earlier, we’re running multiple inferences per second, each on a one-second window of data. This means that we’ll run inference on any given word multiple times, in multiple windows.
In Figure 7-5, you can see a waveform of the word “noted” being spoken, surrounded by a box representing a one-second window being captured.
Our model is trained to detect the word “no,” and it understands that the word “noted” is not the same thing. If we run inference on this one-second window, it will (hopefully) output a low probability for the word “no.” However, what if the window came slightly earlier in the audio stream, as in Figure 7-6?
In this case, the only part of the word “noted” that appears within the window is its first syllable. Because the first syllable of “noted” sounds like “no,” it’s likely that the model will interpret this as having a high probability of being a “no.”
This problem, along with others, means that we can’t rely on a single inference to tell us whether a word was spoken. This is where RecognizeCommands
comes in!
The recognizer calculates the average score for each word over the past few inferences, and decides whether it’s high enough to count as a detection. To do this, we feed it each inference result as they roll in.
You can see its interface in recognize_commands.h, partially reproduced here:
class
RecognizeCommands
{
public
:
explicit
RecognizeCommands
(
tflite
::
ErrorReporter
*
error_reporter
,
int32_t
average_window_duration_ms
=
1000
,
uint8_t
detection_threshold
=
200
,
int32_t
suppression_ms
=
1500
,
int32_t
minimum_count
=
3
);
// Call this with the results of running a model on sample data.
TfLiteStatus
ProcessLatestResults
(
const
TfLiteTensor
*
latest_results
,
const
int32_t
current_time_ms
,
const
char
**
found_command
,
uint8_t
*
score
,
bool
*
is_new_command
);
The class RecognizeCommands
is defined, along with a constructor that defines default values for a few things:
The length of the averaging window (average_window_duration_ms
)
The minimum average score that counts as a detection (detection_threshold
)
The amount of time we’ll wait after hearing a command before recognizing a second one (suppression_ms
)
The minimum number of inferences required in the window for a result to count (3
)
The class has one method, ProcessLatestResults()
. It accepts a pointer to a TfLiteTensor
containing the model’s output (latest_results
), and it must be called with the current time (current_time_ms
).
In addition, it takes three pointers that it uses for output. First, it gives us the name of any word that was detected (found_command
). It also provides the average score of the command (score
) and whether the command is new or has been heard in previous inferences within a certain timespan (is_new_command
).
Averaging the results of multiple inferences is a useful and common technique when dealing with time-series data. In the next few pages, we’ll walk through the code in recognize_commands.cc and learn a bit about how it works. You don’t need to understand every line, but it’s helpful to get some insight into what might be a helpful tool in your own projects.
First, we make sure the input tensor is the right shape and type:
TfLiteStatus
RecognizeCommands
::
ProcessLatestResults
(
const
TfLiteTensor
*
latest_results
,
const
int32_t
current_time_ms
,
const
char
**
found_command
,
uint8_t
*
score
,
bool
*
is_new_command
)
{
if
((
latest_results
->
dims
->
size
!=
2
)
||
(
latest_results
->
dims
->
data
[
0
]
!=
1
)
||
(
latest_results
->
dims
->
data
[
1
]
!=
kCategoryCount
))
{
error_reporter_
->
Report
(
"The results for recognition should contain %d elements, but there are "
"%d in an %d-dimensional shape"
,
kCategoryCount
,
latest_results
->
dims
->
data
[
1
],
latest_results
->
dims
->
size
);
return
kTfLiteError
;
}
if
(
latest_results
->
type
!=
kTfLiteUInt8
)
{
error_reporter_
->
Report
(
"The results for recognition should be uint8 elements, but are %d"
,
latest_results
->
type
);
return
kTfLiteError
;
}
Next, we check current_time_ms
to verify that it is after the most recent result in our averaging window:
if
((
!
previous_results_
.
empty
())
&&
(
current_time_ms
<
previous_results_
.
front
().
time_
))
{
error_reporter_
->
Report
(
"Results must be fed in increasing time order, but received a "
"timestamp of %d that was earlier than the previous one of %d"
,
current_time_ms
,
previous_results_
.
front
().
time_
);
return
kTfLiteError
;
}
After that, we add the latest result to a list of results we’ll be averaging:
// Add the latest results to the head of the queue.
previous_results_
.
push_back
({
current_time_ms
,
latest_results
->
data
.
uint8
});
// Prune any earlier results that are too old for the averaging window.
const
int64_t
time_limit
=
current_time_ms
-
average_window_duration_ms_
;
while
((
!
previous_results_
.
empty
())
&&
previous_results_
.
front
().
time_
<
time_limit
)
{
previous_results_
.
pop_front
();
If there are fewer results in our averaging window than the minimum number (defined by minimum_count_
, which is 3
by default), we can’t provide a valid average. In this case, we set the output pointers to indicate that found_command
is the most recent top command, that the score is 0, and that the command is not a new one:
// If there are too few results, assume the result will be unreliable and
// bail.
const
int64_t
how_many_results
=
previous_results_
.
size
();
const
int64_t
earliest_time
=
previous_results_
.
front
().
time_
;
const
int64_t
samples_duration
=
current_time_ms
-
earliest_time
;
if
((
how_many_results
<
minimum_count_
)
||
(
samples_duration
<
(
average_window_duration_ms_
/
4
)))
{
*
found_command
=
previous_top_label_
;
*
score
=
0
;
*
is_new_command
=
false
;
return
kTfLiteOk
;
}
Otherwise, we continue by averaging all of the scores in the window:
// Calculate the average score across all the results in the window.
int32_t
average_scores
[
kCategoryCount
];
for
(
int
offset
=
0
;
offset
<
previous_results_
.
size
();
++
offset
)
{
PreviousResultsQueue
::
Result
previous_result
=
previous_results_
.
from_front
(
offset
);
const
uint8_t
*
scores
=
previous_result
.
scores_
;
for
(
int
i
=
0
;
i
<
kCategoryCount
;
++
i
)
{
if
(
offset
==
0
)
{
average_scores
[
i
]
=
scores
[
i
];
}
else
{
average_scores
[
i
]
+=
scores
[
i
];
}
}
}
for
(
int
i
=
0
;
i
<
kCategoryCount
;
++
i
)
{
average_scores
[
i
]
/=
how_many_results
;
}
We now have enough information to identify which category is our winner. Establishing this is a simple process:
// Find the current highest scoring category.
int
current_top_index
=
0
;
int32_t
current_top_score
=
0
;
for
(
int
i
=
0
;
i
<
kCategoryCount
;
++
i
)
{
if
(
average_scores
[
i
]
>
current_top_score
)
{
current_top_score
=
average_scores
[
i
];
current_top_index
=
i
;
}
}
const
char
*
current_top_label
=
kCategoryLabels
[
current_top_index
];
The final piece of logic determines whether the result was a valid detection. To do this, it ensures that its score is above the detection threshold (200 by default), and that it didn’t happen too quickly after the last valid detection, which can be an indication of a faulty result:
// If we've recently had another label trigger, assume one that occurs too
// soon afterwards is a bad result.
int64_t
time_since_last_top
;
if
((
previous_top_label_
==
kCategoryLabels
[
0
])
||
(
previous_top_label_time_
==
std
::
numeric_limits
<
int32_t
>::
min
()))
{
time_since_last_top
=
std
::
numeric_limits
<
int32_t
>::
max
();
}
else
{
time_since_last_top
=
current_time_ms
-
previous_top_label_time_
;
}
if
((
current_top_score
>
detection_threshold_
)
&&
((
current_top_label
!=
previous_top_label_
)
||
(
time_since_last_top
>
suppression_ms_
)))
{
previous_top_label_
=
current_top_label
;
previous_top_label_time_
=
current_time_ms
;
*
is_new_command
=
true
;
}
else
{
*
is_new_command
=
false
;
}
*
found_command
=
current_top_label
;
*
score
=
current_top_score
;
If the result was valid, is_new_command
is set to true
. This is what the caller can use to determine whether a word was genuinely detected.
The tests (in recognize_commands_test.cc) exercise various different combinations of inputs and results that are stored in the averaging window.
Let’s walk through one of the tests, RecognizeCommandsTestBasic
, which demonstrates how RecognizeCommands
is used. First, we just create an instance of the class:
TF_LITE_MICRO_TEST
(
RecognizeCommandsTestBasic
)
{
tflite
::
MicroErrorReporter
micro_error_reporter
;
tflite
::
ErrorReporter
*
error_reporter
=
&
micro_error_reporter
;
RecognizeCommands
recognize_commands
(
error_reporter
);
Next, we create a tensor containing some fake inference results, which will be used by ProcessLatestResults()
to decide whether a command was heard:
TfLiteTensor
results
=
tflite
::
testing
::
CreateQuantizedTensor
(
{
255
,
0
,
0
,
0
},
tflite
::
testing
::
IntArrayFromInitializer
({
2
,
1
,
4
}),
"input_tensor"
,
0.0f
,
128.0f
);
Then, we set up some variables that will be set with the output of ProcessLatestResults()
:
const
char
*
found_command
;
uint8_t
score
;
bool
is_new_command
;
Finally, we call ProcessLatestResults()
, providing pointers to these variables along with the tensor containing the results. We assert that the function will return kTfLiteOk
, indicating that the input was processed successfully:
TF_LITE_MICRO_EXPECT_EQ
(
kTfLiteOk
,
recognize_commands
.
ProcessLatestResults
(
&
results
,
0
,
&
found_command
,
&
score
,
&
is_new_command
));
The other tests in the file perform some more exhaustive checks to make sure the function is performing correctly. You can read through them to learn more.
To run all of the tests, use the following command:
make -f tensorflow/lite/micro/tools/make/Makefile test_recognize_commands_test
As soon as we’ve determined whether a command was detected, it’s time to share our results with the world (or at least our on-board LEDs). The command responder is what makes this happen.
The final piece in our puzzle, the command responder, is what produces an output to let us know that a word was detected.
The command responder is designed to be overridden for each type of device. We explore the device-specific implementations later in this chapter.
For now, let’s look at its very simple reference implementation, which just logs detection results as text. You can find it in the file command_responder.cc:
void
RespondToCommand
(
tflite
::
ErrorReporter
*
error_reporter
,
int32_t
current_time
,
const
char
*
found_command
,
uint8_t
score
,
bool
is_new_command
)
{
if
(
is_new_command
)
{
error_reporter
->
Report
(
"Heard %s (%d) @%dms"
,
found_command
,
score
,
current_time
);
}
}
That’s it! The file implements just one function: RespondToCommand()
. As parameters, it expects an error_reporter
, the current time (current_time
), the command that was last detected (found_command
), the score it received (score
), and whether the command was newly heard (is_new_command
).
It’s important to note that in our program’s main loop, this function will be called every time inference is performed, even if a command was not detected. This means that we should check is_new_command
to determine whether anything needs to be done.
The test for this function, in command_responder_test.cc, is equally simple. It just calls the function, given that there’s no way for it to test that it generates the correct output:
TF_LITE_MICRO_TEST
(
TestCallability
)
{
tflite
::
MicroErrorReporter
micro_error_reporter
;
tflite
::
ErrorReporter
*
error_reporter
=
&
micro_error_reporter
;
// This will have external side-effects (like printing to the debug console
// or lighting an LED) that are hard to observe, so the most we can do is
// make sure the call doesn't crash.
RespondToCommand
(
error_reporter
,
0
,
"foo"
,
0
,
true
);
}
To run this test, enter this in your terminal:
make -f tensorflow/lite/micro/tools/make/Makefile test_command_responder_test
And that’s it! We’ve walked through all of the components of the application. Now, let’s see how they come together in the program itself.
You can find the following code in main_functions.cc, which defines the setup()
and loop()
functions that are the core of our program. Let’s read through it together!
Because you’re now a seasoned TensorFlow Lite expert, a lot of this code will look familiar to you. So let’s try to focus on the new bits.
First, we list the ops that we want to use:
namespace
tflite
{
namespace
ops
{
namespace
micro
{
TfLiteRegistration
*
Register_DEPTHWISE_CONV_2D
();
TfLiteRegistration
*
Register_FULLY_CONNECTED
();
TfLiteRegistration
*
Register_SOFTMAX
();
}
// namespace micro
}
// namespace ops
}
// namespace tflite
Next, we set up our global variables:
namespace
{
tflite
::
ErrorReporter
*
error_reporter
=
nullptr
;
const
tflite
::
Model
*
model
=
nullptr
;
tflite
::
MicroInterpreter
*
interpreter
=
nullptr
;
TfLiteTensor
*
model_input
=
nullptr
;
FeatureProvider
*
feature_provider
=
nullptr
;
RecognizeCommands
*
recognizer
=
nullptr
;
int32_t
previous_time
=
0
;
// Create an area of memory to use for input, output, and intermediate arrays.
// The size of this will depend on the model you're using, and may need to be
// determined by experimentation.
constexpr
int
kTensorArenaSize
=
10
*
1024
;
uint8_t
tensor_arena
[
kTensorArenaSize
];
}
// namespace
Notice how we declare a FeatureProvider
and a RecognizeCommands
in addition to the usual TensorFlow suspects. We also declare a variable named g_previous_time
, which keeps track of the most recent time we received new audio samples.
Next up, in the setup()
function, we load the model, set up our interpreter, add ops, and allocate tensors:
void
setup
()
{
// Set up logging.
static
tflite
::
MicroErrorReporter
micro_error_reporter
;
error_reporter
=
&
micro_error_reporter
;
// Map the model into a usable data structure. This doesn't involve any
// copying or parsing, it's a very lightweight operation.
model
=
tflite
::
GetModel
(
g_tiny_conv_micro_features_model_data
);
if
(
model
->
version
()
!=
TFLITE_SCHEMA_VERSION
)
{
error_reporter
->
Report
(
"Model provided is schema version %d not equal "
"to supported version %d."
,
model
->
version
(),
TFLITE_SCHEMA_VERSION
);
return
;
}
// Pull in only the operation implementations we need.
static
tflite
::
MicroMutableOpResolver
micro_mutable_op_resolver
;
micro_mutable_op_resolver
.
AddBuiltin
(
tflite
::
BuiltinOperator_DEPTHWISE_CONV_2D
,
tflite
::
ops
::
micro
::
Register_DEPTHWISE_CONV_2D
());
micro_mutable_op_resolver
.
AddBuiltin
(
tflite
::
BuiltinOperator_FULLY_CONNECTED
,
tflite
::
ops
::
micro
::
Register_FULLY_CONNECTED
());
micro_mutable_op_resolver
.
AddBuiltin
(
tflite
::
BuiltinOperator_SOFTMAX
,
tflite
::
ops
::
micro
::
Register_SOFTMAX
());
// Build an interpreter to run the model with.
static
tflite
::
MicroInterpreter
static_interpreter
(
model
,
micro_mutable_op_resolver
,
tensor_arena
,
kTensorArenaSize
,
error_reporter
);
interpreter
=
&
static_interpreter
;
// Allocate memory from the tensor_arena for the model's tensors.
TfLiteStatus
allocate_status
=
interpreter
->
AllocateTensors
();
if
(
allocate_status
!=
kTfLiteOk
)
{
error_reporter
->
Report
(
"AllocateTensors() failed"
);
return
;
}
After allocating tensors, we check that the input tensor is the correct shape and type:
// Get information about the memory area to use for the model's input.
model_input
=
interpreter
->
input
(
0
);
if
((
model_input
->
dims
->
size
!=
4
)
||
(
model_input
->
dims
->
data
[
0
]
!=
1
)
||
(
model_input
->
dims
->
data
[
1
]
!=
kFeatureSliceCount
)
||
(
model_input
->
dims
->
data
[
2
]
!=
kFeatureSliceSize
)
||
(
model_input
->
type
!=
kTfLiteUInt8
))
{
error_reporter
->
Report
(
"Bad input tensor parameters in model"
);
return
;
}
Next comes the interesting stuff. First, we instantiate a FeatureProvider
, pointing it at our input tensor:
// Prepare to access the audio spectrograms from a microphone or other source
// that will provide the inputs to the neural network.
static
FeatureProvider
static_feature_provider
(
kFeatureElementCount
,
model_input
->
data
.
uint8
);
feature_provider
=
&
static_feature_provider
;
We then create a RecognizeCommands
instance and initialize our previous_time
variable:
static
RecognizeCommands
static_recognizer
(
error_reporter
);
recognizer
=
&
static_recognizer
;
previous_time
=
0
;
}
Up next, it’s time for our loop()
function. Like in the previous example, this function will be called over and over again, indefinitely. In the loop, we first use the feature provider to create a spectrogram:
void
loop
()
{
// Fetch the spectrogram for the current time.
const
int32_t
current_time
=
LatestAudioTimestamp
();
int
how_many_new_slices
=
0
;
TfLiteStatus
feature_status
=
feature_provider
->
PopulateFeatureData
(
error_reporter
,
previous_time
,
current_time
,
&
how_many_new_slices
);
if
(
feature_status
!=
kTfLiteOk
)
{
error_reporter
->
Report
(
"Feature generation failed"
);
return
;
}
previous_time
=
current_time
;
// If no new audio samples have been received since last time, don't bother
// running the network model.
if
(
how_many_new_slices
==
0
)
{
return
;
}
If there’s no new data since the last iteration, we don’t bother running inference.
After we have our input, we just invoke the interpreter:
// Run the model on the spectrogram input and make sure it succeeds.
TfLiteStatus
invoke_status
=
interpreter
->
Invoke
();
if
(
invoke_status
!=
kTfLiteOk
)
{
error_reporter
->
Report
(
"Invoke failed"
);
return
;
}
The model’s output tensor is now filled with the probabilities for each category. To interpret them, we use our RecognizeCommands
instance. We obtain a pointer to the output tensor, then set up a few variables to receive the ProcessLatestResults()
output:
// Obtain a pointer to the output tensor
TfLiteTensor
*
output
=
interpreter
->
output
(
0
);
// Determine whether a command was recognized based on the output of inference
const
char
*
found_command
=
nullptr
;
uint8_t
score
=
0
;
bool
is_new_command
=
false
;
TfLiteStatus
process_status
=
recognizer
->
ProcessLatestResults
(
output
,
current_time
,
&
found_command
,
&
score
,
&
is_new_command
);
if
(
process_status
!=
kTfLiteOk
)
{
error_reporter
->
Report
(
"RecognizeCommands::ProcessLatestResults() failed"
);
return
;
}
Finally, we call the command responder’s RespondToCommand()
method so that it can notify users if a word was detected:
// Do something based on the recognized command. The default implementation
// just prints to the error console, but you should replace this with your
// own function for a real application.
RespondToCommand
(
error_reporter
,
current_time
,
found_command
,
score
,
is_new_command
);
}
And that’s it! The call to RespondToCommand()
is the final thing in our loop. Everything from feature generation onward will repeat endlessly, checking the audio for known words and producing some output if one is confirmed.
The setup()
and loop()
functions are called by our main()
function, defined in main.cc, which begins the loop when the application starts:
int
main
(
int
argc
,
char
*
argv
[])
{
setup
();
while
(
true
)
{
loop
();
}
}
The example contains an audio provider compatible with macOS. If you have access to a Mac, you can run the example on your development machine. First, use the following command to build it:
make -f tensorflow/lite/micro/tools/make/Makefile micro_speech
After the build completes, you can run the example with the following command:
tensorflow/lite/micro/tools/make/gen/osx_x86_64/bin/micro_speech
You might see a pop-up asking for microphone access. If so, grant it, and the program will start.
Try saying “yes” and “no.” You should see output that looks like the following:
Heard yes (201) @4056ms Heard no (205) @6448ms Heard unknown (201) @13696ms Heard yes (205) @15000ms Heard yes (205) @16856ms Heard unknown (204) @18704ms Heard no (206) @21000ms
The number after each detected word is its score. By default, the command recognizer component considers matches as valid only if their score is more than 200, so all of the scores you see will be at least 200.
The number after the score is the number of milliseconds since the program was started.
If you don’t see any output, make sure your Mac’s internal microphone is selected in the Mac’s Sound menu and that its input volume is turned up high enough.
We’ve established that the program works on a Mac. Now, let’s get it running on some embedded hardware.
In this section, we deploy the code to three different devices:
For each one, we’ll walk through the build and deployment process.
Because every device has its own mechanism for capturing audio, there’s a separate implementation of audio_provider.cc for each one. The same is true for output, so each has a variant of command_responder.cc, too.
The audio_provider.cc implementations are complex and device-specific, and not directly related to machine learning. Consequently, we won’t walk through them in this chapter. However, there’s a walkthrough of the Arduino variant in Appendix B. If you need to capture audio in your own project, you’re welcome to reuse these implementations in your own code.
Alongside deployment instructions, we’re also going to walk through the command_responder.cc implementation for each device. First up, it’s time for Arduino.
As of this writing, the only Arduino board with a built-in microphone is the Arduino Nano 33 BLE Sense, so that’s what we’ll be using for this section. If you’re using a different Arduino board and attaching your own microphone, you’ll need to implement your own audio_provider.cc.
The Arduino Nano 33 BLE Sense also has a built-in LED, which is what we use to indicate that a word has been recognized.
Figure 7-7 shows a picture of the board with its LED highlighted.
Now let’s look at how we use this LED to indicate that a word has been detected.
Every Arduino board has a built-in LED, and there’s a convenient constant called LED_BUILTIN
that we can use to obtain its pin number, which varies across boards. To keep this code portable, we’ll constrain ourselves to using this single LED for output.
Here’s what we’re going to do. To show that inference is running, we’ll flash the LED by toggling it on or off with each inference. However, when we hear the word “yes,” we’ll switch on the LED for a few seconds.
What about the word “no”? Well, because this is just a demonstration, we won’t worry about it too much. We do, however, log all of the detected commands to the serial port, so we can connect to the device and see every match.
The replacement command responder for Arduino is located in arduino/command_responder.cc. Let’s walk through its source. First, we include the command responder header file and the Arduino platform’s library header file:
#include "tensorflow/lite/micro/examples/micro_speech/command_responder.h"
#include "Arduino.h"
Next, we begin our function implementation:
// Toggles the LED every inference, and keeps it on for 3 seconds if a "yes"
// was heard
void
RespondToCommand
(
tflite
::
ErrorReporter
*
error_reporter
,
int32_t
current_time
,
const
char
*
found_command
,
uint8_t
score
,
bool
is_new_command
)
{
Our next step is to place the built-in LED’s pin into output mode so that we can switch it on and off. We do this inside an if
statement that runs only once, thanks to a static bool
called is_initialized
. Remember, static
variables preserve their state between function calls:
static
bool
is_initialized
=
false
;
if
(
!
is_initialized
)
{
pinMode
(
LED_BUILTIN
,
OUTPUT
);
is_initialized
=
true
;
}
Next, we set up another couple of static
variables to keep track of the last time a “yes” was detected, and the number of inferences that have been performed:
static
int32_t
last_yes_time
=
0
;
static
int
count
=
0
;
Now comes the fun stuff. If the is_new_command
argument is true
, we know we’ve heard something, so we log it with the ErrorReporter
instance. But if it’s a “yes” we heard—which we determine by checking the first character of the found_command
character array—we store the current time and switch on the LED:
if
(
is_new_command
)
{
error_reporter
->
Report
(
"Heard %s (%d) @%dms"
,
found_command
,
score
,
current_time
);
// If we heard a "yes", switch on an LED and store the time.
if
(
found_command
[
0
]
==
'y'
)
{
last_yes_time
=
current_time
;
digitalWrite
(
LED_BUILTIN
,
HIGH
);
}
}
Next, we implement the behavior that switches off the LED after a few seconds—three, to be precise:
// If last_yes_time is non-zero but was >3 seconds ago, zero it
// and switch off the LED.
if
(
last_yes_time
!=
0
)
{
if
(
last_yes_time
<
(
current_time
-
3000
))
{
last_yes_time
=
0
;
digitalWrite
(
LED_BUILTIN
,
LOW
);
}
// If it is non-zero but <3 seconds ago, do nothing.
return
;
}
When the LED is switched off, we also set last_yet_time
to 0
, so we won’t enter this if
statement until the next time a “yes” is heard. The return
statement is important: it’s what prevents any further output code from running if we recently heard a “yes,” so the LED stays solidly lit.
So far, our implementation will switch on the LED for around three seconds when a “yes” is heard. The next part will toggle the LED on and off with each inference—except for while we’re in “yes” mode, when we’re prevented from reaching this point by the aforementioned return
statement.
Here’s the final chunk of code:
// Otherwise, toggle the LED every time an inference is performed.
++
count
;
if
(
count
&
1
)
{
digitalWrite
(
LED_BUILTIN
,
HIGH
);
}
else
{
digitalWrite
(
LED_BUILTIN
,
LOW
);
}
By incrementing the count
variable for each inference, we keep track of the total number of inferences that we’ve performed. Inside the if
conditional, we use the &
operator to do a binary AND operation with the count
variable and the number 1
.
By performing an AND on count
with 1
, we filter out all of count
’s bits except the smallest. If the smallest bit is a 0
, meaning count
is an odd number, the result will be a 0
. In a C++ if statement
, this evaluates to false
.
Otherwise, the result will be a 1
, indicating an even number. Because a 1
evaluates to true
, our LED will switch on with even values and off with odd values. This is what makes it toggle.
And that’s it! We’ve now implemented our command responder for Arduino. Let’s get it running so that we can see it in action.
To deploy this example, here’s what we’ll need:
An Arduino Nano 33 BLE Sense board
A micro-USB cable
The Arduino IDE
There’s always a chance that the build process might have changed since this book was written, so check README.md for the latest instructions.
The projects in this book are available as example code in the TensorFlow Lite Arduino library. If you haven’t already installed the library, open the Arduino IDE and select Manage Libraries from the Tools menu. In the window that appears, search for and install the library named Arduino_TensorFlowLite. You should be able to use the latest version, but if you run into issues, the version that was tested with this book is 1.14-ALPHA.
You can also install the library from a .zip file, which you can either download from the TensorFlow Lite team or generate yourself using the TensorFlow Lite for Microcontrollers Makefile. If you’d prefer to do the latter, see Appendix A.
After you’ve installed the library, the micro_speech
example will show up in the File menu under Examples→Arduino_TensorFlowLite, as shown in Figure 7-8.
Click “micro_speech” to load the example. It will appear as a new window, with a tab for each of the source files. The file in the first tab, micro_speech, is equivalent to the main_functions.cc we walked through earlier.
“Running the Example” already explained the structure of the Arduino example, so we won’t cover it again here.
To run the example, plug in your Arduino device via USB. Make sure the correct device type is selected from the Board drop-down list in the Tools menu, as shown in Figure 7-9.
If your device’s name doesn’t appear in the list, you’ll need to install its support package. To do this, click Boards Manager. In the window that appears, search for your device, and then install the latest version of the corresponding support package. Next, make sure the device’s port is selected in the Port drop-down list, also in the Tools menu, as demonstrated in Figure 7-10.
Finally, in the Arduino window, click the upload button (highlighted in white in Figure 7-11) to compile and upload the code to your Arduino device.
After the upload has successfully completed you should see the LED on your Arduino board begin to flash.
To test the program, try saying “yes.” When it detects a “yes,” the LED will remain lit solidly for around three seconds.
If you can’t get the program to recognize your “yes,” try saying it a few times in a row.
You can also see the results of inference via the Arduino Serial Monitor. To do this, open the Serial Monitor from the Tools menu. Now, try saying “yes,” “no,” and other words. You should see something like Figure 7-12.
The model we’re using is small and imperfect, and you’ll probably notice that it’s better at detecting “yes” than “no.” This is an example of how optimizing for a tiny model size can result in issues with accuracy. We cover this topic in Chapter 8.
Now that you’ve deployed the application, try playing around with the code! You can edit the source files in the Arduino IDE. When you save, you’ll be prompted to re-save the example in a new location. After you’ve made your changes, you can click the upload button in the Arduino IDE to build and deploy.
Here are a few ideas you could try:
The SparkFun Edge has both a microphone and a row of four colored LEDs—red, blue, green, and yellow—which will make displaying results easy. Figure 7-13 shows the SparkFun Edge with its LEDs highlighted.
To make it clear that our program is running, let’s toggle the blue LED on and off with each inference. We’ll switch on the yellow LED when a “yes” is heard, the red LED when a “no” is heard, and the green LED when an unknown command is heard.
The command responder for SparkFun Edge is implemented in sparkfun_edge/command_responder.cc. The file begins with some includes:
#include "tensorflow/lite/micro/examples/micro_speech/command_responder.h"
#include "am_bsp.h"
The command_responder.h include is this file’s corresponding header. am_bsp.h is the Ambiq Apollo3 SDK, which you saw in the last chapter.
Inside the function definition, the first thing we do is set up the pins connected to the LEDs as outputs:
// This implementation will light up the LEDs on the board in response to
// different commands.
void
RespondToCommand
(
tflite
::
ErrorReporter
*
error_reporter
,
int32_t
current_time
,
const
char
*
found_command
,
uint8_t
score
,
bool
is_new_command
)
{
static
bool
is_initialized
=
false
;
if
(
!
is_initialized
)
{
am_hal_gpio_pinconfig
(
AM_BSP_GPIO_LED_RED
,
g_AM_HAL_GPIO_OUTPUT_12
);
am_hal_gpio_pinconfig
(
AM_BSP_GPIO_LED_BLUE
,
g_AM_HAL_GPIO_OUTPUT_12
);
am_hal_gpio_pinconfig
(
AM_BSP_GPIO_LED_GREEN
,
g_AM_HAL_GPIO_OUTPUT_12
);
am_hal_gpio_pinconfig
(
AM_BSP_GPIO_LED_YELLOW
,
g_AM_HAL_GPIO_OUTPUT_12
);
is_initialized
=
true
;
}
We call the am_hal_gpio_pinconfig()
function from the Apollo3 SDK to set all four LED pins to output mode, represented by the constant g_AM_HAL_GPIO_OUTPUT_12
. We use the is_initialized
static
variable to ensure that we do this only once!
Next comes the code that will toggle the blue LED on and off. We do this using a count
variable, in the same way as in the Arduino implementation:
static
int
count
=
0
;
// Toggle the blue LED every time an inference is performed.
++
count
;
if
(
count
&
1
)
{
am_hal_gpio_output_set
(
AM_BSP_GPIO_LED_BLUE
);
}
else
{
am_hal_gpio_output_clear
(
AM_BSP_GPIO_LED_BLUE
);
}
This code uses the am_hal_gpio_output_set()
and am_hal_gpio_output_clear()
functions to switch the blue LED’s pin either on or off.
By incrementing the count
variable at each inference, we keep track of the total number of inferences we’ve performed. Inside the if
conditional, we use the &
operator to do a binary AND operation with the count
variable and the number 1
.
By performing an AND on count
with 1
, we filter out all of count
’s bits except the smallest. If the smallest bit is a 0
, meaning count
is an odd number, the result will be a 0
. In a C++ if statement
, this evaluates to false
.
Otherwise, the result will be a 1
, indicating an even number. Because a 1
evaluates to true
, our LED will switch on with even values and off with odd values. This is what makes it toggle.
Next, we light the appropriate LED depending on which word was just heard. By default, we clear all of the LEDs, so if a word was not recently heard the LEDs will all be unlit:
am_hal_gpio_output_clear
(
AM_BSP_GPIO_LED_RED
);
am_hal_gpio_output_clear
(
AM_BSP_GPIO_LED_YELLOW
);
am_hal_gpio_output_clear
(
AM_BSP_GPIO_LED_GREEN
);
We then use some simple if
statements to switch on the appropriate LED depending on which command was heard:
if
(
is_new_command
)
{
error_reporter
->
Report
(
"Heard %s (%d) @%dms"
,
found_command
,
score
,
current_time
);
if
(
found_command
[
0
]
==
'y'
)
{
am_hal_gpio_output_set
(
AM_BSP_GPIO_LED_YELLOW
);
}
if
(
found_command
[
0
]
==
'n'
)
{
am_hal_gpio_output_set
(
AM_BSP_GPIO_LED_RED
);
}
if
(
found_command
[
0
]
==
'u'
)
{
am_hal_gpio_output_set
(
AM_BSP_GPIO_LED_GREEN
);
}
}
As we saw earlier, is_new_command
is true
only if RespondToCommand()
was called with a genuinely new command, so if a new command wasn’t heard the LEDs will remain off. Otherwise, we use the am_hal_gpio_output_set()
function to switch on the appropriate LED.
We’ve now walked through how our example code lights up LEDs on the SparkFun Edge. Next, let’s get the example up and running.
There’s always a chance that the build process might have changed since this book was written, so check README.md for the latest instructions.
To build and deploy our code, we’ll need the following:
A SparkFun Edge board
A USB programmer (we recommend the SparkFun Serial Basic Breakout, which is available in micro-B USB and USB-C variants)
A matching USB cable
Python 3 and some dependencies
Chapter 6 shows how to confirm whether you have the correct version of Python installed. If you already did this, great. If not, it’s worth flipping back to “Running the Example” to take a look.
In your terminal, clone the TensorFlow repository and then change into its directory:
git clone https://github.com/tensorflow/tensorflow.git
cd
tensorflow
Next, we’re going to build the binary and run some commands that get it ready for downloading to the device. To avoid some typing, you can copy and paste these commands from README.md.
The following command downloads all of the required dependencies and then compiles a binary for the SparkFun Edge:
make -f tensorflow/lite/micro/tools/make/Makefile
TARGET
=
sparkfun_edgeTAGS
=
cmsis-nn micro_speech_bin
The binary is created as a .bin file, in the following location:
tensorflow/lite/micro/tools/make/gen/
sparkfun_edge_cortex-m4/bin/micro_speech.bin
To check whether the file exists, you can use the following command:
test
-f tensorflow/lite/micro/tools/make/gen/sparkfun_edge_cortex-m4/bin/micro_speech.bin
&&
echo
"Binary was successfully created"
||
echo
"Binary is missing"
If you run that command, you should see Binary was successfully created
printed to the console. If you see Binary is missing
, there was a problem with the build process. If so, it’s likely that there are some clues to what went wrong in the output of the make
command.
The binary must be signed with cryptographic keys to be deployed to the device. Let’s now run some commands that will sign the binary so it can be flashed to the SparkFun Edge. The scripts used here come from the Ambiq SDK, which is downloaded when the Makefile is run.
Enter the following command to set up some dummy cryptographic keys that you can use for development:
cp tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.0.0/tools/apollo3_scripts/keys_info0.py
tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.0.0/
tools/apollo3_scripts/keys_info.py
Next, run the following command to create a signed binary. Substitute python3
with python
if necessary:
python3 tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.0.0/tools/apollo3_scripts/create_cust_image_blob.py
--bin tensorflow/lite/micro/tools/make/gen/
sparkfun_edge_cortex-m4/bin/micro_speech.bin
--load-address 0xC000
--magic-num 0xCB -o main_nonsecure_ota
--version 0x0
This creates the file main_nonsecure_ota.bin. Now run this command to create a final version of the file that can be used to flash your device with the script you will use in the next step:
python3 tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.0.0/tools/apollo3_scripts/create_cust_wireupdate_blob.py
--load-address 0x20000
--bin main_nonsecure_ota.bin
-i
6
-o main_nonsecure_wire--options 0x1
You should now have a file called main_nonsecure_wire.bin in the directory where you ran the commands. This is the file you’ll be flashing to the device.
To download new programs to the board, you’ll use the SparkFun USB-C Serial Basic serial programmer. This device allows your computer to communicate with the microcontroller via USB.
To attach this device to your board, perform the following steps:
On the side of the SparkFun Edge, locate the six-pin header.
Plug the SparkFun USB-C Serial Basic into these pins, ensuring the pins labeled BLK and GRN on each device are lined up correctly, as illustrated in Figure 7-14.
You connect the board to your computer via USB. To program the board, you need to find out the name that your computer gives the device. The best way of doing this is to list all the computer’s devices before and after attaching it, and look to see which device is new.
Some people have reported issues with their operating system’s default drivers for the programmer, so we strongly recommend installing the driver before you continue.
Before attaching the device via USB, run the following command:
# macOS:
ls /dev/cu*# Linux:
ls /dev/tty*
This should output a list of attached devices that looks something like the following:
/dev/cu.Bluetooth-Incoming-Port /dev/cu.MALS /dev/cu.SOC
Now, connect the programmer to your computer’s USB port and run the command again:
# macOS:
ls /dev/cu*# Linux:
ls /dev/tty*
You should see an extra item in the output, as shown in the example that follows. Your new item might have a different name. This new item is the name of the device:
/dev/cu.Bluetooth-Incoming-Port /dev/cu.MALS /dev/cu.SOC /dev/cu.wchusbserial-1450
This name will be used to refer to the device. However, it can change depending on which USB port the programmer is attached to, so if you disconnect the board from your computer and then reattach it, you might need to look up its name again.
Some users have reported two devices appearing in the list. If you see two devices, the correct one to use begins with the letters “wch”; for example, “/dev/wchusbserial-14410.”
After you’ve identified the device name, put it in a shell variable for later use:
export DEVICENAME=<your device name here>
This is a variable that you can use when running commands that require the device name, later in the process.
To flash the board, you must put it into a special “bootloader” state that prepares it to receive the new binary. You’ll then run a script to send the binary to the board.
First create an environment variable to specify the baud rate, which is the speed at which data will be sent to the device:
export
BAUD_RATE
=
921600
Now paste the command that follows into your terminal—but do not press Enter yet! The ${DEVICENAME}
and ${BAUD_RATE}
in the command will be replaced with the values you set in the previous sections. Remember to substitute python3
with python
if necessary:
python3 tensorflow/lite/micro/tools/make/downloads/AmbiqSuite-Rel2.0.0/tools/apollo3_scripts/uart_wired_update.py
-b
${
BAUD_RATE
}
${
DEVICENAME
}
-r
1
-f main_nonsecure_wire.bin-i 6
Next, you’ll reset the board into its bootloader state and flash the board. On the board, locate the buttons marked RST
and 14
, as shown in Figure 7-15. Perform the following steps:
Ensure that your board is connected to the programmer and the entire thing is connected to your computer via USB.
On the board, press and hold the button marked 14
. Continue holding it.
While still holding the button marked 14
, press the button marked RST
to reset the board.
Press Enter on your computer to run the script. Continue holding button 14
.
You should now see something like the following appearing on your screen:
Connecting with Corvette over serial port /dev/cu.usbserial-1440... Sending Hello. Received response for Hello Received Status length = 0x58 version = 0x3 Max Storage = 0x4ffa0 Status = 0x2 State = 0x7 AMInfo = 0x1 0xff2da3ff 0x55fff 0x1 0x49f40003 0xffffffff [...lots more 0xffffffff...] Sending OTA Descriptor = 0xfe000 Sending Update Command. number of updates needed = 1 Sending block of size 0x158b0 from 0x0 to 0x158b0 Sending Data Packet of length 8180 Sending Data Packet of length 8180 [...lots more Sending Data Packet of length 8180...]
Keep holding button 14
until you see Sending Data Packet of length 8180
. You can release the button after seeing this (but it’s okay if you keep holding it). The program will continue to print lines on the terminal. Eventually, you’ll see something like the following:
[...lots more Sending Data Packet of length 8180...] Sending Data Packet of length 8180 Sending Data Packet of length 6440 Sending Reset Command. Done.
This indicates a successful flashing.
To make sure the program is running, press the RST
button. You should now see the blue LED flashing.
To test the program, try saying “yes.” When it detects a “yes,” the orange LED will flash. The model is also trained to recognize “no,” and when unknown words are spoken. The red LED should flash for “no,” and the green for unknown.
If you can’t get the program to recognize your “yes,” try saying it a few times in a row: “yes, yes, yes.”
The model we’re using is small and imperfect, and you’ll probably notice that it’s better at detecting “yes” than “no,” which it often recognizes as “unknown.” This is an example of how optimizing for a tiny model size can result in issues with accuracy. We cover this topic in Chapter 8.
The program will also log successful recognitions to the serial port. To view this data, we can monitor the board’s serial port output using a baud rate of 115200. On macOS and Linux, the following command should work:
screen${
DEVICENAME
}
115200
You should initially see output that looks something like the following:
Apollo3 Burst Mode is Available Apollo3 operating in Burst Mode (96MHz)
Try issuing some commands by saying “yes” or “no.” You should see the board printing debug information for each command:
Heard yes (202) @65536ms
To stop viewing the debug output with screen
, press Ctrl-A immediately followed by the K key, and then press the Y key.
Now that you’ve deployed the basic application, try playing around and making some changes. You can find the application’s code in the tensorflow/lite/micro/examples/micro_speech folder. Just edit and save and then repeat the preceding instructions to deploy your modified code to the device.
Here are a few things that you could try:
RespondToCommand()
’s score
argument shows the prediction score. Use the LEDs as a meter to show the strength of the match.
Make the application respond to a specific sequence of “yes” and “no” commands, like a secret code phrase.
Use the “yes” and “no” commands to control other components, like additional LEDs or servos.
Because the STM32F746G comes with a fancy LCD display, we can use this to show off whichever wake words are detected, as depicted in Figure 7-16.
The STM32F746G’s LCD driver gives us methods that we can use to write text to the display. In this exercise, we’ll use these to show one of the following messages, depending on which command was heard:
“Heard yes!”
“Heard no :(”
“Heard unknown”
“Heard silence”
We’ll also set the background color differently depending on which command was heard.
To begin, we include some header files:
#include "tensorflow/lite/micro/examples/micro_speech/command_responder.h"
#include "LCD_DISCO_F746NG.h"
The first, command_responder.h, just declares the interface for this file. The second, LCD_DISCO_F74NG.h, gives us an interface to control the device’s LCD display. You can read more about it on the Mbed site.
Next, we instantiate an LCD_DISCO_F746NG
object, which holds the methods we use to control the LCD:
LCD_DISCO_F746NG
lcd
;
In the next few lines, the RespondToCommand()
function is declared, and we check whether it has been called with a new command:
// When a command is detected, write it to the display and log it to the
// serial port.
void
RespondToCommand
(
tflite
::
ErrorReporter
*
error_reporter
,
int32_t
current_time
,
const
char
*
found_command
,
uint8_t
score
,
bool
is_new_command
)
{
if
(
is_new_command
)
{
error_reporter
->
Report
(
"Heard %s (%d) @%dms"
,
found_command
,
score
,
current_time
);
When we know this is a new command, we use the error_reporter
to log it to the serial port.
Next, we use a big if
statement to determine what happens when each command is found. First comes “yes”:
if
(
*
found_command
==
'y'
)
{
lcd
.
Clear
(
0xFF0F9D58
);
lcd
.
DisplayStringAt
(
0
,
LINE
(
5
),
(
uint8_t
*
)
"Heard yes!"
,
CENTER_MODE
);
We use lcd.Clear()
to both clear any previous content from the screen and set a new background color, like a fresh coat of paint. The color 0xFF0F9D58
is a nice, rich green.
On our green background, we use lcd.DisplayStringAt()
to draw some text. The first argument specifies an x coordinate, the second specifies a y. To position our text roughly in the middle of the display, we use a helper function, LINE()
, to determine the y coordinate that would correspond to the fifth line of text on the screen.
The third argument is the string of text we’ll be displaying, and the fourth argument determines the alignment of the text; here, we use the constant CENTER_MODE
to specify that the text is center-aligned.
We continue the if
statement to cover the remaining three possibilities, “no,” “unknown,” and “silence” (which is captured by the else
block):
}
else
if
(
*
found_command
==
'n'
)
{
lcd
.
Clear
(
0xFFDB4437
);
lcd
.
DisplayStringAt
(
0
,
LINE
(
5
),
(
uint8_t
*
)
"Heard no :("
,
CENTER_MODE
);
}
else
if
(
*
found_command
==
'u'
)
{
lcd
.
Clear
(
0xFFF4B400
);
lcd
.
DisplayStringAt
(
0
,
LINE
(
5
),
(
uint8_t
*
)
"Heard unknown"
,
CENTER_MODE
);
}
else
{
lcd
.
Clear
(
0xFF4285F4
);
lcd
.
DisplayStringAt
(
0
,
LINE
(
5
),
(
uint8_t
*
)
"Heard silence"
,
CENTER_MODE
);
}
And that’s it! Because the LCD library gives us such easy high-level control over the display, it doesn’t take much code to output our results. Let’s deploy the example to see this all in action.
Now we can use the Mbed toolchain to deploy our application to the device.
There’s always a chance that the build process might have changed since this book was written, so check README.md for the latest instructions.
Before we begin, we’ll need the following:
An STM32F746G Discovery kit board
A mini-USB cable
The Arm Mbed CLI (follow the Mbed setup guide)
Python 3 and pip
Like the Arduino IDE, Mbed requires source files to be structured in a certain way. The TensorFlow Lite for Microcontrollers Makefile knows how to do this for us and can generate a directory suitable for Mbed.
To do so, run the following command:
make -f tensorflow/lite/micro/tools/make/Makefile
TARGET
=
mbedTAGS
=
"cmsis-nn disco_f746ng"
generate_micro_speech_mbed_project
This results in the creation of a new directory:
tensorflow/lite/micro/tools/make/gen/mbed_cortex-m4/prj/ micro_speech/mbed
This directory contains all of the example’s dependencies structured in the correct way for Mbed to be able to build it.
First, change into the directory so that you can run some commands within it:
cd
tensorflow/lite/micro/tools/make/gen/mbed_cortex-m4/prj/micro_speech/mbed
Next, you’ll use Mbed to download the dependencies and build the project.
To begin, use the following command to inform Mbed that the current directory is the root of an Mbed project:
mbed config root .
Next, instruct Mbed to download the dependencies and prepare to build:
mbed deploy
By default, Mbed builds the project using C++98. However, TensorFlow Lite requires C++11. Run the following Python snippet to modify the Mbed configuration files so that it uses C++11. You can just type or paste it into the command line:
python
-
c
'import fileinput, glob;
for
filename
in
glob
.
glob
(
"mbed-os/tools/profiles/*.json"
):
for
line
in
fileinput
.
input
(
filename
,
inplace
=
True
):
(
line
.
replace
(
"
"
-std=gnu++98
"
"
,
"
"
-std=c++11
"
,
"
-fpermissive
"
"
))
'
Finally, run the following command to compile:
mbed compile -m DISCO_F746NG -t GCC_ARM
This should result in a binary at the following path:
./BUILD/DISCO_F746NG/GCC_ARM/mbed.bin
One of the nice things about the STM32F746G board is that deployment is really easy. To deploy, just plug in your STM board and copy the file to it. On macOS, you can do this by using the following command:
cp ./BUILD/DISCO_F746NG/GCC_ARM/mbed.bin /Volumes/DIS_F746NG/
Alternately, just find the DIS_F746NG
volume in your file browser and drag the file over.
When this is complete, try saying “yes.” You should see the appropriate text appear on the display and the background color change.
If you can’t get the program to recognize your “yes,” try saying it a few times in a row, like “yes, yes, yes.”
The model we’re using is small and imperfect, and you’ll probably notice that it’s better at detecting “yes” than “no,” which it often recognizes as “unknown.” This is an example of how optimizing for a tiny model size can result in issues with accuracy. We cover this topic in Chapter 8.
The program also logs successful recognitions to the serial port. To view the output, establish a serial connection to the board using a baud rate of 9600.
On macOS and Linux, the device should be listed when you issue the following command:
ls /dev/tty*
It will look something like the following:
/dev/tty.usbmodem1454203
After you’ve identified the device, use the following command to connect to it, replacing </dev/tty.devicename
> with the name of your device as it appears in /dev:
screen /dev/<tty.devicename 9600>
Try issuing some commands by saying “yes” or “no.” You should see the board printing debug information for each command:
Heard yes (202) @65536ms
To stop viewing the debug output with screen
, press Ctrl-A, immediately followed by the K key, and then press the Y key.
If you’re not sure how to make a serial connection on your platform, you could try CoolTerm, which works on Windows, macOS, and Linux. The board should show up in CoolTerm’s Port drop-down list. Make sure you set the baud rate to 9600.
Now that you’ve deployed the application, it could be fun to play around and make some changes. You can find the application’s code in the tensorflow/lite/micro/tools/make/gen/mbed_cortex-m4/prj/micro_speech/mbed folder. Just edit and save and then repeat the preceding instructions to deploy your modified code to the device.
Here are a few things you could try:
RespondToCommand()
’s score
argument shows the prediction score. Create a visual indicator of the score on the LCD display.
Make the application respond to a specific sequence of “yes” and “no” commands, like a secret code phrase.
Use the “yes” and “no” commands to control other components, like additional LEDs or servos.
The application code we’ve walked through has been mostly concerned with capturing data from the hardware and then extracting features that are suitable for inference. The part that actually feeds data into the model and runs inference is relatively small, and it’s very similar to the example covered in Chapter 6.
This is fairly typical of machine learning projects. The model is already trained, thus our job is just to keep it fed with the appropriate sort of data. As an embedded developer working with TensorFlow Lite, you’ll be spending most of your programming time on capturing sensor data, processing it into features, and responding to the output of your model. The inference part itself is quick and easy.
But the embedded application is only part of the package—the really fun part is the model. In Chapter 8, you’ll learn how to train your own speech model to listen for different words. You’ll also learn more about how it works.
18.226.88.110