Chapter 13. TensorFlow Lite for Microcontrollers

In this chapter we look at the software framework we’ve been using for all of the examples in the book: TensorFlow Lite for Microcontrollers. We go into a lot of detail, but you don’t need to understand everything we cover to use it in an application. If you’re not interested in what’s happening under the hood, feel free to skip this chapter; you can always return to it when you have questions. If you do want to better understand the tool you’re using to run machine learning, we cover the history and inner workings of the library here.

What Is TensorFlow Lite for Microcontrollers?

The first question you might ask is what the framework actually does. To understand that, it helps to break the (rather long) name down a bit and explain the components.

TensorFlow

You may well have heard of TensorFlow itself if you’ve looked into machine learning. TensorFlow is Google’s open source machine learning library, with the motto “An Open Source Machine Learning Framework for Everyone.” It was developed internally at Google and first released to the public in 2015. Since then a large external community has grown up around the software, with more contributors outside Google than inside. It’s aimed at Linux, Windows, and macOS desktop and server platforms and offers a lot of tools, examples, and optimizations around training and deploying models in the cloud. It’s the main machine learning library used within Google to power its products, and the core code itself is the same across the internal and published versions.

There are also a large number of examples and tutorials available from Google and other sources. These can show you how to train and use models for everything from speech recognition to data center power management or video analysis.

The biggest need when TensorFlow was launched was the ability to train models and run them in desktop environments. This influenced a lot of the design decisions, such as trading the size of the executable for lower latency and more functionality—on a cloud server where even RAM is measured in gigabytes and there are terabytes of storage space, having a binary that’s a couple of hundred megabytes in size is not a problem. Another example is that its main interface language at launch was Python, a scripting language widely used on servers.

These engineering trade-offs weren’t as appropriate for other platforms, though. On Android and iPhone devices, adding even a few megabytes to the size of an app can decrease the number of downloads and customer satisfaction dramatically. You can build TensorFlow for these phone platforms, but by default it adds 20 MB to the application size, and even with some work never shrinks below 2 MB.

TensorFlow Lite

To meet these lower size requirements for mobile platforms, in 2017 Google started a companion project to mainline TensorFlow called TensorFlow Lite. This library is aimed at running neural network models efficiently and easily on mobile devices. To reduce the size and complexity of the framework, it drops features that are less common on these platforms. For example, it doesn’t support training, just running inference on models that were previously trained on a cloud platform. It also doesn’t support the full range of data types (such as double) available in mainline TensorFlow. Additionally, some less-used operations aren’t present, like tf.depth_to_space. You can find the latest compatibility information on the TensorFlow website.

In return for these trade-offs, TensorFlow Lite can fit within just a few hundred kilobytes, making it much easier to fit into a size-constrained application. It also has highly optimized libraries for Arm Cortex-A-series CPUs, along with support for Android’s Neural Network API for accelerators, and GPUs through OpenGL. Another key advantage is that it has good support for 8-bit quantization of networks. Because a model might have millions of parameters, the 75% size reduction from 32-bit floats to 8-bit integers alone makes it worthwhile, but there are also specialized code paths that allow inference to run much faster on the smaller data type.

TensorFlow Lite for Microcontrollers

TensorFlow Lite has been widely adopted by mobile developers, but its engineering trade-offs didn’t meet the requirements of all platforms. The team noticed that there were a lot of Google and external products that could benefit from machine learning being build on embedded platforms, on which the existing TensorFlow Lite library wouldn’t fit. Again, the biggest constraint was binary size. For these environments even a few hundred kilobytes was too large; they needed something that would fit within 20 KB or less. A lot of the dependencies that mobile developers take for granted, like the C Standard Library, weren’t present either, so no code that relied on these libraries could be used. A lot of the requirements were very similar, though. Inference was the primary use case, quantized networks were important for performance, and having a code base that was simple enough for developers to explore and modify was a priority.

With those needs in mind, in 2018 a team at Google (including the authors of this book) started experimenting with a specialized version of TensorFlow Lite aimed just at these embedded platforms. The goal was to reuse as much of the code, tooling, and documentation from the mobile project as possible, while satisfying the tough requirements of embedded environments. To make sure Google was building something practical, the team focused on the real-world use case of recognizing a spoken “wake word,” similar to the “Hey Google” or “Alexa” examples from commercial voice interfaces. Aiming at an end-to-end example of how to tackle this problem, Google worked to ensure the system we designed was usable for production systems.

Requirements

The Google team knew that running in embedded environments imposed a lot of constraints on how the code could be written, so it identified some key requirements for the library:

No operating system dependencies

A machine learning model is fundamentally a mathematical black box where numbers are fed in, and numbers are returned as the results. Access to the rest of the system shouldn’t be necessary to perform these operations, so it’s possible to write a machine learning framework without calls to the underlying operating system. Some of the targeted platforms don’t have an OS at all, and avoiding any references to files or devices in the basic code made it possible to port to those chips.

No standard C or C++ library dependencies at linker time

This is a bit subtler than the OS requirement, but the team was aiming to deploy on devices that might have only a few tens of kilobytes of memory to store a program, so the binary size was very important. Even apparently simple functions like sprintf() can easily take up 20 KB by themselves, so the team aimed to avoid anything that had to be pulled in from the library archives that hold the implementations of the C and C++ standard libraries. This was tricky because there’s no well-defined boundary between header-only dependencies (like stdint.h, which holds the sizes of data types) and linker-time parts of the standard libraries (such as many string functions or sprintf(). In practice the team had to use some common sense to understand that, generally, compile-time constants and macros were fine, but anything more complex should be avoided. The one exception to this linker avoidance is the standard C math library, which is relied on for things like trigonometric functions that do need to be linked in.

No floating-point hardware expected

Many embedded platforms don’t have support for floating-point arithmetic in hardware, so the code had to avoid any performance-critical uses of floats. This meant focusing on models with 8-bit integer parameters, and using 8-bit arithmetic within operations (though for compatibility the framework also supports float ops if they’re needed).

No dynamic memory allocation

A lot of applications using microcontrollers need to run continuously for months or years. If the main loop of a program is allocating and deallocating memory using malloc()/new and free()/delete, it’s very difficult to guarantee that the heap won’t eventually end up in a fragmented state, causing an allocation failure and a crash. There’s also very little memory available on most embedded systems, so upfront planning of this limited resource is more important than on other platforms, and without an OS there might not even be a heap and allocation routines. This means that embedded applications often avoid using dynamic memory allocation entirely. Because the library was designed to be used by those applications, it needed do the same. In practice the framework asks the calling application to pass in a small, fixed-size arena that the framework can use for temporary allocations (like activation buffers) at initialization time. If the arena is too small, the library will return an error immediately and the client will need to recompile with a larger arena. Otherwise, the calls to perform inference happen with no further memory allocations, so they can be made repeatedly with no risk of heap fragmentation or memory errors.

The team also decided against some other constraints that are common in the embedded community because they would make sharing code and maintaining compatibility with mobile TensorFlow Lite too difficult. Therefore:

It requires C++11

It’s common to write embedded programs in C, and some platforms don’t have toolchains that support C++ at all, or support older versions than the 2011 revision of the standard. TensorFlow Lite is mostly written in C++, with some plain C APIs, which makes calling it from other languages easier. It doesn’t rely on advanced features like complex templates; its style is in the spirit of a “better C” with classes to help modularize the code. Rewriting the framework in C would have taken a lot of work and been a step backward for users on mobile platforms, and when we surveyed the most popular platforms we found, they all had C++11 support already, so the team decided to trade support for older devices against making it easier to share code across all flavors of TensorFlow Lite.

It expects 32-bit processors

There are a massive number of different hardware platforms available in the embedded world, but the trend in recent years has been toward 32-bit processors, rather than the 16-bit or 8-bit chips that used to be common. After surveying the ecosystem, Google decided to focus its development on the newer 32-bit devices because that kept assumptions like the C int data type being 32 bits the same across mobile and embedded versions of the framework. We have had reports of successful ports to some 16-bit platforms, but these rely on modern toolchains that compensate for the limitations, and are not our main priority.

Why Is the Model Interpreted?

One question that comes up a lot is why we chose to interpret models at runtime rather than doing code generation from a model ahead of time. Explaining that decision involves teasing apart some of the benefits and problems of the different approaches involved.

Code generation involves converting a model directly into C or C++ code, with all of the parameters stored as data arrays in the code and the architecture expressed as a series of function calls that pass activations from one layer to the next. This code is often output into a single large source file with a handful of entry points. That file can then be included in an IDE or toolchain directly, and compiled like any other code. Here are a few of the key advantages of code generation:

Ease of building

Users told us the number one benefit was how easy it makes integrating into build systems. If all you have is a few C or C++ files, with no external library dependencies, you can easily drag and drop them into almost any IDE and get a project built with few chances for things to go wrong.

Modifiability

When you have a small amount of code in a single implementation file, it’s much simpler to step through and change the code if you need to, at least compared to a large library for which you first need to establish what implementations are even being used.

Inline data

The data for the model itself can be stored as part of the implementation source code, so no additional files are required. It can also be stored directly as an in-memory data structure, so no loading or parsing step is required.

Code size

If you know what model and platform you’re building for ahead of time, you can avoid including code that will never be called, so the size of the program segment can be kept minimal.

Interpreting a model is a different approach, and relies on loading a data structure that defines the model. The executed code is static; only the model data changes, and the information in the model controls which operations are executed and where parameters are drawn from. This is more like running a script in an interpreted language like Python, whereas you can see code generation as being closer to traditional compiled languages like C. Here are some of the drawbacks of code generation, compared to interpreting a model data structure:

Upgradability

What happens if you’ve locally modified the generated code but you want to upgrade to a newer version of the overall framework to get new functionality or optimizations? You’ll either need to manually cherry-pick changes into your local files or regenerate them entirely and try to patch back in your local changes.

Multiple models

It’s difficult to support more than one model at a time through code generation without a lot of source duplication.

Replacing models

Each model is expressed as a mixture of source code and data arrays within the program, so it’s difficult to change the model without recompiling the entire program.

What the team realized was that it’s possible to get a lot of the benefits of code generation, without incurring the drawbacks, using what we term project generation.

Project Generation

In TensorFlow Lite, project generation is a process that creates a copy of just the source files you need to build a particular model, without making any changes to them, and also optionally sets up any IDE-specific project files so that they can be built easily. It retains most of the benefits of code generation, but it has some key advantages:

Upgradability

All of the source files are just copies of originals from the main TensorFlow Lite code base, and they appear in the same location in the folder hierarchy, so if you make local modifications they can easily be ported back to the original source, and library upgrades can be merged simply using standard merge tools.

Multiple and replacement models

The underlying code is an interpreter, so you can have more than one model or swap out a data file easily without recompiling.

Inline data

The model parameters themselves can still be compiled into the program as a C data array if needed, and the use of the FlatBuffers serialization format means that this representation can be used directly in memory with no unpacking or parsing required.

External dependencies

All of the header and source files required to build the project are copied into the folder alongside the regular TensorFlow code, so no dependencies need to be downloaded or installed separately.

The biggest advantage that doesn’t come automatically is code size, because the interpreter structure makes it more difficult to spot code paths that will never be called. This is addressed separately in TensorFlow Lite by manually using the OpResolver mechanism to register only the kernel implementations that you expect to use in your application.

Build Systems

TensorFlow Lite was originally developed in a Linux environment, so a lot of our tooling is based around traditional Unix tools like shell scripts, Make, and Python. We know that’s not a common combination for embedded developers, though, so we aim to support other platforms and compilation toolchains as first-class citizens.

The way we do that is through the aforementioned project generation. If you grab the TensorFlow source code from GitHub, you can build for a lot of platforms using a standard Makefile approach on Linux. For example, this command line should compile and test an x86 version of the library:

make -f tensorflow/lite/micro/tools/make/Makefile test

You can build a specific target, like the speech wake-word example for the SparkFun Edge platform, with a command like this:

make -f tensorflow/lite/micro/tools/make/Makefile 
  TARGET="sparkfun_edge" micro_speech_bin

What if you’re running on a Windows machine or want to use an IDE like Keil, Mbed, Arduino, or another specialized build system? That’s where the project generation comes in. You can generate a folder that’s ready to use with the Mbed IDE by running the following command line from Linux:

make -f tensorflow/lite/micro/tools/make/Makefile 
  TARGET="disco_f746ng" generate_micro_speech_mbed_project

You should now see a set of source files in tensorflow/lite/micro/tools/make/gen/disco_f746ng_x86_64/prj/micro_speech/mbed/, along with all the dependencies and project files you need to build within the Mbed environment. The same approach works for Keil and Arduino, and there’s a generic version that just outputs the folder hierarchy of source files without project metainformation (though it does include a Visual Studio Code file that defines a couple of build rules).

You might be wondering how this Linux command-line approach helps people on other platforms. We automatically run this project-generation process as part of our nightly continuous integration workflow and whenever we do a major release. Whenever it’s run, it automatically puts the resulting files up on a public web server. This means that users on all platforms should be able to find a version for their preferred IDE, and download the project as a self-contained folder instead of through GitHub.

Specializing Code

One of the benefits of code generation is that it’s easy to rewrite part of the library to work well on a particular platform, or even just optimize a function for a particular set of parameters that you know are common in your use case. We didn’t want to lose this ease of modification, but we also wanted to make it as easy as possible for more generally useful changes to be merged back into the main framework’s source code. We had the additional constraint that some build environments don’t make it easy to pass in custom #define macros during compilation, so we couldn’t rely on switching to different implementations at compile time using macro guards.

To solve this problem we’ve broken the library into small modules, each of which has a single C++ file implementing a default version of its functionality, along with a C++ header that defines the interface that other code can call to use the module. We then adopted a convention that if you want to write a specialized version of a module, you save your new version out as a C++ implementation file with the same name as the original but in a subfolder of the directory that the original is in. This subfolder should have the name of the platform or feature you’re specializing for (see Figure 13-1), and will be automatically used by the Makefile or generated projects instead of the original implementation when you’re building for that platform or feature. This probably sounds pretty complicated, so let’s walk through a couple of concrete examples.

The speech wake-word sample code needs to grab audio data from a microphone, but unfortunately there’s no cross-platform way to capture audio. Because we need to at least compile across a wide range of devices, we wrote a default implementation that just returns a buffer full of zero values, without using a microphone. Here’s what the interface to that module looks like, from audio_provider.h:

TfLiteStatus GetAudioSamples(tflite::ErrorReporter* error_reporter,
                             int start_ms, int duration_ms,
                             int* audio_samples_size, int16_t** audio_samples);
int32_t LatestAudioTimestamp();
Screenshot of a specialized audio provider file
Figure 13-1. Screenshot of a specialized audio provider file

The first function outputs a buffer filled with audio data for a given time period, returning an error if something goes wrong. The second function returns when the most recent audio data was captured, so the client can ask for the correct range of time, and know when new data has arrived.

Because the default implementation can’t rely on a microphone being present, the implementations of the two functions in audio_provider.cc are very simple:

namespace {
int16_t g_dummy_audio_data[kMaxAudioSampleSize];
int32_t g_latest_audio_timestamp = 0;
}  // namespace

TfLiteStatus GetAudioSamples(tflite::ErrorReporter* error_reporter,
                             int start_ms, int duration_ms,
                             int* audio_samples_size, int16_t** audio_samples) {
  for (int i = 0; i < kMaxAudioSampleSize; ++i) {
    g_dummy_audio_data[i] = 0;
  }
  *audio_samples_size = kMaxAudioSampleSize;
  *audio_samples = g_dummy_audio_data;
  return kTfLiteOk;
}

int32_t LatestAudioTimestamp() {
  g_latest_audio_timestamp += 100;
  return g_latest_audio_timestamp;
}

The timestamp is incremented automatically every time the function is called, so that clients will behave as if new data were coming in, but the same array of zeros is returned every time by the capture routine. The benefit of this is that it allows you to prototype and experiment with the sample code even before you have a microphone working on a system. kMaxAudioSampleSize is defined in the model header and is the largest number of samples that the function will ever be asked for.

On a real device the code needs to be a lot more complex, so we need a new implementation. Earlier, we compiled this example for the STM32F746NG Discovery kit board, which has microphones built in and uses a separate Mbed library to access them. The code is in disco_f746ng/audio_provider.cc. It’s not included inline here because it’s too big, but if you look at that file, you’ll see it implements the same two public functions as the default audio_provider.cc: GetAudioSamples() and LatestAudioTimestamp(). The definitions of the functions are a lot more complex, but their behavior from a client’s perspective is the same. The complexity is hidden, and the calling code can remain the same despite the change in platform—and now, instead of receiving an array of zeros every time, captured audio will show up in the returned buffer.

If you look at the full path of this specialized implementation, tensorflow/lite/micro/examples/micro_speech/disco_f746ng/audio_provider.cc, you’ll see it’s almost identical to that of the default implementation at tensorflow/lite/micro/examples/micro_speech/audio_provider.cc, but it’s inside a disco_f746ng subfolder at the same level as the original .cc file. If you look back at the command line for building the STM32F746NG Mbed project, you’ll see we passed in TARGET=disco_f746ng to specify what platform we want. The build system always looks for .cc files in subfolders with the target name for possible specialized implementations, so in this case disco_f746ng/audio_provider.cc is used instead of the default audio_provider.cc version in the parent folder. When the source files are being assembled for the Mbed project copy, that parent-level .cc file is ignored, and the one in the subfolder is copied over; thus, the specialized version is used by the resulting project.

Capturing audio is done differently on almost every platform, so we have a lot of different specialized implementations of this module. There’s even a macOS version, osx/audio_provider.cc, which is useful if you’re debugging locally on a Mac laptop.

This mechanism isn’t just used for portability, though; it’s also flexible enough to use for optimizations. We actually use this approach in the speech wake-word example to help speed up the depthwise convolution operation. If you look in tensorflow/lite/micro/kernels you’ll see implementations of all the operations that TensorFlow Lite for Microcontrollers supports. These default implementations are written to be short, be easy to understand, and run on any platform, but meeting those goals means that they often miss opportunities to run as fast as they could. Optimization usually involves making the algorithms more complicated and more difficult to understand, so these reference implementations are expected to be comparatively slow. The idea is that we want to enable developers to get code running in the simplest possible way first and ensure that they’re getting correct results, and then be able to incrementally change the code to improve performance. This means that every small change can be tested to make sure it doesn’t break correctness, making debugging much easier.

The model used in the speech wake-word example relies heavily on the depthwise convolution operation, which has an unoptimized implementation at tensorflow/lite/micro/kernels/depthwise_conv.cc. The core algorithm is implemented in tensorflow/lite/kernels/internal/reference/depthwiseconv_uint8.h, and it’s written as a straightforward set of nested loops. Here’s the code itself:

   for (int b = 0; b < batches; ++b) {
      for (int out_y = 0; out_y < output_height; ++out_y) {
        for (int out_x = 0; out_x < output_width; ++out_x) {
          for (int ic = 0; ic < input_depth; ++ic) {
            for (int m = 0; m < depth_multiplier; m++) {
              const int oc = m + ic * depth_multiplier;
              const int in_x_origin = (out_x * stride_width) - pad_width;
              const int in_y_origin = (out_y * stride_height) - pad_height;
              int32 acc = 0;
              for (int filter_y = 0; filter_y < filter_height; ++filter_y) {
                for (int filter_x = 0; filter_x < filter_width; ++filter_x) {
                  const int in_x =
                      in_x_origin + dilation_width_factor * filter_x;
                  const int in_y =
                      in_y_origin + dilation_height_factor * filter_y;
                  // If the location is outside the bounds of the input image,
                  // use zero as a default value.
                  if ((in_x >= 0) && (in_x < input_width) && (in_y >= 0) &&
                      (in_y < input_height)) {
                    int32 input_val =
                        input_data[Offset(input_shape, b, in_y, in_x, ic)];
                    int32 filter_val = filter_data[Offset(
                        filter_shape, 0, filter_y, filter_x, oc)];
                    acc += (filter_val + filter_offset) *
                           (input_val + input_offset);
                  }
                }
              }
              if (bias_data) {
                acc += bias_data[oc];
              }
              acc = DepthwiseConvRound<output_rounding>(acc, output_multiplier,
                                                        output_shift);
              acc += output_offset;
              acc = std::max(acc, output_activation_min);
              acc = std::min(acc, output_activation_max);
              output_data[Offset(output_shape, b, out_y, out_x, oc)] =
                  static_cast<uint8>(acc);
            }
          }
        }
      }
    }

You might be able to see lots of opportunities to speed this up just from a quick look, like precalculating all the array indices that we figure out every time in the inner loop. Those changes would add to the complexity of the code, so for this reference implementation we’ve avoided them. The speech wake-word example needs to run multiple times a second on a microcontroller, though, and it turns out that this naive implementation is the main speed bottleneck preventing that on the SparkFun Edge Cortex-M4 processor. To make the example run at a usable speed, we needed to add some optimizations.

To provide an optimized implementation, we created a new subfolder called portable_optimized inside tensorflow/lite/micro/kernels, and added a new C++ source file called depthwise_conv.cc. This is much more complex than the reference implementation, and takes advantage of particular features of the speech model to enable specialized optimizations. For example, the convolution windows are multiples of 8 wide, so we can load the values as two 32-bit words from memory, rather than as 8 individual bytes.

You’ll notice that we’ve named the subfolder portable_optimized, rather than something platform-specific as we did for the previous example. This is because none of the changes we’ve made are tied to a particular chip or library; they’re generic optimizations that are expected to help across a wide variety of processors, such as precalculating array indices or loading multiple byte values as larger words. We then specify that this implementation should be used inside the make project files, by adding portable_optimized to the ALL_TAGS list. Because this tag is present, and there’s an implementation of depthwise_conv.cc inside the subfolder with the same name, the optimized implementation is linked in rather than the default reference version.

Hopefully these examples show how you can use the subfolder mechanism to extend and optimize the library code while keeping the core implementations small and easy to understand.

Makefiles

On the topic of being easy to understand, Makefiles aren’t. The Make build system is now more than 40 years old and has a lot of features that can be confusing, such as its use of tabs as meaningful syntax or the indirect specification of build targets through declarative rules. We chose to use Make over alternatives such as Bazel or Cmake because it was flexible enough to implement complex behaviors like project generation, and we hope that most users of TensorFlow Lite for Microcontrollers will use those generated projects in more modern IDEs rather than interacting with Makefiles directly.

If you’re making changes to the core library, you might need to understand more about what’s going on under the hood in the Makefiles, though, so this section covers some of the conventions and helper functions that you’ll need to be familiar with to make modifications.

Note

If you’re using a bash terminal on Linux or macOS, you should be able to see all of the available targets (names of things you can build) by typing the normal make -f tensorflow/lite/micro/tools/make/Makefile command and then pressing the Tab key. This autocomplete feature can be very useful when finding or debugging targets.

If you’re just adding a specialized version of a module or operation, you shouldn’t need to update the Makefile at all. There’s a custom function called specialize() that automatically takes the ALL_TAGS list of strings (populated with the platform name, along with any custom tags) and a list of source files, and returns the list with the correct specialized versions substituted for the originals. This does also give you the flexibility to manually specify tags on the command line if you want to. For example, this:

make -f tensorflow/lite/micro/tools/make/Makefile 
  TARGET="bluepill" TAGS="portable_optimized foo" test

would produce an ALL_TAGS list that looked like “bluepill portable_optimized foo,” and for every source file the subfolders would be searched in order to find any specialized versions to substitute.

You also don’t need to alter the Makefile if you’re just adding new C++ files to standard folders, because most of these are automatically picked up by wildcard rules, like the definition of MICROLITE_CC_BASE_SRCS.

The Makefile relies on defining lists of source and header files to build at the root level and then modifying them depending on which platform and tags are specified. These modifications happen in sub-Makefiles included from the parent build project. For example, all of the .inc files in the tensorflow/lite/micro/tools/make/targets folder are automatically included. If you look in one of these, like the apollo3evb_makefile.inc used for Ambiq and SparkFun Edge platforms, you can see that it checks whether the chips it’s targeting have been specified for this build; if they have, it defines a lot of flags and modifies the source lists. Here’s an abbreviated version including some of the most interesting bits:

ifeq ($(TARGET),$(filter $(TARGET),apollo3evb sparkfun_edge))
  export PATH := $(MAKEFILE_DIR)/downloads/gcc_embedded/bin/:$(PATH)
  TARGET_ARCH := cortex-m4
  TARGET_TOOLCHAIN_PREFIX := arm-none-eabi-
...
  $(eval $(call add_third_party_download,$(GCC_EMBEDDED_URL), 
      $(GCC_EMBEDDED_MD5),gcc_embedded,))
  $(eval $(call add_third_party_download,$(CMSIS_URL),$(CMSIS_MD5),cmsis,))
...
  PLATFORM_FLAGS = 
    -DPART_apollo3 
    -DAM_PACKAGE_BGA 
    -DAM_PART_APOLLO3 
    -DGEMMLOWP_ALLOW_SLOW_SCALAR_FALLBACK 
...
  LDFLAGS += 
    -mthumb -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mfloat-abi=hard 
    -nostartfiles -static 
    -Wl,--gc-sections -Wl,--entry,Reset_Handler 
...
  MICROLITE_LIBS := 
    $(BOARD_BSP_PATH)/gcc/bin/libam_bsp.a 
    $(APOLLO3_SDK)/mcu/apollo3/hal/gcc/bin/libam_hal.a 
    $(GCC_ARM)/lib/gcc/arm-none-eabi/7.3.1/thumb/v7e-m/fpv4-sp/hard/crtbegin.o 
    -lm
  INCLUDES += 
    -isystem$(MAKEFILE_DIR)/downloads/cmsis/CMSIS/Core/Include/ 
    -isystem$(MAKEFILE_DIR)/downloads/cmsis/CMSIS/DSP/Include/ 
    -I$(MAKEFILE_DIR)/downloads/CMSIS_ext/ 
...
  MICROLITE_CC_SRCS += 
    $(APOLLO3_SDK)/boards/apollo3_evb/examples/hello_world/gcc_patched/ 
        startup_gcc.c 
    $(APOLLO3_SDK)/utils/am_util_delay.c 
    $(APOLLO3_SDK)/utils/am_util_faultisr.c 
    $(APOLLO3_SDK)/utils/am_util_id.c 
    $(APOLLO3_SDK)/utils/am_util_stdio.c

This is where all of the customizations for a particular platform happen. In this snippet, we’re indicating to the build system where to find the compiler that we want to use, and what architecture to specify. We’re specifying some extra external libraries to download, like the GCC toolchain and Arm’s CMSIS library. We’re setting up compilation flags for the build, and arguments to pass to the linker, including extra library archives to link in and include paths to look in for headers. We’re also adding some extra C files we need to build successfully on Ambiq platforms.

A similar kind of sub-Makefile inclusion is used for building the examples. The speech wake-word sample code has its own Makefile at micro_speech/Makefile.inc, and it defines its own lists of source code files to compile, along with extra external dependencies to download.

You can generate standalone projects for different IDEs by using the generate_microlite_projects() function. This takes a list of source files and flags and then copies the required files to a new folder, together with any additional project files that are needed by the build system. For some IDEs this is very simple, but the Arduino, for example, requires all .cc files to be renamed to .cpp and some include paths to be altered in the source files as they are copied.

External libraries such as the C++ toolchain for embedded Arm processors are automatically downloaded as part of the Makefile build process. This happens because of the add_third_party_download rule that’s invoked for every needed library, passing in a URL to pull from and an MD5 sum to check the archive against to ensure that it’s correct. These are expected to be ZIP, GZIP, BZ2, or TAR files, and the appropriate unpacker will be called depending on the file extension. If headers or source files from any of these are needed by build targets, they should be explicitly included in the file lists in the Makefile so that they can be copied over to any generated projects, so each project’s source tree is self-contained. This is easy to forget with headers because setting up include paths is enough to get the Makefile compilation working without explicitly mentioning each included file, but the generated projects will then fail to build. You should also ensure that any license files are included in your file lists, so that the copies of the external libraries retain the proper attribution.

Writing Tests

TensorFlow aims to have unit tests for all of its code, and we’ve already covered some of these tests in detail in Chapter 5. The tests are usually arranged as _test.cc files in the same folder as the module that’s being tested, and with the same prefix as the original source file. For example, the implementation of the depthwise convolution operation is tested by tensorflow/lite/micro/kernels/depthwise_conv_test.cc. If you’re adding a new source file, you must add an accompanying unit test that exercises it if you want to submit your modifications back into the main tree. This is because we need to support a lot of different platforms and models and many people are building complex systems on top of our code, so it’s important that our core components can be checked for correctness.

If you add a file in a direct subfolder of tensorflow/tensorflow/lite/experimental/micro, you should be able to name it <something>_test.cc and it will be picked up automatically. If you’re testing a module inside an example, you’ll need to add an explicit call to the microlite_test Makefile helper function, like this:

# Tests the feature provider module using the mock audio provider.
$(eval $(call microlite_test,feature_provider_mock_test,
$(FEATURE_PROVIDER_MOCK_TEST_SRCS),$(FEATURE_PROVIDER_MOCK_TEST_HDRS)))

The tests themselves need to be run on microcontrollers, so they must stick to the same constraints around dynamic memory allocation, avoiding OS and external library dependencies that the framework aims to satisfy. Unfortunately, this means that popular unit test systems like Google Test aren’t acceptable. Instead, we’ve written our own very minimal test framework, defined and implemented in the micro_test.h header.

To use it, create a .cc file that includes the header. Start with a TF_LITE_MICRO_TESTS_BEGIN statement on a new line, and then define a series of test functions, each with a TF_LITE_MICRO_TEST() macro. Inside each test, you call macros like TF_LITE_MICRO_EXPECT_EQ() to assert the expected results that you want to see from the functions being tested. At the end of all the test functions you’ll need TF_LITE_MICRO_TESTS_END. Here’s a basic example:

#include "tensorflow/lite/micro/testing/micro_test.h"

TF_LITE_MICRO_TESTS_BEGIN

TF_LITE_MICRO_TEST(SomeTest) {
  TF_LITE_LOG_EXPECT_EQ(true, true);
}

TF_LITE_MICRO_TESTS_END

If you compile this for your platform, you’ll get a normal binary that you should be able to run. Executing it will output logging information like this to stderr (or whatever equivalent is available and written to by ErrorReporter on your platform):

----------------------------------------------------------------------------
Testing SomeTest
1/1 tests passed
~~~ALL TESTS PASSED~~~
----------------------------------------------------------------------------

This is designed to be human-readable, so you can just run tests manually, but the string ~~~ALL TESTS PASSED~~~ should appear only if all of the tests do actually pass. This makes it possible to integrate with automated test systems by scanning the output logs and looking for that magic value. This is how we’re able to run tests on microcontrollers. As long as there’s some debug logging connection back, the host can flash the binary and then monitor the output log to ensure the expected string appears to indicate whether the tests succeeded.

Supporting a New Hardware Platform

One of the main goals of the TensorFlow Lite for Microcontrollers project is to make it easy to run machine learning models across many different devices, operating systems, and architectures. The core code is designed to be as portable as possible, and the build system is written to make bringing up new environments straightforward. In this section, we present a step-by-step guide to getting TensorFlow Lite for Microcontrollers running on a new platform.

Printing to a Log

The only platform dependency that TensorFlow Lite absolutely requires is the ability to print strings to a log that can be inspected externally, typically from a desktop host machine. This is so that we can see whether tests have been run successfully and generally debug what’s happening inside the programs we’re running. Because this is a difficult requirement, the first thing you will need to do on your platform is determine what kind of logging facilities are available and then write a small program to print something out to exercise them.

On Linux and most other desktop operating systems, this would be the canonical “hello world” example that begins many C training curriculums. It would typically look something like this:

#include <stdio.h>

int main(int argc, char** argv) {
  fprintf(stderr, "Hello World!
");
}

If you compile and build this on Linux, macOS, or Windows and then run the executable from the command line, you should see “Hello World!” printed to the terminal. It might also work on a microcontroller if it’s running an advanced OS, but at the very least you’ll need to figure out where the text itself appears given that embedded systems don’t have displays or terminals themselves. Typically you’ll need to connect to a desktop machine over USB or another debugging connection to see any logs, even if fprintf() is supported when compiling.

There are a few tricky parts about this code from a microcontroller perspective. One of them is that the stdio.h library requires functions to be linked in, and some of them are quite large, which can increase the binary size beyond the resources available on a small device. The library also assumes that there are all the normal C standard library facilities available, like dynamic memory allocation and string functions. And there’s no natural definition for where stderr should go on an embedded system, so the API is unclear.

Instead, most platforms define their own debug logging interfaces. How these are called often depends on what kind of connection is being used between the host and microcontroller, as well as the hardware architecture and the OS (if any) being run on the embedded system. For example, Arm Cortex-M microcontrollers support semihosting, which is a standard for communicating between the host and target systems during the development process. If you’re using a connection like OpenOCD from your host machine, calling the SYS_WRITE0 system call from the microcontroller will cause the zero-terminated string argument in register 1 to be shown on the OpenOCD terminal. In this case, the code for an equivalent “hello world” program would look like this:

void DebugLog(const char* s) {
  asm("mov r0, #0x04
"  // SYS_WRITE0
      "mov r1, %[str]
"
      "bkpt #0xAB
"
      :
      : [ str ] "r"(s)
      : "r0", "r1");
}

int main(int argc, char** argv) {
  DebugLog("Hello World!
");
}

The need for assembly here shows how platform-specific this solution is, but it does avoid the need to bring in any external libraries at all (even the standard C library).

Exactly how to do this will vary widely across different platforms, but one common approach is to use a serial UART connection to the host. Here’s how you do that on Mbed:

#include <mbed.h>

// On mbed platforms, we set up a serial port and write to it for debug logging.
void DebugLog(const char* s) {
  static Serial pc(USBTX, USBRX);
  pc.printf("%s", s);
}

int main(int argc, char** argv) {
  DebugLog("Hello World!
");
}

And here’s a slightly more complex example for Arduino:

#include "Arduino.h"

// The Arduino DUE uses a different object for the default serial port shown in
// the monitor than most other models, so make sure we pick the right one. See
// https://github.com/arduino/Arduino/issues/3088#issuecomment-406655244
#if defined(__SAM3X8E__)
#define DEBUG_SERIAL_OBJECT (SerialUSB)
#else
#define DEBUG_SERIAL_OBJECT (Serial)
#endif

// On Arduino platforms, we set up a serial port and write to it for debug
// logging.
void DebugLog(const char* s) {
  static bool is_initialized = false;
  if (!is_initialized) {
    DEBUG_SERIAL_OBJECT.begin(9600);
    // Wait for serial port to connect. Only needed for some models apparently?
    while (!DEBUG_SERIAL_OBJECT) {
    }
    is_initialized = true;
  }
  DEBUG_SERIAL_OBJECT.println(s);
}

int main(int argc, char** argv) {
  DebugLog("Hello World!
");
}

Both of these examples create a serial object, and then expect that the user will hook up a serial connection to the microcontroller over USB to their host machine.

The key first step in the porting effort is to create a minimal example for your platform, running in the IDE you want to use, that gets a string printed to the host console somehow. If you can get this working, the code you use will become the basis of a specialized function that you’ll add to the TensorFlow Lite code.

Implementing DebugLog()

If you look in tensorflow/lite/micro/debug_log.cc, you’ll see that there’s an implementation of the DebugLog() function that looks very similar to the first “hello world” example we showed, using stdio.h and fprintf() to output a string to the console. If your platform supports the standard C library fully and you don’t mind the extra binary size, you can just use this default implementation and ignore the rest of this section. It’s more likely that you’ll need to use a different approach, though, unfortunately.

As a first step, we’ll use the test that already exists for the DebugLog() function. To begin, run this command line:

make -f tensorflow/lite/micro/tools/make/Makefile 
  generate_micro_error_reporter_test_make_project

When you look inside tensorflow/lite/micro/tools/make/gen/linux_x86_64/prj/micro_error_reporter_test/make/ (replacing linux with osx or windows if you’re on a different host platform) you should see some folders like tensorflow and third_party. These folders contain C++ source code, and if you drag them into your IDE or build system and compile all the files, you should end up with an executable that tests out the error reporting functionality we need to create. It’s likely that your first attempt to build this code will fail, because it’s still using the default DebugLog() implementation in debug_log.cc, which relies on stdio.h and the C standard library. To work around that problem, change debug_log.cc to remove the #include <cstdio> statement and replace the DebugLog() implementation with one that does nothing:

#include "tensorflow/lite/micro/debug_log.h"

extern "C" void DebugLog(const char* s) {
  // Do nothing for now.
}

With that changed, try to get the set of source files successfully compiling. After you’ve done that, take the resulting binary and load it onto your embedded system. If you can, check that the program runs without crashing, even though you won’t be able to see any output yet.

When the program seems to build and run correctly, see whether you can get the debug logging working. Take the code that you used for the “hello world” program in the previous section and put it into the DebugLog() implementation inside debug_log.cc.

The actual test code itself exists in tensorflow/lite/micro/micro_error_reporter_test.cc, and it looks like this:

int main(int argc, char** argv) {
  tflite::MicroErrorReporter micro_error_reporter;
  tflite::ErrorReporter* error_reporter = &micro_error_reporter;
  error_reporter->Report("Number: %d", 42);
  error_reporter->Report("Badly-formed format string %");
  error_reporter->Report("Another % badly-formed %% format string");
  error_reporter->Report("~~~%s~~~", "ALL TESTS PASSED");
}

It’s not calling DebugLog() directly—it goes through the ErrorReporter interface that handles things like variable numbers of arguments first—but it does rely on the code you’ve just written as its underlying implementation. You should see something like this in your debug console if everything’s working correctly:

Number: 42
Badly-formed format string
Another  badly-formed  format string
~~~ALL TESTS PASSED~~~

After you have that working, you’ll want to put your implementation of DebugLog() back into the main source tree. To do this, you’ll use the subfolder specialization technique that we discussed earlier. You’ll need to decide on a short name (with no capital letters, spaces, or other special characters) to use to identify your platform. For example, we use arduino, sparkfun_edge, and linux for some of the platforms we already support. For the purposes of this tutorial, we’ll use my_mcu. Start by creating a new subfolder in tensorflow/lite/micro/ called my_mcu in the copy of the source code you checked out from GitHub (not the one you just generated or downloaded). Copy the debug_log.cc file with your implementation into that my_mcu folder, and add it to source tracking using Git. Copy your generated project files to a backup location and then run the following commands:

make -f tensorflow/lite/micro/tools/make/Makefile TARGET=my_mcu clean
make -f tensorflow/lite/micro/tools/make/Makefile 
  TARGET=my_mcu generate_micro_error_reporter_test_make_project

If you now look in tensorflow/lite/micro/tools/make/gen/my_mcu_x86_64/prj/micro_error_reporter_test/make/tensorflow/lite/micro/ you should see that the default debug_log.cc is no longer present, but your implementation is in the my_mcu subfolder. If you drag this set of source files back into your IDE or build system, you should now see a program that successfully builds, runs, and outputs to the debug console.

Running All the Targets

If that works, congratulations: you’ve now enabled all of the TensorFlow test and executable targets! Implementing debug logging is the only required platform-specific change you need to make; everything else in the code base should be written in a portable enough way that it will build and run on any C++11-supporting toolchain, with no need for standard library linking beyond the math library. To create all of the targets so that you can try them in your IDE, you can run the following command from the terminal:

make -f tensorflow/lite/micro/tools/make/Makefile generate_projects 
  TARGET=my_mcu

This creates a large number of folders in similar locations to the generated error reporter test, each exercising different parts of the library. If you want to get the speech wake-word example running on your platform, you can look at tensorflow/lite/micro/tools/make/gen/my_mcu_x86_64/prj/micro_speech/make/.

Now that you have DebugLog() implemented, it should run on your platform, but it won’t do anything useful because the default audio_provider.cc implementation is always returning arrays full of zeros. To get it working properly, you’ll need to create a specialized audio_provider.cc module that returns captured sound, using the subfolder specialization approach described earlier. If you don’t care about a working demonstration, you can still look at things like the inference latency of neural networks on your platform using the same sample code, or some of the other tests.

As well as hardware support for sensors and output devices like LEDs, you may well want to implement versions of the neural network operators that run faster by taking advantage of special features of your platform. We welcome this kind of specialized optimization and hope that the subfolder specialization technique will be a good way to integrate them back into the main source tree if they prove to be useful.

Integrating with the Makefile Build

So far we’ve talked only about using your own IDE, given that it’s often simpler and more familiar to many embedded programmers than using our Make system. If you want to be able to have your code tested by our continuous integration builds, or have it available outside of a particular IDE, you might want to integrate your changes more fully with our Makefiles. One of the essentials for this is finding a publicly downloadable toolchain for your platform, along with public downloads for any SDKs or other dependencies, so that a shell script can automatically grab everything it needs to build without having to worry about website logins or registrations. For example, we download the macOS and Linux versions of the GCC Embedded toolchain from Arm, with the URLs in tensorflow/lite/micro/tools/make/third_party_downloads.inc.

You’ll then need to determine the correct command-line flags to pass into the compiler and linker, along with any extra source files you need that aren’t found using subfolder specialization, and encode that information into a sub-Makefile in tensorflow/lite/micro/tools/make/targets. If you want extra credit, you can then figure out how to emulate your microcontroller on an x86 server using a tool like Renode so that we can run the tests during our continuous integration, not just confirm the build. You can see an example of the script we run to test the “Bluepill” binaries using Renode at tensorflow/lite/micro/testing/test_bluepill_binary.sh.

If you have all of the build settings configured correctly, you’ll be able to run something like this to generate a flashable binary (setting the target as appropriate for your platform):

make -f tensorflow/lite/micro/tools/make/Makefile 
  TARGET=bluepill micro_error_reporter_test_bin

If you have the script and environment for running tests working correctly, you can do this to run all the tests for the platform:

make -f tensorflow/lite/micro/tools/make/Makefile TARGET=bluepill test

Supporting a New IDE or Build System

TensorFlow Lite for Microcontrollers can create standalone projects for Arduino, Mbed, and Keil toolchains, but we know that a lot of other development environments are used by embedded engineers. If you need to run the framework in a new environment, the first thing we recommend is seeing whether the “raw” set of files that are generated when you generate a Make project can be imported into your IDE. This kind of project archive contains only the source files needed for a particular target, including any third-party dependencies, so in many cases you can just point your toolchain at the root folder and ask it to include everything.

Note

When you have only a few files, it can seem odd to keep them in the nested subfolders (like tensorflow/lite/micro/examples/micro_speech) of the original source tree when you export them to a generated project. Wouldn’t it make more sense to flatten out the directory hierarchy?

The reason we chose to keep the deeply nested folders is to make merging back into the main source tree as straightforward as possible, even if it is a little less convenient when working with the generated project files. If the paths always match between the original code checked out of GitHub and the copies in each project, keeping track of changes and updates is a lot easier.

This approach won’t work for all IDEs, unfortunately. For example, Arduino libraries require all C++ source code files to have the suffix .cpp rather than TensorFlow’s default of .cc, and they’re also unable to specify include paths, so we need to change the paths in the code when we copy over the original files to the Arduino destination. To support these more complex transformations we have some rules and scripts in the Makefile build, with the root function generate_microlite_projects() calling into specialized versions for each IDE, which then rely on more rules, Python scripts, and template files to create the final output. If you need to do something similar for your own IDE, you’ll need to add similar functionality using the Makefile, which won’t be straightforward to implement because the build system is quite complex to work with.

Integrating Code Changes Between Projects and Repositories

One of the biggest disadvantages of a code generation system is that you end up with multiple copies of the source scattered in different locations, which makes dealing with code updates very tricky. To minimize the cost of merging changes, we’ve adopted some conventions and recommended procedures that should help. The most common use case is that you’ve made some modifications to files within the local copy of your project, and you’d like to update to a newer version of the TensorFlow Lite framework to get extra features or bug fixes. Here’s how we suggest handling that process:

  1. Either download a prebuilt archive of the project file for your IDE and target or generate one manually from the Makefile using the version of the framework you’re interested in.

  2. Unpack this new set of files into a folder and make sure that the folder structures match between the new folder and the folder containing the project files that you’ve been modifying. For example, both should have tensorflow subfolders at the top level.

  3. Run a merge tool between the two folders. Which tool you use will depend on your OS, but Meld is a good choice that works on Linux, Windows, and macOS. The complexity of the merge process will depend on how many files you’ve changed locally, but it’s expected most of the differences will be updates on the framework side, so you should usually be able to choose the equivalent of “accept theirs.”

If you have changed only one or two files locally, it might be easier to just copy the modified code from the old version and manually merge it into the new exported project.

You could also get more advanced by checking your modified code into Git, importing the latest project files as a new branch, and then using Git’s built-in merging facilities to handle integration. We’re still not advanced enough Git masters to offer advice on this approach, so we haven’t used it ourselves.

The big difference between this process and doing the same with more traditional code-generation approaches is that the code is still separated into many logical files whose paths remain constant over time. Typical code generation will concatenate all of the source into a single file, which makes merging or tracking changes very difficult because trivial changes to the order or layout make historical comparisons impossible.

Sometimes you might want to port changes in the other direction, merging from project files to the main source tree. This main source tree doesn’t need to be the official repository on GitHub; it could be a local fork that you maintain and don’t distribute. We love to get pull requests to the main repository with fixes or upgrades, but we know that’s not always possible with proprietary embedded development, so we’re also happy to help keep forks healthy. The key thing to watch is that you try to keep a single “source of truth” for your development files. Especially if you have multiple developers, it’s easy to have incompatible changes being made in different local copies of the source files inside project archives, which makes updating and debugging a nightmare. Whether it’s only internal or shared publicly, we highly recommend having a source-control system that has a single copy of each file, rather than checking in multiple versions.

To handle migrating changes back to the source of truth repository, you’ll need to keep track of which files you’ve modified. If you don’t have that information handy, you can always go back to the project files you originally downloaded or generated and run a diff to see what has changed. As soon as you know what files are modified or new, just copy them into the Git (or other source-control system) repository at the same paths they occur at in the project files.

The only exceptions to this approach are files that are part of third-party libraries, given that these don’t exist in the TensorFlow repository. Getting changes to those files submitted is beyond the scope of this book—the process will depend on the rules of each individual repository—but as a last resort, if you have changes that aren’t being accepted, you can often fork the project on GitHub and point your platform’s build system to that new URL rather than the original. Assuming that you’re changing just TensorFlow source files, you should now have a locally modified repository that contains your changes. To verify that the modifications have been successfully integrated, you’ll need to run generate_projects() using Make and then ensure that the project for your IDE and target has your updates applied as you’d expect. When that’s complete, and you’ve run tests to ensure nothing else has been broken, you can submit your changes to your fork of TensorFlow. As soon as that’s done, the final stage is to submit a pull request if you’d like to see your changes made public.

Contributing Back to Open Source

There are already more contributors to TensorFlow outside of Google than inside, and the microcontroller work has a larger reliance on collaboration than most other areas. We’re very keen to get help from the community, and one of the most important ways of helping is through pull requests (though there are plenty of other ways, like Stack Overflow) or creating your own example projects). GitHub has great documentation covering the basics of pull requests, but there are some details that are helpful to know when working with TensorFlow:

  • We have a code review process run by project maintainers inside and outside Google. This is managed through GitHub’s code review system, so you should expect to see a discussion about your submission there.

  • Changes that are more than just a bug fix or optimization usually need a design document first. There’s a group called SIG Micro that’s run by external contributors to help define our priorities and roadmap, so that’s a good forum to talk about new designs. The document can be just a page or two for a smaller change; it’s helpful to understand the context and motivation behind a pull request.

  • Maintaining a public fork can be a great way of getting feedback on experimental changes before they’re submitted to the main branch because you can make changes with any cumbersome processes to slow you down.

  • There are automated tests that run against all pull requests, both publicly and with some additional Google internal tools that check the integration against our own projects that depend on this. The results of these tests can sometimes be difficult to interpret, unfortunately, and even worse, they’re occasionally “flakey,” with tests failing for reasons unrelated to your changes. We’re constantly trying to improve this process because we know it’s a bad experience, but please do ping the maintainers in the conversation thread if you’re having trouble understanding test failures.

  • We aim for 100% test coverage, so if a change isn’t exercised by an existing test, we’ll ask you for a new one. These tests can be quite simple; we just want to make sure there’s some coverage of everything we do.

  • For readability’s sake, we use the Google style guide for C and C++ code formatting consistently across the entire TensorFlow code base, so we request any new or modified code be in this style. You can use clang-format with the google style argument to automatically format your code.

Thanks in advance for any contributions you can make to TensorFlow, and for your patience with the work involved in getting changes submitted. It’s not always easy, but you’ll be making a difference to many developers around the world!

Supporting New Hardware Accelerators

One of the goals of TensorFlow Lite for Microcontrollers is to be a reference software platform to help hardware developers make faster progress with their designs. What we’ve observed is that a lot of the work around getting a new chip doing something useful with machine learning is in tasks like writing exporters from the training environment, especially with regard to tricky details like quantization and implementing the “long tail” of operations that are needed for typical machine learning models. These tasks take so little time that they aren’t good candidates for hardware optimization.

To address these problems, we hope that the first step that hardware developers will take is getting the unoptimized reference code for TensorFlow Lite for Microcontrollers running on their platform and producing the correct results. This will demonstrate that everything but the hardware optimization is working, so that can be the focus of the remaining work. One challenge might be if the chip is an accelerator that doesn’t support general-purpose C++ compilation, because it only has specialized functionality rather than a traditional CPU. For embedded use cases, we’ve found that it’s almost always necessary to have some general-purpose computation available, even if it’s slow (like a small microcontroller), because many users’ graphs have operations that can’t be compactly expressed except as arbitrary C++ implementations. We’ve also made the design decision that the TensorFlow Lite for Microcontrollers interpreter won’t support asynchronous execution of subgraphs, because that would complicate the code considerably and also seems uncommon in the embedded domain (unlike the mobile world, where Android’s Neural Network API is popular).

This means that the kinds of architectures TensorFlow Lite for Microcontrollers supports look more like synchronous coprocessors that run in lockstep with a traditional processor, with the accelerator speeding up compute-intensive functions that would otherwise take a long time but deferring the smaller ops with more flexible requirements to a CPU. The result in practice is that we recommend starting off by replacing individual operator implementations at the kernel level with calls to any specialized hardware. This does mean that the results and inputs are expected to be in normal memory addressable by the CPU (because you don’t have any guarantees about what processor subsequent ops will run on), and you will either need to wait for the accelerator to complete before continuing or use platform-specific code to switch to threads outside of the Micro framework. These restrictions should at least enable some quick prototyping, though, and hopefully offer the ability to make incremental changes while always being able to test the correctness of each small modification.

Understanding the File Format

The format TensorFlow Lite uses to store its models has many virtues, but unfortunately simplicity is not one of them. Don’t be put off by the complexity, though; it’s actually fairly straightforward to work with after you understand some of the fundamentals.

As we touched on in Chapter 3, neural network models are graphs of operations with inputs and outputs. Some of the inputs to an operation might be large arrays of learned values, known as weights, and others will come from the results of earlier operations, or input value arrays fed in by the application layer. These inputs might be image pixels, audio sample data, or accelerometer time-series data. At the end of running a single pass of the model, the final operations will leave arrays of values in their outputs, typically representing things like classification predictions for different categories.

Models are usually trained on desktop machines, so we need a way of transferring them to other devices like phones or microcontrollers. In the TensorFlow world, we do this using a converter that can take a trained model from Python and export it as a TensorFlow Lite file. This exporting stage can be fraught with problems, because it’s easy to create a model in TensorFlow that relies on features of the desktop environment (like being able to execute Python code snippets or use advanced operations) that are not supported on simpler platforms. It’s also necessary to convert all the values that are variable in training (such as weights) into constants, remove operations that are needed only for gradient backpropagation, and perform optimizations like fusing neighboring ops or folding costly operations like batch normalization into less expensive forms. What makes this even trickier is that there are more than 800 operations in mainline TensorFlow, and more are being added all the time. This means that it’s fairly straightforward to write your own converter for a small set of models, but handling the broader range of networks that users can create in TensorFlow reliably is much more difficult. Just keeping up to date with new operations is a full-time job.

The TensorFlow Lite file that you get out of the conversion process doesn’t suffer from most of these issues. We try to produce a simpler and more stable representation of a trained model with clear inputs and outputs, variables that are frozen into weights, and common graph optimizations like fusing already applied. This means that even if you’re not intending to use TensorFlow Lite for Microcontrollers, we recommend using the TensorFlow Lite file format as the way you access TensorFlow models for inference instead of writing your own converter from the Python layer.

FlatBuffers

We use FlatBuffers as our serialization library. It was designed for applications for which performance is critical, so it’s a good fit for embedded systems. One of the nice features is that its runtime in-memory representation is exactly the same as its serialized form, so models can be embedded directly into flash memory and accessed immediately, with no need for any parsing or copying. This does mean that the generated code classes to read properties can be a bit difficult to follow because there are a couple of layers of indirection, but the important data (such as weights) is stored directly as little-endian blobs that can be accessed like raw C arrays. There’s also very little wasted space, so you aren’t paying a size penalty by using FlatBuffers.

FlatBuffers work using a schema that defines the data structures we want to serialize, together with a compiler that turns that schema into native C++ (or C, Python, Java, etc.) code for reading and writing the information. For TensorFlow Lite, the schema is in tensorflow/lite/schema/schema.fbs, and we cache the generated C++ accessor code at tensorflow/lite/schema/schema_generated.h. We could generate the C++ code every time we do a fresh build rather than storing it in source control, but this would require every platform we build on to include the flatc compiler as well as the rest of the toolchain, and we decided to trade the convenience of automatic generation for ease of porting.

If you want to understand the format at the byte level, we recommend looking at the internals page of the FlatBuffers C++ project or the equivalent for the C library. We’re hopeful that most needs will be met through the various high-level language interfaces, though, and you won’t need to work at that granularity. To introduce you to the concepts behind the format, we’re going to walk through the schema and the code in MicroInterpreter that reads a model; hopefully, having some concrete examples will help it all make sense.

Ironically, to get started we need to scroll to the very end of the schema. Here we see a line declaring that the root_type is Model:

root_type Model;

FlatBuffers need a single container object that acts as the root for the tree of other data structures held within the file. This statement tells us that the root of this format is going to be a Model. To find out what that means, we scroll up a few more lines to the definition of Model:

table Model {

This tells us that Model is what FlatBuffers calls a table. You can think of this like a Dict in Python or a struct in C or C++ (though it’s more flexible than that). It defines what properties an object can have, along with their names and types. There’s also a less-flexible type in FlatBuffers called struct that’s more memory-efficient for arrays of objects, but we don’t currently use this in TensorFlow Lite.

You can see how this is used in practice by looking at the micro_speech example’s main() function:

  // Map the model into a usable data structure. This doesn't involve any
  // copying or parsing, it's a very lightweight operation.
  const tflite::Model* model =
      ::tflite::GetModel(g_tiny_conv_micro_features_model_data);

The g_tiny_conv_micro_features_model_data variable is a pointer to an area of memory containing a serialized TensorFlow Lite model, and the call to ::tflite::GetModel() is effectively just a cast to get a C++ object backed up by that underlying memory. It doesn’t require any memory allocation or walking of data structures, so it’s a very quick and efficient call. To understand how we can use it, look at the next operation we perform on the data structure:

  if (model->version() != TFLITE_SCHEMA_VERSION) {
    error_reporter->Report(
        "Model provided is schema version %d not equal "
        "to supported version %d.
",
        model->version(), TFLITE_SCHEMA_VERSION);
    return 1;
  }

If you look at the start of the Model definition in the schema, you can see the definition of the version property this code is referring to:

  // Version of the schema.
  version:uint;

This informs us that the version property is a 32-bit unsigned integer, so the C++ code generated for model->version() returns that type of value. Here we’re just doing error checking to make sure the version is one that we can understand, but the same kind of accessor function is generated for all the properties that are defined in the schema.

To understand the more complex parts of the file format, it’s worth following the flow of the MicroInterpreter class as it loads a model and prepares to execute it. The constructor is passed a pointer to a model in memory, such as the previous example’s g_tiny_conv_micro_features_model_data. The first property it accesses is buffers:

  const flatbuffers::Vector<flatbuffers::Offset<Buffer>>* buffers =
      model->buffers();

You might see the Vector name in the type definition, and be worried we’re trying to use objects similar to Standard Template Library (STL) types inside an embedded environment without dynamic memory management, which would be a bad idea. Happily, though, the FlatBuffers Vector class is just a read-only wrapper around the underlying memory, so just like with the root Model object, there’s no parsing or memory allocation required to create it.

To understand more about what this buffers array represents, it’s worth taking a look at the schema definition:

// Table of raw data buffers (used for constant tensors). Referenced by tensors
// by index. The generous alignment accommodates mmap-friendly data structures.
table Buffer {
  data:[ubyte] (force_align: 16);
}

Each buffer is defined as a raw array of unsigned 8-bit values, with the first value 16-byte-aligned in memory. This is the container type used for all of the arrays of weights (and any other constant values) held in the graph. The type and shape of the tensors are held separately; this array just holds the raw bytes that back up the data inside the arrays. Operations refer to these constant buffers by index inside this top-level vector.

The next property we access is a list of subgraphs:

  auto* subgraphs = model->subgraphs();
  if (subgraphs->size() != 1) {
    error_reporter->Report("Only 1 subgraph is currently supported.
");
    initialization_status_ = kTfLiteError;
    return;
  }
  subgraph_ = (*subgraphs)[0];

A subgraph is a set of operators, the connections between them, and the buffers, inputs, and outputs that they use. There are some advanced models that might require multiple subgraphs in the future—for example, to support control flow—but all of the networks we want to support on microcontrollers at the moment have a single subgraph, so we can simplify our subsequent code by making sure the current model meets that requirement. To get more of an idea of what’s in a subgraph, we can look back at the schema:

// The root type, defining a subgraph, which typically represents an entire
// model.
table SubGraph {
  // A list of all tensors used in this subgraph.
  tensors:[Tensor];

  // Indices of the tensors that are inputs into this subgraph. Note this is
  // the list of non-static tensors that feed into the subgraph for inference.
  inputs:[int];

  // Indices of the tensors that are outputs out of this subgraph. Note this is
  // the list of output tensors that are considered the product of the
  // subgraph's inference.
  outputs:[int];

  // All operators, in execution order.
  operators:[Operator];

  // Name of this subgraph (used for debugging).
  name:string;
}

The first property every subgraph has is a list of tensors, and the MicroInterpreter code accesses it like this:

  tensors_ = subgraph_->tensors();

As we mentioned earlier, the Buffer objects just hold raw values for weights, without any metadata about their types or shapes. Tensors are the place where this extra information is stored for constant buffers. They also hold the same information for temporary arrays like inputs, outputs, or activation layers. You can see this metadata in their definition near the top of the schema file:

table Tensor {
  // The tensor shape. The meaning of each entry is operator-specific but
  // builtin ops use: [batch size, height, width, number of channels] (That's
  // Tensorflow's NHWC).
  shape:[int];
  type:TensorType;
  // An index that refers to the buffers table at the root of the model. Or,
  // if there is no data buffer associated (i.e. intermediate results), then
  // this is 0 (which refers to an always existent empty buffer).
  //
  // The data_buffer itself is an opaque container, with the assumption that the
  // target device is little-endian. In addition, all builtin operators assume
  // the memory is ordered such that if `shape` is [4, 3, 2], then index
  // [i, j, k] maps to data_buffer[i*3*2 + j*2 + k].
  buffer:uint;
  name:string;  // For debugging and importing back into tensorflow.
  quantization:QuantizationParameters;  // Optional.

  is_variable:bool = false;
}

The shape is a simple list of integers that indicates the tensor’s dimensions, whereas type is an enum mapping to the possible data types that are supported in TensorFlow Lite. The buffer property indicates which Buffer in the root-level list has the actual values backing up this tensor if it’s a constant read from a file, or is zero if the values are calculated dynamically (for example, for an activation layer). The name is there only to give a human-readable label for the tensor, which can help with debugging, and the quantization property defines how to map low-precision values into real numbers. Finally, the is_variable member exists to support future training and other advanced applications, but it doesn’t need to be used on microcontroller units (MCUs).

Going back to the MicroInterpreter code, the second major property we pull from the subgraph is a list of operators:

operators_ = subgraph_->operators();

This list holds the graph structure of the model. To understand how this is encoded, we can go back to the schema definition of Operator:

// An operator takes tensors as inputs and outputs. The type of operation being
// performed is determined by an index into the list of valid OperatorCodes,
// while the specifics of each operations is configured using builtin_options
// or custom_options.
table Operator {
  // Index into the operator_codes array. Using an integer here avoids
  // complicate map lookups.
  opcode_index:uint;

  // Optional input and output tensors are indicated by -1.
  inputs:[int];
  outputs:[int];

  builtin_options:BuiltinOptions;
  custom_options:[ubyte];
  custom_options_format:CustomOptionsFormat;

  // A list of booleans indicating the input tensors which are being mutated by
  // this operator.(e.g. used by RNN and LSTM).
  // For example, if the "inputs" array refers to 5 tensors and the second and
  // fifth are mutable variables, then this list will contain
  // [false, true, false, false, true].
  //
  // If the list is empty, no variable is mutated in this operator.
  // The list either has the same length as `inputs`, or is empty.
  mutating_variable_inputs:[bool];
}

The opcode_index member is an index into the root-level operator_codes vector inside Model. Because a particular kind of operator, like Conv2D, might show up many times in one graph, and some ops require a string to define them, it saves serialization size to keep all of the op definitions in one top-level array and refer to them indirectly from subgraphs.

The inputs and outputs arrays define the connections between an operator and its neighbors in the graph. These are lists of integers that refer to the tensor array in the parent subgraph, and may refer to constant buffers read from the model, inputs fed into the network by the application, the results of running other operations, or output destination buffers that will be read by the application after calculations have finished.

One important thing to know about this list of operators held in the subgraph is that they are always in topological order, so that if you execute them from the beginning of the array to the end, all of the inputs for a given operation that rely on previous operations will have been calculated by the time that operation is reached. This makes writing interpreters much simpler, because the execution loop doesn’t need to do any graph operations beforehand and can just execute the operations in the order they’re listed. It does mean that running the same subgraph in different orders (for example, to use back-propagation with training) is not straightforward, but TensorFlow Lite’s focus is on inference so this is a worthwhile trade-off.

Operators also usually require parameters, like the shape and stride for the filters for a Conv2D kernel. The representation of these is unfortunately pretty complex. For historical reasons, TensorFlow Lite supports two different families of operations. Built-in operations came first, and are the most common ops that are used in mobile applications. You can see a list in the schema. As of November 2019 there are only 122, but TensorFlow supports more than 800 operations—so what can we do about the remainder? Custom operations are defined by a string name instead of a fixed enum like built-ins, so they can be added more easily without touching the schema.

For built-in ops, the parameter structures are listed in the schema. Here’s an example for Conv2D:

table Conv2DOptions {
  padding:Padding;
  stride_w:int;
  stride_h:int;
  fused_activation_function:ActivationFunctionType;
  dilation_w_factor:int = 1;
  dilation_h_factor:int = 1;
}

Hopefully most of the members listed look somewhat familiar, and they are accessed in the same way as other FlatBuffers objects: through the builtin_options union of each Operator object, with the appropriate type picked based on the operator code (though the code to do so is based on a monster switch statement).

If the operator code turns out to indicate a custom operator, we don’t know the structure of the parameter list ahead of time, so we can’t generate a code object. Instead, the argument information is packed into a FlexBuffer. This is a format that the FlatBuffer library offers for encoding arbitrary data when you don’t know the structure in advance, which means the code implementing the operator needs to access the resulting data specifying what the type is, and with messier syntax than a built-in’s. Here’s an example from some object detection code:

  const flexbuffers::Map& m = flexbuffers::GetRoot(buffer_t, length).AsMap();
  op_data->max_detections = m["max_detections"].AsInt32();

The buffer pointer being referenced in this example ultimately comes from the custom_options member of the Operator table, showing how you can access parameter data from this property.

The final member of Operator is mutating_variable_inputs. This is an experimental feature to help manage Long Short-Term Memory (LSTM) and other ops that might want to treat their inputs as variables, and shouldn’t be relevant for most MCU applications.

Those are the key parts of the TensorFlow Lite serialization format. There are a few other members we haven’t covered (like metadata_buffer in Model), but these are for nonessential features that are optional and so can usually be ignored. Hopefully this overview will be enough to get you started on reading, writing, and debugging your own model files.

Porting TensorFlow Lite Mobile Ops to Micro

There are more than one hundred “built-in” operations in the mainline TensorFlow Lite version targeting mobile devices. TensorFlow Lite for Microcontrollers reuses most of the code, but because the default implementations of these ops bring in dependencies like pthreads, dynamic memory allocation, or other features unavailable on embedded systems, the op implementations (also known as kernels) require some work to make them available on Micro.

Eventually, we hope to unify the two branches of op implementations, but that effort requires some design and API changes across the framework, so it won’t be happening in the short term. Most ops should already have Micro implementations, but if you discover one that’s available on mobile TensorFlow Lite but not through the embedded version, this section walks you through the conversion process. After you’ve identified the operation you’re going to port, there are several stages.

Separate the Reference Code

All of the ops listed should already have reference code, but the functions are likely to be in reference_ops.h. This is a monolithic header file that’s almost 5,000 lines long. Because it covers so many operations, it pulls in a lot of dependencies that are not available on embedded platforms. To begin the porting process, you first need to extract the reference functions that are required for the operation you’re working on into a separate header file. You can see examples of these smaller headers in https://oreil.ly/vH-6[_conv.h] and pooling.h. The reference functions themselves should have names that match the operation they implement, and there will typically be multiple implementations for different data types, sometimes using templates.

As soon as the file is separated from the larger header, you’ll need to include it from reference_ops.h so that all the existing users of that header still see the functions you’ve moved (though our Micro code will include only the separated headers individually). You can see how we do this for conv2d here. You’ll also need to add the header to the kernels/internal/BUILD:reference_base and kernels/internal/BUILD:legacy_reference_base build rules. After you’ve made those changes, you should be able to run the test suite and see all of the existing mobile tests passing:

bazel test tensorflow/lite/kernels:all

This is a good point to create an initial pull request for review. You haven’t ported anything to the micro branch yet, but you’ve prepared the existing code for the change, so it’s worth trying to get this work reviewed and submitted while you work on the following steps.

Create a Micro Copy of the Operator

Each micro operator implementation is a modified copy of a mobile version held in tensorflow/lite/kernels/. For example, the micro conv.cc is based on the mobile conv.cc. There are a few big differences. First, dynamic memory allocation is trickier in embedded environments, so the creation of the OpData structure that caches calculated values for the calculations used during inference is moved into a separate function so that it can be called during Invoke() rather than returned from Prepare(). This involves a little more work for each Invoke() call, but the reduction in memory overhead usually makes sense for microcontrollers.

Second, most of the parameter-checking code in Prepare() is usually removed. It might be better to enclose this in #if defined(DEBUG) rather than removing it entirely, but the removal keeps the code size to a minimum. All references to external frameworks (Eigen, gemmlowp, cpu_backend_support) should be removed from the includes and the code. In the Eval() function, everything but the path that calls the function in the reference_ops:: namespace should be removed.

The resulting modified operator implementation should be saved in a file with the same name as the mobile version (usually the lowercase version of the operator name) in the tensorflow/lite/micro/kernels/ folder.

Port the Test to the Micro Framework

We can’t run the full Google Test framework on embedded platforms, so we need to use the Micro Test library instead. This should look familiar to users of GTest, but it avoids any constructs that require dynamic memory allocation or C++ global initialization. There’s more documentation elsewhere in this book.

You’ll want to run the same tests that you run on mobile in the embedded environment, so you’ll need to use the version in tensorflow/lite/kernels/<your op name>_test.cc as a starting point. For example, look at tensorflow/lite/kernels/conv_test.cc and the ported version tensorflow/lite/micro/kernels/conv_test.cc. Here are the big differences:

  • The mobile code relies on C++ STL classes like std::map and std::vector, which require dynamic memory allocation.

  • The mobile code also uses helper classes and passes in data objects in a way that involves allocations.

  • The micro version allocates all of its data on the stack, using std::initializer_list to pass down objects that look a lot like std::vectors, but do not require dynamic memory allocation.

  • Calls to run a test are expressed as function calls rather than object allocations because this helps reuse a lot of code without hitting allocation issues.

  • Most standard error checking macros are available, but with the TF_LITE_MICRO_ suffix. For example, EXPECT_EQ becomes TF_LITE_MICRO_EXPECT_EQ.

The tests all have to live in one file, and be surrounded by a single TF_LITE_MICRO_TESTS_BEGIN/TF_LITE_MICRO_TESTS_END pair. Under the hood this actually creates a main() function so that the tests can be run as a standalone binary.

We also try to ensure that the tests rely on only the kernel code and API, not bringing in other classes like the interpreter. The tests should call into the kernel implementations directly, using the C API returned from GetRegistration(). This is because we want to ensure that the kernels can be used completely standalone, without needing the rest of the framework, so the testing code should avoid those dependencies, too.

Build a Bazel Test

Now that you have created the operator implementation and test files, you’ll want to check whether they work. You’ll need to use the Bazel open source build system to do this. Add a tflite_micro_cc_test rule to the BUILD file and then try building and running this command line (replacing conv with your operator name):

bazel test ttensorflow/lite/micro/kernels:conv_test --test_output=streamed

No doubt there will be compilation errors and test failures, so expect to spend some time iterating on fixing those.

Add Your Op to AllOpsResolver

Applications can choose to pull in only certain operator implementations for binary size reasons, but there’s an op resolver that pulls in all available operators, to make getting started easy. You should add a call to register your operator implementation in the constructor of all_ops_resolver.cc, and make sure the implementation and header files are included in the BUILD rules, too.

Build a Makefile Test

So far, everything you’ve been doing has been within the micro branch of TensorFlow Lite, but you’ve been building and testing on x86. This is the easiest way to develop, and the initial task is to create portable, unoptimized implementations of all the ops, so we recommend doing as much as you can in this domain. At this point, though, you should have a completely working and tested operator implementation running on desktop Linux, so it’s time to begin compiling and testing on embedded devices.

The standard build system for Google open source projects is Bazel, but unfortunately it’s not easy to implement cross-compilation and support for embedded toolchains using it, so we’ve had to turn to the venerable Make for deployment. The Makefile itself is very complicated internally, but hopefully your new operator should be automatically picked up based on the name and location of its implementation file and test. The only manual step should be adding the reference header you created to the MICROLITE_CC_HDRS file list.

To test your operator in this environment, cd to the folder, and run this command (with your own operator name instead of conv):

make -f tensorflow/lite/micro/tools/make/Makefile test_conv_test

Hopefully this will compile and the test will pass. If not, run through the normal debugging procedures to work out what’s going wrong.

This is still running natively on your local Intel x86 desktop machine, though it’s using the same build machinery as the embedded targets. You can try compiling and flashing your code onto a real microcontroller like the SparkFun Edge now (just passing in TARGET=sparkfun_edge on the Makefile line should be enough), but to make life easier we also have software emulation of a Cortex-M3 device available. You should be able to run your test through this by executing the following command:

make -f tensorflow/lite/micro/tools/make/Makefile TARGET=bluepill test_conv_test

This can be a little flakey because sometimes the emulator takes too long to execute and the process times out, but hopefully giving it a second try will fix it. If you’ve gotten this far, we encourage you to contribute your changes back to the open source build if you can. The full process of open-sourcing your code can be a bit involved, but the TensorFlow Community guide is a good place to start.

Wrapping Up

After finishing this chapter, you might be feeling like you’ve been trying to drink from a fire hose. We’ve given you a lot of information about how TensorFlow Lite for Microcontrollers works. Don’t worry if you don’t understand it all, or even most of it—we just wanted to give you enough background so that if you do need to delve under the hood, you know where to begin looking. The code is all open source and is the ultimate guide to how the framework operates, but we hope this commentary will help you navigate its structure and understand why some of its design decisions were made.

After seeing how to run some prebuilt examples and taking a deep dive into how the library works, you’re probably wondering how you can apply what you’ve learned to your own applications. The remainder of the book concentrates on the skills you need to be able to deploy custom machine learning in your own products, covering optimization, debugging, and porting models, along with privacy and security.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.151.45