Prototyping a TinyML application directly on a physical device is really fun because we can instantly see our ideas at work in something that looks and feels like the real thing. However, before any application comes to life, we need to ensure that the models work as expected and, possibly, among different devices. Testing and debugging applications directly on microcontroller boards often requires a lot of development time. The main reason for this is the necessity to upload a program into a device for every change in code. However, virtual platforms can come in handy to make testing more straightforward and faster.
In this chapter, we will build an image classification application with TensorFlow Lite for Microcontrollers (TFLu) for an emulated Arm Cortex-M3 microcontroller. We will start by installing the Zephyr OS, the primary framework used in this chapter to accomplish our task. Next, we will design a tiny quantized CIFAR-10 model with TensorFlow (TF). This model will be capable of running on a microcontroller with only 256 KB of program memory and 64 KB of RAM. In the end, we will deploy an image classification application on an emulated Arm Cortex-M3 microcontroller through Quick Emulator (QEMU).
The aim of this chapter is to learn how to build and run a TFLu-based application with the Zephyr OS on a virtual platform and provide practical advice on the design of an image classification model for memory-constrained microcontrollers.
In this chapter, we're going to implement the following recipes:
To complete all the practical recipes of this chapter, we will need the following:
The source code and additional material are available in the Chapter07 file (https://github.com/PacktPublishing/TinyML-Cookbook/tree/main/Chapter07).
In this recipe, we will install the Zephyr project, the framework used in this chapter to build and run the TFLu application on the emulated Arm Cortex-M3 microcontroller. At the end of this recipe, we will check whether everything works as expected by running a sample application on the virtual platform considered for our project.
To get started with this first recipe, we need to know what the Zephyr project is about.
Zephyr (https://zephyrproject.org/) is an open source Apache 2.0 project that provides a small-footprint Real-Time Operating System (RTOS) for various hardware platforms based on multiple architectures, including Arm Cortex-M, Intel x86, ARC, Nios II, and RISC-V. The RTOS has been designed for memory-constrained devices with security in mind.
Zephyr does not provide just an RTOS, though. It also offers a Software Development Kit (SDK) with a collection of ready-to-use examples and tools to build Zephyr-based applications for various supported devices, including virtual platforms through QEMU.
QEMU (https://www.qemu.org/) is an open source machine emulator that allows us to test programs without using real hardware. The Zephyr SDK supports two QEMU Arm Cortex-M-based microcontrollers, which are as follows:
From the preceding two QEMU platforms, we will use the LM3S6965. Our choice fell to the Texas Instruments board because it has a bigger RAM capacity than the BBC micro:bit. In fact, although the devices have the same program memory size (256 KB), LM3S6965 has 64 KB of RAM. Unfortunately, the BBC micro:bit has only 16 KB of RAM, not enough for running a CIFAR-10 model.
The Zephyr installation consists of the following steps:
Important Note
The installation guide reported in this section refers to Zephyr 2.7.0 and the Zephyr SDK 0.13.1.
Before getting started, we recommend you have the Python Virtual Environment (virtualenv) tool installed to create an isolated Python environment. If you haven't installed it yet, open your terminal and use the following pip command:
$ pip install virtualenv
To launch the Python virtual environment, create a new directory (for example, zephyr):
$ mkdir zephyr && cd zephyr
Then, create a virtual environment inside the directory just created:
$ python -m venv env
The preceding command creates the env directory with all the executables and Python packages required for the virtual environment.
To use the virtual environment, you just need to activate it with the following command:
$ source env/bin/activate
If the virtual environment is activated, the shell will be prefixed with (env):
(env)$
Tip
You can deactivate the Python virtual environment at any time by typing deactivate in the shell.
The following steps will help you prepare the Zephyr environment and run a simple application on the virtual Arm Cortex-M3-based microcontroller:
$ cd ~/zephyrproject/zephyr/samples/synchronization
Zephyr provides ready-to-use applications in the samples/ folder to demonstrate the usage of RTOS features. Since our goal is to run an application on a virtual platform, we consider the synchronization sample because it does not require interfacing with external components (for example, LEDs).
$ west build -b qemu_cortex_m3 .
The sample test is compiled with the west command (https://docs.zephyrproject.org/latest/guides/west/index.html). West is a tool developed by Zephyr to manage multiple repositories conveniently with a few command lines. However, West is more than a repository manager. In fact, the tool can also plug additional functionalities through extensions. Zephyr exploits this pluggable mechanism to offer the commands to compile, flash, and debug applications (https://docs.zephyrproject.org/latest/guides/west/build-flash-debug.html).
The west command used to compile the application has the following syntax:
$ west build -b <BOARD> <EXAMPLE-TO-BUILD>
Let's break down the preceding command:
Once we have built the application, we can run it on the target device.
$ west build -t run
To run the application, we just need to use the west build command, followed by the build system target (-t) as a command-line argument. Since we had specified the target platform when we built the application, we can simply pass the run option to upload and run the program on the device.
If Zephyr is installed correctly, the synchronization sample will run on the virtual Arm Cortex-M3 platform and print the following output:
threadA: Hello World from arm!
threadB: Hello World from arm!
threadA: Hello World from arm!
threadB: Hello World from arm!
You can now close QEMU by pressing Ctrl + A.
The tight memory constraint on LM3S6965 forces us to design a model with extremely low memory utilization. In fact, the target microcontroller has four times less memory capacity than Arduino Nano.
Despite this challenging constraint, in this recipe, we will be leveraging the following tiny model for the CIFAR-10 image classification, capable of running on LM3S6965:
The preceding network will be designed with TF and the Keras API.
The following Colab file (in the Designing and training a tiny CIFAR-10 model section) contains the code referred to in this recipe:
The network tailored in this recipe takes inspiration from the success of the MobileNet V1 on the ImageNet dataset classification. Our model aims to classify the 10 classes of the CIFAR-10 dataset: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.
The CIFAR-10 dataset is available at https://www.cs.toronto.edu/~kriz/cifar.html and consists of 60,000 RGB images with 32 x 32 resolution.
To understand why the proposed model can run successfully on LM3S6965, we want to outline the architectural design choices that make this network suitable for our target device.
As shown in Figure 7.1, the model has a convolution base, which acts as a feature extractor, and a classification head, which takes the learned features to perform the classification.
Early layers have large spatial dimensions and low Output Feature Maps (OFMs) to learn simple features (for example, simple lines). Deeper layers, instead, have small spatial dimensions and a high OFMs to learn complex features (for example, shapes).
The model uses pooling layers to halve the spatial dimensionality of the tensors and reduce the risk of overfitting when increasing the OFM. Generally, we want several activation maps for deep layers to combine as many complex features as possible. Therefore, the idea is to get smaller spatial dimensions to afford more OFMs.
In the following subsection, we will explain the design choice in using Depthwise Separable Convolution (DWSC) layers instead of the standard convolution 2D.
DWSC is the layer that made MobileNet V1 a success on the ImageNet dataset and the heart of our proposed convolution-based architecture. This operator took the lead in MobileNet V1 to produce an accurate model that can also run on a device with limited memory and computational resources.
As seen in Chapter 5, Indoor Scene Classification with TensorFlow Lite for Microcontrollers and the Arduino Nano, and shown in the following figure, DWSC is a depthwise convolution followed by a convolution layer with a 1 x 1 kernel size (that is, pointwise convolution):
To demonstrate the efficiency of this operator, consider the first DWSC layer in the network presented in Figure 7.1. As shown in the following diagram, the input tensor has a 32 x 32 x 16 dimension while the output tensor has a 32 x 32 x 24 dimension:
If we replace the DWSC with a regular convolution 2D with a 3 x 3 filter size, we will need 3,480 trainable parameters, of which 3,456 are weights (3 x 3 x 16 x 24), and 24 are biases. The DWSC, instead, just needs 560 trainable parameters, distributed as follows:
Therefore, in this particular case, the DWSC layer yields roughly six times fewer trainable parameters than a regular convolution 2D layer.
The model size reduction is not the only benefit this layer offers. The other advantage in using the DWSC is given by the reduction of the arithmetic operations. In fact, although both layers are made of several Multiply-Accumulate (MAC) operations, the DWSC needs considerably fewer MAC operations than convolution 2D.
This aspect is demonstrated by the following two formulas for the calculation of the total MAC operations for convolution 2D and the DWSC:
The formula is broken down as follows:
The calculation of the total MAC for the DWSC has two parts. The first part calculates the MAC operations for depthwise convolution, assuming that the input and output tensors have the same feature maps. The second part calculates the MAC operations for the pointwise convolution.
If we use the preceding two formulas for the case reported in Figure 7.3, we will discover that convolution 2D needs 3,583,944 operations while DWSC needs only 540,672 operations. Therefore, there is a computational complexity reduction of over six times with DWSC.
Hence, the efficiency of the DWSC layer is double since it decreases the trainable parameters and arithmetic operations involved.
Now that we know the benefits of this layer, let's discover how to design a model that can run on our target device.
Our goal is to produce a model that can fit in 256 KB of program memory and run with 64 KB of RAM. The program memory usage can be obtained directly from the .tflite model generated. Alternatively, you can check the Total params value returned by the Keras summary() method (https://keras.io/api/models/model/#summary-method) to have an indication of how big the model will be. Total params represents the number of trainable parameters, and it is affected mainly by the OFM and layers. In our case, the convolution base has five trainable layers with a maximum of 192 activation maps. This choice will make our model utilize just 30% of the total program memory.
The estimation of the RAM utilization is a bit more complicated and depends upon the model architecture. All the non-constant variables, such as the network input, output, and intermediate tensors, stay in RAM. However, although the network may need several tensors, TFLu has a memory manager capable of efficiently providing portions of memory at runtime. For a sequential model such as ours, where each layer has one input and one output tensor, a ballpark figure for the RAM utilization is given by the sum of the following:
In our network, the first DWSC produces the largest intermediate tensor with 24,576 elements (32 x 32 x 24), as shown in the following figure:
As you can see from the preceding diagram, the first DWSC produces a tensor with 24 OFMs, which we found as a good compromise between accuracy and RAM utilization. However, you may consider reducing this further to make the model even smaller and more performant.
Create a new Colab project and follow these steps to design and train a quantized CIFAR-10 model with TFLite:
(train_imgs, train_lbls), (test_imgs, test_lbls) = datasets.cifar10.load_data()
train_imgs = train_imgs / 255.0
test_imgs = test_imgs / 255.0
This step ensures that all data is on the same scale.
def separable_conv(i, ch):
x = layers.DepthwiseConv2D((3,3), padding="same")(i)
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
x = layers.Conv2D(ch, (1,1), padding="same")(x)
x = layers.BatchNormalization()(x)
return layers.Activation("relu")(x)
The separable_conv() function accepts the following input arguments:
The batch normalization layer standardizes the input to a layer and makes the model training faster and more stable.
input = layers.Input((32,32,3))
x = layers.Conv2D(16, (3, 3), padding='same')(input)
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
x = separable_conv(0, x, 24)
x = layers.MaxPooling2D((2, 2))(x)
x = separable_conv(0, x, 48)
x = layers.MaxPooling2D((2, 2))(x)
x = separable_conv(0, x, 96)
x = separable_conv(0, x, 192)
x = layers.MaxPooling2D((2, 2))(x)
We use pooling layers to reduce the spatial dimensionality of the feature maps through the network. Although we can use DWSC with non-unit strides to accomplish a similar sub-sampling task, we preferred pooling layers to keep the number of trainable parameters low.
x = layers.Flatten()(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(10)(x)
model = Model(input, x)
model.summary()
As shown in the following screenshot, the model summary returns roughly 60,000 parameters:
In the case of 8-bit quantization, 60,000 floating-point parameters correspond to 60,000 8-bit integer values. Therefore, the weights contribute to the model size with 60 KB, well away from the 256 KB maximum target. However, we should not consider this number as the model size, since what we deploy on a microcontroller is the TFLite file, which also contains the network architecture and the quantization parameters.
A ballpark figure for the RAM utilization can be estimated from the tensor size of each intermediate tensor in the network. This information can be extrapolated from the output of model.summary(). As anticipated in the previous Getting ready section, the intermediate tensors of the first DWSC layer have the largest number of elements. The following screenshot is taken from the output of model.summary() and reports the tensor shapes for these two tensors:
As you can see from the DWSC area marked in the preceding screenshot, the tensors with the largest number of elements are as follows:
Therefore, the expected memory utilization for the intermediate tensor should be in the order of 41 KB. To this number, we should add the memory for the input and output nodes to get a more precise ballpark figure of the RAM usage. The input and output tensors need 3,082 bytes, of which 3,072 bytes are for the input and 10 bytes are for the output. In total, we expect to use 44 KB of RAM during the model inference, which is less than the 64 KB target.
Note
In Figure 7.6, there are three layers with the (None, 32, 32, 16) output shape: conv0_dwsc2, bn1_dwsc2, and act1_dwsc2. However, only the pointwise convolution layer (conv0_dwsc2) counts for memory utilization of the intermediate tensors because batch normalization (bn1_dwsc2) and activation (act1_dwsc2) will be fused into the convolution (conv0_dwsc2) by the TFLite converter.
model.compile(optimizer='adam', loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])
model.fit(train_imgs, train_lbls, epochs=10, validation_data=(test_imgs, test_lbls))
After 10 epochs, the model should obtain an accuracy of 73% on the validation dataset.
model.save("cifar10")
Our CIFAR-10 model is now ready for being quantized with the TFLite converter.
The tiny model just trained can classify the 10 classes of CIFAR-10 with an accuracy of 73%. However, what is the model's accuracy of the quantized variant generated by the TFLite converter?
In this recipe, we will quantize the model with the TFLite converter and show how to perform this accuracy evaluation on the test dataset with the TFLite Python interpreter. After the accuracy evaluation, we will convert the TFLite model to a C-byte array.
The following Colab file (the Evaluating the accuracy of the quantized model section) contains the code referred to in this recipe:
In this section, we will explain why the accuracy of the TFLite model may differ from the trained one.
As we know, the trained model needs to be converted to a more compact and lightweight representation before being deployed on a resource-constrained device such as a microcontroller.
Quantization is the essential part of this step to make the model small and improve the inference performance. However, post-training quantization may change the model accuracy because of the arithmetic operations at a lower precision. Therefore, it is crucial to check whether the accuracy of the generated .tflite model is within an acceptable range before deploying it into the target device.
Unfortunately, TFLite does not provide a Python tool for the model accuracy evaluation. Hence, we will use the TFLite Python interpreter to accomplish this task. The interpreter will allow us to feed the input data to the network and read the classification result. The accuracy will be reported as the fraction of samples correctly classified from the test dataset.
Follow these steps to evaluate the accuracy of the quantized CIFAR-10 model on the test dataset:
cifar_ds = tf.data.Dataset.from_tensor_slices(train_images).batch(1)
def representative_data_gen():
for i_value in cifar_ds.take(100):
i_value_f32 = tf.dtypes.cast(i_value, tf.float32)
yield [i_value_f32]
The TFLite converter uses the representative dataset to estimate the quantization parameters.
tflite_conv = tf.lite.TFLiteConverter.from_saved_model("cifar10")
tflite_conv.representative_dataset = tf.lite.RepresentativeDataset(representative_data_gen)
tflite_conv.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_conv.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
tflite_conv.inference_input_type = tf.int8
tflite_conv.inference_output_type = tf.int8
For quantizing the TF model to 8-bit, we import the SavedModel directory (cifar10) into the TFLite converter and enforce full integer quantization.
tfl_model = tfl_conv.convert()
open("cifar10.tflite", "wb").write(tfl_model)
print(len(tfl_model))
The expected model size is 81,304 bytes. As you can see, the model can fit in 256 KB of program memory.
tfl_inter = tf.lite.Interpreter(model_content=tfl_model)
tfl_inter.allocate_tensors()
Get the quantization parameters of the input and output nodes:
i_details = tfl_inter.get_input_details()[0]
o_details = tfl_inter.get_output_details()[0]
i_quant = i_details["quantization_parameters"]
i_scale = i_quant['scales'][0]
i_zero_point = i_quant['zero_points'][0]
o_scale = o_quant['scales'][0]
o_zero_point = o_quant['zero_points'][0]
Initialize a variable to zero (num_correct_samples) to keep track of the correct classifications:
num_correct_samples = 0
num_total_samples = len(list(test_imgs))
Iterate over the test samples:
for i_value, o_value in zip(test_imgs, test_lbls):
input_data = i_value.reshape((1, 32, 32, 3))
i_value_f32 = tf.dtypes.cast(input_data, tf.float32)
Quantize each test sample:
i_value_f32 = i_value_f32 / i_scale + i_zero_point
i_value_s8 = tf.cast(i_value_f32, dtype=tf.int8)
Initialize the input node with the quantized sample and start the inference:
tfl_conv.set_tensor(i_details["index"], i_value_s8)
tfl_conv.invoke()
Read the classification result and dequantize the output to a floating point:
o_pred = tfl_conv.get_tensor(o_details["index"])[0]
o_pred_f32 = (o_pred - o_zero_point) * o_scale
Compare the classification result with the expected output class:
if np.argmax(o_pred_f32) == o_value:
num_correct_samples += 1
print("Accuracy:", num_correct_samples/num_total_samples)
After a few minutes, the accuracy result will be printed in the output log. The expected accuracy should still be around 73%.
!apt-get update && apt-get -qq install xxd
!xxd -i cifar10.tflite > model.h
You can download the model.h and cifar10.tflite files from Colab's left pane.
Our application will be running on a virtual platform with no access to a camera module. Therefore, we need to supply a valid test input image into our application to check whether the model works as expected.
In this recipe, we will get an image from the test dataset that must return a correct classification for the ship class. The sample will then be converted to an int8_t C array and saved as an input.h file.
The following Colab file (refer to the Converting a NumPy image to a C-byte array section) contains the code referred to in this recipe:
To get ready for this recipe, we just need to know how to prepare the C file containing the input test image. The structure of this file is quite simple and reported in the following figure:
As you can observe from the file structure, we only need an array and two variables to describe our input test sample, which are as follows:
The input image will be obtained from the test dataset. Therefore, we will need to implement a function in Python to convert an image stored in NumPy format to a C array.
Follow the these steps to generate a C header file containing a ship image from the test dataset:
def array_to_str(data):
NUM_COLS = 12
val_string = ''
for i, val in enumerate(data):
val_string += str(val)
if (i + 1) < len(data):
val_string += ','
if (i + 1) % NUM_COLS == 0:
val_string += ' '
return val_string
In the preceding code, the NUM_COLS variable limits the number of values on a single row. In our case, NUM_COLS is set to 12 so that we can add a newline character after every 12 values.
def gen_h_file(size, data, ilabel):
str_out = f'int8_t g_test[] = '
str_out += " { "
str_out += f'{data}'
str_out += '}; '
str_out += f"const int g_test_len = {size}; "
str_out += f"const int g_test_ilabel = {ilabel}; "
return str_out
As you can see from the preceding code, the function expects {data} to be a single string of comma-separated integer values.
imgs = list(zip(test_imgs, test_lbls))
cols = [Image, 'Label']
df = pd.DataFrame(imgs, columns = cols)
cond = df['Label'] == 8
ship_samples = df[cond]
In the preceding code, 8 is the index for the ship class.
c_code = ""
for index, row in ship_samples.iterrows():
i_value = np.asarray(row['Image'].tolist())
o_value = np.asarray(row['Label'].tolist())
o_pred_f32 = classify(i_value, o_value)
if np.argmax(o_pred_f32) == o_value:
i_value_f32 = i_value / i_scale + i_zero_point
i_value_s8 = i_value_f32.astype(dtype=np.uint8)
i_value_s8 = i_value_s8.ravel()
# Generate a string from NumPy array
val_string = array_to_str(i_value_s8)
# Generate the C header file
c_code = gen_h_file( i_value_s8.size, val_string, "8")
break
with open("input.h", 'w') as file:
file.write(c_code)
You can download the input.h file containing the input test image from Colab's left pane.
Only a few steps are separating us from the completion of this project. Now that we have the input test image, we can leave Colab's environment and focus on the application with the Zephyr OS.
In this recipe, we will prepare the skeleton of the TFLu project from the pre-built TFLu hello_world sample available in the Zephyr SDK.
The following C files contain the code referred to in this recipe:
https://github.com/PacktPublishing/TinyML-Cookbook/blob/main/Chapter07/ZephyrProject/Skeleton
This section aims to provide the basis for starting a new TFLu project with the Zephyr OS from scratch.
The easiest way to create a project is to copy and edit one of the pre-built samples for TFLu. The samples are available in the ~/zephyrproject/zephyr/samples/modules/tflite-micro folder. At the time of writing, there are two ready-to-use examples:
In this recipe, we will base our application on the hello_world application, and the following screenshot shows what you should find in the sample directory:
The hello_world folder contains three subfolders, but only src/ is of interest to us because it contains the source code for the application. However, not all the files in src/ are essential for our project. For example, assert.cc, constants.h, constants.c, model.cc, model.h, output_handler.cc, and output_handler.h are only required for the sine wave sample application. Therefore, the only C files needed for a new TFLu project are as follows:
In the end, the CMakeList.txt and prj.conf files in the hello_world directory are required for building the application. We will learn more about these files in the last recipe of this chapter.
Open the terminal and follow these steps to create a new TFLu project:
$ cd ~/zephyrproject/zephyr/samples/modules/tflite-micro/
$ mkdir cifar10
$ cp -r hello_world/* cifar10
constants.h, constants.c, model.c, model.h, output_handler.cc, output_handler.h, and assert.cc
These files can be removed because they are only required for the sine wave sample application, as explained in the Getting ready section of this recipe.
Once you have copied the files, the cifar10/src folder should contain the following files:
Before continuing, ensure you have the files listed in the previous screenshot.
Now, open your default C editor (for example, Vim) to make some code changes in the main.c and main_functions.cc files.
int main(int argc, char *argv[]) {
setup();
while(true) {
loop();
}
return 0;
}
This preceding code replicates exactly the behavior of an Arduino sketch, where setup() is called once and loop() is repeated indefinitely.
Next, replace g_model with the name of the array in model.h. The g_model variable is used when calling tflite::GetModel().
Now that we have the project structure ready, we can finally implement our application.
The skeleton of our Zephyr project is ready, so we just need to finalize our application to classify our input test image.
In this recipe, we will see how to build the TFLu application and run the program on the emulated Arm Cortex-M3-based microcontroller.
The following C files contain the code referred to in this recipe:
https://github.com/PacktPublishing/TinyML-Cookbook/blob/main/Chapter07/ZephyrProject/CIFAR10
Most of the ingredients required for developing this recipe are related to TFLu and have already been discussed in earlier chapters, such as Chapter 3, Building a Weather Station with TensorFlow Lite for Microcontrollers, or Chapter 5, Indoor Scene Classification with TensorFlow Lite for Microcontrollers and the Arduino Nano. However, there is one small detail of TFLu that has a big impact on the program memory usage that we haven't discussed yet.
In this section, we will talk about the tflite::MicroMutableOpResolver interface.
As we know from our previous projects, the TFLu interpreter is responsible for preparing the computation for a given model. One of the things that the interpreter needs to know is the function pointer for each operator to run. So far, we have provided this information with tflite::AllOpsResolver. However, tflite::AllOpsResolver is not recommended because of the heavy program memory usage. For example, this interface will prevent building our application because of the low program memory capacity on the target device. Therefore, TFLu offers tflite::MicroMutableOpResolver, an alternative and more efficient interface to load only the operators required by the model. To know which different operators the model needs, you can visualize the TFLite model (.tflite) file with the Netron web application (https://netron.app/).
Let's start this recipe by visualizing the architecture of our TFLite CIFAR-10 model file (cifar10.tflite) with Netron.
The following screenshot shows a slice of our model visualized with this tool:
Inspecting the model with Netron, we can see that the model only uses five operators: Conv2D, DepthwiseConv2D, MaxPool2D, Reshape, and FullyConnected. This information will be used to initialize tflite::MicroMutableOpResolver.
Now, open your default C editor and open the main_functions.cc file.
Follow these steps to build the TFLu application:
#include "input.h"
constexpr int tensor_arena_size = 52000;
Note
The original variable name for the tensor arena is kTensorArenaSize. To keep consistency with the lower_case naming convention used in the book, we have renamed this variable to tensor_arena_size.
The TFLu tensor arena is the portion of memory allocated by the user to accommodate the network input, output, intermediate tensors, and other data structures required by TFLu. The arena size should be a multiple of 16 to have a 16-byte data alignment.
As we have seen from the design of the CIFAR-10 model, the expected RAM usage for the model inference is in the order of 44 KB. Therefore, 52,000 bytes is okay for our case because it is greater than 44 KB, a multiple of 16, and less than 64 KB, the maximum RAM capacity.
uint8_t *tensor_arena = nullptr;
The tensor arena is too big for being placed in the stack. Therefore, we should dynamically allocate this memory in the setup() function.
tflite::MicroMutableOpResolver<5> resolver;
This object is created by providing the maximum number of different operations that the model requires as a template argument.
float o_scale = 0.0f;
int32_t o_zero_point = 0;
resolver.AddConv2D();
resolver.AddDepthwiseConv2D();
resolver.AddMaxPool2D();
resolver.AddReshape();
resolver.AddFullyConnected();
static tflite::MicroInterpreter static_interpreter(model, resolver, tensor_arena, tensor_arena_size, error_reporter);
interpreter = &static_interpreter;
const auto* o_quantization = reinterpret_cast<TfLiteAffineQuantization*>(output->quantization.params);
o_scale = o_quantization->scale->data[0];
o_zero_point = o_quantization->zero_point->data[0];
for(int i = 0; i < g_test_len; i++) {
input->data.int8[i] = g_test[i];
}
Next, run the inference:
TfLiteStatus invoke_status = interpreter->Invoke();
size_t ix_max = 0;
float pb_max = 0;
for (size_t ix = 0; ix < 10; ix++) {
int8_t out_val = output->data.int8[ix];
float pb = ((float)out_val - o_zero_point) * o_scale;
if(pb > pb_max) {
ix_max = ix;
pb_max = pb;
}
}
The preceding code iterates over the quantized output values and returns the class (ix_max) with the highest score.
if(ix_max == g_test_ilabel) {
If so, print CORRECT classification! and return the classification result:
static const char *label[] = {"airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"};
printf("CORRECT classification! %s ", label[ix_max]);
while(1);
}
Now, open the terminal, and use the following command to build the project for qemu_cortex_m3:
$ cd ~/zephyrproject/zephyr/samples/modules/tflite-micro/cifar10
$ west build -b qemu_cortex_m3 .
After a few seconds, the west tool should display the following output in the terminal, confirming that the program has been successfully compiled:
From the summary generated by the west tool, you can see that our CIFAR-10-based application uses 52.57% of program memory (FLASH) and 6.92% of RAM (SRAM). However, we should not be misled by RAM usage. In fact, the summary does not consider the memory that we allocate dynamically. Therefore, to the 4,536 bytes statically allocated in RAM, we should add the 52,000 bytes of the tensor arena, which brings us to 88% of RAM utilization.
Now that the application is built, we can run it on the virtual platform with the following command:
$ west build -t run
The west tool will boot the virtual device and return the following output, confirming that the model correctly classified the image as a ship:
As you can see from the preceding screenshot, the virtual device outputs the CORRECT classification message, confirming the successful execution of our tiny CIFAR-10 model!
Do not miss out on the opportunity to take your reading experience beyond the pages. We have a dedicated channel on the Embedded System Professionals community over Discord for you to read the book with other users.
Join now to share your journey with the book, discuss queries with other users and the author, share advice with others wherever you can, and importantly build your projects in collaboration with so many other users who have already joined us on the book club.
See you on the other side!
35.171.45.182