The C++ native part of the project

We perform the main classification task in the native C++ library. The Android Studio IDE has already created the native-lib.cpp file for us, so we just have to modify it. The following code snippet shows what header files we should include in order to work with the JNI, PyTorch, and Android asset libraries:

#include <jni.h>
#include <string>
#include <iostream>

#include <torch/script.h>
#include <caffe2/serialize/read_adapter_interface.h>

#include <android/asset_manager_jni.h>
#include <android/asset_manager.h>

If you want to use the Android logging system to output some messages to the IDE's logcat, you can define the following macro, which uses the __android_log_print() function:

#include <android/log.h>

#define  LOGD(...)  __android_log_print(ANDROID_LOG_DEBUG, "CAMERA_TAG", __VA_ARGS__)

The first native function we used in the Java code was the initClassifier() function. To implement it in the C++ code and make it visible in the Java code, we have to follow JNI rules to make the function declaration correct. The name of the function should include the full Java package name, including namespaces, and our first two required parameters should be of the JNIEnv* and jobject types. The following code shows how to define this function:

extern "C" JNIEXPORT void JNICALL
Java_com_example_camera2_MainActivity_initClassifier(
                    JNIEnv *env, jobject /*self*/, jobject j_asset_manager) {
    AAssetManager *asset_manager = AAssetManager_fromJava(env, j_asset_manager);
    if (asset_manager != nullptr) {
        LOGD("initClassifier start OK");
        
        auto model = ReadAsset(asset_manager, "model.pt");
        if (!model.empty()) {
            g_image_classifier.InitModel(model);
        }
        
        auto synset = ReadAsset(asset_manager, "synset.txt");
        if (!synset.empty()) {
            VectorStreamBuf<char> stream_buf(synset);
            std::istream is(&stream_buf);
            g_image_classifier.InitSynset(is);
        }
        LOGD("initClassifier finish OK");
    }
}

The initClassifier() function initializes the g_image_classifier global object, which is of the ImageClassifier type. We use this object to perform image classification in our application. There are two main entities that we initialized to make this object work as expected. The first one was model initialization from the snapshot, while the second was class descriptions, which we loaded from the synset file. As we saw previously, the synset and model snapshot files were attached to our application as assets, so to access them, we used the reference (or the pointer) to the application's AssetManager object. We passed the Java reference to the AssetManager object as the function's parameter when we called this function from the Java code. In the C/C++ code, we used the AAssetManager_fromJava() function to convert the Java reference into a C++ pointer. Then, we used the ReadAsset() function to read assets from the application bundle as std::vector<char> objects. Our ImageClassifier class has the InitModel() and InitSynset() methods to read the corresponding entities.

The following code shows the ReadAsset() function's implementation:

std::vector<char> ReadAsset(AAssetManager *asset_manager, const std::string &name) {
    std::vector<char> buf;
    AAsset *asset = AAssetManager_open(asset_manager, name.c_str(), 
    AASSET_MODE_UNKNOWN);
    if (asset != nullptr) {
        LOGD("Open asset %s OK", name.c_str());
        off_t buf_size = AAsset_getLength(asset);
        buf.resize(buf_size + 1, 0);
        auto num_read = AAsset_read(asset, buf.data(), buf_size);
        LOGD("Read asset %s OK", name.c_str());
        
        if (num_read == 0)
            buf.clear();
        AAsset_close(asset);
        LOGD("Close asset %s OK", name.c_str());
    }
    return buf;
}

There are four Android framework functions that we used to read an asset from the application bundle. The AAssetManager_open() function opened the asset and returned the not null pointer to the AAsset object. This function assumes that the path to the asset is in the file path format and that the root of this path is the assets folder. After we opened the asset, we used the AAsset_getLength() function to get the file size and allocated the memory for std::vector<char> with the std::vector::resize() method. Then, we used the AAsset_read() function to read the whole file to the buf object.

This function does the following:

Takes the pointer to the asset object to read from
The void* pointer to the memory buffer to read in
Measures the size of the bytes to read.

So, as you can see, the assets API is pretty much the same as the standard C library API for file operations. When we'd finished working with the asset object, we used AAsset_close() to notify the system that we don't need access to this asset anymore. If your assets are in .zip archive format, you should check the number of bytes returned by the AAsset_read() function because the Android framework reads archives chunk by chunk.

You may have noticed that we used the VectorStreamBuf adapter to pass data to the ImageClassifier::InitSynset() method. This method takes an object of the std::istream type. To convert std::vector<char> into the std::istream type object, we developed the following adapter class:

template<typename CharT, typename TraitsT = std::char_traits<CharT> >
struct VectorStreamBuf : public std::basic_streambuf<CharT, TraitsT> {
    explicit VectorStreamBuf(std::vector<CharT> &vec) {
        this->setg(vec.data(), vec.data(), vec.data() + vec.size());
    }
};

The following code shows the ImageClassifier class' declaration:

class ImageClassifier {
  public:
    using Classes = std::map<size_t, std::string>;
    
    ImageClassifier() = default;
    
    void InitSynset(std::istream &stream);
    
    void InitModel(const std::vector<char> &buf);
    
    std::string Classify(const at::Tensor &image);
    
  private:
    Classes classes_;
    torch::jit::script::Module model_;
};

We declared the global object of this class in the following way at the beginning of the native-lib.cpp file:

ImageClassifier g_image_classifier;

The following code shows the InitSynset() method's implementation:

void ImageClassifier::InitSynset(std::istream &stream) {
    LOGD("Init synset start OK");
    classes_.clear();
    if (stream) {
        std::string line;
        std::string id;
        std::string label;
        std::string token;
        size_t idx = 1;
        while (std::getline(stream, line)) {
            auto pos = line.find_first_of(" ");
            id = line.substr(0, pos);
            label = line.substr(pos + 1);
            classes_.insert({idx, label});
            ++idx;
        }
    }
    LOGD("Init synset finish OK");
}

The lines in the synset file are in the following format:

[ID] space character [Description text]

So, we read this file line by line and split each line at the position of the first space character. The first part of each line is the class identifier, while the second one is the class description. All the classes in this file are ordered, so the line number is the class number that's used for training the model. Therefore, to match the model's evaluation result with the correct class description, we created the dictionary (map) object, where key is the line number and value is the class description. The InitSynset() function takes std::istream as a parameter because we're going to be using the same code for other samples, where we read the synset file with the standard C++ Streams API.

The following code shows the ImageClassifier::InitModel() method's implementation:

void ImageClassifier::InitModel(const std::vector<char> &buf) {
    model_ = torch::jit::load(std::make_unique<ModelReader>(buf), 
    at::kCPU);
}

Here, we simply used a single function call to load the TorchScript model snapshot. The torch::jit::load() function did all the hard work for us; it loaded the model and initialized it with weights, which were also saved in the snapshot file. The main difficulty with this function is reading the model snapshot from the memory buffer, as in our case. The torch::jit::load() function doesn't work with standard C++ streams and types; instead, it accepts a pointer to an object of the caffe2::serialize::ReadAdapterInterface class. The following code shows how to make the concrete implementation of the caffe2::serialize::ReadAdapterInterface class, which wraps the std::vector<char> object:

class ModelReader : public caffe2::serialize::ReadAdapterInterface {
    public:
    explicit ModelReader(const std::vector<char> &buf) : buf_(&buf) {}
    
    ~ModelReader() override {};
    
    virtual size_t size() const override {
        return buf_->size();
    }
    
    virtual size_t read(uint64_t pos, void *buf, size_t n, 
                        const char *what)
    const override {
        std::copy_n(buf_->begin() + pos, n, reinterpret_cast<char *>(buf));
        return n;
    }
    
    private:
    const std::vector<char> *buf_;
};

The ModelReader class overrides two methods, size() and read(), from the caffe2::serialize::ReadAdapterInterface base class. Their implementations are pretty obvious: the size() method returns the size of the underlying vector object, while the read() method copies the n bytes (chars) from the vector to the destination buffer with the standard algorithm function, that is, std::copy_n.

The primary purpose of the ImageClassifier class is to perform image classification. The following code shows the implementation of the target method of this class, that is, Classify():

std::string ImageClassifier::Classify(const at::Tensor &image) {
    std::vector<torch::jit::IValue> inputs;
    inputs.emplace_back(image);
    at::Tensor output = model_.forward(inputs).toTensor();
    
    LOGD("Output size %d %d %d", static_cast<int>(output.ndimension()),
         static_cast<int>(output.size(0)),
         static_cast<int>(output.size(1)));
    
    auto max_result = output.squeeze().max(0);
    auto max_index = std::get<1>(max_result).item<int64_t>();
    auto max_value = std::get<0>(max_result).item<float>();
    
    max_index += 1;
    
    return std::to_string(max_index) + " - " + std::to_string(max_value) + 
        " - " + classes_[static_cast<size_t>(max_index)];
}

This function takes the at::Tensor object, which contains the image data, as an input parameter. We used the forward() method of the torch::jit::script::Module class to evaluate the model on the input image. Notice that the forward() method takes a vector of torch::jit::IValue objects. There is an implicit cast from the at::Tensor type to the torch::jit::IValue type, which means we can use the ATen library's tensor objects transparently. The output of the model is a 1x1000 dimensional tensor, where each value is the class score. To determine the most probable class for the given image, we looked for the column with the maximum value with the at::Tensor::max() method. The preceding squeeze() method removed the first dimension and made the tensor one-dimensional. The at::Tensor::max() method returns a pair of values; the first is the actual maximum value, while the second is its index. We incremented the index of the class we got because we have the same increment in the InitSynset() function. Then, we used this index to find the class description in the classes_ map, which we found in the InitSynset() method and called from the initClassifier() function.

The last JNI function we need to implement is classifyBitmap(). The following code shows how we declare it:

extern "C" JNIEXPORT jstring JNICALL
Java_com_example_camera2_MainActivity_classifyBitmap(
JNIEnv *env, jobject /*self*/, jintArray pixels, jint width, jint height) {
...
}

This function takes three parameters: the pixels object and its width and height dimensions. The pixels object is a reference to the Java int[] array type, so we have to convert it into a C/C++ array to be able to process it. The following code shows how we can extract separate colors and put them into distinct buffers:

jboolean is_copy = 0;
jint *pixels_buf = env->GetIntArrayElements(pixels, &is_copy);

auto channel_size = static_cast<size_t>(width * height);
using ChannelData = std::vector<float>;
size_t channels_num = 3; // RGB image
std::vector<ChannelData> image_data(channels_num);
for (size_t i = 0; i < channels_num; ++i) {
    image_data[i].resize(channel_size);
}

// split original image
for (int y = 0; y < height; ++y) {
    for (int x = 0; x < width; ++x) {
        auto pos = x + y * width;
        auto pixel_color = static_cast<uint32_t>(pixels_buf[pos]); 
                            // ARGB format
        uint32_t mask{0x000000FF};
        
        for (size_t i = 0; i < channels_num; ++i) {
            uint32_t shift = i * 8;
            uint32_t channel_value = (pixel_color >> shift) & mask;
            image_data[channels_num - (i + 1)][pos] = 
                                     static_cast<float>(channel_value);
        }
    }
}

env->ReleaseIntArrayElements(pixels, pixels_buf, 0);

JNIEnv's GetIntArrayElements() method returned the pointer to the jint array's elements, where the jint type is actually the regular C/C++ int type. With the pointer to the image's pixels data at hand, we processed it. We separated each color value into components because we needed to normalize each color channel separately.

We defined the image_data object of the std::vector<ChannelData> type to hold the color channel's data. Each channel object is of the ChannelData type, which is std::vector<float> underneath. The channel data was filled in by iterating over the image pixels row by row and splitting each pixel color into components. We got each color component by shifting the color value, which is of the int type, to the right by 8 bits three times. We didn't need the alpha color component; that is why we only performed the shift three times. After shifting, we extracted the exact component value by applying the AND operator with the 0x000000FF mask value. We also cast the color values to the floating-point type because we need values in the [0,1] range for later and we need to normalize them. After we'd finished working with the pixel values, we released the data pointer with the ReleaseIntArrayElements() method of the JNIEnv object.

Now that we've extracted the color channels from the pixel data, we have to create tensor objects from them. Using tensor objects allows us to perform vectorized calculations that are more computationally effective. The following code snippet shows how to create at::Tensor objects from floating-point vectors:

std::vector<int64_t> channel_dims = {height, width};

std::vector<at::Tensor> channel_tensor;
at::TensorOptions options(at::kFloat);
options = options.device(at::kCPU).requires_grad(false);

for (size_t i = 0; i < channels_num; ++i) {
    channel_tensor.emplace_back(
        torch::from_blob(image_data[i].data(), 
                         at::IntArrayRef(channel_dims),
                         options).clone());
}

Notice that we specified the at::kFloat type in at::TensorOptions to make it compatible with our floating-point channel's vectors. We also used the torch::from_blob() function to make a tensor object from the raw array data; we used this function in previous chapters. Simply put, we initialized the channel_tensor vector, which contains three tensors with values for each color channel.

The ResNet model we're using requires that we normalize the input image; that is, we should subtract a distinct predefined mean value from each channel and divide it with a distinct predefined standard deviation value. The following code shows how we can normalize the color channels in the channel_tensor container:

std::vector<float> mean{0.485f, 0.456f, 0.406f};
std::vector<float> stddev{0.229f, 0.224f, 0.225f};

for (size_t i = 0; i < channels_num; ++i) {
    channel_tensor[i] = ((channel_tensor[i] / 255.0f) - mean[i]) / stddev[i];
}

After we've normalized each channel, we have to make a tensor from them to satisfy the ResNet model's requirements. The following code shows how to use the stack() function to combine channels:

auto image_tensor = at::stack(channel_tensor);
image_tensor = image_tensor.unsqueeze(0);

The stack() function also adds a new dimension to the new tensor. This new tensor's dimensions become 3 x height x width.

Another of the model's requirements is that it needs a batch size dimension for the input image tensor. We used the tensor's unsqueeze() method to add a new dimension to the tensor so that its dimensions became 1 x 3 x height x width.

The following code shows the final part of the classifyBitmap() function:

std::string result = g_image_classifier.Classify(image_tensor);

return env->NewStringUTF(result.c_str());

Here, we called the Classify() method of the global g_image_classifier object to evaluate the loaded model on the prepared tensor, which contains the captured image. Then, we converted the obtained classification string into a Java String object by calling the NewStringUTF() method of the JNIEnv type object. As we mentioned previously, the Java part of the application will show this string to the user in the onActivityResult() method.

In this section, we looked at the implementation of image classification applications for the Android system. We learned how to export a pre-trained model from a Python program as a PyTorch script file. Then, we delved into developing a mobile application with Android Studio IDE and the mobile version of the PyTorch C++ library.

In the next section, we will discuss and deploy an application for image classification to the Google Compute Engine platform.

Table of Contents for The C++ native part of the project

Create new playlist

Sign In

Sign Up

Table of Contents for
The C++ native part of the project