Reducing your model's weight

We have spent considerable time discussing layers of a network; we have learned that layers are made up of weights, configured in such a way that they can transform an input into a desirable output. These weights come at a cost, though; each one (by default) is a 32-bit floating-point number with a typical model, especially in computer vision, having millions resulting in networks that are hundreds of megabytes in size. On top of that; it's plausible that your application will have multiple models (with this chapter being a good example, requiring a model for each style).

Fortunately, our model in this chapter has a moderate number of weights and weighs in at a mere 2.2 MB; but this is possibly an exception. So we'll use this chapter as an excuse to explore some ways we can reduce our model's weights. But before doing so, let's quickly discuss why, even though it's probably obvious. The three main reasons why you should be conscious of your model's size include:

Download time
Application footprint
Demands on memory

These could all hinder the user experience and are reasons for a user to either uninstall the application quickly or not even download it in the first place. So how do you reduce your model's size to avoid deterring the user. There are three broad approaches:

Reduce the number of layers your network uses
Reduce the number of units in each of those layers
Reduce the size of the weights

The first two require that you have access to the original network and tools to re-architect and train the model; the last is the most accessible and it's the one we will discuss now.

In iOS 11.2, Apple allowed your networks to use half-precision floating-point numbers (16-bit). Now, with the release of iOS 12, Apple has taken this even further and introduced quantization, which allows us to use eight or less bits to encode our model's weights. In the following figure, we can see how these options compare with one another:

Let's discuss each in turn, starting with reducing our weights precision by converting it's floating points from 32-bits to 16-bits.

For both of these techniques (half-precision and quantization), we will be using the Core ML Tools Python package; so, begin by opening up your browser and heading over to https://notebooks.azure.com. Once the page is loaded navigate to the folder Chapter6/Notebooks/ and open the Jupyter Notebook FastNeuralStyleTransfer_OptimizeCoreML.ipynb. As we did before, we'll walk through each of the Notebook's cells here with the assumption that you will be executing each one as we cover it (if you are working along).

We begin by importing the Core ML Tools package; execute the cell with the following code:

try:
    import coremltools
except:
    !pip install coremltools    
    import coremltools

For convenience, we have wrapped the import in an exception block, so that it automatically installs the package if it doesn't exist.

At the time of writing, Core ML 2 was still in beta and only recently publicly announced. If you're using a version of Core ML Tools that is less than 2.0 then replace !pip install coremltools with !pip install coremltools>=2.0b1 to install the latest beta version to have access to the necessary modules for this section.

Next, we will load our mlmodel file that we had previously saved, using the following statement:

coreml_model = coremltools.models.MLModel('output/FastStyleTransferVanGoghStarryNight.mlmodel')

Next, we perform the conversion by simply calling coremltools.utils.convert_neural_network_weights_to_fp16 and passing in your model. If successful, this method will return an equivalent model (that you passed in), using half precision weights instead of 32-bits for storing its weights. Run the cell with the following code to do just that:

 fp16_coreml_model = coremltools.utils.convert_neural_network_weights_to_fp16(coreml_model)

Finally, we save it so we can later download it and import into our project; run the next cell with the code:

fp16_coreml_model.save('output/fp16_FastStyleTransferVanGoghStarryNight.mlmodel')

And with that executed (essentially three lines of code), we have managed to have our models size, going from 2.2 MB to 1.1 MB - so, what's the catch?

As you might suspect, there is a trade-off here; reducing the precision of your models weights will affect its accuracy, but possibly not enough to be concerned with. The only way you will know is by comparing the optimized model with the original and re-evaluating it on your test data, ensuring that it satisfies your required accuracy/results. For this, Core ML Tools provides a collection of utilities that makes this fairly seamless, which you can learn at the official website https://apple.github.io/coremltools/index.html.

Quantization is no more complicated (with respect to using it via Core ML Tools rather than concept); it's a clever technique, so let's quickly discuss how it achieves 8-bit compression before running through the code.

At a high-level, quantization is a technique that maps a continuous range of values to a discrete set; you can think of it as a process of clustering your values into a discrete set of groups and then creating a lookup table which maps your values to the closest group. The size is then dependent on the number of clusters used (index) rather than value, which allows you to encode your weights using anything from 8-bits to 2-bits.

To make this concept more concrete, the following figure illustrates the results of color quantization; where a 24-bit image is mapped to 16 discrete colors:

Instead of each pixel representing its color (using 24-bits/8-bits per channel), they now are indexes to the 16 color palette, that is, from 24-bits to 4-bits.

Before moving onto how we optimize our models using quantization with the Core ML Tools package, you maybe wondering how this palette (or discrete set of values) is derived. The short answer is that there are many ways, from linearly separating the values into groups, to using an unsupervised learning technique such as k-means, or even using a custom, domain-specific technique. Core ML Tools allows for all variations and the choice will be dependent on your data distribution and that results achieved during testing. Let's jump into it; first, we will start by importing the module:

from coremltools.models.neural_network import quantization_utils as quant_utils

With this statement, we have imported the module and assigned it the alias quant_utils; the next cell, we optimize our model using a variation of sizes and methods:

lq8_coreml_model = quant_utils.quantize_weights(coreml_model, 8, 'linear')
lq4_coreml_model = quant_utils.quantize_weights(coreml_model, 4, 'linear')
km8_coreml_model = quant_utils.quantize_weights(coreml_model, 8, 'kmeans')
km4_coreml_model = quant_utils.quantize_weights(coreml_model, 4, 'kmeans')

Once this is completed, let's save each of our optimized models to the output directory before downloading them to our local disk to import them into Xcode (this may take some time):

coremltools.models.MLModel(lq8_coreml_model) 
    .save('output/lq8_FastStyleTransferVanGoghStarryNight.mlmodel')
coremltools.models.MLModel(lq4_coreml_model) 
    .save('output/lq4_FastStyleTransferVanGoghStarryNight.mlmodel')
coremltools.models.MLModel(km8_coreml_model) 
    .save('output/km8_FastStyleTransferVanGoghStarryNight.mlmodel')
coremltools.models.MLModel(km4_coreml_model) 
    .save('output/km8_FastStyleTransferVanGoghStarryNight.mlmodel')

I will omit the details of downloading and importing the model into your project as we have already gone through these steps previously in this chapter, but I do encourage that you to inspect the results from each model to get a feel for how each optimization affects the results - of course, these effects are highly dependent on the model, data, and domain. The following figure shows the results of each of the optimizations along with the model's size:

Admittedly, it's difficult to see the differences due to the low-resolution of the image (and possibly because you're reading this in black and white) but generally, the quality appears minimal between the original and k-means 8-bit version.

With the release of Core ML 2, Apple offers another powerful feature to optimize you Core ML models; specifically around consolidating multiple models into a single package. This not only reduces the size of your application but also convenient for you, the developer, when interfacing with your model. For example, flexible shapes and sizes allows for variable input and output dimensions, that is, instead of a single fixed input and output dimension, you have the flexibility of having multiple variants or a variable range within a limit. You can learn more about this feature on their official website at https://developer.apple.com/machine-learning; but for now, we will wrap up this chapter with a quick summary before moving on to the next chapter.

Table of Contents for Reducing your model's weight

Create new playlist

Sign In

Sign Up

Table of Contents for
Reducing your model's weight