Building an animal image classification – using transfer learning and VGG-16 architecture

In this section, we're going to build a cat-and-dog recognizer Java application using the VGG-16 architecture and transfer learning. Let's revisit the VGG-16 architecture (explained previously in the Working with classical networks section).

The VGG-16 architecture is quite uniform; we have only one 3 x 3 same convolution, which leaves the first 2 dimensions untouched and increases the number of channels in the third dimension, and also increases the max pooling 2 x 2 stride two, which, in turn, decreases the first 2 dimensions by dividing it by 2, thereby leaving the third dimension untouched. The idea with many convolution architectures is eventually to shrink these two-dimensions and increase the number of channels; if we look at the output of these convolution layers from 224 x 224 x 3, we end up with 7 x 7 x 512. Then, in the next step, we connect all these to 2 fully-connected hidden layers, each of them with 4,096 neurons. Finally, we use a softmax to predict 1,000 classes, which is the same case with ImageNet.

With transfer learning, we will first freeze all these layers. These weights won't be trained any more, but we'll use the pre-trained values that they were trained on, for example, the ImageNet dataset. After that, we'll do another modification, since we don't need 1,000 classes, but just 2. We'll replace the softmax to predict only cats and dogs. The number of trainable parameters will be reduced only to these ones here, which, as we'll see in the code, is just 2 multiplied by 4,096. Let's jump into the code and see how we can do all this using Java and Deeplearning4j.

First, we start with some familiar parameters, the number of epochs and the learning rate, as shown in the following screenshot:

While the freeze layer is quite important, we'll learn more about it as we proceed. The first step is to load the VGG-16 architecture and the pre-trained weights. That's quite easy; the constructor prepares everything, but will have no weights there; in order to load the pre-trained weights that you can find at the CatVsDogREcognition.java file, this method needs to be called as follows:

public CatVsDogRecognition() throws IOException {
 this.computationGraph = loadModel();
 computationGraph.init();
 log.info(computationGraph.summary());
 }
public AnimalType detectAnimalType(File file, Double threshold) throws IOException {
 INDArray image = imageFileToMatrix(file);
 INDArray output = computationGraph.outputSingle(false, image);
 if (output.getDouble(0) > threshold) {
 return AnimalType.CAT;
 } else if (output.getDouble(1) > threshold) {
 return AnimalType.DOG;
 } else {
 return AnimalType.NOT_KNOWN;
 }
 }

Then, plot the weights, as shown in the following code snippet, which were gained by training with ImageNet:

public class CatVsDogRecognition {
 public static final String TRAINED_PATH_MODEL = DataStorage.DATA_PATH + "/model.zip";
 private ComputationGraph computationGraph;

ImageNet has a huge dataset, with millions of images. Just through these two lines, as mentioned in the preceding code block, we get the benefit of using pre-trained weights in a huge dataset; a big team of developers actually trained it for a really long time and went through the painful process of choosing the best parameters that you can find in the TransferLearningVGG16.java file, as follows:

public class TransferLearningVGG16 {
 private static final int SAVING_INTERVAL = 100;
/**
 * Number of total traverses through data.
 * with 5 epochs we will have 5/@MINI_BATCH_SIZE iterations or weights updates
 */
 private static final int EPOCH = 5;

 /**
 * The layer where we need to stop back propagating
 */
 private static final String FREEZE_UNTIL_LAYER = "fc2";
/**
 * The alpha learning rate defining the size of step towards the minimum
 */
 private static final double LEARNING_RATE = 5e-5;
private NeuralNetworkTrainingData neuralNetworkTrainingData;

Then, we'll print the structure of the VGG-16 architecture, which will be in the form of a table. Then, we'll download the data which will be used to train this modifier or this transfer learning VGG-16 architecture:

private ComputationGraph loadVGG16PreTrainedWeights() throws IOException {
 ZooModel zooModel = new VGG16();
 log.info("Start Downloading VGG16 model...");
 return (ComputationGraph) zooModel.initPretrained(PretrainedType.IMAGENET);
 }
private void unzip(File fileZip) {
Unzip unZip = new Unzip();
 unZip.setSrc(fileZip);
 unZip.setDest(new File(DATA_PATH));
 unZip.execute();
 }

In this method, we aren't going into the details. But it's quite easy; we just download some data and unzip it to a folder. After we download the data, which is in a row format, we need a way to structure it into the training dataset, the development dataset, and the test dataset.

Just to recall, the training dataset is the set used to train or optimize our weights. The developer set is used to see how to generalize the unseen data and some optimizations. The test dataset, which isn't often used for small projects, is just to get an unbiased evaluation for the data that the neural network has never seen nor optimized. How to recall this dataset is demonstrated in the following code block:

public void train() throws IOException {
 ComputationGraph preTrainedNet = loadVGG16PreTrainedWeights();
 log.info("VGG 16 Architecture");
 log.info(preTrainedNet.summary());
log.info("Start Downloading NeuralNetworkTrainingData...");
downloadAndUnzipDataForTheFirstTime();
log.info("NeuralNetworkTrainingData Downloaded and unzipped");
neuralNetworkTrainingData = new DataStorage() {
 }.loadData();

In this case, we'll sample the training data through 85%, which will be used for training, and the rest will be used for the development of datasets at 50%. As we continue, all the test datasets will be used for tests or 100% of them, as shown in the following code:


 private void downloadAndUnzipDataForTheFirstTime() throws IOException {
 File data = new File(DATA_PATH + "/data.zip");
 if (!data.exists() || FileUtils.checksum(data, new Adler32()).getValue() != 1195241806) {
 data.delete();
 FileUtils.copyURLToFile(new URL(ONLINE_DATA_URL), data);
 log.info("File downloaded");
 }
 if (!new File(TRAIN_DIRECTORY_PATH).exists()) {
 log.info("Unzipping NeuralNetworkTrainingData...");
 unzip(data); 
 }
 }

Then, we have the fine-tuned configuration (executed in the following code block and that can be found in the DataStorage.java file in the repo), which is just a collection of the parameters used for training more effectively than a network:

 FineTuneConfiguration fineTuneConf = new FineTuneConfiguration.Builder()
 .learningRate(LEARNING_RATE)
 .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
 .updater(Updater.NESTEROVS)
 .seed(1234)
 .build();

As we have seen before, this training is for the momentum updater and the learning rate. Also, the stochastic gradient descent is used with the mini-batch for training the model:

default NeuralNetworkTrainingData loadData() throws IOException {
 InputSplit[] trainAndDevData = TRAIN_DATA.sample(PATH_FILTER, TRAIN_SIZE, 100 - TRAIN_SIZE);
 return new NeuralNetworkTrainingData(getDataSetIterator(trainAndDevData[0]),
 getDataSetIterator(trainAndDevData[1]),
 getDataSetIterator(TEST_DATA.sample(PATH_FILTER, 1, 0)[0]));
}

Then we have the modified version of the VGG-16 architecture, which is where we apply the transfer learning. We'll use a transfer learning graph builder. The first method is just to give a reference to the fine-tuned configuration, and the second method is quite important. Here, we instruct Deeplearning4j to freeze the layers, and this fc2 is just a fully-connected layer two. If we print the VGG-16 architecture before the modification, the output will be in the form of a table, and, in the end, we have fc2.

This method freezes everything up to this layer. All these layers will be frozen, as we saw in this section, including the fc2 itself. Hence, this will no longer be trained.

And then this second method stipulates removal of the predictions, since we have 1,000 classes, and to replace this original prediction with another prediction, which provides a softmax layer with two classes as an output: dog and cat. We'll use a softmax layer with the Xavier initialization. If we print the modified architecture, which is shown in the following code block and can be found under the TransferLearningVGG16.java file in the repo, we can see that this 1,000 was reduced to 2, while the number of parameters was reduced as well:

ComputationGraph vgg16Transfer = new TransferLearning.GraphBuilder(preTrainedNet)
 .fineTuneConfiguration(fineTuneConf)
 .setFeatureExtractor(FREEZE_UNTIL_LAYER)
 .removeVertexKeepConnections("predictions")
 .addLayer("predictions",
 new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
 .nIn(4096)
 .nOut(NUM_POSSIBLE_LABELS)
 .weightInit(WeightInit.XAVIER)
 .activation(Activation.SOFTMAX)
 .build(),
 FREEZE_UNTIL_LAYER)
 .build();
 vgg16Transfer.setListeners(new ScoreIterationListener(5));

Notice how the trainable parameters were 138,000,000 in the original architecture, and are now reduced to 8,194. This number is calculated in accordance with the number of classes multiplied by the fully-connected layer, fc2. This gives us 8,192 plus 2 biases:

We reduce the number of parameters dramatically, and this will speed up our training. At the same time, we have these weights, which we trained in a huge dataset. We have the benefits of capturing various types of features already, demonstrated as follows:

After modifying the architecture, we're ready to train the model. This training code here is quite similar to what we saw in the preceding code block. We'll use the mini-batch sizes to train and then we'll save our progress and evaluate the development dataset for the configure interval. Here, we'll save the model every 100 iterations, and see how it's doing.

We'll evaluate every single epoch in the test dataset, and get the stats.

Then, the best model will be used by this implementation, which is quite straightforward:

When it gets the image file, it transforms it into an array, and it will ask the best model for the prediction. It then stipulates what type of image you'd return if one of the predictions exceeds the threshold:

The default value of the threshold is 50%, but when we see it from a graphical user interface, this is configurable. When any type of prediction doesn't exceed the threshold, we'll feel like we don't know what this is.

Table of Contents for Building an animal image classification&#xA0;&#x2013; using transfer learning and VGG-16 architecture

Create new playlist

Sign In

Sign Up

Table of Contents for
Building an animal image classification – using transfer learning and VGG-16 architecture