Chapter 10. Image Classification with Convolutional Neural Networks

Computer vision is a branch of deep learning in which computers discern information from images. Real-world uses include identifying objects in photos, removing inappropriate images from social media sites, counting the cars in line at a tollbooth, and recognizing faces in photos. Computer-vision models can even be combined with natural language processing (NLP) models to caption photos. I snapped a photo while on vacation and asked Azure’s Computer Vision service to caption it. The result is shown in Figure 10-1. It’s somewhat remarkable given that no human intervention was required.

Figure 10-1. “A body of water with a dock and a building in the background”—Azure AI

The field of computer vision has advanced rapidly in recent years, mostly due to convolutional neural networks, also known as CNNs or ConvNets. In 2012, an eight-layer CNN called AlexNet outperformed traditional machine learning models entered in the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) by achieving an error rate of 15.3% when identifying objects in photos. In 2015, ResNet-152 featuring a whopping 152 layers won the challenge with an error rate of just 3.5%, which exceeds a human’s ability to classify images featured in the competition.

CNNs are magical because they treat images as images rather than just arrays of pixel values. They use a decades-old technology called convolution kernels to extract “features” from images, allowing them to recognize the shape of a cat’s head or the outline of a dog’s tail. Moreover, they are easy to build with Keras and TensorFlow.

State-of-the-art CNNs such as ResNet-152 are trained at great expense with millions of images on GPUs, but there’s a lot you can do with an ordinary CPU. In this chapter, you’ll learn what CNNs are and how they work, and you’ll build and train a few CNNs of your own. You’ll also learn how to leverage advanced CNNs published for public consumption by companies such as Google and Microsoft, and how to use a technique called transfer learning to repurpose those CNNs to solve domain-specific problems.

Understanding CNNs

Figure 10-2 shows the topology of a basic CNN. It begins with one or more sets of convolution layers and pooling layers. Convolution layers extract features from images, generating transformed images that are commonly referred to as feature maps because they highlight distinguishing features such as shapes and contours. Pooling layers reduce the feature maps’ size by half so that features can be extracted at various resolutions and are less sensitive to small changes in position. Output from the final pooling layer is flattened to one dimension and input to one or more dense layers for classification. The convolution and pooling layers are called bottleneck layers since they reduce the dimensionality of images input to them. They also account for the bulk of the computation time during training.

Convolution layers extract features from images by passing convolution kernels over them—the same technique used by image editing tools to blur, sharpen, and emboss images. A kernel is simply a matrix of values. It usually measures 3 × 3, but it can be larger. To process an image, you place the kernel in the upper-left corner of the image, multiply the kernel values by the pixel values underneath, and compute a new value for the center pixel by summing the products, as shown in Figure 10-3. Then you move the kernel one pixel to the right and repeat the process, continuing row by row and column by column until the entire image has been processed.

Figure 10-2. Convolutional neural network
Figure 10-3. Processing image pixels with a 3 × 3 convolution kernel

Figure 10-4 shows what happens when you apply a 3 × 3 kernel to a hot dog image. This particular kernel is called a bottom Sobel kernel, and it’s designed to do edge detection by highlighting edges as if a light were shined from the bottom. The convolution layers of a CNN use kernels like this one to extract features that help distinguish one class from another.

Figure 10-4. Processing an image with a bottom Sobel kernel

A convolution layer doesn’t use just one kernel to process images. It uses many—sometimes 100 or more. The kernel values aren’t determined ahead of time. They are initialized with random values and then learned (adjusted) as the CNN is trained, just as the weights connecting neurons in dense layers are learned. Each kernel also has a bias associated with it, just like a neuron in a dense layer. The images in Figure 10-5 were generated by the first convolution layer in a trained CNN. You can see how the various convolution kernels allow the network to view the same hot dog image in different ways, and how certain features such as the shape of the bun and the ribbon of mustard on top are highlighted.

Figure 10-5. Images generated by convolution kernels in a CNN

Pooling layers downsample images to reduce their size. The most common resizing technique is max pooling, which divides images into 2 × 2 blocks of pixels and selects the highest of the four values in each block. An alternative is average pooling, which averages the values in each block.

Figure 10-6 shows how an image contracts as it passes through successive pooling layers. The first row came from the first pooling layer, the second row came from the second pooling layer, and so on.

Figure 10-6. Images generated by pooling layers in a CNN

Pooling isn’t the only way to downsize an image. While less common, reduction can be accomplished without pooling layers by setting a convolution layer’s stride to 2. Stride is the number of pixels a convolution kernel moves as it passes over an image. It defaults to 1, but setting it to 2 halves the image size by ignoring every other row and every other column of pixels.

The dense layers at the end of the network classify features extracted from the bottleneck layers and are referred to as the CNN’s classification layers. They are no different than the multilayer perceptrons featured in Chapter 9. For binary classification, the output layer contains one neuron and uses the sigmoid activation function. For multiclass classification, the output layer contains one neuron per class and uses the softmax activation function.

Note

There’s no law that says bottleneck layers have to be paired with classification layers. You could take the feature maps output from the bottleneck layers and classify them with a support vector machine rather than a multilayer perceptron. It’s not as far-fetched as it sounds. In Chapter 12, I’ll introduce one well-known model that does just that.

Using Keras and TensorFlow to Build CNNs

To simplify building CNNs that classify images, Keras offers the Conv2D class, which models convolution layers, and the MaxPooling2D class, which implements max pooling layers. The following statements create a CNN with two pairs of convolution and pooling layers, a flatten layer to reshape the output into a 1D array for input to a dense layer, a dense layer to classify the features extracted from the bottleneck layers, and a softmax output layer for classification:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(MaxPooling2D(2, 2))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(10, activation='softmax'))

The first parameter passed to the Conv2D function is the number of convolution kernels to include in the layer. More kernels means more fitting power, similar to the number of neurons in a dense layer. The second parameter is the dimensions of each kernel. You sometimes get greater accuracy from 5 × 5 kernels, but a kernel that size increases training time by requiring 25 multiplication operations for each pixel as opposed to nine for a 3 × 3 kernel. The input_shape parameter in the first layer specifies the size of the images input to the CNN: in this case, one-channel (grayscale) 28 × 28 images. All the images used to train a CNN must be the same size.

Note

Conv2D processes images, which are two-dimensional. Keras also offers the Conv1D class for processing 1D data and Conv3D for 3D data. The former finds use processing text and time-series data. Canonical use cases for Conv3D include analyzing video and 3D medical images.

Given a set of images with a relatively high degree of separation between classes, it’s perfectly feasible to train a CNN to classify those images on a typical laptop or PC. A great example is the MNIST dataset, which contains 60,000 training images of scanned, handwritten digits, each measuring 28 × 28 pixels, plus 10,000 test images. Figure 10-7 shows the first 50 scans in the training set.

Figure 10-7. The MNIST digits dataset

Let’s train a CNN to recognize digits in the MNIST dataset, which conveniently is one of several sample datasets built into Keras. Begin by creating a new Jupyter notebook and using the following statements to load the dataset, reshape the 28 × 28 images into 28 × 28 × 1 arrays (28 × 28 images containing a single color channel), and divide the pixel values by 255 as a simple form of normalization:

from tensorflow.keras.datasets import mnist

(train_images, y_train), (test_images, y_test) = mnist.load_data()
x_train = train_images.reshape(60000, 28, 28, 1) / 255
x_test = test_images.reshape(10000, 28, 28, 1) / 255

Next, define a CNN that accepts 28 × 28 × 1 arrays of pixel values as input, contains two pairs of convolution and pooling layers, and has a softmax output layer with 10 neurons since the dataset contains scans of 10 different digits:

from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Flatten
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(MaxPooling2D(2, 2))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
model.summary(line_length=80)

Figure 10-8 shows the output from the call to summary on the final line. The summary reveals a lot about how this CNN processes images. Each pooling layer reduces the image size by half, while each convolution layer reduces the image’s height and width by two pixels. Why is that? By default, a convolution kernel doesn’t start with its center cell over the pixel in the upper-left corner of the image; rather, its upper-left corner is aligned with the image’s upper-left corner. For a 3 × 3 kernel, there’s a 1-pixel-wide border around the edges that doesn’t survive the convolution. (For a 5 × 5 kernel, the border that doesn’t survive is 2 pixels wide.) The term for this is padding, and if you’d like, you can override the default behavior to push the kernel’s center cell right up to the edges of the image. In Keras, this is accomplished by including a padding='same' parameter in the call to Conv2D.

Figure 10-8. Output from the summary method

Another takeaway is that each 28 × 28 image exits the first convolution layer as a 3D array or tensor measuring 26 × 26 × 32: one 26 × 26 feature map for each of the 32 kernels. After max pooling, the tensor is reduced to 13 × 13 × 32 and input to the second convolution layer, where 64 more kernels filter features from the thirty-two 13 × 13 feature maps and combine them to produce 64 new feature maps (a tensor measuring 11 × 11 × 64). A final pooling layer reduces that to 5 × 5 × 64. These values are flattened into a 1D tensor containing 1,600 values and fed into a dense layer for classification.

Note

The big picture here is that the CNN transforms each 28 × 28 image comprising 784 pixel values into an array of 1,600 floating-point numbers that (hopefully) distinguishes the contents of the image more clearly than ordinary pixel values do. That’s what bottleneck layers do: they transform matrices of integer pixel values into tensors of floating-point numbers that better characterize the images input to them. As you’ll see in Chapter 13, NLP networks use word embeddings to create dense vector representations of the words in a document. Dense vector representation is a term you’ll encounter a lot in deep learning. It’s nothing more than arrays of floating-point numbers that do more to characterize the input than the input data itself.

The output from summary would look exactly the same if the images input to the network were three-channel color images rather than one-channel grayscale images. Applying a convolution layer with n kernels to an image produces n feature maps regardless of image depth, just as applying a convolution layer featuring n kernels to the feature maps output by preceding layers produces n new feature maps regardless of input depth. Internally, CNNs use tensor dot products to produce 2D feature maps from 3D feature maps. Python’s NumPy library includes a function named tensordot for computing tensor dot products quickly.

Now train the network and plot the training and validation accuracy:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

hist = model.fit(x_train, y_train,
                 validation_data=(x_test, y_test),
                 epochs=10, batch_size=50)
acc = hist.history['accuracy']
val_acc = hist.history['val_accuracy']
epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, '-', label='Training Accuracy')
plt.plot(epochs, val_acc, ':', label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')

Once trained, this simple CNN can achieve 99% accuracy classifying handwritten digits:

One reason it can attain such accuracy is the number of training samples—roughly 6,000 per class. (As a test, I trained the network with just 100 samples of each class and got 92% accuracy.) Another factor is that a 2 looks very different from, say, an 8. If a person can rather easily distinguish between the two, then a CNN can too.

Training a CNN to Recognize Arctic Wildlife

A basic CNN can easily achieve 99% accuracy on the MNIST dataset. But it isn’t as easy when the problem is more perceptual—for example, when the goal is to determine whether a photo contains a dog or a cat. One reason is that most 8s look a lot alike, while dogs and cats come in many varieties. Another factor is that each digit in the MNIST dataset is carefully cropped to precisely fill the frame, whereas dogs and cats can appear anywhere in the frame and can be photographed in different poses and from an infinite number of angles.

To demonstrate, let’s train a CNN to distinguish between Arctic foxes, polar bears, and walruses. For context, imagine you’ve been tasked with creating a system that uses AI to examine pictures snapped by motion-activated cameras deployed in the Arctic to document polar bear activity.

Start by downloading a ZIP file containing images for training and testing the CNN. Unpack the ZIP file and place its contents in a subdirectory named Wildlife where your Jupyter notebooks are hosted. The ZIP file contains folders named train, test, and samples. Each folder contains subfolders named arctic_fox, polar_bear, and walrus. The training folders contain 100 images each, while the test folders contain 40 images each. Figure 10-9 shows some of the polar bear training images. These are public images that were downloaded from the internet and cropped and resized to 224 × 224 pixels.

Figure 10-9. Polar bear images

Now create a Jupyter notebook and use the following code to define a pair of helper functions—one to load a batch of images from a specified location in the filesystem and assign them labels, and another to show the first eight images in a batch of images:

import os
from tensorflow.keras.preprocessing import image
import matplotlib.pyplot as plt
%matplotlib inline

def load_images_from_path(path, label):
    images, labels = [], []

    for file in os.listdir(path):
        img = image.load_img(os.path.join(path, file), target_size=(224, 224, 3))
        images.append(image.img_to_array(img))
        labels.append((label))

    return images, labels

def show_images(images):
    fig, axes = plt.subplots(1, 8, figsize=(20, 20),
                            subplot_kw={'xticks': [], 'yticks': []})

    for i, ax in enumerate(axes.flat):
        ax.imshow(images[i] / 255)

x_train, y_train, x_test, y_test = [], [], [], []

Use the following statements to load 100 Arctic fox training images and plot a subset of them:

images, labels = load_images_from_path('Wildlife/train/arctic_fox', 0)
show_images(images)

x_train += images
y_train += labels

Do the same to load and label the polar bear training images:

images, labels = load_images_from_path('Wildlife/train/polar_bear', 1)
show_images(images)

x_train += images
y_train += labels

And then the walrus training images:

images, labels = load_images_from_path('Wildlife/train/walrus', 2)
show_images(images)

x_train += images
y_train += labels

You also need to load the images used to validate the CNN. Start with 40 Arctic fox test images:

images, labels = load_images_from_path('Wildlife/test/arctic_fox', 0)
show_images(images)

x_test += images
y_test += labels

Then the polar bear test images:

images, labels = load_images_from_path('Wildlife/test/polar_bear', 1)
show_images(images)

x_test += images
y_test += labels

And finally the walrus test images:

images, labels = load_images_from_path('Wildlife/test/walrus', 2)
show_images(images)

x_test += images
y_test += labels

The next step is to normalize the training and testing images by dividing their pixel values by 255:

import numpy as np

x_train = np.array(x_train) / 255
x_test = np.array(x_test) / 255

y_train = np.array(y_train)
y_test = np.array(y_test)

Now it’s time to build a CNN. Since the images measure 224 × 224 and we want the final feature maps to compress as much information as possible into a small space, we’ll use five pairs of convolution and pooling layers to extract features from the training images at five resolutions: 224 × 224, 111 × 111, 54 × 54, 26 × 26, and 12 × 12. We’ll follow those with a dense layer and a softmax output layer containing three neurons—one for each of the three classes:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3)))
model.add(MaxPooling2D(2, 2))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dense(3, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
model.summary(line_length=80)

Call fit to train the model:

hist = model.fit(x_train, y_train,
                 validation_data=(x_test, y_test),
                 batch_size=10, epochs=20)

If you train the model on a CPU, training will probably require from 10 to 20 seconds per epoch. (Think of all those pixel calculations taking place on all those images with all those convolution kernels.) When training is complete, use the following statements to plot the training and validation accuracy:

import seaborn as sns
sns.set()

acc = hist.history['accuracy']
val_acc = hist.history['val_accuracy']
epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, '-', label='Training Accuracy')
plt.plot(epochs, val_acc, ':', label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.plot()

Here is the output:

Were the results what you expected? The validation accuracy is decent, but it’s not state of the art. It probably landed between 60% and 70%. Modern CNNs often do 95% or better classifying images such as these. You might be able to squeeze more out of this model by stacking convolution layers or increasing the number of kernels, and you might get it to generalize slightly better by introducing a dropout layer. But you won’t reach 95% with this network and this dataset.

One of the reasons modern CNNs can do image classification so accurately is that they’re trained with millions of images. You don’t need millions of samples of each class, but you probably need at least an order of magnitude more—if not two orders of magnitude more—than the 300 you trained with here. You could scour the internet for more images, but more images means more training time. If the goal is to achieve an accuracy of 95% or more, you’ll quickly get to the point where the CNN takes too long to train—or find yourself shopping for an NVIDIA GPU.

That doesn’t mean CNNs aren’t practical for solving business problems. It just means that there’s more to learn. The next section is the first step in understanding how to attain high levels of accuracy without training a CNN from scratch.

Pretrained CNNs

Microsoft, Google, and other tech companies use a subset of the ImageNet dataset containing more than 1 million images to train state-of-the-art CNNs to recognize hundreds of objects, including Arctic foxes and polar bears. Then they make them available for public consumption. Called pretrained CNNs, they are more sophisticated than anything you’re likely to train yourself. And if that’s not awesome enough, Keras reduces the process of loading a pretrained CNN to one line of code.

Keras provides classes that wrap more than two dozen popular pretrained CNNs. The full list is documented on the Keras website. Most of these CNNs are documented in scholarly papers such as “Deep Residual Learning for Image Recognition” and “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”. Some have won prestigious competitions such as the ImageNet Large Scale Visual Recognition Challenge and the COCO Detection Challenge. Among the most notable are the ResNet family of networks from Microsoft and the Inception networks from Google. Also noteworthy is MobileNet, which trades size for accuracy and is ideal for mobile devices due to its small memory footprint. You can learn more about it in the Google AI blog.

The following statement instantiates Keras’s MobileNetV2 class and initializes it with the weights, biases, and kernel values arrived at when the network was trained on the ImageNet dataset:

from tensorflow.keras.applications import MobileNetV2

model = MobileNetV2(weights='imagenet')

The weights='imagenet' parameter tells Keras what parameters to load to re-create the network in its trained state. You can also pass a path to a file containing custom weights, but imagenet is the only set of predefined weights that are currently supported.

Before an image is submitted to a pretrained CNN for classification, it must be sized to the dimensions the CNN expects—typically 224 × 224—and preprocessed. Different CNNs expect images to be preprocessed in different ways, so Keras provides a preprocess_input function for each pretrained CNN. It also includes utility functions for loading and resizing images. The following statements load an image from the filesystem and preprocess it for input to the MobileNetV2 network:

import numpy as np
from tensorflow.keras.applications.mobilenet import preprocess_input
from tensorflow.keras.preprocessing import image

x = image.load_img('arctic_fox.jpg', target_size=(224, 224))
x = image.img_to_array(x)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

In most cases, preprocess_input does all the work that’s needed, which often involves applying unit variance to pixel values and converting RGB images to BGR format. In some cases, however, you still need to divide the pixel values by 255. ResNet50V2 is one example:

import numpy as np
from tensorflow.keras.applications.resnet50 import preprocess_input
from tensorflow.keras.preprocessing import image

x = image.load_img('arctic_fox.jpg', target_size=(224, 224))
x = image.img_to_array(x)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x) / 255

Once an image is preprocessed, making a prediction is as simple as calling the network’s predict method:

y = model.predict(x)

To help you interpret the output, Keras also provides a network-specific decode​_pre⁠dictions method. Figure 10-10 shows what that method returned for a photo submitted to ResNet50V2.

Figure 10-10. Output from decode_predictions

ResNet50V2 is 89% sure the photo contains an Arctic fox—which, it so happens, it does. MobileNetV2 predicted with 92% certainty that the photo contains an Arctic fox. Both networks were trained on the same dataset, but different pretrained CNNs classify images slightly differently.

Using ResNet50V2 to Classify Images

Let’s use Keras to load a pretrained CNN and classify a pair of images. Fire up a notebook and use the following statements to load ResNet50V2:

from tensorflow.keras.applications import ResNet50V2

model = ResNet50V2(weights='imagenet')
model.summary()

Next, load an Arctic fox image and show it in the notebook:

%matplotlib inline
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing import image

x = image.load_img('Wildlife/samples/arctic_fox/arctic_fox_140.jpeg',
                   target_size=(224, 224))
plt.xticks([])
plt.yticks([])
plt.imshow(x)

Now preprocess the image (remember that for ResNet50V2, you also have to divide all the pixel values by 255 after calling Keras’s preprocess_input method) and pass it to the CNN for classification:

import numpy as np
from tensorflow.keras.applications.resnet50 import preprocess_input
from tensorflow.keras.applications.resnet50 import decode_predictions

x = image.img_to_array(x)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x) / 255

y = model.predict(x)
decode_predictions(y)

The output should look like this:

[[('n02120079', 'Arctic_fox', 0.9999944),
  ('n02114548', 'white_wolf', 4.760021e-06),
  ('n02119789', 'kit_fox', 2.3306782e-07),
  ('n02442845', 'mink', 1.2460312e-07),
  ('n02111889', 'Samoyed', 1.1914468e-07)]]

ResNet50V2 is virtually certain that the image contains an Arctic fox. But now load a walrus image:

x = image.load_img('Wildlife/samples/walrus/walrus_143.png',
                   target_size=(224, 224))
plt.xticks([])
plt.yticks([])
plt.imshow(x)

Ask ResNet50V2 to classify it:

x = image.img_to_array(x)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x) / 255

y = model.predict(x)
decode_predictions(y)

Here’s the output:

[[('n02454379', 'armadillo', 0.63758147),
  ('n01704323', 'triceratops', 0.16057032),
  ('n02113978', 'Mexican_hairless', 0.07795086),
  ('n02398521', 'hippopotamus', 0.022284042),
  ('n01817953', 'African_grey', 0.016944142)]]

ResNet50V2 thinks the image is most likely an armadillo, but it’s not even very sure about that. Can you guess why?

ResNet50V2 was trained with almost 1.3 million images. None of them, however, contained a walrus. The ImageNet 1000 Class List shows a complete list of classes it was trained to recognize. A pretrained CNN is great when you need it to classify images using the classes it was trained with, but it is powerless to handle domain-specific tasks that it wasn’t trained for.

But all is not lost. A technique called transfer learning enables pretrained CNNs to be repurposed to solve domain-specific problems. The repurposing can be done on an ordinary CPU; no GPU required. Transfer learning sometimes achieves 95% accuracy with just a few hundred training images. Once you learn about it, you’ll have a completely different perspective on the efficacy of using CNNs to solve business problems.

Transfer Learning

Earlier, you used a dataset with photos of Arctic foxes, polar bears, and walruses to train a CNN to recognize Artic wildlife. Trained with 300 images—100 for each of the three classes—the CNN achieved an accuracy of around 60%. That’s not sufficient for most purposes.

One solution is to train the CNN with tens of thousands of photos. A better solution—one that can deliver world-class accuracy with the 300 photos you have and doesn’t require expensive hardware—is transfer learning. In the hands of software developers and engineers, transfer learning makes CNNs a practical solution for a variety of computer-vision problems. And it requires orders of magnitude less time and compute power than CNNs trained from scratch. Let’s take a moment to understand what transfer learning is and how it works—and then put it to work identifying Arctic wildlife.

Pretrained CNNs trained on the ImageNet dataset can identify Arctic foxes and polar bears, but they can’t identify walruses because they weren’t trained with walrus images. Transfer learning lets you repurpose pretrained CNNs to identify objects they weren’t originally trained to identify. It leverages the intelligence baked into pretrained CNNs, but it repurposes that intelligence to solve new problems.

Recall that a CNN has two groups of layers: bottleneck layers containing the convolution and pooling layers that extract features from images at various resolutions, and classification layers, which classify features output from the bottleneck layers as belonging to an Arctic fox, a polar bear, or something else. Convolution layers use convolution kernels to extract features, and the values in the convolution kernels are learned during training. This learning accounts for the bulk of the training time. When sophisticated CNNs are trained with millions of images, the convolution kernels become very efficient at extracting features. But that efficiency comes at a cost.

The premise behind transfer learning is shown in Figure 10-11. You load the bottleneck layers of a pretrained CNN, but you don’t load the classification layers. Instead, you provide your own, which train orders of magnitude more quickly than an entire CNN. Then you pass the training images through the bottleneck layers for feature extraction and train the classification layers on the resulting features. The pretrained CNN might have been trained to extract features from pictures of apples and oranges, but those same layers are probably pretty good at extracting features from photos of dogs and cats too. By using the pretrained bottleneck layers to extract features and then using those features to train your own classification layers, you can teach the model that a certain feature extracted from an image might be indicative of a dog rather than an apple.

Figure 10-11. Neural network architecture for transfer learning

Transfer learning is relatively simple to implement with Keras and TensorFlow. Recall that the following statement loads ResNet50V2 and initializes it with the weights (including kernel values) and biases that were arrived at when the network was trained on a subset of the ImageNet dataset:

base_model = ResNet50V2(weights='imagenet')

To load ResNet50V2 (or any other pretrained CNN that Keras supports) without the classification layers, you simply add an include_top=False attribute:

base_model = ResNet50V2(weights='imagenet', include_top=False)

From that point, there are two ways to go about transfer learning. The first involves appending classification layers to the base model’s bottleneck layers and setting each base layer’s trainable attribute to False so that the weights, biases, and convolution kernels won’t be updated when the network is trained:

for layer in base_model.layers:
    layer.trainable = False

model = Sequential()
model.add(base_model)
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dense(3, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x, y, validation_split=0.2, epochs=10, batch_size=10)

The second technique is to run all the training images through the base model for feature extraction, and then run the features through a separate network containing your classification layers:

features = base_model.predict(x)

model = Sequential()
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(3, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(features, y, validation_split=0.2, epochs=10, batch_size=10)

Which technique is better? The second is faster because the training images go through the bottleneck layers for feature extraction just one time rather than once per epoch. It’s the technique you should use in the absence of a compelling reason to do otherwise. The first technique is slightly slower, but it lends itself to fine-tuning, in which you unfreeze one or more bottleneck layers after training is complete and train for a few more epochs with a very low learning rate. It also facilitates data augmentation, which I’ll introduce in the next section.

Note

Fine-tuning is frequently applied to transfer-learning models after training is complete in an effort to squeeze out an extra percentage point or two of accuracy. We will use fine-tuning in Chapter 13 to increase the accuracy of an NLP model that utilizes a pretrained neural network.

If you use the first technique to implement transfer learning, you make predictions by preprocessing the images and passing them to the model’s predict method. For the second technique, making predictions is a two-step process. After preprocessing the images, you pass them to the base model’s predict method, and then you pass the output from that method to your model’s predict method:

x = image.img_to_array(x)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x) / 255

features = base_model.predict(x)
predictions = model.predict(features)

And with that, transfer learning is complete. All that remains is to put it in practice.

Using Transfer Learning to Identify Arctic Wildlife

Let’s use transfer learning to solve the same problem that we attempted to solve earlier with a scratch-built CNN: building a model that determines whether a photo contains an Arctic fox, a polar bear, or a walrus.

Create a Jupyter notebook and use the same code you used earlier to load the training and test images and assign labels to them: 0 for Arctic foxes, 1 for polar bears, and 2 for walruses. Once that’s done, the next step is to preprocess the images. We’ll use ResNet50V2 as our pretrained CNN, so use the ResNet version of preprocess_input to preprocess the pixels. Then divide the pixel values by 255:

import numpy as np
from tensorflow.keras.applications.resnet50 import preprocess_input

x_train = preprocess_input(np.array(x_train)) / 255
x_test = preprocess_input(np.array(x_test)) / 255

y_train = np.array(y_train)
y_test = np.array(y_test)

The next step is to load ResNet50V2, being careful to load the bottleneck layers but not the classification layers, and use it to extract features from the training and test images:

from tensorflow.keras.applications import ResNet50V2

base_model = ResNet50V2(weights='imagenet', include_top=False)

x_train = base_model.predict(x_train)
x_test = base_model.predict(x_test)

Now train a neural network to classify features extracted from the training images:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense

model = Sequential()
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dense(3, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

hist = model.fit(x_train, y_train,
                 validation_data=(x_test, y_test),
                 batch_size=10, epochs=10)

How well did the network train? Plot the training accuracy and validation accuracy for each epoch:

import seaborn as sns
sns.set()

acc = hist.history['accuracy']
val_acc = hist.history['val_accuracy']
epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, '-', label='Training Accuracy')
plt.plot(epochs, val_acc, ':', label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.plot()

Your results will differ from mine, but I got about 97% accuracy. If you didn’t quite get there, try training the network again:

Finally, use a confusion matrix to visualize how well the network distinguishes between classes:

from sklearn.metrics import ConfusionMatrixDisplay as cmd

sns.reset_orig()
fig, ax = plt.subplots(figsize=(4, 4))
ax.grid(False)

y_pred = model.predict(x_test)
class_labels = ['arctic fox', 'polar bear', 'walrus']

cmd.from_predictions(y_test, y_pred.argmax(axis=1),
                     display_labels=class_labels, colorbar=False,
                     cmap='Blues', xticks_rotation='vertical', ax=ax)

Here’s how it turned out for me:

To see transfer learning at work, load one of the Arctic fox images from the samples folder. That folder contains wildlife images with which the model was neither trained nor validated:

x = image.load_img('Wildlife/samples/arctic_fox/arctic_fox_140.jpeg',
                    target_size=(224, 224))
plt.xticks([])
plt.yticks([])
plt.imshow(x)

Now preprocess the image, run it through ResNet50V2’s feature extraction layers, and run the output through the newly trained classification layers:

x = image.img_to_array(x)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x) / 255

y = base_model.predict(x)
predictions = model.predict(y)

for i, label in enumerate(class_labels):
    print(f'{label}: {predictions[0][i]}')

For me, the network predicted with almost 100% confidence that the image contains an Arctic fox:

arctic fox: 1.0
polar bear: 0.0
walrus: 0.0

Perhaps that’s not surprising, since ResNet50V2 was trained with Arctic fox images. But now let’s load a walrus image, which, you’ll recall, ResNet50V2 was unable to classify:

x = image.load_img('Wildlife/samples/walrus/walrus_143.png',
                    target_size=(224, 224))
plt.xticks([])
plt.yticks([])
plt.imshow(x)

Preprocess the image and make a prediction:

x = image.img_to_array(x)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x) / 255

y = base_model.predict(x)
predictions = model.predict(y)

for i, label in enumerate(class_labels):
    print(f'{label}: {predictions[0][i]}')

Here’s how it turned out this time:

arctic fox: 0.0
polar bear: 0.0
walrus: 1.0

ResNet50V2 wasn’t trained to recognize walruses, but your network was. That’s transfer learning in a nutshell. It’s the deep-learning equivalent of having your cake and eating it too. And it’s the secret sauce that makes CNNs a viable tool for anyone with a laptop and a few hundred training images.

That’s not to say that transfer learning will always get you 97% accuracy with 100 images per class. It won’t. If a dataset lacks the information to achieve that level of separation, neither scratch-built CNNs nor transfer learning will magically make it happen. That’s always true in machine learning and AI. You can’t get water from a rock. And you can’t build an accurate model from data that doesn’t support it.

Data Augmentation

The previous example demonstrated how to use transfer learning to build a model that, with just 300 training images, can classify photos of three different types of Arctic wildlife with 97% accuracy. One of the benefits of transfer learning is that it can do more with fewer images. This feature is also a bug, however. With just 100 or so samples of each class, there is little diversity among images. A model might be able to recognize a polar bear if the bear’s head is perfectly aligned in the center of the photo. But if the training images don’t include photos with the bear’s head aligned differently or tilted at different angles, the model might have difficulty classifying the photo.

One solution is data augmentation. Rather than scare up more training images, you can rotate, translate, and scale the images you have. It doesn’t always increase a CNN’s accuracy, but it frequently does, especially with small datasets. Keras makes it easy to randomly transform training images provided to a network. Images are transformed differently in each epoch, so if you train for 10 epochs, the network sees 10 different variations of each training image. This can increase a model’s ability to generalize with little impact on training time. Figure 10-12 shows the effect of applying random transforms to a hot dog image. You can see why presenting the same image to a model in different ways might make the model more adept at recognizing hot dogs, regardless of how the hot dog is framed.

Figure 10-12. Hot dog image with random transforms applied

Keras has built-in support for data augmentation with images. Let’s look at a couple of ways to put image augmentation to work, and then apply it to the Arctic wildlife model.

Image Augmentation with ImageDataGenerator

One way to apply image augmentation when training a model is to use Keras’s ImageDataGenerator class. ImageDataGenerator generates batches of training images on the fly, either from images you’ve loaded (for example, with Keras’s load_img function) or from a specified location in the filesystem. The latter is especially useful when training CNNs with millions of images because it loads images into memory in batches rather than all at once. Regardless of where the images come from, however, ImageDataGenerator is happy to apply transforms as it serves them up.

Here’s a simple example that you can try yourself. Use the following code to load an image from your filesystem, wrap an ImageDataGenerator around it, and generate 24 versions of the image:

import numpy as np
from tensorflow.keras.preprocessing import image
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt
%matplotlib inline

# Load an image
x = image.load_img('Wildlife/train/polar_bear/polar_bear_010.jpeg')
x = image.img_to_array(x)
x = np.expand_dims(x, axis=0)

# Wrap an ImageDataGenerator around it
idg = ImageDataGenerator(rescale=1./255,
                         horizontal_flip=True,
                         rotation_range=30,
                         width_shift_range=0.2,
                         height_shift_range=0.2,
                         zoom_range=0.2)
idg.fit(x)

# Generate 24 versions of the image
generator = idg.flow(x, [0], batch_size=1, seed=0)
fig, axes = plt.subplots(3, 8, figsize=(16, 6),
                         subplot_kw={'xticks': [], 'yticks': []})

for i, ax in enumerate(axes.flat):
    img, label = generator.next()
    ax.imshow(img[0])

Here’s the result:

The parameters passed to ImageDataGenerator tell it how to transform each image it delivers:

rescale=1./255
Divides each pixel value by 255
horizontal_flip=True
Randomly flips the image horizontally (around the vertical axis)
rotation_range=30
Randomly rotates the image by –30 to 30 degrees
width_shift_range=0.2 and height_shift_range=0.2
Randomly translates the image by –20% to 20%
zoom_range=0.2
Randomly scales the image by –20% to 20%

There are other parameters that you can use, such as vertical_flip, shear_range, and brightness_range, but you get the picture. The flow method used in this example generates images from the images you pass to fit. The related flow_from​_direc⁠tory method loads images from the filesystem and optionally labels them based on the subdirectories they’re in.

The generator returned by flow can be passed directly to a model’s fit method to provide randomly transformed images to the model as it is trained. Assume that x_train and y_train hold a collection of training images and labels. The following code wraps an ImageDataGenerator around them and uses them to train a model:

idg = ImageDataGenerator(rescale=1./255,
                         horizontal_flip=True,
                         rotation_range=30,
                         width_shift_range=0.2,
                         height_shift_range=0.2,
                         zoom_range=0.2)

idg.fit(x_train)
image_batch_size = 10
generator = idg.flow(x_train, y_train, batch_size=image_batch_size, seed=0)

model.fit(generator,
          steps_per_epoch=len(x_train) // image_batch_size,
          validation_data=(x_test, y_test),
          batch_size=20,
          epochs=10)

The steps_per_epoch parameter is key because an ImageDataGenerator can provide an infinite number of versions of each image. In this example, the batch_size parameter passed to flow tells the generator to create 10 images in each batch. Dividing the number of images by the image batch size to calculate steps_per_epoch ensures that in each training epoch, the model is provided with one transformed version of each image in the dataset.

Note

Versions of Keras prior to 2.1 didn’t allow a generator to be passed to the fit method. Instead, they provided a separate method named fit_generator. That method is deprecated and will be removed in a future release.

Observe that the call to fit includes a validation_data parameter identifying a separate set of images and labels for validating the network during training. You generally don’t want to augment validation images, so you should avoid using validation_split when passing a generator to fit.

Image Augmentation with Augmentation Layers

You can use ImageDataGenerator to provide transformed images to a model, but recent versions of Keras provide an alternative in the form of image preprocessing layers and image augmentation layers. Rather than transform training images separately, you can integrate the transforms directly into the model. Here’s an example:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from tensorflow.keras.layers import Rescaling, RandomFlip, RandomRotation
from tensorflow.keras.layers import RandomTranslation, RandomZoom

model = Sequential()
model.add(Rescaling(1./255))
model.add(RandomFlip(mode='horizontal'))
model.add(RandomTranslation(0.2, 0.2))
model.add(RandomRotation(0.2))
model.add(RandomZoom(0.2))
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3)))
model.add(MaxPooling2D(2, 2))
model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(3, activation='softmax')

Each image used to train the CNN has its pixel values divided by 255 and is then randomly flipped, translated, rotated, and scaled. Significantly, the RandomFlip, RandomTranslation, RandomRotation, and RandomZoom layers operate only on training images. They are inactive when the network is validated or asked to make predictions. Consequently, it’s fine to use validation_split when training a model that contains image augmentation layers. The Rescaling layer is active at all times, meaning you no longer have to remember to divide pixel values by 255 before training the model or submitting an image for classification.

Applying Image Augmentation to Arctic Wildlife

Would image augmentation make transfer learning even better? There’s one way to find out.

Create a Jupyter notebook and copy the code that loads the training and test images from the transfer learning example. Then use the following statements to prepare the data. Note that there is no need to divide by 255 this time because a Rescaling layer will take care of that:

import numpy as np
from tensorflow.keras.applications.resnet50 import preprocess_input

x_train = preprocess_input(np.array(x_train))
x_test = preprocess_input(np.array(x_test))

y_train = np.array(y_train)
y_test = np.array(y_test)

Now load ResNet50V2 without the classification layers and initialize it with the ImageNet weights. A key element here is preventing the bottleneck layers from training when the network is trained by setting their trainable attributes to False, effectively freezing those layers. Rather than setting each individual layer’s trainable attribute to False, we’ll set trainable to False on the model itself and allow that setting to be “inherited” by the individual layers:

from tensorflow.keras.applications import ResNet50V2

base_model = ResNet50V2(weights='imagenet', include_top=False)
base_model.trainable = False

Define a network that incorporates rescaling and augmentation layers, ResNet50V2’s bottleneck layers, and dense layers for classification. Then train the network:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Rescaling, RandomFlip
from tensorflow.keras.layers import RandomRotation, RandomTranslation, RandomZoom

model = Sequential()
model.add(Rescaling(1./255))
model.add(RandomFlip(mode='horizontal'))
model.add(RandomTranslation(0.2, 0.2))
model.add(RandomRotation(0.2))
model.add(RandomZoom(0.2))
model.add(base_model)
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dense(3, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

hist = model.fit(x_train, y_train,
                 validation_data=(x_test, y_test),
                 batch_size=10, epochs=10)

How well did the network train? Plot the training accuracy and validation accuracy for each epoch:

import seaborn as sns
sns.set()

acc = hist.history['accuracy']
val_acc = hist.history['val_accuracy']
epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, '-', label='Training Accuracy')
plt.plot(epochs, val_acc, ':', label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.plot()

With a little luck, the accuracy slightly exceeded that of the model trained without data augmentation:

Note

You may find that the version of the model that uses data augmentation is less accurate than the version that doesn’t. To be sure, I trained each version 10 times and averaged the results. I found that the data augmentation version delivered, on average, about 0.5% more accuracy than the version that lacks augmentation. That’s not a lot, but data scientists frequently go to great lengths to improve accuracy by just a fraction of a percentage point.

Use a confusion matrix to visualize how well the network performed during testing:

from sklearn.metrics import ConfusionMatrixDisplay as cmd

sns.reset_orig()
fig, ax = plt.subplots(figsize=(4, 4))
ax.grid(False)

y_pred = model.predict(x_test)
class_labels = ['arctic fox', 'polar bear', 'walrus']

cmd.from_predictions(y_test, y_pred.argmax(axis=1),
                     display_labels=class_labels, colorbar=False,
                     cmap='Blues', xticks_rotation='vertical', ax=ax)

Here’s how it turned out for me:

Data scientists sometimes employ data augmentation even when they’re training a CNN from scratch rather than employing transfer learning, especially when the dataset is relatively small. It’s a useful tool to know about, and one that could make a difference when you’re trying to squeeze every last ounce of accuracy out of a deep-learning model.

Global Pooling

The purpose of including a Flatten layer in a CNN is to reshape the 3D tensors containing the final feature maps into 1D tensors suitable for input to a Dense layer. But Flatten isn’t the only way to do it. Flattening sometimes leads to overfitting by providing too much information to the classification layers.

One way to combat overfitting is to introduce a Dropout layer. Another strategy is to reduce the width of the Dense layer. A third option is to replace the Flatten layer with a GlobalMaxPooling2D layer or a GlobalAverage​Pool⁠ing2D layer. They, too, output 1D tensors, but they generate them in a different way. And that way is less prone to overfitting.

To demonstrate, modify the MNIST dataset example earlier in this chapter to use a GlobalMaxPooling2D layer rather than a Flatten layer:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, 
    GlobalMaxPooling2D, Dense

model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(MaxPooling2D(2, 2))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(GlobalMaxPooling2D()) # In lieu of Flatten()
model.add(Dense(128, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
model.summary(line_length=120)

The summary (Figure 10-13) shows that the output from the GlobalMaxPooling2D layer is a tensor containing 64 values—one per feature map emitted by the final MaxPooling2D layer—rather than 5 × 5 × 64, or 1,600, values, as it was for the Flatten layer. Each value is the maximum of the 25 values in each 5 × 5 feature map. Had you used GlobalAveragePooling2D instead, each value would have been the average of the 25 values in each feature map.

Figure 10-13. Output from the summary method

Global pooling sometimes increases a CNN’s ability to generalize and sometimes does not. For the MNIST dataset, it slightly diminishes accuracy. As is so often the case in machine learning, the only way to know is to try. And due to the randomness inherent in training neural networks, it’s always advisable to train the network several times in each configuration and average the results before drawing conclusions.

Audio Classification with CNNs

Imagine that you’re the leader of a group of climate scientists concerned about the planet’s dwindling rainforests. The world loses up to 10 million acres of old-growth rainforests each year, much of it due to illegal logging. Your team plans to convert thousands of discarded smartphones into solar-powered listening devices and position them throughout the Amazon to transmit alerts in response to the sounds of chainsaws and truck engines. You need software that uses AI to identify such sounds in real time. And you need it fast, because climate change won’t wait.

An effective way to perform audio classification is to convert audio streams into spectrogram images, which provide visual representations of spectrums of frequencies as they vary over time, and use CNNs to classify the spectrograms. The spectrograms in Figure 10-14 were generated from WAV files containing chainsaw sounds. Let’s use transfer learning to create a model that can identify the telltale sounds of logging operations and distinguish them from ambient sounds such as wildlife and thunderstorms.

Figure 10-14. Spectrograms generated from audio files containing chainsaw sounds
Note

The tutorial in this section was inspired by the Rainforest Connection, which uses recycled Android phones to monitor rainforests for sounds of illegal activity. A TensorFlow CNN hosted in the cloud analyzes audio from the phones and may one day run on the phones themselves with an assist from TensorFlow Lite, a smaller version of TensorFlow designed for mobile, embedded, and edge devices. For more information, see “The Fight Against Illegal Deforestation with TensorFlow” in the Google AI blog. It’s just one example of how AI is making the world a better place.

Begin by downloading a ZIP file containing a dataset of rainforest sounds. (Warning: it’s a 666 MB download.) Create a subdirectory named Sounds in the directory where your notebooks are hosted, and copy the contents of the ZIP file into the subdirectory. Sounds now contains subdirectories named background, chainsaw, engine, and storm. Each subdirectory contains 100 WAV files. The WAV files in the background directory contain rainforest background noises only, while the files in the other subdirectories include the sounds of chainsaws, engines, and thunderstorms overlaid on the background noises. I generated these files by using a soundscape synthesis package named Scaper to combine sounds in the public UrbanSound8K dataset with rainforest sounds. Play a few of the WAV files on your computer to get a feel for the sounds they contain.

Now create a Jupyter notebook and paste the following code into the first cell:

import numpy as np
import librosa.display, os
import matplotlib.pyplot as plt
%matplotlib inline

def create_spectrogram(audio_file, image_file):
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)
    fig.subplots_adjust(left=0, right=1, bottom=0, top=1)

    y, sr = librosa.load(audio_file)
    ms = librosa.feature.melspectrogram(y=y, sr=sr)
    log_ms = librosa.power_to_db(ms, ref=np.max)
    librosa.display.specshow(log_ms, sr=sr)

    fig.savefig(image_file)
    plt.close(fig)

def create_pngs_from_wavs(input_path, output_path):
    if not os.path.exists(output_path):
        os.makedirs(output_path)

    dir = os.listdir(input_path)

    for i, file in enumerate(dir):
        input_file = os.path.join(input_path, file)
        output_file = os.path.join(output_path, file.replace('.wav', '.png'))
        create_spectrogram(input_file, output_file)

This code defines a pair of functions to help convert WAV files into spectrogram images. create_spectrogram uses a Python package named Librosa to create a spectrogram image from a WAV file. create_pngs_from_wavs converts all the WAV files in a specified directory into spectrogram images. You will need to install Librosa if it isn’t installed already.

Use the following statements to create PNG files containing spectrograms from all the WAV files in the Sounds directory’s subdirectories:

create_pngs_from_wavs('Sounds/background', 'Spectrograms/background')
create_pngs_from_wavs('Sounds/chainsaw', 'Spectrograms/chainsaw')
create_pngs_from_wavs('Sounds/engine', 'Spectrograms/engine')
create_pngs_from_wavs('Sounds/storm', 'Spectrograms/storm')

Check the Spectrograms directory for subdirectories containing spectrograms and confirm that each subdirectory contains 100 PNG files. Then use the following code to define two new helper functions for loading and displaying spectrograms, and declare two Python lists—one to store spectrogram images and another to store class labels:

from tensorflow.keras.preprocessing import image

def load_images_from_path(path, label):
    images, labels = [], []

    for file in os.listdir(path):
        images.append(image.img_to_array(image.load_img(os.path.join(path, file),
                      target_size=(224, 224, 3))))
        labels.append((label))

    return images, labels

def show_images(images):
    fig, axes = plt.subplots(1, 8, figsize=(20, 20),
                             subplot_kw={'xticks': [], 'yticks': []})

    for i, ax in enumerate(axes.flat):
        ax.imshow(images[i] / 255)

x, y = [], []

Use the following statements to load the background spectrogram images, add them to the list named x, and label them with 0s:

images, labels = load_images_from_path('Spectrograms/background', 0)
show_images(images)

x += images
y += labels

Repeat this process to load chainsaw spectrograms from the Spectrograms/chainsaw directory, engine spectrograms from the Spectrograms/engine directory, and thunderstorm spectrograms from the Spectrograms/storm directory. Label chainsaw spectrograms with 1s, engine spectrograms with 2s, and thunderstorm spectrograms with 3s. Here are the labels for the four classes of images:

Spectrogram typeLabel
Background0
Chainsaw1
Engine2
Storm3

Since this model may one day run on mobile phones, we’ll use MobileNetV2 as the base network. Use the following code to preprocess the pixels and split the images and labels into two datasets—one for training and one for testing:

from sklearn.model_selection import train_test_split
from tensorflow.keras.applications.mobilenet import preprocess_input

x = preprocess_input(np.array(x))
y = np.array(y)

x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y,
                                                    test_size=0.3,
                                                    random_state=0)

Call Keras’s MobileNetV2 function to instantiate MobileNetV2 without the classification layers. Then run the training data and test data through MobileNetV2 to extract features from the spectrogram images:

from tensorflow.keras.applications import MobileNetV2

base_model = MobileNetV2(weights='imagenet', include_top=False,
                         input_shape=(224, 224, 3))

train_features = base_model.predict(x_train)
test_features = base_model.predict(x_test)

Define a neural network to classify features extracted by MobileNetV2:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten

model = Sequential()
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dense(4, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
Note

As an experiment, I replaced the Flatten layer with a Global​Avera⁠gePooling2D layer. Validation accuracy improved slightly, but the model didn’t generalize as well when tested with audio extracted from a documentary video. This underscores an important point from Chapter 9: you can have full trust and confidence in a model only when it’s tested with data it has never seen before—preferably data that comes from a different source.

Train the network with the features:

hist = model.fit(train_features, y_train,
                 validation_data=(test_features, y_test),
                 batch_size=10, epochs=10)

Plot the training and validation accuracy:

import seaborn as sns
sns.set()

acc = hist.history['accuracy']
val_acc = hist.history['val_accuracy']
epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, '-', label='Training Accuracy')
plt.plot(epochs, val_acc, ':', label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.plot()

The validation accuracy should reach 95% or higher:

Run the test images through the network and use a confusion matrix to assess the results:

from sklearn.metrics import ConfusionMatrixDisplay as cmd

sns.reset_orig()
fig, ax = plt.subplots(figsize=(4, 4))
ax.grid(False)

y_pred = model.predict(test_features)
class_labels = ['background', 'chainsaw', 'engine', 'storm']

cmd.from_predictions(y_test, y_pred.argmax(axis=1),
                     display_labels=class_labels, colorbar=False,
                     cmap='Blues', xticks_rotation='vertical', ax=ax)

The network is reasonably adept at identifying clips that don’t contain the sounds of chainsaws or engines. It sometimes confuses chainsaw sounds and engine sounds. That’s OK, because the presence of either might indicate illicit activity in a rainforest:

The Sounds directory has a subdirectory named samples containing WAV files with which the CNN was neither trained nor validated. The WAV files bear no relation to the samples used for training and testing; they come from a YouTube video documenting Brazil’s efforts to curb illegal logging. Let’s use the model you just trained to analyze these files for sounds of logging activity.

Start by creating a spectrogram from the first sample WAV file, which contains audio of loggers cutting down trees in the Amazon:

create_spectrogram('Sounds/samples/sample1.wav', 'Spectrograms/sample1.png')

x = image.load_img('Spectrograms/sample1.png', target_size=(224, 224))
plt.xticks([])
plt.yticks([])
plt.imshow(x)

Preprocess the spectrogram image, pass it to MobileNetV2 for feature extraction, and classify the features:

x = image.img_to_array(x)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

y = base_model.predict(x)
predictions = model.predict(y)

for i, label in enumerate(class_labels):
    print(f'{label}: {predictions[0][i]}')

Now create a spectrogram from a WAV file that features the sound of a logging truck rumbling through the rainforest:

create_spectrogram('Sounds/samples/sample2.wav', 'Spectrograms/sample2.png')

x = image.load_img('Spectrograms/sample2.png', target_size=(224, 224))
plt.xticks([])
plt.yticks([])
plt.imshow(x)

Preprocess the image, pass it to MobileNetV2 for feature extraction, and classify the features:

x = image.img_to_array(x)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

y = base_model.predict(x)
predictions = model.predict(y)

for i, label in enumerate(class_labels):
    print(f'{label}: {predictions[0][i]}')

If the network got either of the samples wrong, try training it again. Remember that a neural network will train differently every time, in part because Keras initializes the weights with small random values. In the real world, data scientists often train a neural network several times and average the results to quantify its accuracy.

Summary

Convolutional neural networks excel at image classification because they use convolution kernels to extract features from images at different resolutions—features intended to accentuate differences between classes. Convolution layers use convolution kernels to extract features, and pooling layers reduce the size of the feature maps output from the convolution layers. Output from these layers is input to fully connected layers for classification. Keras provides implementations of convolution and pooling layers in classes such as Conv2D and MaxPooling2D.

Training a CNN from scratch when there is a relatively high degree of separation between classes—for example, the MNIST dataset—is feasible on an ordinary laptop or PC. Training a CNN to solve a more perceptual problem requires more training images and commensurately more compute power. Transfer learning is a practical alternative to training CNNs from scratch. It uses the intelligence already present in the bottleneck layers of pretrained CNNs to extract features from images, and then uses its own classification layers to interpret the results.

Data augmentation can increase the accuracy of a CNN trained with a relatively small number of images and is especially useful with transfer learning. Augmentation involves applying random transforms such as translations and rotations to the training images. You can transform images before inputting them to the network with Keras’s ImageDataGenerator class, or you can build the transforms into the network with layers such as RandomRotation and RandomTranslation. Layers that transform images are active at training time but inactive when the network makes predictions.

CNNs are applicable to a wide variety of computer-vision problems and are almost single-handedly responsible for the rapid advancements made in that field in the past decade. They play an important role in modern facial recognition systems too. Want to know more? Detecting and identifying faces in photographs is the subject of the next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.250.114