In this chapter, you will learn how to perform image classification, using the indoor/outdoor dataset that has been labeled by Snorkel in Chapter 3.
The techniques described in this chapter can be used for image classification for any image datasets. This chapter will provide you with a holistic set of discussions and code that can help you get started qiickly with using the dataset that has been labeled by Snorkel (from Chapter 3).
The chapter starts with a gentle introduction to different types of visual object recognition tasks, and discussions on how image features are represented. Next, we discuss how transfer learning for image classification works. In the remainder of the chapter, we will use the indoor/outdoor dataset that has been labeled by Snorkel to fine-tune an image classification model using PyTorch.
Visual Object Recognition is commonly used to identify objects in digital images. To do visual object recognition well, various computer vision tasks are used:
Image Classification - Predict the type/class of an image (e.g. does the image consists of an indoor or outdoor scene)
Object Localization - Identify the objects present in an image with bounding boxes
Object Detection - Identify the objects present in an image with bounding boxes, and the type or class of the object corresponding to each bounding box
Image Instance Segmentation - Identify the objects present in an image, and identifying the pixels that belong to each of those objects.
ImageNet is a visual database that consists of millions of images and is used by many CV researchers for various visual object recognition tasks. Convolutional Neural Networks (CNNs) are commonly used in visual object recognition. Over the years, researchers have continuously been advancing the performance of CNNs trained on ImageNet. In the early day of training AlexNet on ImageNet, it took several days to train the CNN. With innovations in algorithms and hardware, the time taken to train a CNN has decreased significantly.
For example, one type of CNN architecture is ResNet-50. ResNet-50 is a 50 layers deep convolutional network, that leverages residual networks. The model is trained to classify images into 1,000 object categories. Over the years, the time taken to train ResNet-50 on ImageNet from scratch has dropped from days to hours, and to minutes (shown in Table 5-1).
April 2017 | Sept 2017 | November 2017 | July 2018 | November 2018 | March 2019 |
---|---|---|---|---|---|
1 hour |
31 minutes |
15 minutes |
6.6 minutes |
2.0 minutes |
1.2 minutes |
Facebook, Caffe2 |
UC Berkeley, TACC, UC Davis, Tensorflow |
Preferred Networks, ChainerMN |
Tencent, Tensorflow |
Sony, Neural Network Library (NNL) |
Fujitsu, MXNet |
In recent years, PyTorch and Tensorflow have been innovating at a rapid pace as powerful and flexible deep learning frameworks, empowering practitioners to easily get started with training CV and NLP deep learning models.
Model zoos provide AI practitioners with a large collection of deep learning models with pre-trained weights and code. With the availability of model zoos with pre-trained deep learning models, anyone can get started with any computer vision tasks (e.g. classification, object detection, image similarity, segmentation, and many more). This is amazing news for practitioners looking at leveraging some of these state-of-art CNNs for various computer vision tasks.
In this chapter, we will leverage ResNet-50 for classifying the images (indoor or outdoor scenes) that were labeled by Snorkel in Chapter 3.
In order to understand how image classification works, we need to understand how image features are represented in different layers of the CNN. One of the reasons why CNNs are able to perform well for image classification is because of the ability of the different layers of the CNN to extract the different salient features of an image (edges, textures) and group them as patterns and parts.
In the article Feature Visualization, Olah, et al showed how each layer of a convolutional neural network (e.g. GoogLeNet) built up its understanding of edges, textures, patterns, parts and uses these basic constructs to build up the understanding of objects in an image (shown in Figure 5-1).
Before the emergence of deep learning approaches for image classification, researchers working in computer vision (CV) leveraged and used various visual feature extractors to extract features that are used as inputs to classifiers. For example, Histogram of Oriented Gradients (HoG) detectors are commonly used for feature extraction. Often, these custom CV approaches are not generalizable to new tasks (i.e. detectors trained for one image dataset are not easily transferrable to other datasets).
Today, many commercial AI applications, that leverage computer vision capabilities, use transfer learning. Transfer learning enables deep learning models that have been trained on large-scale image datasets (e.g. ImageNet) and uses the pre-trained models for performing image classification, without having to train the models from scratch.
There are several approaches of using transfer learning for computer vision, including these two widely-used approaches:
Using the CNN as a Feature Extractor- Each of the layers of a CNN encodes different features of the image. A CNN that has been trained on a large-scale image dataset would have captured these salient details in each of its layers. This enables the CNN to be used as a feature extractor, and using it to extract the relevant features inputs that can be used with an image classifier.
Fine-tuning the CNN - With a new image dataset, you might want to consider further fine-tuning the weights of the pre-trained CNN model using backpropagation. As one moves from the top layers of a CNN to the last few layers, it is natural that the top layers would have captured generic image features (e.g. edges, textures, patterns, parts), and the later layers are tuned for a specific image dataset. By adjusting the weights of these last few layers of a CNN, the weights can be made more relevant to the image datasets. This process is called fine-tuning the CNN.
Now that we have a good overview of how CNNs are used for image classification, let’s get started with building an image classifier for identifying whether an image shows an indoor or outdoor scene, using PyTorch.
In this section, we will learn how to use the pre-trained ResNet-50 models, available in PyTorch, for performing image classification.
Before we get started, let us load the relevant Python libraries that we will use in this chapter. These include common classes like DataLoader (that defines a Python iterable for datasets), torchvision and common utilities to load the pre-trained CNN models, and image transformations that can be used.
We loaded matplotlib for visualization. In addition, we also used helper classes from _ mpl_toolkits.axes_grid1_ that will enable us to display images from the training, and testing datasets.
import
torch
from
torch.autograd
import
Variable
from
torch.utils.data
import
DataLoader
import
torchvision
from
torchvision
import
datasets
,
models
,
transforms
import
numpy
as
np
import
os
import
time
import
copy
import
matplotlib.pyplot
as
plt
from
mpl_toolkits.axes_grid1
import
ImageGrid
%
matplotlib
inline
With the various Python libraries loaded, we are now ready to create the DataLoader objects on the indoor/outdoor images dataset. First, we specify the directory for loading the training, validation (i.e. val) and testing images. This is specified in the data directory.
# Specify image folder
image_dir
=
'../data/'
$ tree -d . ├── test │ ├── indoor │ └── outdoor ├── train │ ├── indoor │ └── outdoor └── val ├── indoor └── outdoor
Next, we specify the mean and standard deviation for each of the three channels for the images. These are defaults used for the ImageNet dataset and are generally applicable for most image datasets.
# Or we can use the default from ImageNet
mean
=
np
.
array
([
0.485
,
0.456
,
0.406
])
std
=
np
.
array
([
0.229
,
0.224
,
0.225
])
Next, we will specify the transformations that will be used for the training, validation, and testing datasets.
You will notice in the code shown below that for the training dataset, we first apply RandomResizedCrop and RandomHorizontalFlip. RandomResizedCrop crops each of the training images to a random size and then outputs an image that is 224 x 224. RandomHorizontalFlip randomly performs horizontal flipping of the 224 x 224 images. The image is then converted to tensor, and the tensor values normalized to the mean and standard deviation provided.
For the validation and testing images, we resize each image to 224 x 224, and performs a CenterCrop. The image is then converted to tensor, and the tensor values normalized to the mean and standard deviation provided.
# Specify the image transformation
# for training, validation and testing datasets
image_transformations
=
{
'train'
:
transforms
.
Compose
([
transforms
.
RandomResizedCrop
(
224
),
transforms
.
RandomHorizontalFlip
(),
transforms
.
ToTensor
(),
transforms
.
Normalize
(
mean
,
std
)
]),
'val'
:
transforms
.
Compose
([
transforms
.
Resize
(
224
),
transforms
.
CenterCrop
(
224
),
transforms
.
ToTensor
(),
transforms
.
Normalize
(
mean
,
std
)
]),
'test'
:
transforms
.
Compose
([
transforms
.
Resize
(
224
),
transforms
.
CenterCrop
(
224
),
transforms
.
ToTensor
(),
transforms
.
Normalize
(
mean
,
std
)
])
}
With the image transformations defined, we are now ready to create the Python iterable over the training, validation, and testing images using DataLoader. In the code shown below, we iterate through each folder (i.e. train, val, and test). For each of the folder, we specify the relative directory path, and the image transformations.
Next, we define the batch_size as 8, and creates a DataLoader object. We store the references for the training, validation and testing loader in the dataloders variable, so we can use it later. We also store the size of each dataset in the dataset_sizes variable.
# load the training, validation, and test data
image_datasets
=
{}
dataset_sizes
=
{}
dataloaders
=
{}
batch_size
=
8
for
data_folder
in
[
'train'
,
'val'
,
'test'
]:
dataset
=
datasets
.
ImageFolder
(
os
.
path
.
join
(
image_dir
,
data_folder
),
transform
=
image_transformations
[
data_folder
])
loader
=
torch
.
utils
.
data
.
DataLoader
(
dataset
,
batch_size
=
batch_size
,
shuffle
=
True
,
num_workers
=
2
)
# store the dataset/loader/sizes for reference later
image_datasets
[
data_folder
]
=
dataset
dataloaders
[
data_folder
]
=
loader
dataset_sizes
[
data_folder
]
=
len
(
dataset
)
Let us see the number of images in each of the datasets.
dataset_sizes
For this indoor and outdoor image classification exercise, we have 1,609, 247 and 188 images for training, validation and testing respectively.
{'train': 1609, 'val': 247, 'test': 188}
Let us see the class names for the datasets. This is picked up from the name of the directories stored in the data folder.
# Get the classes
class_names
=
image_datasets
[
"train"
]
.
classes
class_names
You will see that we have 2 image classes: indoor and outdoor.
['indoor', 'outdoor']
Next, let us define two utility functions (visualize_images, and model_predictions ) that will be used later for displaying the training and testing images, and computing the predictions for the test dataset.
visualize_images() is a function for visualizing images in an image grid. By default, it shows 16 images from the images array. The array labels is passed to the function to show the class names for each of the images displayed. If an optional array predictions is provided, both the ground truth label and the predicted label will be shown side by side.
You will notice that we multiplied the value of inp by 255, and then cast it as a uint8 data type. This helps to convert the values from 0 to 1 to 0 to 255. It also helps to reduce the clipping errors that might occur due to negative values in the inp variable.
def
visualize_images
(
images
,
labels
,
predictions
=
None
,
num_images
=
16
):
count
=
0
mean
=
np
.
array
([
0.485
,
0.456
,
0.406
])
std
=
np
.
array
([
0.229
,
0.224
,
0.225
])
fig
=
plt
.
figure
(
1
,
figsize
=
(
16
,
16
))
grid
=
ImageGrid
(
fig
,
111
,
nrows_ncols
=
(
4
,
4
),
axes_pad
=
0.05
)
# get the predictions for data in dataloader
for
i
in
range
(
0
,
len
(
images
)):
ax
=
grid
[
count
]
inp
=
images
[
i
]
.
numpy
()
.
transpose
((
1
,
2
,
0
))
inp
=
std
*
inp
+
mean
ax
.
imshow
((
inp
*
255
)
.
astype
(
np
.
uint8
))
if
(
predictions
is
None
):
info
=
'{}'
.
format
(
class_names
[
labels
[
i
]])
else
:
info
=
'{}/{}'
.
format
(
class_names
[
labels
[
i
]],
class_names
[
predictions
[
i
]])
ax
.
text
(
10
,
20
,
'{}'
.
format
(
info
),
color
=
'w'
,
backgroundcolor
=
'black'
,
alpha
=
0.8
,
size
=
15
)
count
+=
1
if
count
==
num_images
:
return
Given a DataLoader and a model, the function model_predictions() iterates through the images, and computes the predicted label using the model provided. The predictions, ground-truth labels, and images are then returned.
# given a dataloader, get the predictions using the model provided.
def
model_predictions
(
dataloder
,
model
):
predictions
=
[]
images
=
[]
label_list
=
[]
# get the predictions for data in dataloader
for
i
,
(
inputs
,
labels
)
in
enumerate
(
dataloder
):
inputs
,
labels
=
Variable
(
inputs
.
cuda
()),
Variable
(
labels
.
cuda
())
outputs
=
model
(
inputs
)
_
,
preds
=
torch
.
max
(
outputs
.
data
,
1
)
predictions
.
append
(
preds
.
cpu
())
label_list
.
append
(
labels
.
cpu
())
for
j
in
range
(
inputs
.
size
()[
0
]):
images
.
append
(
inputs
.
cpu
()
.
data
[
j
])
predictions_f
=
list
(
np
.
concatenate
(
predictions
)
.
flat
)
label_f
=
list
(
np
.
concatenate
(
label_list
)
.
flat
)
return
predictions_f
,
label_f
,
images
It is important to have a deep understanding of the data before you start training the deep learning model. Let us use the visualize_images() function to show the first 8 images in the training dataset (shown in Figure 5-2).
images
,
labels
=
next
(
iter
(
dataloaders
[
'train'
]))
visualize_images
(
images
,
labels
)
Many different kinds of pre-trained deep learning model architectures can be used for image classification. PyTorch (and similarly TensorFlow) provides a rich set of model architectures that you can use.
For example, in TorchVision.Models, you will see that PyTorch provides model definitions for AlexNet, VGG, ResNet, SqueezeNet, DenseNet, Inception V3, GoogLeNet, ShuffleNet v2, MobileNet v2, ResNeXt, Wide ResNet, MNASNet and more. Pre-trained models are available by setting pretrained=True when loading the models.
For this exercise, we will use ResNet-50. We will load the pre-trained ResNet-50 model, and fine-tune it for the indoor/outdoor image classification task.
Residual Networks (or ResNet) was first introduced in the paper (Deep Residual Learning for Image Recognition) by Kaiming He et al.
In the paper, the authors showed how residual networks can be easily optimized, and explored with different network depths (up to 1,000 layers).
In 2015, ResNet-based networks won first place in the ILSVRC classification task.
First, we load the ResNet-50 model.
# Load Resnet50 model
model
=
models
.
resnet50
(
pretrained
=
True
)
Resnet-50 is trained on ImageNet with 1,000 classes. We will need to update the FC layer to 2 classes.
# Specify a final layer with 2 classes - indoor and outdoor
num_classes
=
2
num_features
=
model
.
fc
.
in_features
model
.
fc
=
torch
.
nn
.
Linear
(
num_features
,
num_classes
)
Now that we have modified the FC layer, let us define the criterion, optimizer, and scheduler. In the code below, you will see that we specify the loss function as Cross-Entropy Loss. The PyTorch CrossEntropyLoss() criterion combines both nn.LogSoftmax() and nn.NLLLoss() together, and is commonly used for image classification problems with N classes.
For optimizer, torch.optim.SGD() is used. The Stochastic Gradient Descent (SGD) optimization approach is commonly used in training CNN, over batches of data.
For scheduler, lr_scheduler.StepLR() is used. The StepLR scheduler adjusts the learning rate by the value of gamma. In our example, we use the default gamma value of 0.1 and specified a step_size of 8.
import
torch.optim
as
optim
from
torch.optim
import
lr_scheduler
# Use CrossEntropyLoss as a loss function
loss_function
=
torch
.
nn
.
CrossEntropyLoss
()
optimizer
=
torch
.
optim
.
SGD
(
model
.
fc
.
parameters
(),
lr
=
0.001
,
momentum
=
0.8
)
scheduler
=
lr_scheduler
.
StepLR
(
optimizer
,
step_size
=
8
)
We are ready to start fine-tuning the ResNet-50 model. Let us move the model to the GPU.
device
=
torch
.
device
(
"cuda:0"
if
torch
.
cuda
.
is_available
()
else
"cpu"
)
model
.
to
(
device
)
When the model is moved to the GPU, it outputs the structure of the network. You will notice that the structure of the network reflects what is shown in Figure 5-3. In this chapter, we show a snippet of the network, and you can see the full network when you execute the code.
In the PyTorch implementation of ResNet, you will see that the ResNet-50 implementation consists of multiple BottleNeck blocks, each with a kernel size of (1,1), (3,3), and (1,1). As noted in the TorchVision ResNet implementation, the bottleNeck blocks used in TorchVision puts the stride for downsampling at 3x3 convolution.
In the last few layers of the ResNet-50 architecture, you will see the use of the 2D adaptive average pooling, following by an FC layer that outputs 2 features (corresponding to the indoor and outdoor classes).
Now that the model has been pushed to the GPU, we are now ready to fine-tune the model with the training images of indoor and outdoor scenes.
ResNet( (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (layer1): Sequential( (0): Bottleneck( (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) ) ... More layers .... ) (avgpool): AdaptiveAvgPool2d(output_size=(1, 1)) (fc): Linear(in_features=2048, out_features=2, bias=True) )
The function train() (shown later in the page) is adapted from the example provided in the PyTorch tutorial Finetuning Torchvision Models.
We first perform a deep copy of all the pre-trained model weights found in model.state_dict() to the variable best_model_weights. We initialize the best accuracy of the model to be 0.0.
The code then iterates through multiple epochs. For each epoch, we first load the data and labels that will be used for training and then pushes it to the GPU. We reset the optimizer gradient, before calling model(inputs) to perform the forward-pass. We compute the loss, using cross-entropy loss. Once these steps are completed, we call loss.backward() for the backward pass. We then use optimizer.step() to update all the relevant parameters.
In the train() function, you will notice that we turn off the gradient calculation during the validation phase. This is because during the validation phase, gradient calculation is not required, and you simply want to use the validation inputs to compute the loss and accuracy only.
We checked whether the validation accuracy for the current epoch has improved over the previous epochs. If there are improvements in validation accuracy, we store the results in best_model_weights and set best_accuracy to denote the best validation accuracy observed so far.
def
train
(
model
,
criterion
,
optimizer
,
scheduler
,
num_epochs
=
10
):
# use to store the training and validation loss
training_loss
=
[]
val_loss
=
[]
best_model_weights
=
copy
.
deepcopy
(
model
.
state_dict
())
best_accuracy
=
0.0
# Note the start time of the training
start_time
=
time
.
time
()
for
epoch
in
range
(
num_epochs
):
(
'Epoch {}/{}, '
.
format
(
epoch
+
1
,
num_epochs
),
end
=
' '
)
# iterate through training and validation phase
for
phase
in
[
'train'
,
'val'
]:
total_loss
=
0.0
total_corrects
=
0
if
phase
==
'train'
:
model
.
train
()
(
"[Training] "
,
end
=
' '
)
elif
phase
==
'val'
:
model
.
eval
()
(
"[Validation] "
,
end
=
' '
)
else
:
(
"Not supported phase"
)
for
inputs
,
labels
in
dataloaders
[
phase
]:
inputs
=
inputs
.
to
(
device
)
labels
=
labels
.
to
(
device
)
# Reset the optimizer gradients
optimizer
.
zero_grad
()
if
phase
==
'train'
:
with
torch
.
set_grad_enabled
(
True
):
outputs
=
model
(
inputs
)
_
,
preds
=
torch
.
max
(
outputs
,
1
)
loss
=
criterion
(
outputs
,
labels
)
loss
.
backward
()
optimizer
.
step
()
else
:
with
torch
.
set_grad_enabled
(
False
):
outputs
=
model
(
inputs
)
_
,
preds
=
torch
.
max
(
outputs
,
1
)
loss
=
criterion
(
outputs
,
labels
)
total_loss
+=
loss
.
item
()
*
inputs
.
size
(
0
)
total_corrects
+=
torch
.
sum
(
preds
==
labels
.
data
)
# compute loss and accuracy
epoch_loss
=
total_loss
/
dataset_sizes
[
phase
]
epoch_accuracy
=
(
total_corrects
+
0.0
)
/
dataset_sizes
[
phase
]
if
phase
==
'train'
:
scheduler
.
step
()
training_loss
.
append
(
epoch_loss
)
else
:
val_loss
.
append
(
epoch_loss
)
if
phase
==
'val'
and
epoch_accuracy
>
best_accuracy
:
best_accuracy
=
epoch_accuracy
best_model_weights
=
copy
.
deepcopy
(
model
.
state_dict
())
(
'Loss: {:.3f} Accuracy: {:.3f}, '
.
format
(
epoch_loss
,
epoch_accuracy
),
end
=
' '
)
()
# Elapse time
time_elapsed
=
time
.
time
()
-
start_time
(
'Train/Validation Duration:
%s
'
%
time
.
strftime
(
"
%H
:
%M
:
%S
"
,
time
.
gmtime
(
time_elapsed
)))
(
'Best Validation Accuracy: {:3f}'
.
format
(
best_accuracy
))
# Load the best weights to the model
model
.
load_state_dict
(
best_model_weights
)
return
model
,
training_loss
,
val_loss
Let us finetune the ResNet-50 model (with pre-trained weights) using 25 epochs.
best_model
,
train_loss
,
val_loss
=
train
(
model
,
loss_function
,
optimizer
,
scheduler
,
num_epochs
=
25
)
From the output, you will see the training and validation loss over multiple epochs.
Epoch 1/25, [Training] Loss: 0.425 Accuracy: 0.796, [Validation] Loss: 0.258 Accuracy: 0.895, Epoch 2/25, [Training] Loss: 0.377 Accuracy: 0.842, [Validation] Loss: 0.310 Accuracy: 0.891, Epoch 3/25, [Training] Loss: 0.377 Accuracy: 0.837, [Validation] Loss: 0.225 Accuracy: 0.927, Epoch 4/25, [Training] Loss: 0.357 Accuracy: 0.850, [Validation] Loss: 0.225 Accuracy: 0.931, Epoch 5/25, [Training] Loss: 0.331 Accuracy: 0.861, [Validation] Loss: 0.228 Accuracy: 0.927, ... Epoch 24/25, [Training] Loss: 0.302 Accuracy: 0.871, [Validation] Loss: 0.250 Accuracy: 0.907, Epoch 25/25, [Training] Loss: 0.280 Accuracy: 0.886, [Validation] Loss: 0.213 Accuracy: 0.927, Train/Validation Duration: 00:20:10 Best Validation Accuracy: 0.935223
After the model is trained, you will want to make sure that the model is not overfitting.
During the training of the model, we store the training and validation loss. This is returned by the train() function, and stored in the arrays train_loss and val_loss. Let’s use this to plot the training and validation loss (shown in Figure 5-4), using the following code. From Figure 5-4, you will observe that the validation loss is consistently lower than the training loss. This is a good indication that the model has not overfitted the data.
# Visualize training and validation loss
num_epochs
=
25
plt
.
figure
(
figsize
=
(
9
,
5
))
plt
.
title
(
"Training vs Validation Loss"
)
plt
.
xlabel
(
"Epochs"
)
plt
.
ylabel
(
"Loss"
)
plt
.
plot
(
range
(
1
,
num_epochs
+
1
),
train_loss
,
label
=
"Training Loss"
,
linewidth
=
3.5
)
plt
.
plot
(
range
(
1
,
num_epochs
+
1
),
val_loss
,
label
=
"Validation Loss"
,
linewidth
=
3.5
)
plt
.
ylim
((
0
,
1.
))
plt
.
xticks
(
np
.
arange
(
1
,
num_epochs
+
1
,
1.0
))
plt
.
legend
()
plt
.
show
()
There are definitely lots of room for further improvements to the model. For example, you can explore performing hyperparameter sweeps for the learning rate, momentum, and more.
Now that we have identified the best model, let us use it to predict the class for the images in the test dataset. To do this, we use the utility function that we have defined earlier, model_predictions(). We provide as inputs the dataloader for the test dataset, and the best model (i.e. best_model).
# Use the model for prediction using the test dataset
predictions
,
labels
,
images
=
model_predictions
(
dataloaders
[
"test"
],
best_model
)
Let us look at the classification report for the model, using the test dataset.
# print out the classification report
from
sklearn.metrics
import
classification_report
from
sklearn.metrics
import
roc_auc_score
(
classification_report
(
labels
,
predictions
))
(
'ROC_AUC:
%.4f
'
%
(
roc_auc_score
(
labels
,
predictions
)))
The results are shown.
precision recall f1-score support 0 0.74 0.82 0.78 61 1 0.91 0.86 0.88 127 accuracy 0.85 188 macro avg 0.82 0.84 0.83 188 weighted avg 0.85 0.85 0.85 188 ROC_AUC: 0.8390
Let us visualize the ground-truth and predicted labels for each of the test images. To do this, we use the utility function visualize_images(), and pass as inputs: test images, labels, and predicted labels.
# Show the label, and prediction
visualize_images
(
images
,
labels
,
predictions
)
The output of visualize_images() is shown in Figure 5-5.
From Figure 5-5, you will see that the fine-tuned model is performing relatively well using the weak labels that have been produced by Snorkel in Chapter 3.
One of the images (shown on the second row, first image) is incorrectly classified. You will observe that this is a problem with the ground-truth label, and not due to the image classifier that we have just trained. Snorkel has incorrectly labeled it as an outdoor image. In addition, you will notice that it is hard to tell whether the image is an indoor or outdoor image. Hence, the confusion.
In this chapter, you learned how to leverage the weakly labeled dataset generated using Snorkel to train a deep convolutional neural network for image classification in PyTorch. We also used concepts from transfer learning to incorporate powerful pre-trained computer vision models into our model training approach.
While the indoor-outdoor application discussed here is relatively simple, Snorkel has been used to power a broad set of real-world applications in computer vision ranging from medical image interpretation to scene graph prediction. The same principles outlined in this chapter on using Snorkel to build a weakly supervised dataset for a new modality can be extended to other domains like volumetric imaging (e.g. computed tomography), time series, and video.
It is also common to have signals from multiple modalities at once. This cross-modal setting described by Dunnmon et al. represents a particularly powerful way to combine the material from Chapters 4 and 5. In brief, it is common to have image data that is accompanied by free text (clinical report, article, caption, etc.) describing that image. In this setting, one can write labeling functions over the text, and ultimately use the generated labels to train a neural network over the associated image, which can often be easier than writing labeling functions over the image directly.
There exist a wide variety of real-world situations in which this cross-modal weak supervision approach can be useful because we have multiple modalities associated with any given datapoint, and the modality we wish to operate on at test time is harder to write labeling functions over than another we may have available at train time. We encourage the reader to consider a cross-modal approach when planning how to approach building models for a particular application.
Transfer learning for both computer vision (discussed in this chapter) and Natural Language Processing (NLP) (discussed in Chapter 4) has enabled data scientists and researchers to effectively transfer the knowledge learned from large-scale datasets, and adapt it to new domains. The availability of pre-trained models for both computer vision and NLP has drove rapid innovations in the machine learning/deep learning community, and we encourage the reader to consider how transfer learning can be combined with weak supervision wherever possible.
Going forward, we expect that the powerful combination of Snorkel and transfer learning will create a flywheel that drives AI innovations and success in both commercial and academic settingss.
Khaled Saab, Jared Dunnmon, Roger Goldman, Hersh Sagreiya, Alexander Ratner, Christopher Ré, and Daniel Rubin. Doubly Weak Supervision of Deep Learning Models for Head CT , MICCAI, 2019.
Jason Fries, Paroma Varma, Vincent Chen, Ke Xiao, Heliodoro Tejeda, Saha Priyanka, Jared Dunnmon, Henry Chubb, Shiraz Maskatia, Madalina Fiterau, Scott Delp, Euan Ashley, Christopher Ré and James Priest. Weakly supervised classification of aortic valve malformations using unlabeled cardiac MRI sequences, Nature Communications, 2019
Vincent Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, Christopher Ré, Fei-Fei Li Scene Graph Prediction with Limited Labels, International Conference on Computer Vision (ICCV), 2019.
Frederic Sala, Paroma Varma, Jason Fries, Daniel Y. Fu, Shiori Sagawa, Saelig Khattar, Ashwini Ramamoorthy, Ke Xiao, Kayvon Fatahalian, James Priest, Christopher Ré. Multi-Resolution Weak Supervision for Sequential Data, NEURIPS, 2019.
Jared Dunnmon, Alexander Ratner, Nishith Khandwala, Khaled Saab, Matthew Markert, Hersh Sagreiya, Roger Goldman, Christopher Lee-Messer, Matthew Lungren, Daniel Rubin, and Christopher Ré Cross-Modal Data Programming Enables Rapid Medical Machine Learning Patterns, 2020.
Sasank Chilamkurthy. Transfer Learning for Computer Vision Tutorial. 2017.
Sebastian Ruder. Transfer Learning - Machine Learning’s Next Frontier. 2017.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition. 2015.
Olah, Chris and Mordvintsev, Alexander and Schubert, Ludwig. Feature Visualization. 2017
Nathan Inkawhich. Finetuning Torchvision Models. 2017.
54.144.219.156