Chapter 5. Using the Snorkel-labeled Dataset for Image Classification

In this chapter, you will learn how to perform image classification, using the indoor/outdoor dataset that has been labeled by Snorkel in Chapter 3.

The techniques described in this chapter can be used for image classification for any image datasets. This chapter will provide you with a holistic set of discussions and code that can help you get started qiickly with using the dataset that has been labeled by Snorkel (from Chapter 3).

The chapter starts with a gentle introduction to different types of visual object recognition tasks, and discussions on how image features are represented. Next, we discuss how transfer learning for image classification works. In the remainder of the chapter, we will use the indoor/outdoor dataset that has been labeled by Snorkel to fine-tune an image classification model using PyTorch.

Visual Object Recognition Overview

Visual Object Recognition is commonly used to identify objects in digital images. To do visual object recognition well, various computer vision tasks are used:

  • Image Classification - Predict the type/class of an image (e.g. does the image consists of an indoor or outdoor scene)

  • Object Localization - Identify the objects present in an image with bounding boxes

  • Object Detection - Identify the objects present in an image with bounding boxes, and the type or class of the object corresponding to each bounding box

  • Image Instance Segmentation - Identify the objects present in an image, and identifying the pixels that belong to each of those objects.

ImageNet is a visual database that consists of millions of images and is used by many CV researchers for various visual object recognition tasks. Convolutional Neural Networks (CNNs) are commonly used in visual object recognition. Over the years, researchers have continuously been advancing the performance of CNNs trained on ImageNet. In the early day of training AlexNet on ImageNet, it took several days to train the CNN. With innovations in algorithms and hardware, the time taken to train a CNN has decreased significantly.

For example, one type of CNN architecture is ResNet-50. ResNet-50 is a 50 layers deep convolutional network, that leverages residual networks. The model is trained to classify images into 1,000 object categories. Over the years, the time taken to train ResNet-50 on ImageNet from scratch has dropped from days to hours, and to minutes (shown in Table 5-1).

Table 5-1. Training ResNet-50 on ImageNet - How long does it take?
April 2017 Sept 2017 November 2017 July 2018 November 2018 March 2019

1 hour

31 minutes

15 minutes

6.6 minutes

2.0 minutes

1.2 minutes

Facebook, Caffe2

UC Berkeley, TACC, UC Davis, Tensorflow

Preferred Networks, ChainerMN

Tencent, Tensorflow

Sony, Neural Network Library (NNL)

Fujitsu, MXNet

In recent years, PyTorch and Tensorflow have been innovating at a rapid pace as powerful and flexible deep learning frameworks, empowering practitioners to easily get started with training CV and NLP deep learning models.

Model zoos provide AI practitioners with a large collection of deep learning models with pre-trained weights and code. With the availability of model zoos with pre-trained deep learning models, anyone can get started with any computer vision tasks (e.g. classification, object detection, image similarity, segmentation, and many more). This is amazing news for practitioners looking at leveraging some of these state-of-art CNNs for various computer vision tasks.

In this chapter, we will leverage ResNet-50 for classifying the images (indoor or outdoor scenes) that were labeled by Snorkel in Chapter 3.

Representing Image Features

In order to understand how image classification works, we need to understand how image features are represented in different layers of the CNN. One of the reasons why CNNs are able to perform well for image classification is because of the ability of the different layers of the CNN to extract the different salient features of an image (edges, textures) and group them as patterns and parts.

In the article Feature Visualization, Olah, et al showed how each layer of a convolutional neural network (e.g. GoogLeNet) built up its understanding of edges, textures, patterns, parts and uses these basic constructs to build up the understanding of objects in an image (shown in Figure 5-1).

Feature Visualization of different layers of GoogLeNet trained on ImageNet
Figure 5-1. Feature Visualization of different layers of GoogLeNet trained on ImageNet (Credits: Olah, et al., “Feature Visualization”, Distill, 2017.).

Transfer Learning for Computer Vision

Before the emergence of deep learning approaches for image classification, researchers working in computer vision (CV) leveraged and used various visual feature extractors to extract features that are used as inputs to classifiers. For example, Histogram of Oriented Gradients (HoG) detectors are commonly used for feature extraction. Often, these custom CV approaches are not generalizable to new tasks (i.e. detectors trained for one image dataset are not easily transferrable to other datasets).

Today, many commercial AI applications, that leverage computer vision capabilities, use transfer learning. Transfer learning enables deep learning models that have been trained on large-scale image datasets (e.g. ImageNet) and uses the pre-trained models for performing image classification, without having to train the models from scratch.

There are several approaches of using transfer learning for computer vision, including these two widely-used approaches:

  • Using the CNN as a Feature Extractor- Each of the layers of a CNN encodes different features of the image. A CNN that has been trained on a large-scale image dataset would have captured these salient details in each of its layers. This enables the CNN to be used as a feature extractor, and using it to extract the relevant features inputs that can be used with an image classifier.

  • Fine-tuning the CNN - With a new image dataset, you might want to consider further fine-tuning the weights of the pre-trained CNN model using backpropagation. As one moves from the top layers of a CNN to the last few layers, it is natural that the top layers would have captured generic image features (e.g. edges, textures, patterns, parts), and the later layers are tuned for a specific image dataset. By adjusting the weights of these last few layers of a CNN, the weights can be made more relevant to the image datasets. This process is called fine-tuning the CNN.

Now that we have a good overview of how CNNs are used for image classification, let’s get started with building an image classifier for identifying whether an image shows an indoor or outdoor scene, using PyTorch.

Using PyTorch for Image classification

In this section, we will learn how to use the pre-trained ResNet-50 models, available in PyTorch, for performing image classification.

Before we get started, let us load the relevant Python libraries that we will use in this chapter. These include common classes like DataLoader (that defines a Python iterable for datasets), torchvision and common utilities to load the pre-trained CNN models, and image transformations that can be used.

We loaded matplotlib for visualization. In addition, we also used helper classes from _ mpl_toolkits.axes_grid1_ that will enable us to display images from the training, and testing datasets.

import torch
from torch.autograd import Variable
from torch.utils.data import DataLoader

import torchvision
from torchvision import datasets, models, transforms

import numpy as np
import os
import time
import copy

import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid

%matplotlib inline

Loading the Indoor/Outdoor dataset

With the various Python libraries loaded, we are now ready to create the DataLoader objects on the indoor/outdoor images dataset. First, we specify the directory for loading the training, validation (i.e. val) and testing images. This is specified in the data directory.

# Specify image folder
image_dir = '../data/'
$ tree -d
.
├── test
│   ├── indoor
│   └── outdoor
├── train
│   ├── indoor
│   └── outdoor
└── val
    ├── indoor
    └── outdoor

Next, we specify the mean and standard deviation for each of the three channels for the images. These are defaults used for the ImageNet dataset and are generally applicable for most image datasets.

# Or we can use the default from ImageNet
mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])

Next, we will specify the transformations that will be used for the training, validation, and testing datasets.

You will notice in the code shown below that for the training dataset, we first apply RandomResizedCrop and RandomHorizontalFlip. RandomResizedCrop crops each of the training images to a random size and then outputs an image that is 224 x 224. RandomHorizontalFlip randomly performs horizontal flipping of the 224 x 224 images. The image is then converted to tensor, and the tensor values normalized to the mean and standard deviation provided.

For the validation and testing images, we resize each image to 224 x 224, and performs a CenterCrop. The image is then converted to tensor, and the tensor values normalized to the mean and standard deviation provided.

# Specify the image transformation
# for training, validation and testing datasets

image_transformations = {
    'train': transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean, std)
    ]),
    'val': transforms.Compose([
        transforms.Resize(224),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean, std)
    ]),
        'test': transforms.Compose([
        transforms.Resize(224),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean, std)
    ])
}

With the image transformations defined, we are now ready to create the Python iterable over the training, validation, and testing images using DataLoader. In the code shown below, we iterate through each folder (i.e. train, val, and test). For each of the folder, we specify the relative directory path, and the image transformations.

Next, we define the batch_size as 8, and creates a DataLoader object. We store the references for the training, validation and testing loader in the dataloders variable, so we can use it later. We also store the size of each dataset in the dataset_sizes variable.

# load the training,  validation, and test data
image_datasets = {}
dataset_sizes  = {}
dataloaders    = {}
batch_size     = 8

for data_folder in ['train', 'val', 'test']:
     dataset = datasets.ImageFolder(
                os.path.join(image_dir, data_folder),
                transform=image_transformations[data_folder])

     loader = torch.utils.data.DataLoader(dataset,
                                        batch_size=batch_size,
                                        shuffle=True,
                                        num_workers=2)

     # store the dataset/loader/sizes for reference later
     image_datasets[data_folder] = dataset
     dataloaders[data_folder] = loader
     dataset_sizes[data_folder] = len(dataset)

Let us see the number of images in each of the datasets.

dataset_sizes

For this indoor and outdoor image classification exercise, we have 1,609, 247 and 188 images for training, validation and testing respectively.

{'train': 1609, 'val': 247, 'test': 188}

Let us see the class names for the datasets. This is picked up from the name of the directories stored in the data folder.

# Get the classes
class_names = image_datasets["train"].classes
class_names

You will see that we have 2 image classes: indoor and outdoor.

['indoor', 'outdoor']

Utility Functions

Next, let us define two utility functions (visualize_images, and model_predictions ) that will be used later for displaying the training and testing images, and computing the predictions for the test dataset.

visualize_images() is a function for visualizing images in an image grid. By default, it shows 16 images from the images array. The array labels is passed to the function to show the class names for each of the images displayed. If an optional array predictions is provided, both the ground truth label and the predicted label will be shown side by side.

You will notice that we multiplied the value of inp by 255, and then cast it as a uint8 data type. This helps to convert the values from 0 to 1 to 0 to 255. It also helps to reduce the clipping errors that might occur due to negative values in the inp variable.

def visualize_images(images, labels, predictions=None,  num_images=16):

    count = 0
    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])

    fig = plt.figure(1, figsize=(16, 16))
    grid = ImageGrid(fig, 111, nrows_ncols=(4, 4), axes_pad=0.05 )

    # get the predictions for data in dataloader
    for i in range(0,len(images)):
        ax = grid[count]

        inp = images[i].numpy().transpose((1, 2, 0))
        inp = std * inp + mean
        ax.imshow((inp * 255).astype(np.uint8))

        if (predictions is None):
          info = '{}'.format(class_names[labels[i]])
        else :
          info = '{}/{}'.format(class_names[labels[i]],
          class_names[predictions[i]])

        ax.text(10, 20, '{}'.format(info), color='w',
                backgroundcolor='black',
                alpha=0.8,
                size=15)

        count += 1
        if count == num_images:
          return

Given a DataLoader and a model, the function model_predictions() iterates through the images, and computes the predicted label using the model provided. The predictions, ground-truth labels, and images are then returned.

# given a dataloader, get the predictions using the model provided.
def model_predictions(dataloder, model):
    predictions = []
    images = []
    label_list = []

    # get the predictions for data in dataloader
    for i, (inputs, labels) in enumerate(dataloder):
        inputs, labels = Variable(inputs.cuda()), Variable(labels.cuda())

        outputs = model(inputs)
        _, preds = torch.max(outputs.data, 1)

        predictions.append(preds.cpu())
        label_list.append(labels.cpu())

        for j in range(inputs.size()[0]):
            images.append(inputs.cpu().data[j])

    predictions_f = list(np.concatenate(predictions).flat)
    label_f = list(np.concatenate(label_list).flat)

    return predictions_f,  label_f, images

Visualizing the Training Data

It is important to have a deep understanding of the data before you start training the deep learning model. Let us use the visualize_images() function to show the first 8 images in the training dataset (shown in Figure 5-2).

images, labels = next(iter(dataloaders['train']))
visualize_images(images, labels)
Training images for indoor and outdoor scene. By visualizing these images with the ground truth labels, it enables the AI practitioners to quickly explore a sample of the image dataset, and understand the images that are used for training
Figure 5-2. Training images for indoor and outdoor scene

Fine-tuning the Pre-trained Model

Many different kinds of pre-trained deep learning model architectures can be used for image classification. PyTorch (and similarly TensorFlow) provides a rich set of model architectures that you can use.

For example, in TorchVision.Models, you will see that PyTorch provides model definitions for AlexNet, VGG, ResNet, SqueezeNet, DenseNet, Inception V3, GoogLeNet, ShuffleNet v2, MobileNet v2, ResNeXt, Wide ResNet, MNASNet and more. Pre-trained models are available by setting pretrained=True when loading the models.

For this exercise, we will use ResNet-50. We will load the pre-trained ResNet-50 model, and fine-tune it for the indoor/outdoor image classification task.

Note

Residual Networks (or ResNet) was first introduced in the paper (Deep Residual Learning for Image Recognition) by Kaiming He et al.

In the paper, the authors showed how residual networks can be easily optimized, and explored with different network depths (up to 1,000 layers).

In 2015, ResNet-based networks won first place in the ILSVRC classification task.

Getting ready for Training

First, we load the ResNet-50 model.

# Load Resnet50 model
model = models.resnet50(pretrained=True)

Resnet-50 is trained on ImageNet with 1,000 classes. We will need to update the FC layer to 2 classes.

# Specify a final layer with 2 classes - indoor and outdoor
num_classes = 2
num_features = model.fc.in_features
model.fc = torch.nn.Linear(num_features, num_classes)

Now that we have modified the FC layer, let us define the criterion, optimizer, and scheduler. In the code below, you will see that we specify the loss function as Cross-Entropy Loss. The PyTorch CrossEntropyLoss() criterion combines both nn.LogSoftmax() and nn.NLLLoss() together, and is commonly used for image classification problems with N classes.

For optimizer, torch.optim.SGD() is used. The Stochastic Gradient Descent (SGD) optimization approach is commonly used in training CNN, over batches of data.

For scheduler, lr_scheduler.StepLR() is used. The StepLR scheduler adjusts the learning rate by the value of gamma. In our example, we use the default gamma value of 0.1 and specified a step_size of 8.

import torch.optim as optim
from torch.optim import lr_scheduler

# Use CrossEntropyLoss as a loss function
loss_function = torch.nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(model.fc.parameters(), lr=0.001, momentum=0.8)
scheduler = lr_scheduler.StepLR(optimizer, step_size=8)

We are ready to start fine-tuning the ResNet-50 model. Let us move the model to the GPU.

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)

When the model is moved to the GPU, it outputs the structure of the network. You will notice that the structure of the network reflects what is shown in Figure 5-3. In this chapter, we show a snippet of the network, and you can see the full network when you execute the code.

In the PyTorch implementation of ResNet, you will see that the ResNet-50 implementation consists of multiple BottleNeck blocks, each with a kernel size of (1,1), (3,3), and (1,1). As noted in the TorchVision ResNet implementation, the bottleNeck blocks used in TorchVision puts the stride for downsampling at 3x3 convolution.

In the last few layers of the ResNet-50 architecture, you will see the use of the 2D adaptive average pooling, following by an FC layer that outputs 2 features (corresponding to the indoor and outdoor classes).

Now that the model has been pushed to the GPU, we are now ready to fine-tune the model with the training images of indoor and outdoor scenes.

ResNet Different Layers
Figure 5-3. Architecture of ResNet, with different layers (Src: Deep Residual Learning for Image Recognition, Kaiming He et al. )
ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2),
  padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1,
  affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1,
  dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1),
      stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1,
      affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1),
      padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1,
      affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1),
      stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1,
      affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(64, 256, kernel_size=(1, 1),
        stride=(1, 1), bias=False)
        (1): BatchNorm2d(256, eps=1e-05, momentum=0.1,
        affine=True, track_running_stats=True)
      )
    )
  )

  ... More layers ....

  )
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=2048, out_features=2, bias=True)
)

The function train() (shown later in the page) is adapted from the example provided in the PyTorch tutorial Finetuning Torchvision Models.

We first perform a deep copy of all the pre-trained model weights found in model.state_dict() to the variable best_model_weights. We initialize the best accuracy of the model to be 0.0.

The code then iterates through multiple epochs. For each epoch, we first load the data and labels that will be used for training and then pushes it to the GPU. We reset the optimizer gradient, before calling model(inputs) to perform the forward-pass. We compute the loss, using cross-entropy loss. Once these steps are completed, we call loss.backward() for the backward pass. We then use optimizer.step() to update all the relevant parameters.

In the train() function, you will notice that we turn off the gradient calculation during the validation phase. This is because during the validation phase, gradient calculation is not required, and you simply want to use the validation inputs to compute the loss and accuracy only.

We checked whether the validation accuracy for the current epoch has improved over the previous epochs. If there are improvements in validation accuracy, we store the results in best_model_weights and set best_accuracy to denote the best validation accuracy observed so far.

def train(model, criterion, optimizer, scheduler, num_epochs=10):
    # use to store the training and validation loss
    training_loss = []
    val_loss = []

    best_model_weights = copy.deepcopy(model.state_dict())
    best_accuracy = 0.0

    # Note the start time of the training
    start_time = time.time()

    for epoch in range(num_epochs):
        print('Epoch {}/{}, '.format(epoch+1, num_epochs), end = ' ' )

        # iterate through training and validation phase
        for phase in ['train', 'val']:

            total_loss = 0.0
            total_corrects = 0

            if phase == 'train':
                model.train()
                print("[Training] ", end=' ')
            elif phase == 'val':
                model.eval()
                print("[Validation] ", end=' ')
            else:
                print("Not supported phase")

            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                # Reset the optimizer gradients
                optimizer.zero_grad()

                if phase == 'train':
                    with torch.set_grad_enabled(True):
                        outputs = model(inputs)
                        _, preds = torch.max(outputs, 1)
                        loss = criterion(outputs, labels)

                    loss.backward()
                    optimizer.step()
                else:
                    with torch.set_grad_enabled(False):
                        outputs = model(inputs)
                        _, preds = torch.max(outputs, 1)
                        loss = criterion(outputs, labels)

                total_loss += loss.item() * inputs.size(0)
                total_corrects += torch.sum(preds == labels.data)

            # compute loss and accuracy
            epoch_loss = total_loss / dataset_sizes[phase]
            epoch_accuracy =  (total_corrects + 0.0) / dataset_sizes[phase]

            if phase == 'train':
                scheduler.step()
                training_loss.append(epoch_loss)
            else:
                val_loss.append(epoch_loss)


            if phase == 'val' and epoch_accuracy > best_accuracy:
                best_accuracy = epoch_accuracy
                best_model_weights  = copy.deepcopy(model.state_dict())

            print('Loss: {:.3f} Accuracy: {:.3f}, '.format(
             epoch_loss, epoch_accuracy), end=' ')


        print()

    # Elapse time
    time_elapsed = time.time() - start_time

    print ('Train/Validation Duration: %s'
          % time.strftime("%H:%M:%S", time.gmtime(time_elapsed)))
    print('Best Validation Accuracy: {:3f}'.format(best_accuracy))

    # Load the best weights to the model
    model.load_state_dict(best_model_weights )
    return model, training_loss, val_loss

Fine-tuning the ResNet-50 Model

Let us finetune the ResNet-50 model (with pre-trained weights) using 25 epochs.

best_model, train_loss, val_loss = train(
                    model,
                    loss_function,
                    optimizer,
                    scheduler,
                    num_epochs=25)

From the output, you will see the training and validation loss over multiple epochs.

Epoch 1/25,
[Training]  Loss: 0.425 Accuracy: 0.796,
[Validation]  Loss: 0.258 Accuracy: 0.895,
Epoch 2/25,
[Training]  Loss: 0.377 Accuracy: 0.842,
[Validation]  Loss: 0.310 Accuracy: 0.891,
Epoch 3/25,
[Training]  Loss: 0.377 Accuracy: 0.837,
[Validation]  Loss: 0.225 Accuracy: 0.927,
Epoch 4/25,
[Training]  Loss: 0.357 Accuracy: 0.850,
[Validation]  Loss: 0.225 Accuracy: 0.931,
Epoch 5/25,
[Training]  Loss: 0.331 Accuracy: 0.861,
[Validation]  Loss: 0.228 Accuracy: 0.927,
...
Epoch 24/25,
[Training]  Loss: 0.302 Accuracy: 0.871,
[Validation]  Loss: 0.250 Accuracy: 0.907,
Epoch 25/25,
[Training]  Loss: 0.280 Accuracy: 0.886,
[Validation]  Loss: 0.213 Accuracy: 0.927,
Train/Validation Duration: 00:20:10
Best Validation Accuracy: 0.935223

After the model is trained, you will want to make sure that the model is not overfitting.

During the training of the model, we store the training and validation loss. This is returned by the train() function, and stored in the arrays train_loss and val_loss. Let’s use this to plot the training and validation loss (shown in Figure 5-4), using the following code. From Figure 5-4, you will observe that the validation loss is consistently lower than the training loss. This is a good indication that the model has not overfitted the data.

# Visualize training and validation loss
num_epochs = 25

plt.figure(figsize=(9, 5))
plt.title("Training vs Validation Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")

plt.plot(range(1,num_epochs+1),train_loss,
         label="Training Loss",
         linewidth=3.5)

plt.plot(range(1,num_epochs+1),val_loss,
         label="Validation Loss",
         linewidth=3.5)

plt.ylim((0,1.))
plt.xticks(np.arange(1, num_epochs+1, 1.0))

plt.legend()
plt.show()
Training vs Validation Loss
Figure 5-4. Training vs Validation Loss (Number of Epochs = 25)

There are definitely lots of room for further improvements to the model. For example, you can explore performing hyperparameter sweeps for the learning rate, momentum, and more.

Model Evaluation

Now that we have identified the best model, let us use it to predict the class for the images in the test dataset. To do this, we use the utility function that we have defined earlier, model_predictions(). We provide as inputs the dataloader for the test dataset, and the best model (i.e. best_model).

# Use the model for prediction using the test dataset
predictions, labels, images = model_predictions(dataloaders["test"],best_model)

Let us look at the classification report for the model, using the test dataset.

# print out the classification report
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

print(classification_report(labels, predictions))
print('ROC_AUC: %.4f' % (roc_auc_score(labels, predictions)))

The results are shown.

              precision    recall  f1-score   support

           0       0.74      0.82      0.78        61
           1       0.91      0.86      0.88       127

    accuracy                           0.85       188
   macro avg       0.82      0.84      0.83       188
weighted avg       0.85      0.85      0.85       188

ROC_AUC: 0.8390

Let us visualize the ground-truth and predicted labels for each of the test images. To do this, we use the utility function visualize_images(), and pass as inputs: test images, labels, and predicted labels.

# Show the label, and prediction
visualize_images(images, labels, predictions )

The output of visualize_images() is shown in Figure 5-5.

Test images for Indoor and Outdoor scene
Figure 5-5. Testing images for indoor and outdoor scenes. For each image, the ground-truth labels are shown first, followed by the predicted labels.

From Figure 5-5, you will see that the fine-tuned model is performing relatively well using the weak labels that have been produced by Snorkel in Chapter 3.

One of the images (shown on the second row, first image) is incorrectly classified. You will observe that this is a problem with the ground-truth label, and not due to the image classifier that we have just trained. Snorkel has incorrectly labeled it as an outdoor image. In addition, you will notice that it is hard to tell whether the image is an indoor or outdoor image. Hence, the confusion.

Summary

In this chapter, you learned how to leverage the weakly labeled dataset generated using Snorkel to train a deep convolutional neural network for image classification in PyTorch. We also used concepts from transfer learning to incorporate powerful pre-trained computer vision models into our model training approach.

While the indoor-outdoor application discussed here is relatively simple, Snorkel has been used to power a broad set of real-world applications in computer vision ranging from medical image interpretation to scene graph prediction. The same principles outlined in this chapter on using Snorkel to build a weakly supervised dataset for a new modality can be extended to other domains like volumetric imaging (e.g. computed tomography), time series, and video.

It is also common to have signals from multiple modalities at once. This cross-modal setting described by Dunnmon et al. represents a particularly powerful way to combine the material from Chapters 4 and 5. In brief, it is common to have image data that is accompanied by free text (clinical report, article, caption, etc.) describing that image. In this setting, one can write labeling functions over the text, and ultimately use the generated labels to train a neural network over the associated image, which can often be easier than writing labeling functions over the image directly.

There exist a wide variety of real-world situations in which this cross-modal weak supervision approach can be useful because we have multiple modalities associated with any given datapoint, and the modality we wish to operate on at test time is harder to write labeling functions over than another we may have available at train time. We encourage the reader to consider a cross-modal approach when planning how to approach building models for a particular application.

Transfer learning for both computer vision (discussed in this chapter) and Natural Language Processing (NLP) (discussed in Chapter 4) has enabled data scientists and researchers to effectively transfer the knowledge learned from large-scale datasets, and adapt it to new domains. The availability of pre-trained models for both computer vision and NLP has drove rapid innovations in the machine learning/deep learning community, and we encourage the reader to consider how transfer learning can be combined with weak supervision wherever possible.

Going forward, we expect that the powerful combination of Snorkel and transfer learning will create a flywheel that drives AI innovations and success in both commercial and academic settingss.

Bibliography

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
54.144.219.156