Appendix A: Vision Transformers
Before the advent of vision transformers (ViTs), all tasks related to image-based machine learning like image classification, object detection, Q&A on images, image-to-caption mapping, etc. were taken care of by mostly CNNs and related neural architectures. With vision transformers there emerged an alternate means of handling such image-related tasks with better results.
Among the papers released on vision transformers, the one released on October 22, 2020, by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, and Thomas Unterthiner is one that’s particularly noteworthy. Their approach is based on Vaswani’s “Attention Is All You Need” paper, which is widely used in natural language processing and has been referred to in previous chapters. There were no changes made to the attention layers in this paper. Breaking an image into little patches (perhaps 16×16) is their most essential trick.
Self-Attention and Vision Transformers
How you apply self-attention to images is the big question. As in NLP where one word pays attention to other words (to find the relation between the words), we need to apply a similar concept to images. The important aspect to understand here is that how we achieve this mechanism. This is where vision transformers come into the picture.
To achieve self-attention, vision transformers divide the image into different parts. Each part of the image is a linear sequence of vectors, which constitutes the pixel values. The only thing we do is that we have reduced the 2D representation of this part of the image into a 1D vector representation. Post this, to each of this representation a positional embedding is done, so that we have positional semblance maintained within the learned representation. This is similar in nature to the positional embedding we have seen for the text embeddings in previous chapters.
At a high level, the architecture of a vision
transformer is shown in Figure
A-1.
The multi-layer perceptron (MLP) layer and the multi-headed self-attention (MSA) layer are both components of the transformer encoder module. The multi-headed self-attention layer divides the inputs into numerous heads, allowing each head to learn a distinct level of self-attention independently. After that, the outputs of each of the heads are stringed together, and then the multi-layer perceptron layer processes them.
The book is not about going into details of the vision transformer itself. The interested reader can seek further knowledge from the paper on vision transformers.
Before we look into the code, there is an important class called FeatureExtractor in the huggingface library.
In most cases, the job of preparing input features for models that don’t fit within the traditional NLP models falls on the shoulders of a feature extractor. They are responsible for a variety of tasks, including the manipulation of photographs and the processing of audio recordings. The majority of vision products are packaged with an additional feature extractor.
Without going into details of different aspects of vision transformers (as NLP is the main focus of this book), we illustrate one example for image classification via vision transformers.
We illustrate in the following the code for using a ViT for an image classification task.
As with code samples in Chapter 5, we use Gradio as the framework here, and so our example below will follow the same pattern that we adopted in Chapter 5.
Image Classification Using a ViT
Code
app.py
import gradio as grad
grad.Interface.load(
"models/microsoft/swin-tiny-patch4-window7-224",
theme="default",
css=".footer{display:none !important}",
title=None).launch()
Listing A-1Gradio app
for vision transformers
Here, the gr.Interface.load method loads the model we provided as the path. In this example it is models/microsoft/swin-tiny-patch4-window7-224 .
We get output as shown in Figure
A-2.
Uploading a dog image classifies it properly again, as shown in Figure
A-3.
There are other tasks like image segmentation and object detection possible via vision transformers. We leave this to you to try out as an exercise.