Back Matter

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Appendix A: Vision Transformers

Before the advent of vision transformers (ViTs), all tasks related to image-based machine learning like image classification, object detection, Q&A on images, image-to-caption mapping, etc. were taken care of by mostly CNNs and related neural architectures. With vision transformers there emerged an alternate means of handling such image-related tasks with better results.

Among the papers released on vision transformers, the one released on October 22, 2020, by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, and Thomas Unterthiner is one that’s particularly noteworthy. Their approach is based on Vaswani’s “Attention Is All You Need” paper, which is widely used in natural language processing and has been referred to in previous chapters. There were no changes made to the attention layers in this paper. Breaking an image into little patches (perhaps 16×16) is their most essential trick.

Self-Attention and Vision Transformers

How you apply self-attention to images is the big question. As in NLP where one word pays attention to other words (to find the relation between the words), we need to apply a similar concept to images. The important aspect to understand here is that how we achieve this mechanism. This is where vision transformers come into the picture.

To achieve self-attention, vision transformers divide the image into different parts. Each part of the image is a linear sequence of vectors, which constitutes the pixel values. The only thing we do is that we have reduced the 2D representation of this part of the image into a 1D vector representation. Post this, to each of this representation a positional embedding is done, so that we have positional semblance maintained within the learned representation. This is similar in nature to the positional embedding we have seen for the text embeddings in previous chapters.

At a high level, the architecture of a vision transformer is shown in Figure A-1.

Figure A-1

Architecture of a ViT taken from the paper

The multi-layer perceptron (MLP) layer and the multi-headed self-attention (MSA) layer are both components of the transformer encoder module. The multi-headed self-attention layer divides the inputs into numerous heads, allowing each head to learn a distinct level of self-attention independently. After that, the outputs of each of the heads are stringed together, and then the multi-layer perceptron layer processes them.

The book is not about going into details of the vision transformer itself. The interested reader can seek further knowledge from the paper on vision transformers.

Before we look into the code, there is an important class called FeatureExtractor in the huggingface library.

In most cases, the job of preparing input features for models that don’t fit within the traditional NLP models falls on the shoulders of a feature extractor. They are responsible for a variety of tasks, including the manipulation of photographs and the processing of audio recordings. The majority of vision products are packaged with an additional feature extractor.

Without going into details of different aspects of vision transformers (as NLP is the main focus of this book), we illustrate one example for image classification via vision transformers.

We illustrate in the following the code for using a ViT for an image classification task.

As with code samples in Chapter 5, we use Gradio as the framework here, and so our example below will follow the same pattern that we adopted in Chapter 5.

Image Classification Using a ViT

Code

app.py

import gradio as grad

grad.Interface.load(

"models/microsoft/swin-tiny-patch4-window7-224",

theme="default",

css=".footer{display:none !important}",

title=None).launch()

Listing A-1

Gradio app for vision transformers

Here, the gr.Interface.load method loads the model we provided as the path. In this example it is models/microsoft/swin-tiny-patch4-window7-224 .

We get output as shown in Figure A-2.

Figure A-2

Gradio app that takes an image of a cat and classifies it into a breed of a cat

Uploading a dog image classifies it properly again, as shown in Figure A-3.

Figure A-3

Gradio app that takes an image of a dog and classifies it into a breed of a dog

There are other tasks like image segmentation and object detection possible via vision transformers. We leave this to you to try out as an exercise.

Summary

In this Appendix, we touched upon vision transformers in a very brief manner. The intent here was not to take a deep dive into vision transformers as this book is mainly around NLP. But since this is also a new development in the area of transformers, the author thinks that having knowledge of this area will be handy for the readers.

We will discuss how we can use huggingface for fine-tuning purposes.

Index

Application programming interface (API)

Artificial general intelligence (AGI)

Artificial intelligence (AI)

Attention

“Attention Is All You Need”

Attention mechanism

AutoModel

probabilities

text classification

Transformers library

wrapper class

AutoTokenizer

Backpropagation (BP)

Bag of words

BERT-Base

BERT-based Q&A

BERT-based tokenizer

BERT-Large

Bidirectional Encoder Representations from Transformers (BERT)

application, bidirectional training

BERT-Base

BERT-Large

huggingface

inference in NSP

input representations

LSTM-type architectures

MLM

performance

pretrained models

sentiment analysis, tweet

training language models process

transfer learning

transformer architecture

use cases

Chatbot/dialog bot

CLS and SEP

Code comment generator

CodeGen

CodeT5 model

comment model directory

encoder-decoder transformer model

Google search APIs

Java code

non-huggingface repository

pretrained models

sequential manner

T5 encoder-decoder paradigm

Colossal Cleaned Common Crawl (C4)

Common Crawl

CSV files

Decoder

Distilbert-base-cased-distilled-squad

distilgpt2

Docker image

Encoder

feed-forward network

input embeddings

layer normalization

multi-headed attention system

residual connections

Exponential function

FeatureExtractor

Feed-forward networks

Fine-tuning

datasets

evaluation

Hugging Face

IMDB dataset

inference

neural architecture

Trainer API

training

workflow

Flatter distributions

General Language Understanding Evaluation (GLUE)

Google Colab

Google Pegasus model

google/pegasus-xsum model

Google’s BERT

Google’s T5 model

Google Translate

GPT3 model

GPU cluster

Gradio

code generation app

files

Hugging Face

huggingface infra

SDK

summarization app

text generation

translation UI

web framework

Harder distribution

Helsinki-NLP/opus-mt-en-de model

Helsinki-NLP/opus-mt-en-fr model

Hugging Face

AutoModel

SeeAutoModel

ecosystem

features

GPT3 model

NLP

open source software and data

pipelines

repository

tokenizer

SeeTokenizer

Transformers library

Huggingface APIs

Huggingface datasets

Huggingface infra

Huggingface spaces

Hugging Face tasks

code comment generator

code generation

dialog system

language translation

Q&A

SQuAD dataset

text generation models

text-to-text generation

SeeT5 model

zero-shot learning

Huggingface UI

Hyperparameters

I, J, K

Image-to-caption mapping

Input document

Input embeddings

Language models

advantages

development

neural network–based

word sequences

Large language models (LLMs)

Layer normalization

Long Short-Term Memory (LSTM)

Machine learning

Machine translation (MT)

Markov chains

Masked language modeling (MLM)

Masking

Microsoft DialoGPT model

Multi-headed attention system

Multi-headed self-attention (MSA)

Multi-layer perceptron (MLP)

Natural language (NL)

generation

understanding

Natural language processing (NLP)

artificial intelligence

history

LLMs

Neural network–based language models

Neural networks

Next Sentence Prediction (NSP)

BERT

inference

n-grams

Num_return_sequences

One-hot encoding

OpenAI Codex model

OpenAI’s GPT models

P, Q

Padding

Part-of-speech (POS)

Pegasus system

Pipeline API

Pipelines

Positional embedding

Positional encoding

Pretrained models

fine-tuning

SeeFine-tuning

Hugging Face

huggingface APIs

transfer learning

zero-shot learning

Programming language (PL)

Recurrent neural networks (RNNs)

designs

vs. feed-forward networks

GRUs

information

LSTM

in 1980s

sequential data

Residual connection

RoBERTa base model

Sampling

Self-assurance

Self-attention

Self-attention mechanism

Sentence paraphrasing task

Sentiment analysis task

Sentiment classification

Sequence-to-sequence (Seq2Seq) neural network

decoder

encoder

encoder-decoder usage

function

LSTM

sequence of objects

Sequential data

Situations with Adversarial Generations (SWAG)

Softer distribution

Span

Stanford Question Answering Dataset (SQuAD)

T, U

T5 language model

T5 model

app.py code

English-to-German

generating questions

Google Research

sentence paraphrasing

sentiment analysis

summarizing text

transformer-based architecture

T5 model–based translation

Temperature

Temperature function

Tensors

Text classification

Text generation

deep learning

GPT2 model

LSTM architectures

Markov chains

parameter num_beams

repetition

RNNs

sampling

straightforward approach

Text mining

Text processing

Text-to-text model

Tokenization

Tokenizer

AutoTokenizer class

BERT-based

code

dictionary

function

Google Colab

GPT2 model

libraries

multiple encodings

padding

pipeline component

specialized tokens

Transformers library

truncation

Traditional NLP models

Trainer class

Transfer learning

Transformer-based models

Transformer-based neural network architecture

Transformers

architecture

decoder

input embeddings and positional encoding

layers

linear classifier and final softmax, output probabilities

multi-headed attention

encoder

component

feed-forward network

input embeddings

layer normalization

multi-headed attention system

residual connections

high-level architecture

library

Seq2Seq models

Transformer-XL

Trigram model

Truncation

Vectorization

Vision transformers (ViTs)

architecture

Gradio

image-based machine learning

image classification

MLP

papers

tasks

W, X, Y

Word tokenization

Zero-shot classification

Zero-shot learning

classifier

need for

pretrained model

text classification

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Back Matter

Create new playlist

Sign In

Sign Up

Table of Contents for
Back Matter