Since their introduction in 2017, Transformers have quickly become the dominant architecture for achieving state-of-the-art results on a variety of natural language processing tasks. If you're a data scientist or coder, this practical book shows you how to train and scale these large models using HuggingFace Transformers, a Python-based deep learning library.

Transformers have been used to write realistic news stories, improve Google Search queries, and even create chatbots that tell corny jokes. In this guide, authors Lewis Tunstall, Leandro von Werra, and Thomas Wolf use a hands-on approach to teach you how Transformers work and how to integrate them in your applications. You'll quickly learn a variety of tasks they can help you solve.

  • Build, debug, and optimize Transformer models for core NLP tasks, such as text classification, named entity recognition, and question answering
  • Learn how Transformers can be used for cross-lingual transfer learning
  • Apply Transformers in real-world scenarios where labeled data is scarce
  • Make Transformer models efficient for deployment using techniques such as distillation, pruning, and quantization
  • Train Transformers from scratch and learn how to scale to multiple GPUs and distributed environments

Table of Contents

  1. 1. Text Classification
    1. The Dataset
    2. A First Look at Hugging Face Datasets
    3. From Datasets to DataFrames
    4. Look at the Class Distribution
    5. How Long Are Our Tweets?
    6. From Text to Tokens
    7. Character Tokenization
    8. Word Tokenization
    9. Subword Tokenization
    10. Using Pretrained Tokenizers
    11. Training a Text Classifier
    12. Transformers as Feature Extractors
    13. Fine-tuning Transformers
    14. Further Improvements
    15. Conclusion
  2. 2. Question Answering
    1. Building a Review-Based QA System
    2. The Dataset
    3. Extracting Answers from Text
    4. Using Haystack to Build a QA Pipeline
    5. Improving Our QA Pipeline
    6. Evaluating the Retriever
    7. Evaluating the Reader
    8. Domain Adaptation
    9. Evaluating the Whole QA Pipeline
    10. Going Beyond Extractive QA
    11. Retrieval Augmented Generation
    12. Conclusion
  3. 3. Making Transformers Efficient in Production
    1. Intent Detection as a Case Study
    2. Creating a Performance Benchmark
    3. Benchmarking Our Baseline Model
    4. Making Models Smaller via Knowledge Distillation
    5. Knowledge Distillation for Fine-tuning
    6. Knowledge Distillation for Pretraining
    7. Creating a Knowledge Distillation Trainer
    8. Choosing a Good Student Initialization
    9. Finding Good Hyperparameters with Optuna
    10. Benchmarking Our Distilled Model
    11. Making Models Faster with Quantization
    12. Quantization Strategies
    13. Quantizing Transformers in PyTorch
    14. Benchmarking Our Quantized Model
    15. Optimizing Inference with ONNX and the ONNX Runtime
    16. Optimizing for Transformer Architectures
    17. Making Models Sparser with Weight Pruning
    18. Sparsity in Deep Neural Networks
    19. Weight Pruning Methods
    20. Creating Masked Transformers
    21. Creating a Pruning Trainer
    22. Fine-Pruning With Increasing Sparsity
    23. Counting the Number of Pruned Weights
    24. Pruning Once and For All
    25. Quantizing and Storing in Sparse Format
    26. Conclusion
  4. 4. Multilingual Named Entity Recognition
    1. The Dataset
    2. Multilingual Transformers
    3. mBERT
    4. XLM
    5. XLM-R
    6. Training a Named Entity Recognition Tagger
    7. SentencePiece Tokenization
    8. The Anatomy of the Transformers Model Class
    9. Bodies and Heads
    10. Creating Your Own XLM-R Model for Token Classification
    11. Loading a Custom Model
    12. Tokenizing and Encoding the Texts
    13. Performance Measures
    14. Fine-tuning XLM-RoBERTa
    15. Error Analysis
    16. Evaluating Cross-Lingual Transfer
    17. When Does Zero-Shot Transfer Make Sense?
    18. Fine-tuning on Multiple Languages at Once
    19. Building a Pipeline for Inference
    20. Conclusion
  5. 5. Dealing With Few to No Labels
    1. Building a GitHub Issues Tagger
    2. Getting the Data
    3. Preparing the Data
    4. Creating Training Sets
    5. Creating Training Slices
    6. Implementing a Bayesline
    7. Working With No Labeled Data
    8. Zero-Shot Classification
    9. Working With A Few Labels
    10. Data Augmentation
    11. Using Embeddings as a Lookup Table
    12. Fine-tuning a Vanilla Transformer
    13. In-context and Few-shot Learning with Prompts
    14. Levaraging Unlabelled Data
    15. Fine-tuning a Language Model
    16. Fine-tuning a Classifier
    17. Advanced Methods
    18. Conclusion
  6. 6. Text Generation
    1. The Challenge With Generating Coherent Text
    2. Greedy Search Decoding
    3. Beam Search Decoding
    4. Sampling Methods
    5. Which Decoding Method is Best?
    6. Conclusion
  7. 7. Summarization
    1. The CNN/DailyMail Dataset
    2. Text Summarization Pipelines
    3. Summarization Baseline
    4. GPT-2
    5. T5
    6. BART
    7. PEGASUS
    8. Comparing Different Summaries
    9. Measuring the Quality of Generated Text
    10. BLEU
    11. ROUGE
    12. Evaluating PEGASUS on the CNN/DailyMail Dataset
    13. Training Your Own Summarization Model
    14. Evaluating PEGASUS on SAMSum
    15. Fine-Tuning PEGASUS
    16. Generating Dialogue Summaries
    17. Conclusion