Most data scientists and engineers today rely on quality labeled data to train their machine learning models. But building training sets manually is time-consuming and expensive, leaving many companies with unfinished ML projects. There's a more practical approach. In this book, Amit Bahree, Senja Filipi, and Wee Hyong Tok from Microsoft show you how to create products using weakly supervised learning models.

You'll learn how to build natural language processing and computer vision projects using weakly labeled datasets from Snorkel, a spin-off from the Stanford AI Lab. Because so many companies pursue ML projects that never go beyond their labs, this book also provides a guide on how to ship the deep learning models you build.

  • Get a practical overview of weak supervision
  • Dive into data programming with help from Snorkel
  • Perform text classification using Snorkel's weakly labeled dataset
  • Use Snorkel's labeled indoor-outdoor dataset for computer vision tasks
  • Scale up weak supervision using scaling strategies and underlying technologies

Table of Contents

  1. Preface
    1. Who Should Read This Book
    2. Navigating This Book
    3. Conventions Used in This Book
    4. Using Code Examples
    5. O’Reilly Online Learning
    6. How to Contact Us
    7. Acknowledgments
  2. 1. Introduction to Weak Supervision
    1. What is Weak Supervision?
    2. Real-world Weak Supervision with Snorkel
    3. Approaches to Weak Supervision
    4. Incomplete supervision
    5. Inexact supervision
    6. Inaccurate supervision
    7. Data programming
    8. Getting training data
    9. How data programming is helping accelerate Software 2.0
    10. Summary
    11. Bibliography
  3. 2. Diving into Data Programming with Snorkel
    1. Snorkel, a data programming framework
    2. Getting started with Labeling Functions
    3. Applying the labels to the datasets
    4. Analyzing the labeling performance
    5. Using a validation set
    6. Reaching labeling consensuswith LabelModel
    7. Strategies to improve the labeling functions
    8. Data Augmentation with Snorkel Transformers
    9. Data augmentation through word removal
    10. Snorkel Preprocessors
    11. Data augmentation through GPT-2 prediction
    12. Data Augmentation through translation
    13. Applying the transformation functions to the dataset
    14. Summary
    15. Bibliography
  4. 3. Labeling in Action
    1. Labeling a Text Dataset: Identifying Fake News
    2. Exploring the Fake news detection(FakeNewsNet) dataset
    3. Importing Snorkel, and setting up representative constants
    4. Fact-checking sites
    5. Is the speaker a “liar”?
    6. Twitter profile and Botometer score
    7. Generating agreements between weak classifiers
    8. Labeling an Images Dataset. Determining Indoor versus Outdoor Images
    9. Creating a dataset of images from Bing
    10. Defining and training weak classifiers in TensorFlow
    11. Training the various classifiers
    12. Weak classifiers out of image tags
    13. Deploying the Computer Vision Service
    14. Interacting with the Computer Vision Service
    15. Preparing the data frame
    16. Learning a label model
    17. Summary
    18. Bibliography
  5. 4. Using the Snorkel-labeled Dataset for Text Classification
    1. Getting started with Natural Language Processing (NLP)
    2. Transformers
    3. Hard vs Probabilistic Labels
    4. Using ktrain for Performing Text Classification
    5. Data Preparation
    6. Dealing with an Imbalanced Dataset
    7. Training the model
    8. Using the Text Classification model for prediction
    9. Finding a good learning rate
    10. Using Hugging Face and Transformers
    11. Loading the relevant Python packages
    12. Dataset Preparation
    13. Checking whether GPU hardware is available
    14. Performing Tokenization
    15. Model Training
    16. Testing the Fine-tuned Model
    17. Summary
    18. Bibliography
  6. 5. Using the Snorkel-labeled Dataset for Image Classification
    1. Visual Object Recognition Overview
    2. Representing Image Features
    3. Transfer Learning for Computer Vision
    4. Using PyTorch for Image classification
    5. Loading the Indoor/Outdoor dataset
    6. Utility Functions
    7. Visualizing the Training Data
    8. Fine-tuning the Pre-trained Model
    9. Summary
    10. Bibliography
  7. 6. Scalability and Distributed Training
    1. The need for scalability
    2. Distributed training
    3. Apache Spark - An Introduction
    4. Spark Application Design
    5. Using Azure Databricks to Scale
    6. Cluster Setup for Weak Supervision
    7. Fake news detection dataset on Databricks
    8. Labeling Functions for Snorkel
    9. Setting up dependencies
    10. Loading the data
    11. Fact-checking sites
    12. Transfer Learning using the LIAR dataset
    13. Weak classifiers - generating agreement
    14. Type Conversions needed for Spark runtime
    15. Summary
    16. Bibliography