Your training data has as much to do with the success of your data project as the algorithms themselves--most failures in deep learning systems relate to training data. But while training data is the foundation for successful machine learning, there are few comprehensive resources to help you ace the process. This hands-on guide explains how to work with and scale training data. You'll gain a solid understanding of the concepts, tools, and processes needed to:

  • Design, deploy, and ship training data for production-grade deep learning applications
  • Integrate with a growing ecosystem of tools
  • Recognize and correct new training data-based failure modes
  • Improve existing system performance and avoid development risks
  • Confidently use automation and acceleration approaches to more effectively create training data
  • Avoid data loss by structuring metadata around created datasets
  • Clearly explain training data concepts to subject matter experts and other shareholders
  • Successfully maintain, operate, and improve your system

Table of Contents

  1. 1. Training Data Introduction
    1. What is Training Data?
    2. Good Robot, Bad Robot
    3. Thinking of Training Data as Code
    4. Concepts Introduction
    5. Representations
    6. Choices
    7. Who Supervises the Data
    8. Sets of Assumptions
    9. Randomness
    10. Processes and Process Automation
    11. Supervision Automation and Tooling
    12. Dataset Construction & Maintenance
    13. Relevancy
    14. Integrated System Design
    15. What-To-Label
    16. Transfer Learning
    17. Per Sample Judgement Calls
    18. Ethical & Privacy Considerations
    19. Why Training Data Matters for Supervised Learning
    20. Control
    21. Dependencies
    22. Context Matter: Imagine a Perfect System
    23. Contexts in Training Data: Classic and Supervised
    24. Monkey See, Monkey Do
    25. Training Data Sample Creation
    26. Introduction
    27. Approach One: Binary Classification
    28. Let’s manually create our first set
    29. Approach Two: Upgraded Classification
    30. Training Data Process Introduction
    31. Getting Started
    32. Training Data Actions
    33. Levels of System Maturity of Training Data Operations
    34. Training Data in the Ecosystem
    35. Tooling
    36. Applied vs Research Sets
    37. Training Data Management
    38. Introduction
    39. Completed vs Not Completed
    40. When Completed Is More Complicated
    41. Freshness
    42. Maintaining Set Metadata
    43. Task Management
    44. Challenges Introduction
    45. Failures caused by Training Data
    46. Failing to Achieve the Desired Bias
    47. Summary
  2. 2. Training Data Concepts
    1. Schema Deep Dive Introduction
    2. What is it? Labels & Attributes
    3. What do we care about?
    4. Label Introduction
    5. Attributes Introduction
    6. Relationship to Spatial Types
    7. Importance of What it is
    8. The Hidden Background Case
    9. Technical Specifications
    10. Where is it? - Spatial Representation
    11. Computer Vision Spatial Types
    12. Keypoint
    13. Ellipse and Circle
    14. Cuboid
    15. Lines & Curves
    16. Types with multiple uses
    17. Complex Spatial Types
    18. Trade offs with types for architecture and creation
    19. Trade offs with types for usage
    20. When is it? - Relationships, Sequences, Time Series
    21. Guides, Instructions
    22. Choosing good names
    23. Relation of Machine Learning Tasks to Training Data
    24. Tasks
    25. Chart - Relationship of Tasks to Training Data Types
    26. General Concepts
    27. Instance Concept Refresher
    28. Upgrading data over time
    29. Advanced concepts
    30. Boundary between Modeling and Training Data
    31. Raw Data Concepts
    32. Images
    33. Raw Data Constraints
    34. Video
    35. 3D
    36. 3D Point Clouds
    37. Text
    38. Raw Data Combinations
    39. Multimodal Data
    40. Transformations - What view is the data being annotated in? Where is it getting predicted on?
    41. Summary