Whether you're part of a small startup or a planet-spanning megacorp, this practical book shows data scientists, SREs, and business owners how to run ML reliably, effectively, and accountably within your organization. You'll gain insight into everything from how to do model monitoring in production to how to run a well-tuned model development team in a product organization.

By applying an SRE mindset to machine learning, authors and engineering professionals Cathy Chen, Kranti Parisa, Niall Richard Murphy, D. Sculley, Todd Underwood, and featured guests show you how to run an efficient ML system. Whether you want to increase revenue, optimize decision-making, solve problems, or understand and influence customer behavior, you'll learn how to perform day-to-day ML tasks while keeping the bigger picture in mind.

You'll examine:

  • What ML is: how it functions and what it relies on
  • Conceptual frameworks for understanding how ML "loops" work
  • Effective "productionization," and how it can be made easily monitorable, deployable, and operable
  • Why ML systems make production troubleshooting more difficult, and how to get around them
  • How ML, product, and production teams can communicate effectively

Table of Contents

  1. Prospective Table of Contents (Subject to Change)
  2. 1. What Production Engineers Need to Know About Models
    1. What is a model?
    2. A Basic Model Creation Workflow
    3. Model Architecture vs. Configured Model vs. Trained Model
    4. Where Are the Vulnerabilities?
    5. Training Data
    6. Labels
    7. Training Methods
    8. Infrastructure and Pipelines
    9. Platforms
    10. Feature Generation
    11. Upgrades and Fixes
    12. A Set of Useful Questions to Ask about Any Model
    13. An Example ML System
    14. Yarn Product Click Prediction Model
    15. Features
    16. Labels for Features
    17. Model Updating
    18. Model Serving
    19. Common Failures
    20. Beyond the Basics
  3. 2. Incident Response
    1. Incident Management Basics
    2. Life of an Incident
    3. Incident Response Roles
    4. Anatomy of an ML-centric Outage
    5. Terminology Reminder: “Model”
    6. Story Time
    7. Story 1: Searching But Not Finding
    8. Story 2: Suddenly Useless Partners
    9. Story 3: Recommend You Find New Suppliers
    10. Stages of ML Incident Response for Story 3
    11. ML Incident Management Principles
    12. Guiding Principles
    13. Model Developer or Data Scientist
    14. Software Engineer
    15. ML SRE or Production Engineer
    16. Product Manager or Business Leader
    17. Special Topics
    18. Production Engineers and ML Engineering vs Modeling
    19. The Ethical On-Call Engineer Manifesto
    20. Conclusion