Learn how easy it is to apply sophisticated statistical and machine learning methods to real-world problems when you build using Google Cloud Platform (GCP). This hands-on guide shows data engineers and data scientists how to implement an end-to-end data pipeline, using statistical and machine learning methods and tools on GCP.

Through the course of this updated second edition, you'll work through a sample business decision by employing a variety of data science approaches. Follow along by implementing these statistical and machine learning solutions in your own project on GCP, and discover how this platform provides a transformative and more collaborative way of doing data science.

You'll learn how to:

  • Employ best practices in building highly scalable data and ML pipelines on Google Cloud
  • Automate and schedule data ingest using Cloud Run
  • Create and populate a dashboard in Data Studio
  • Build a real-time analytics pipeline using Pub/Sub, Dataflow, and BigQuery
  • Conduct interactive data exploration with BigQuery
  • Create a Bayesian model with Spark on Cloud Dataproc
  • Forecast time series and do anomaly detection with BigQuery ML
  • Aggregate within time windows with Dataflow
  • Train explainable machine learning models with Vertex AI
  • Operationalize ML with Vertex AI Pipelines

Table of Contents

  1. 1. Making Better Decisions Based on Data
    1. Many Similar Decisions
    2. The Role of Data Scientists
    3. Scrappy Environment
    4. Full Stack Cloud Data Scientists
    5. Collaboration
    6. Target audience for the book
    7. Best Practices
    8. Simple to Complex Solutions
    9. Cloud Computing
    10. Serverless
    11. A Probabilistic Decision
    12. Probabilistic Approach
    13. Probability Density Function
    14. Cumulative Distribution Function
    15. Data and Tools
    16. Getting Started with the Code
    17. Summary
  2. 2. Ingesting Data into the Cloud
    1. Airline On-Time Performance Data
    2. Knowability
    3. Training–Serving Skew
    4. Downloading Data
    5. Hub and Spoke Architecture
    6. Dataset Fields
    7. Separation of Compute and Storage
    8. Scaling Up
    9. Scaling Out with Sharded Data
    10. Scaling out with Data in Situ
    11. Ingesting Data
    12. Reverse Engineering a Web Form
    13. Dataset Download
    14. Exploration and Cleanup
    15. Uploading Data to Google Cloud Storage
    16. Loading Data into Google BigQuery
    17. Advantages of a Serverless Columnar Database
    18. Staging on Cloud Storage
    19. Access Control
    20. Ingesting CSV Files
    21. Partitioning
    22. Scheduling Monthly Downloads
    23. Ingesting in Python
    24. Cloud Run
    25. Securing Cloud Run
    26. Deploying and Invoking Cloud Run
    27. Scheduling Cloud Run
    28. Summary
    29. Code Break
  3. 3. Creating Compelling Dashboards
    1. Explain Your Model with Dashboards
    2. Why Build a Dashboard First?
    3. Accuracy, Honesty, and Good Design
    4. Loading Data into Cloud SQL
    5. Create a Google Cloud SQL Instance
    6. Create Table of Data
    7. Interacting with the database
    8. Querying Using BigQuery
    9. Schema Exploration
    10. Using Preview
    11. Using Table Explorer
    12. Creating BigQuery View
    13. Building Our First Model
    14. Contingency Table
    15. Threshold Optimization
    16. Building a Dashboard
    17. Getting Started with Data Studio
    18. Creating Charts
    19. Adding End-User Controls
    20. Showing Proportions with a Pie Chart
    21. Explaining a Contingency Table
    22. Summary
  4. 4. Streaming Data: Publication and Ingest with Pub/Sub and Dataflow
    1. Designing the Event Feed
    2. Transformations Needed
    3. Architecture
    4. Getting airport information
    5. Sharing data
    6. Time Correction
    7. Apache Beam/Cloud Dataflow
    8. Parsing Airports Data
    9. Adding Time Zone Information
    10. Converting Times to UTC
    11. Correcting Dates
    12. Creating Events
    13. Reading and Writing to the Cloud
    14. Running the Pipeline in the Cloud
    15. Publishing an Event Stream to Cloud Pub/Sub
    16. Speed-up Factor
    17. Get Records to Publish
    18. Iterating Through Records
    19. Building a Batch of Events
    20. Publishing a Batch of Events
    21. Real-Time Stream Processing
    22. Streaming in Dataflow
    23. Windowing a pipeline
    24. Streaming aggregation
    25. Using Event Timestamps
    26. Executing the Stream Processing
    27. Analyzing Streaming Data in BigQuery
    28. Real-Time Dashboard
    29. Summary