0%

Book Description

If you use data to make critical business decisions, this book is for you. Whether you’re a data analyst, research scientist, data engineer, ML engineer, data scientist, application developer, or systems developer, this guide helps you broaden your understanding of the modern data science stack, create your own machine learning pipelines, and deploy them to applications at production scale.

The AWS data science stack unifies data science, data engineering, and application development to help you level up your skills beyond your current role. Authors Antje Barth and Chris Fregly show you how to build your own ML pipelines from existing APIs, submit them to the cloud, and integrate results into your application in minutes instead of days.

  • Innovate quickly and save money with AWS’s on-demand, serverless, and cloud-managed services
  • Implement open source technologies such as Kubeflow, Kubernetes, TensorFlow, and Apache Spark on AWS
  • Build and deploy an end-to-end, continuous ML pipeline with the AWS data science stack
  • Perform advanced analytics on at-rest and streaming data with AWS and Spark
  • Integrate streaming data into your ML pipeline for continuous delivery of ML models using AWS and Apache Kafka

Table of Contents

  1. 1. Ingesting Data Into The Cloud
    1. Data Lakes
      1. Import Your Data into the S3 Data Lake
      2. Describe the Dataset
    2. Query the S3 Data Lake with Amazon Athena
      1. Access Athena from the AWS Console
      2. Register Your S3 Data as an Athena Table
      3. Create a Parquet-based Table in Athena
    3. Load Dataset into Redshift Data Warehouse
      1. Query the Data Lake and Data Warehouse with Redshift Spectrum
      2. Export Redshift Data to S3 Data Lake as Parquet
    4. Choosing Athena vs. Redshift
    5. Secure Your Dataset
      1. Authenticating and Authorizing with Identity and Access Management (IAM) Roles
      2. Storing Keys with Key Management Service
      3. Securing Buckets with S3 Access Points
    6. Increase Performance and Reduce Cost
      1. Parquet Partitions and Compression
      2. Redshift Table Design and Compression
      3. S3 Intelligent Tiering
    7. Summary
  2. 2. Exploring the Dataset
    1. Overview
    2. Visualize our Data Lake with Athena
      1. Prepare SageMaker Notebook for Athena
      2. Run a Sample Athena Query in the SageMaker Notebook
      3. Dive Deep into the Dataset with Athena and SageMaker
    3. Query our Data Warehouse
      1. Prepare SageMaker Notebook for Redshift
      2. Run a Sample Redshift Query from the SageMaker Notebook
      3. Dive Deep into the Dataset with Redshift and SageMaker
    4. Create Dashboards with QuickSight
      1. Setup the Datasource
      2. Query and Visualize the Dataset Within QuickSight
    5. Detect Data Quality Issues with Apache Spark
      1. SageMaker Processing Jobs
      2. Analyze Our Dataset with Deequ and Apache Spark
    6. Increase Performance and Reduce Cost
      1. Approximate Counts with HyperLogLog
      2. Dynamically Scale Your Data Warehouse with Redshift AQUA
      3. Improve Dashboard Performance with QuickSight SPICE
    7. Summary
  3. 3. Preparing the Dataset for Model Training
    1. Perform Feature Engineering
      1. Select Training Features and Labels
      2. Balance the Dataset to Improve Your Model
      3. Split the Dataset into Train, Validation, and Test
      4. Transform Raw Text into BERT Embeddings
      5. Convert Features to TFRecord File Format
    2. Scale Feature Engineering with SageMaker Processing Jobs
      1. Transform with Scikit-Learn and TensorFlow
      2. Transform with Apache Spark and TensorFlow
    3. Automate Feature Engineering with Amazon Step Functions
      1. Invoke a Pipeline with S3 Triggers
    4. Share Features Through a Feature Store
    5. Summary
  4. 4. Training Your First Model with SageMaker
    1. Understand the SageMaker Infrastructure
      1. SageMaker Container Environment Variables and S3 Locations
      2. Compute and Network Isolation
      3. Data Encryption at Rest and In-Transit
    2. Develop a SageMaker Model
      1. Built-In Algorithms
      2. Bring Your Own Script or Script Mode
      3. Bring Your Own Container
    3. A Brief History of Natural Language Processing
      1. Contextual Algorithms
      2. Attention-Based Algorithms
    4. Training BERT from Scratch
      1. Masked Language Model (Masked LM)
      2. Next Sentence Prediction
    5. Use Pre-Trained BERT Models
      1. Fine Tune the BERT Model to Create a Custom Classifier
    6. Create the Training Script
      1. Setup the Train, Validation, and Test Datasets
      2. Set Up the Custom Classifier Model
      3. Train and Validate the Model
      4. Save the Model
    7. Launch the Script from a SageMaker Notebook
      1. Define the Metrics to Capture and Monitor
      2. Configure the Hyper-Parameters for Our Algorithm
      3. Select Instance Type and Instance Count
      4. Putting it All Together in the Notebook
    8. Evaluate Model Training
      1. Run Some AdHoc Predictions from the Notebook
      2. Confusion Matrix
      3. TensorBoard
      4. CloudWatch
    9. Debug Model Training with SageMaker Debugger
    10. Increase Performance and Reduce Costs
      1. Reduced 16-bit Half Precision
      2. Mixed 32-bit Full and 16-bit Half Precision
      3. Quantization
      4. Spot Instances and Checkpoints
      5. Early Stopping
    11. Summary
  5. 5. Training and Optimizing Models at Scale
    1. Compare Training Runs with SageMaker Experiments
      1. Trace and Audit Model Lineage
      2. Reproduce a Model
      3. Manage Artifacts and Dependencies
      4. Track Our Model Lifecycle with the Experiments API
      5. Set Up the Experiment
    2. Automatically Find the Best Model Hyper-Parameters
      1. Set Up the Hyper-Parameter Ranges
      2. Run the Hyper-Parameter Tuning Job
      3. Analyze the Tuning Job Results
    3. Warm Start Additional Hyper-Parameter Tuning Jobs
      1. Run Hyper-Parameter Tuning Job using Warm Start
    4. Train and Tune Models at Scale
      1. Increase Cluster Instance Count
      2. Choose an Appropriate Cluster Communication Strategy
    5. Train with Distributed File Systems
      1. Cache S3 Data using FSx for Lustre
      2. Share Data using Elastic File System
    6. Reduce Costs and Increase Performance
      1. Shard the Data with ShardedByS3Key
      2. Stream Data On-the-Fly with PipeMode
      3. Enable Enhanced Networking
    7. Summary
  6. 6. Deploying Models to Production with SageMaker
    1. Collaborating with Multiple Teams
    2. Choose Level Of Customization
      1. Built-In Algorithms
      2. Bring Your Own Script
      3. Bring Your Own Container
    3. Choose Real-Time or Batch Predictions
    4. Real-Time Predictions with SageMaker Endpoints
      1. Deploy Model using SageMaker Python SDK
      2. Endpoint Configuration
      3. Track Model Deployment in our Experiment
      4. Analyze Model Deployment Lineage
      5. Invoking Predictions using the SageMaker Python SDK
      6. Invoke Predictions from Using HTTP POST
    5. Creating Inference Pipelines
    6. Deploying New Models
      1. Split Traffic for Canary Rollouts
      2. Shift Traffic for Blue/Green Deployments
    7. Testing and Comparing New Models
      1. Perform A/B Tests to Compare Model Variants
      2. Reinforcement Learning with Multi-Armed Bandit Testing
    8. Auto-Scale SageMaker Endpoints using CloudWatch
      1. Define a Scaling Policy with Custom Metrics
      2. Using Pre-Defined Metrics
      3. Tuning Responsiveness Using a Cool Down Period
    9. Monitor Predictions and Detect Drift
      1. Enable Data Capture
      2. Create Baseline Statistics and Constraints for Features
      3. Schedule Monitoring Jobs
      4. Interpret Results
      5. Visualize Results in Amazon SageMaker Studio
    10. Perform Batch Predictions with SageMaker Batch Transform
      1. Selecting an Instance Type
      2. Setup the Input Data
      3. Tune the Batch Transformation Configuration
      4. Prepare the Batch Transformation Job
      5. Run the Batch Transformation Job
      6. Review the Batch Predictions
    11. Lambda Functions and API Gateway
    12. Reduce Costs and Increase Performance
      1. Deploy Multiple Models in One Container
      2. Attach a GPU-based Elastic Inference Accelerator
      3. Optimize a Trained Model with SageMaker Neo and TensorFlow Lite
      4. Use Inference-Optimized Hardware
    13. Summary
3.142.12.240