0%

Book Description

Data pipelines are the foundation for success in data analytics and machine learning. Moving data from many diverse sources and processing it to provide context is the difference between having data and actually gaining value from it. This pocket reference defines data pipelines and explains how they work in today’s modern data stack.

You’ll learn common considerations and key decision points when implementing pipelines, such as data pipeline design patterns, data ingestion implementation, data transformation, the orchestration of pipelines, and build versus buy decision making. This book addresses the most common decisions made by data professionals and discusses foundational concepts that apply to open source frameworks, commercial products, and homegrown solutions.

You’ll learn:

  • What a data pipeline is and how it works
  • How data is moved and processed on modern data infrastructure, including cloud platforms
  • Common tools and products used by data engineers to build pipelines
  • How pipelines support machine learning and analytics needs
  • Considerations for pipeline maintenance, testing, and alerting

Table of Contents

  1. Preface
    1. Who This Book Is For
    2. Conventions Used in This Book
    3. Using Code Examples
    4. O’Reilly Online Learning
    5. How to Contact Us
  2. 1. Introduction to Data Pipelines
    1. What Are Data Pipelines?
    2. Who Builds Data Pipelines?
      1. SQL and Data Warehousing Fundamentals
      2. Python and/or Java
      3. Distributed Computing
      4. Basic System Administration
      5. A Business Goal Mentality
    3. Why Build Data Pipelines?
    4. How Are Pipelines Built?
  3. 2. A Modern Data Infrastructure
    1. Diversity of Data Sources
      1. Source System Ownership
      2. Ingestion Interface and Data Structure
      3. Data volume
      4. Data Cleanliness and Validity
      5. Latency and Bandwidth of the Source System
    2. Cloud Data Warehouses and Data Lakes
    3. Data Ingestion Tools
    4. Data Transformation and Modeling Tools
    5. Workflow Orchestration Platforms
      1. Directed Acyclic Graphs
    6. Customizing Your Data infrastructure
  4. 3. Common Data Pipeline Patterns
    1. ETL and ELT
    2. The Emergence of ELT over ETL
    3. EtLT Subpattern
    4. ELT for Data Analysis
    5. ELT for Data Science
    6. ELT for Data Products
  5. 4. Data Ingestion
    1. Setting Up Your Python Environment
    2. Setting Up Cloud File Storage
    3. Configuring an Amazon Redshift Warehouse as a Destination
    4. Configuring a Snowflake Warehouse as a Destination
    5. Extracting Data from a MySQL Database
      1. Full or Incremental MySQL Table Extraction
      2. Binary Log Replication of MySQL Data
    6. Extracting Data from a Postgres Database
      1. Full or Incremental Postgres Table Extraction
      2. Replicating Data Using the Write Ahead Log
    7. Extracting Data from MongoDB
    8. Extracting Data from a REST API
    9. Streaming Data Ingestions with Kafka and Debezium
    10. Loading Data into a Redshift Data Warehouse
      1. Loading Raw Data Stored in CSV Files
      2. Incremental vs Full Loads
      3. Loading Data Extracted from a Change Data Capture Log
    11. Loading Data into a Snowflake Data Warehouse
    12. Using Your File Storage as a Data Lake
    13. Open Source Frameworks
    14. Commercial Alternatives
  6. 5. Transforming Data
    1. Non-Contextual Transformations
      1. Deduplicating Records in a Table
      2. Parsing URLs
    2. When to Transform? During or After Ingestion?
    3. Data Modeling Foundations
      1. Key Data Modeling Terms
      2. Modeling Fully Refreshed Data
      3. Slowly Changing Dimensions for Fully Refreshed Data
      4. Modeling Incrementally Ingested Data
      5. Modeling Append Only Data
      6. Modeling Change Capture Data
44.223.31.148