0%

Data Pipelines with Apache Airflow teaches you how to build and maintain effective data pipelines. You’ll explore the most common usage patterns, including aggregating multiple data sources, connecting to and from data lakes, and cloud deployment. Part reference and part tutorial, this practical guide covers every aspect of the directed acyclic graphs (DAGs) that power Airflow, and how to customize them for your pipeline’s needs.

Table of Contents

  1. inside front cover
  2. Data Pipelines with Apache Airflow
  3. Copyright
  4. brief contents
  5. contents
  6. front matter
    1. preface
    2. acknowledgments
    3. Bas Harenslak
    4. Julian de Ruiter
    5. about this book
    6. Who should read this book
    7. How this book is organized: A road map
    8. About the code
    9. LiveBook discussion forum
    10. about the authors
    11. about the cover illustration
  7. Part 1. Getting started
  8. 1 Meet Apache Airflow
    1. 1.1 Introducing data pipelines
    2. 1.1.1 Data pipelines as graphs
    3. 1.1.2 Executing a pipeline graph
    4. 1.1.3 Pipeline graphs vs. sequential scripts
    5. 1.1.4 Running pipeline using workflow managers
    6. 1.2 Introducing Airflow
    7. 1.2.1 Defining pipelines flexibly in (Python) code
    8. 1.2.2 Scheduling and executing pipelines
    9. 1.2.3 Monitoring and handling failures
    10. 1.2.4 Incremental loading and backfilling
    11. 1.3 When to use Airflow
    12. 1.3.1 Reasons to choose Airflow
    13. 1.3.2 Reasons not to choose Airflow
    14. 1.4 The rest of this book
    15. Summary
  9. 2 Anatomy of an Airflow DAG
    1. 2.1 Collecting data from numerous sources
    2. 2.1.1 Exploring the data
    3. 2.2 Writing your first Airflow DAG
    4. 2.2.1 Tasks vs. operators
    5. 2.2.2 Running arbitrary Python code
    6. 2.3 Running a DAG in Airflow
    7. 2.3.1 Running Airflow in a Python environment
    8. 2.3.2 Running Airflow in Docker containers
    9. 2.3.3 Inspecting the Airflow UI
    10. 2.4 Running at regular intervals
    11. 2.5 Handling failing tasks
    12. Summary
  10. 3 Scheduling in Airflow
    1. 3.1 An example: Processing user events
    2. 3.2 Running at regular intervals
    3. 3.2.1 Defining scheduling intervals
    4. 3.2.2 Cron-based intervals
    5. 3.2.3 Frequency-based intervals
    6. 3.3 Processing data incrementally
    7. 3.3.1 Fetching events incrementally
    8. 3.3.2 Dynamic time references using execution dates
    9. 3.3.3 Partitioning your data
    10. 3.4 Understanding Airflow’s execution dates
    11. 3.4.1 Executing work in fixed-length intervals
    12. 3.5 Using backfilling to fill in past gaps
    13. 3.5.1 Executing work back in time
    14. 3.6 Best practices for designing tasks
    15. 3.6.1 Atomicity
    16. 3.6.2 Idempotency
    17. Summary
  11. 4 Templating tasks using the Airflow context
    1. 4.1 Inspecting data for processing with Airflow
    2. 4.1.1 Determining how to load incremental data
    3. 4.2 Task context and Jinja templating
    4. 4.2.1 Templating operator arguments
    5. 4.2.2 What is available for templating?
    6. 4.2.3 Templating the PythonOperator
    7. 4.2.4 Providing variables to the PythonOperator
    8. 4.2.5 Inspecting templated arguments
    9. 4.3 Hooking up other systems
    10. Summary
  12. 5 Defining dependencies between tasks
    1. 5.1 Basic dependencies
    2. 5.1.1 Linear dependencies
    3. 5.1.2 Fan-in/-out dependencies
    4. 5.2 Branching
    5. 5.2.1 Branching within tasks
    6. 5.2.2 Branching within the DAG
    7. 5.3 Conditional tasks
    8. 5.3.1 Conditions within tasks
    9. 5.3.2 Making tasks conditional
    10. 5.3.3 Using built-in operators
    11. 5.4 More about trigger rules
    12. 5.4.1 What is a trigger rule?
    13. 5.4.2 The effect of failures
    14. 5.4.3 Other trigger rules
    15. 5.5 Sharing data between tasks
    16. 5.5.1 Sharing data using XComs
    17. 5.5.2 When (not) to use XComs
    18. 5.5.3 Using custom XCom backends
    19. 5.6 Chaining Python tasks with the Taskflow API
    20. 5.6.1 Simplifying Python tasks with the Taskflow API
    21. 5.6.2 When (not) to use the Taskflow API
    22. Summary
  13. Part 2. Beyond the basics
  14. 6 Triggering workflows
    1. 6.1 Polling conditions with sensors
    2. 6.1.1 Polling custom conditions
    3. 6.1.2 Sensors outside the happy flow
    4. 6.2 Triggering other DAGs
    5. 6.2.1 Backfilling with the TriggerDagRunOperator
    6. 6.2.2 Polling the state of other DAGs
    7. 6.3 Starting workflows with REST/CLI
    8. Summary
  15. 7 Communicating with external systems
    1. 7.1 Connecting to cloud services
    2. 7.1.1 Installing extra dependencies
    3. 7.1.2 Developing a machine learning model
    4. 7.1.3 Developing locally with external systems
    5. 7.2 Moving data from between systems
    6. 7.2.1 Implementing a PostgresToS3Operator
    7. 7.2.2 Outsourcing the heavy work
    8. Summary
  16. 8 Building custom components
    1. 8.1 Starting with a PythonOperator
    2. 8.1.1 Simulating a movie rating API
    3. 8.1.2 Fetching ratings from the API
    4. 8.1.3 Building the actual DAG
    5. 8.2 Building a custom hook
    6. 8.2.1 Designing a custom hook
    7. 8.2.2 Building our DAG with the MovielensHook
    8. 8.3 Building a custom operator
    9. 8.3.1 Defining a custom operator
    10. 8.3.2 Building an operator for fetching ratings
    11. 8.4 Building custom sensors
    12. 8.5 Packaging your components
    13. 8.5.1 Bootstrapping a Python package
    14. 8.5.2 Installing your package
    15. Summary
  17. 9 Testing
    1. 9.1 Getting started with testing
    2. 9.1.1 Integrity testing all DAGs
    3. 9.1.2 Setting up a CI/CD pipeline
    4. 9.1.3 Writing unit tests
    5. 9.1.4 Pytest project structure
    6. 9.1.5 Testing with files on disk
    7. 9.2 Working with DAGs and task context in tests
    8. 9.2.1 Working with external systems
    9. 9.3 Using tests for development
    10. 9.3.1 Testing complete DAGs
    11. 9.4 Emulate production environments with Whirl
    12. 9.5 Create DTAP environments
    13. Summary
  18. 10 Running tasks in containers
    1. 10.1 Challenges of many different operators
    2. 10.1.1 Operator interfaces and implementations
    3. 10.1.2 Complex and conflicting dependencies
    4. 10.1.3 Moving toward a generic operator
    5. 10.2 Introducing containers
    6. 10.2.1 What are containers?
    7. 10.2.2 Running our first Docker container
    8. 10.2.3 Creating a Docker image
    9. 10.2.4 Persisting data using volumes
    10. 10.3 Containers and Airflow
    11. 10.3.1 Tasks in containers
    12. 10.3.2 Why use containers?
    13. 10.4 Running tasks in Docker
    14. 10.4.1 Introducing the DockerOperator
    15. 10.4.2 Creating container images for tasks
    16. 10.4.3 Building a DAG with Docker tasks
    17. 10.4.4 Docker-based workflow
    18. 10.5 Running tasks in Kubernetes
    19. 10.5.1 Introducing Kubernetes
    20. 10.5.2 Setting up Kubernetes
    21. 10.5.3 Using the KubernetesPodOperator
    22. 10.5.4 Diagnosing Kubernetes-related issues
    23. 10.5.5 Differences with Docker-based workflows
    24. Summary
  19. Part 3. Airflow in practice
  20. 11 Best practices
    1. 11.1 Writing clean DAGs
    2. 11.1.1 Use style conventions
    3. 11.1.2 Manage credentials centrally
    4. 11.1.3 Specify configuration details consistently
    5. 11.1.4 Avoid doing any computation in your DAG definition
    6. 11.1.5 Use factories to generate common patterns
    7. 11.1.6 Group related tasks using task groups
    8. 11.1.7 Create new DAGs for big changes
    9. 11.2 Designing reproducible tasks
    10. 11.2.1 Always require tasks to be idempotent
    11. 11.2.2 Task results should be deterministic
    12. 11.2.3 Design tasks using functional paradigms
    13. 11.3 Handling data efficiently
    14. 11.3.1 Limit the amount of data being processed
    15. 11.3.2 Incremental loading/processing
    16. 11.3.3 Cache intermediate data
    17. 11.3.4 Don’t store data on local file systems
    18. 11.3.5 Offload work to external/source systems
    19. 11.4 Managing your resources
    20. 11.4.1 Managing concurrency using pools
    21. 11.4.2 Detecting long-running tasks using SLAs and alerts
    22. Summary
  21. 12 Operating Airflow in production
    1. 12.1 Airflow architectures
    2. 12.1.1 Which executor is right for me?
    3. 12.1.2 Configuring a metastore for Airflow
    4. 12.1.3 A closer look at the scheduler
    5. 12.2 Installing each executor
    6. 12.2.1 Setting up the SequentialExecutor
    7. 12.2.2 Setting up the LocalExecutor
    8. 12.2.3 Setting up the CeleryExecutor
    9. 12.2.4 Setting up the KubernetesExecutor
    10. 12.3 Capturing logs of all Airflow processes
    11. 12.3.1 Capturing the webserver output
    12. 12.3.2 Capturing the scheduler output
    13. 12.3.3 Capturing task logs
    14. 12.3.4 Sending logs to remote storage
    15. 12.4 Visualizing and monitoring Airflow metrics
    16. 12.4.1 Collecting metrics from Airflow
    17. 12.4.2 Configuring Airflow to send metrics
    18. 12.4.3 Configuring Prometheus to collect metrics
    19. 12.4.4 Creating dashboards with Grafana
    20. 12.4.5 What should you monitor?
    21. 12.5 How to get notified of a failing task
    22. 12.5.1 Alerting within DAGs and operators
    23. 12.5.2 Defining service-level agreements
    24. 12.6 Scalability and performance
    25. 12.6.1 Controlling the maximum number of running tasks
    26. 12.6.2 System performance configurations
    27. 12.6.3 Running multiple schedulers
    28. Summary
  22. 13 Securing Airflow
    1. 13.1 Securing the Airflow web interface
    2. 13.1.1 Adding users to the RBAC interface
    3. 13.1.2 Configuring the RBAC interface
    4. 13.2 Encrypting data at rest
    5. 13.2.1 Creating a Fernet key
    6. 13.3 Connecting with an LDAP service
    7. 13.3.1 Understanding LDAP
    8. 13.3.2 Fetching users from an LDAP service
    9. 13.4 Encrypting traffic to the webserver
    10. 13.4.1 Understanding HTTPS
    11. 13.4.2 Configuring a certificate for HTTPS
    12. 13.5 Fetching credentials from secret management systems
    13. Summary
  23. 14 Project: Finding the fastest way to get around NYC
    1. 14.1 Understanding the data
    2. 14.1.1 Yellow Cab file share
    3. 14.1.2 Citi Bike REST API
    4. 14.1.3 Deciding on a plan of approach
    5. 14.2 Extracting the data
    6. 14.2.1 Downloading Citi Bike data
    7. 14.2.2 Downloading Yellow Cab data
    8. 14.3 Applying similar transformations to data
    9. 14.4 Structuring a data pipeline
    10. 14.5 Developing idempotent data pipelines
    11. Summary
  24. Part 4. In the clouds
  25. 15 Airflow in the clouds
    1. 15.1 Designing (cloud) deployment strategies
    2. 15.2 Cloud-specific operators and hooks
    3. 15.3 Managed services
    4. 15.3.1 Astronomer.io
    5. 15.3.2 Google Cloud Composer
    6. 15.3.3 Amazon Managed Workflows for Apache Airflow
    7. 15.4 Choosing a deployment strategy
    8. Summary
  26. 16 Airflow on AWS
    1. 16.1 Deploying Airflow in AWS
    2. 16.1.1 Picking cloud services
    3. 16.1.2 Designing the network
    4. 16.1.3 Adding DAG syncing
    5. 16.1.4 Scaling with the CeleryExecutor
    6. 16.1.5 Further steps
    7. 16.2 AWS-specific hooks and operators
    8. 16.3 Use case: Serverless movie ranking with AWS Athena
    9. 16.3.1 Overview
    10. 16.3.2 Setting up resources
    11. 16.3.3 Building the DAG
    12. 16.3.4 Cleaning up
    13. Summary
  27. 17 Airflow on Azure
    1. 17.1 Deploying Airflow in Azure
    2. 17.1.1 Picking services
    3. 17.1.2 Designing the network
    4. 17.1.3 Scaling with the CeleryExecutor
    5. 17.1.4 Further steps
    6. 17.2 Azure-specific hooks/operators
    7. 17.3 Example: Serverless movie ranking with Azure Synapse
    8. 17.3.1 Overview
    9. 17.3.2 Setting up resources
    10. 17.3.3 Building the DAG
    11. 17.3.4 Cleaning up
    12. Summary
  28. 18 Airflow in GCP
    1. 18.1 Deploying Airflow in GCP
    2. 18.1.1 Picking services
    3. 18.1.2 Deploying on GKE with Helm
    4. 18.1.3 Integrating with Google services
    5. 18.1.4 Designing the network
    6. 18.1.5 Scaling with the CeleryExecutor
    7. 18.2 GCP-specific hooks and operators
    8. 18.3 Use case: Serverless movie ranking on GCP
    9. 18.3.1 Uploading to GCS
    10. 18.3.2 Getting data into BigQuery
    11. 18.3.3 Extracting top ratings
    12. Summary
  29. appendix A. Running code samples
    1. A.1 Code structure
    2. A.2 Running the examples
    3. A.2.1 Starting the Docker environment
    4. A.2.2 Inspecting running services
    5. A.2.3 Tearing down the environment
  30. appendix B. Package structures Airflow 1 and 2
    1. B.1 Airflow 1 package structure
    2. B.2 Airflow 2 package structure
  31. appendix C. Prometheus metric mapping
  32. index
  33. inside back cover
44.223.31.148