front matter

preface

We’ve both been fortunate to be data engineers in interesting and challenging times. For better or worse, many companies and organizations are realizing that data plays a key role in managing and improving their operations. Recent developments in machine learning and AI have opened a slew of new opportunities to capitalize on. However, adopting data-centric processes is often difficult, as it generally requires coordinating jobs across many different heterogeneous systems and tying everything together in a nice, timely fashion for the next analysis or product deployment.

In 2014, engineers at Airbnb recognized the challenges of managing complex data workflows within the company. To address those challenges, they started developing Airflow: an open source solution that allowed them to write and schedule workflows and monitor workflow runs using the built-in web interface.

The success of the Airflow project quickly led to its adoption under the Apache Software Foundation, first as an incubator project in 2016 and later as a top-level project in 2019. As a result, many large companies now rely on Airflow for orchestrating numerous critical data processes.

Working as consultants at GoDataDriven, we’ve helped various clients adopt Airflow as a key component in projects involving the building of data lakes/platforms, machine learning models, and so on. In doing so, we realized that handing over these solutions can be challenging, as complex tools like Airflow can be difficult to learn overnight. For this reason, we also developed an Airflow training program at GoDataDriven, and have frequently organized and participated in meetings to share our knowledge, views, and even some open source packages. Combined, these efforts have helped us explore the intricacies of working with Airflow, which were not always easy to understand using the documentation available to us.

In this book, we aim to provide a comprehensive introduction to Airflow that covers everything from building simple workflows to developing custom components and designing/managing Airflow deployments. We intend to complement many of the excellent blogs and other online documentation by bringing several topics together in one place, using a concise and easy-to-follow format. In doing so, we hope to kickstart your adventures with Airflow by building on top of the experience we’ve gained through diverse challenges over the past years.

acknowledgments

This book would not have been possible without the support of many amazing people. Colleagues from GoDataDriven and personal friends supported us and provided valuable suggestions and critical insights. In addition, Manning Early Access Program (MEAP) readers posted useful comments in the online forum.

Reviewers from the development process also contributed helpful feedback: Al Krinker, Clifford Thurber, Daniel Lamblin, David Krief, Eric Platon, Felipe Ortega, Jason Rendel, Jeremy Chen, Jiri Pik, Jonathan Wood, Karthik Sirasanagandla, Kent R. Spillner, Lin Chen, Philip Best, Philip Patterson, Rambabu Posa, Richard Meinsen, Robert G. Gimbel, Roman Pavlov, Salvatore Campagna, Sebastián Palma Mardones, Thorsten Weber, Ursin Stauss, and Vlad Navitski.

At Manning, we owe special thanks to Brian Sawyer, our acquisitions editor, who helped us shape the initial book proposal and believed in us being able to see it through; Tricia Louvar, our development editor, who was very patient in answering all our questions and concerns, provided critical feedback on each of our draft chapters, and was an essential guide for us throughout this entire journey; and to the rest of the staff as well: Deirdre Hiam, our project editor; Michele Mitchell, our copyeditor; Keri Hales, our proofreader; and Al Krinker, our technical proofreader.

Bas Harenslak

I would like to thank my friends and family for their patience and support during this year-and-a-half adventure that developed from a side project into countless days, nights, and weekends. Stephanie, thank you for always putting up with me working at the computer. Miriam, Gerd, and Lotte, thank you for your patience and belief in me while writing this book. I would also like to thank the team at GoDataDriven for their support and dedication to always learn and improve, I could not have imagined being the author of a book when I started working five years ago.

Julian de Ruiter

First and foremost, I’d like to thank my wife, Anne Paulien, and my son, Dexter, for their endless patience during the many hours that I spent doing “just a little more work” on the book. This book would not have been possible without their unwavering support. In the same vein, I’d also like to thank our family and friends for their support and trust. Finally, I’d like to thank our colleagues at GoDataDriven for their advice and encouragement, from whom I’ve also learned an incredible amount in the past years.

about this book

Data Pipelines with Apache Airflow was written to help you implement data-oriented workflows (or pipelines) using Airflow. The book begins with the concepts and mechanics involved in programmatically building workflows for Apache Airflow using the Python programming language. Then the book switches to more in-depth topics such as extending Airflow by building your own custom components and comprehensively testing your workflows. The final part of the book focuses on designing and managing Airflow deployments, touching on topics such as security and designing architectures for several cloud platforms.

Who should read this book

Data Pipelines with Apache Airflow is written both for scientists and engineers who are looking to develop basic workflows in Airflow, as well as engineers interested in more advanced topics such as building custom components for Airflow or managing Airflow deployments. As Airflow workflows and components are built in Python, we do expect readers to have intermediate experience with programming in Python (i.e., have a good working knowledge of building Python functions and classes, understanding concepts such as *args and **kwargs, etc.). Some experience with Docker is also beneficial, as most of our code examples are run using Docker (though they can also be run locally if you wish).

How this book is organized: A road map

The book consists of four sections that cover a total of 18 chapters.

Part 1 focuses on the basics of Airflow, explaining what Airflow is and outlining its basic concepts.

  • Chapter 1 discusses the concept of data workflows/pipelines and how these can be built using Apache Airflow. It also discusses the advantages and disadvantages of Airflow compared to other solutions, including in which situations you might not want to use Apache Airflow.

  • Chapter 2 goes into the basic structure of pipelines in Apache Airflow (also known as DAGs), explaining the different components involved and how these fit together.

  • Chapter 3 shows how you can use Airflow to schedule your pipelines to run at recurring time intervals so that you can (for example) build pipelines that incrementally load new data over time. The chapter also dives into some intricacies in Airflow’s scheduling mechanism, which is often a source of confusion.

  • Chapter 4 demonstrates how you can use templating mechanisms in Airflow to dynamically include variables in your pipeline definitions. This allows you to reference things such as schedule execution dates within your pipelines.

  • Chapter 5 demonstrates different approaches for defining relationships between tasks in your pipelines, allowing you to build more complex pipeline structures with branches, conditional tasks, and shared variables.

Part 2 dives deeper into using more complex Airflow topics, including interfacing with external systems, building your own custom components, and designing tests for your pipelines.

  • Chapter 6 shows how you can trigger workflows in other ways that don’t involve fixed schedules, such as files being loaded or via an HTTP call.

  • Chapter 7 demonstrates workflows using operators that orchestrate various tasks outside Airflow, allowing you to develop a flow of events through systems that are not connected.

  • Chapter 8 explains how you can build custom components for Airflow that allow you to reuse functionality across pipelines or integrate with systems that are not supported by Airflow’s built-in functionality.

  • Chapter 9 discusses various options for testing Airflow workflows, touching on several properties of operators and how to approach these during testing.

  • Chapter 10 demonstrates how you can use container-based workflows to run pipeline tasks within Docker or Kubernetes and discusses the advantages and disadvantages of these container-based approaches.

Part 3 focuses on applying Airflow in practice and touches on subjects such as best practices, running/securing Airflow, and a final demonstrative use case.

  • Chapter 11 highlights several best practices to use when building pipelines, which will help you to design and implement efficient and maintainable solutions.

  • Chapter 12 details several topics to account for when running Airflow in a production setting, such as architectures for scaling out, monitoring, logging, and alerting.

  • Chapter 13 discusses how to secure your Airflow installation to avoid unwanted access and to minimize the impact in the case a breach occurs.

  • Chapter 14 demonstrates an example Airflow project in which we periodically process rides from New York City’s Yellow Cab and Citi Bikes to determine the fastest means of transportation between neighborhoods.

Part 4 explores how to run Airflow in several cloud platforms and includes topics such as designing Airflow deployments for the different clouds and how to use built-in operators to interface with different cloud services.

  • Chapter 15 provides a general introduction by outlining which Airflow components are involved in (cloud) deployments, introducing the idea behind cloud-specific components built into Airflow, and weighing the options of rolling out your own cloud deployment versus using a managed solution.

  • Chapter 16 focuses on Amazon’s AWS cloud platform, expanding on the previous chapter by designing deployment solutions for Airflow on AWS and demonstrating how specific components can be used to leverage AWS services.

  • Chapter 17 designs deployments and demonstrates cloud-specific components for Microsoft’s Azure platform.

  • Chapter 18 addresses deployments and cloud-specific components for Google’s GCP platform.

People new to Airflow should read chapters 1 and 2 to get a good idea of what Airflow is and what it can do. Chapters 3–5 provide important information about Airflow’s key functionality. The rest of the book discusses topics such as building custom components, testing, best practices, and deployments and can be read out of order, based on the reader’s particular needs.

About the code

All source code in listings or text is in a fixed-width font like this to separate it from ordinary text. Sometimes code is also in bold to highlight code that has changed from previous steps in the chapter, such as when a new feature adds to an existing line of code.

In many cases, the original source code has been reformatted; we’ve added line breaks and reworked indentation to accommodate the available page space in the book. In rare cases, even this was not enough, and listings include line-continuation markers (). Additionally, comments in the source code have often been removed from the listings when the code is described in the text. Code annotations accompany many of the listings, highlighting important concepts.

References to elements in the code, scripts, or specific Airflow classes/variables/values are often in italics to help distinguish them from the surrounding text.

Source code for all examples and instructions to run them using Docker and Docker Compose are available in our GitHub repository (https://github.com/BasPH/data-pipelines-with-apache-airflow) and can be downloaded via the book’s website (www.manning.com/books/data-pipelines-with-apache-airflow).

Note Appendix A provides more detailed instructions on running the code examples.

All code samples have been tested with Airflow 2.0. Most examples should also run on older versions of Airflow (1.10), with small modifications. Where possible, we have included inline pointers on how to do so. To help you account for differences in import paths between Airflow 2.0 and 1.10, appendix B provides an overview of changed import paths between the two versions.

LiveBook discussion forum

Purchase of Data Pipelines with Apache Airflow includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the author and other users. To access the forum and subscribe to it, go to https://livebook.manning.com/#!/book/data-pipelines-with-apache-airflow/discussion. This page provides information on how to get on the forum once you’re registered, what kind of help is available, and its rules of conduct.

Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the authors can take place. It is not a commitment to any specific amount of participation on the part of the authors, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the authors some challenging questions lest their interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.

about the authors

Bas Harenslak is a data engineer at GoDataDriven, a company developing data-driven solutions located in Amsterdam, Netherlands. With a background in software engineering and computer science, he enjoys working on software and data as if they are challenging puzzles. He favors working on open source software, is a committer on the Apache Airflow project, and is co-organizer of the Amsterdam Airflow meetup.

Julian de Ruiter is a machine learning engineer with a background in computer and life sciences and has a PhD in computational cancer biology. As an experienced software developer, he enjoys bridging the worlds of data science and engineering by using cloud and open source software to develop production-ready machine learning solutions. In his spare time, he enjoys developing his own Python packages, contributing to open source projects, and tinkering with electronics.

about the cover illustration

The figure on the cover of Data Pipelines with Apache Airflow is captioned “Femme de l’Isle de Siphanto,” or Woman from Island Siphanto. The illustration is taken from a collection of dress costumes from various countries by Jacques Grasset de Saint-Sauveur (1757–1810), titled Costumes de Différents Pays, published in France in 1797. Each illustration is finely drawn and colored by hand. The rich variety of Grasset de Saint-Sauveur’s collection reminds us vividly of how culturally apart the world’s towns and regions were just 200 years ago. Isolated from each other, people spoke different dialects and languages. In the streets or in the countryside, it was easy to identify where they lived and what their trade or station in life was just by their dress.

The way we dress has changed since then and the diversity by region, so rich at the time, has faded away. It is now hard to tell apart the inhabitants of different continents, let alone different towns, regions, or countries. Perhaps we have traded cultural diversity for a more varied personal life—certainly for a more varied and fast-paced technological life.

At a time when it is hard to tell one computer book from another, Manning celebrates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional life of two centuries ago, brought back to life by Grasset de Saint-Sauveur’s pictures.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.134.78.106