Introducing the ETL pipeline

Data pipelines are important and ubiquitous. Even organizations with a small online presence run their own jobs: thousands of research facilities, meteorological centers, observatories, hospitals, military bases, and banks all run their internal data processing.

Another name for the data pipelines is ETL, which stands for Extract, Transform, and Load—three conceptual pieces of each pipeline. At first glance, the task may sound trivial. Most of our notebooks are, in a way, ETL jobs—we load some data, work with it, and then store it somewhere. However, building and maintaining a good pipeline requires a thorough and consistent approach. Processes should be reliable, easy to re-run, and reusable. Particular tasks shouldn't run more than once or if their dependencies are not satisfied (say, other tasks haven't finished yet).

It is not, however, something new—a market for ETL and the corresponding programmatic solutions is well established. There are quite a few frameworks on the market, both enterprise and open source. To name a few, there is Airflow, Pinball, Azkaban, Bubbles, and a dozen others. There is also a new kid on the block, Prefect. Airflow is arguably the most popular at the moment. Developed by Airbnb, Airflow allows the running of tasks on a cluster using arbitrary tasks (not necessarily written in Python) and orchestrating them from a web-dashboard—the user does not need to have anything installed in their machine. Airflow is a great tool but requires significantly more hassle to deploy and maintain, which makes it a hard choice for this book or any one-off, simple pipeline, in general.

In this chapter, we'll use luigi, a relatively lightweight and flexible framework that allows the running of pipelines locally, with no remote server required, and with a comparatively small overhead. Because of that, luigi is easy to play and experiment with and might be very useful for structuring any, even a small, process—for example, training a machine learning model.

Let's start our introduction from a conceptual overview. luigi is based on a few core principles:

Every project should be represented as a Directed Acyclic Graph (DAG), where each node represents one logical step in the process, usually called tasks. Edges define the sequence and dependency of each task. Some tasks will depend on the completion of one or a few other tasks.
Tasks can be parameterized (say, run for a specific date) and can send information (including those parameters) to each other.
For every task, there is a simple and unambiguous way to check whether the task is complete. The scheduler will not run the same task twice.

In luigi, DAGs are not predefined—you operate on the tasks, specifying for each its outcomes (say the file or table it will store its data to) and dependencies and which tasks are required to be running before this one. The following is a basic form of a luigi task. As you can see, it inherits from a template class and has one parameter, date, and two methods, output and run:


class MyTask(luigi.Task):
    date = luigi.DateParameter(default='2019-06-01')

    def output(self):
        return luigi.LocalTarget(f'./data/data/{self.date:%Y/%m-
        %d}.csv')
    
    def run(self):
        # do stuff
        # ...
        data.to_csv(self.output().path)

Luigi is meant to be used via class inheritance. On execution, luigi will do the following:

Check whether MyTask is complete, using its complete method, built into luigi.Task. By default, it returns true if the output file exists.
If not, luigi will check whether MyTask has any dependencies, defined in the requires method. As we didn't override the method, luigi uses the one from the template, which returns no dependencies. If a task does have dependencies, and they are not complete (as shown in the preceding code), the scheduler will look at them first (go to the start of this list). Once they are executed successfully, the scheduler will eventually switch back to this task.
As MyTask doesn't have any dependencies, it will be executed immediately, by running the run function.
The parameter (date) will be encoded in the output path.

For each and every task we add, the key is to define (override) the run, requires, output, and, if necessary, complete functions (for example, if the task produces more than one file, and you want to check completeness by the existence of a particular one).

Table of Contents for Introducing the ETL pipeline

Create new playlist

Sign In

Sign Up

Table of Contents for
Introducing the ETL pipeline