Connecting the dots

So far, we've created just one task. Even on its own, it has some value, as it formalizes the work and the output. Now, let's add tasks to collect data for battle. It will look very similar to the previous one—we create a task, inheriting from the Task class:

# luigi_battles.py
from misc import _parse_in_depth
from luigi_fronts import ParseFronts


class ParseFront(luigi.Task):
front = luigi.Parameter()

def requires(self):
return ScrapeFronts()

def output(self):
path = str(folder / 'fronts' / (self.front + '.json'))
return luigi.LocalTarget(path)

def run(self):
with open(self.input().path, 'r') as f:
fronts = json.load(f)

front = fronts[self.front]
result = {}

for cp_name, campaign in front.items():
result[cp_name] = _parse_in_depth(campaign, cp_name)

with self.output().open('w') as f:
json.dump(result, f)

We also introduced a few additional elements:

  • First, we use the requires method, which, as we mentioned earlier, defines the prerequisite task.
  • Next, we use the input method, which is tied to prerequisites and represents access to the corresponding data, similar to output().
  • Finally, we added a parameter—this is how the luigi task can be parameterized.
  • Note that output and input objects (targets, really) do have an open method. It is a good idea to use it—you'll see why soon.
Use Luigi parameters! It is essentially free command-line interface and could be of tremendous value. There are quite a few parameter options, allowing you to pass dates, Booleans, and time periods; specify a range or list of possible values and so on. Luigi will even parse data types according to the expected parameter type. For more information on parameters, check the documentation.

For example, we can add a Boolean flag for production mode so that everything will be written to the staging path by default or the production path, on request. In one line, we get away to safely run tasks without affecting our production. Another example—with date parameters, Luigi can run multiple tasks by pre-generating multiple dates within the given range—and running the task for each of those days.

Given that, we now can collect data for a specific front:

$ python -m luigi --module luigi_battles ParseFront --front "Eastern Front" --local-scheduler

Finally, let's collect data for all of the fronts, at once:

class ParseAll(luigi.Task):
fronts = ["African Front", "Mediterranean Front",
"Western Front", "Atlantic Ocean", "Eastern Front",
"Indian Ocean","Pacific Theater", "China
Front","Southeast Asia Front"]

def requires(self):
return [ParseFront(front=f) for f in self.fronts]

As you can see, this task has neither run nor output methods overwritten, as we don't need them. At the same time, we return not one task but many as the outcome of the requires method. Luckily, Luigi supports both lists, generators, and even dictionaries, as the outcome of the requires and output functions.

Here is how this process looks like a graph. Here, each box represents one task, and arrow—task dependency. We always run the last task in the graph if we want to run all of them. The system then checks whether its dependencies are resolved. If they are not, it then checks their dependencies, and many more, until it finds tasks that are ready to run. All other tasks then run, one by one:

Now, as there will be a few tasks running in parallel, and all of them are heavy on I/O, it may be beneficial to run them in parallel, which is very easy to do with Luigi, you just need to specify the number of workers, as follows:

$ python -m luigi --module luigi_battles ParseAll --workers 2 --local-scheduler

And voilà! We just collected all of the battle information in a robust, production-ready manner. At any step, we can change the code, delete the output file, and re-run luigi. The system will understand which tasks it needs to run and which ones are done already.

We were able to pull data from Wikipedia as a one-time job, but how would we use luigi with scheduled processes? Let's talk about that in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.166.242