Redesigning your code as a pipeline

But how can you define a pipeline? What are the best steps to split your code to keep it both cost-efficient and low-maintenance? From our experience, it mainly depends on a combination of two factors:

  • How reliable is the data source?
  • How critical is this information, or will the data be available to re-pull after a while?

As a rule of thumb, if the dataset is external and hence unreliable (for example, a Wikipedia page or Open Data Portal), we'd recommend splitting your injection pipeline into distinctive steps. In the first step, you'll collect all of the data the way it is provided—say, store the whole HTML page. For data, use JSON or CSV—something with no strict schema. After raw data is stored, you can extract the clean data. Even if something goes wrong, you keep an original on hand.

If the source is reliable (for example, specific internal data with a defined schema) or the dataset can be re-pulled at any time, you can probably wrap your code into a single task. You may still want to keep one task per logical step—so that tasks could be used for other purposes or as part of a different pipeline.

Enough talk! Let's build a task of our own!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.41.142