Chapter 74. The Importance of Data Lineage

Julien Le Dem

As a data engineer, you become a sort of collector of datasets coming from various sources. The challenge is that datasets don’t just stay pinned on a board behind glass. They have to be maintained and updated in a timely manner. They have to be transformed and adapted to various use cases. They change form over time; all the layers built upon them have to be updated.

As data pipelines pile up, complexity increases dramatically, and it becomes harder to keep things updated reliably in a timely manner. Observing data lineage in all the layers of transformation—from ingestion to machine learning, business intelligence, and data processing in general—provides a critical source of visibility. With this information, the engineer on call can understand what’s happening and resolve problems quickly when a crisis happens.

Lineage provides the understanding of how a dataset was derived from another one. Operational lineage goes one step beyond by tracing how and when that transformation happened. It captures information such as:

  • The version of the input that was consumed

  • The subset of the data that was read

  • The version of the code doing the transformation

  • The output’s definition and how each column was derived from the input

  • The time it took to complete and whether it was successful

  • The version of the output that was produced

  • The shape of the output data (schema, row count, distribution, ...)

Tracking operational lineage changes over time provides the critical information needed to quickly answer many questions arising when data-related problems occur:

My transformation is failing
Did anything change upstream? Did the shape of the input data change? Where did this change originate? Did someone change the logic of the transformation producing it?
My dataset is late
Where is the bottleneck upstream? How can this bottleneck be explained? Has it become slower recently? If yes, what has changed about its definition and its input?
My dataset is incorrect
What changed in the shape of the data? Where upstream did the distribution of this column start drifting? What change in transformation logic is correlated with the change of data shape?

OpenLineage is the open source project standardizing lineage and metadata collection across the data ecosystem. With a data-collection platform in place, a data engineer can move fast and fix things fast. They can quickly perform impact analysis to prevent new changes from breaking things by answering such questions as: Will this schema change break downstream consumers? Who should be informed about this semantic change? Can we deprecate this dataset?

Lineage is also the basis of many other data-related needs:

Privacy
Where is my users’ private data consumed? Is it used according to user consent?
Discovery
What datasets exist, and how are they consumed? How is this dataset derived from others? Who owns it?
Compliance
Can I prove that my reporting is correctly derived from my input data?
Governance
Am I using the data correctly?

As the number of datasets and jobs grows within an organization, these questions quickly become impossible to answer without collecting data-lineage metadata. This knowledge is the strong foundation that allows data to be manageable at scale.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.13.173