Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 74. The Importance of Data Lineage

Julien Le Dem

As a data engineer, you become a sort of collector of datasets coming from various sources. The challenge is that datasets don’t just stay pinned on a board behind glass. They have to be maintained and updated in a timely manner. They have to be transformed and adapted to various use cases. They change form over time; all the layers built upon them have to be updated.

As data pipelines pile up, complexity increases dramatically, and it becomes harder to keep things updated reliably in a timely manner. Observing data lineage in all the layers of transformation—from ingestion to machine learning, business intelligence, and data processing in general—provides a critical source of visibility. With this information, the engineer on call can understand what’s happening and resolve problems quickly when a crisis happens.

Lineage provides the understanding of how a dataset was derived from another one. Operational lineage goes one step beyond by tracing how and when that transformation happened. It captures information such as:

The version of the input that was consumed
The subset of the data that was read
The version of the code doing the transformation
The output’s definition and how each column was derived from the input
The time it took to complete and whether it was successful
The version of the output that was produced
The shape of the output data (schema, row count, distribution, ...)

Tracking operational lineage changes over time provides the critical information needed to quickly answer many questions arising when data-related problems occur:

My transformation is failing: Did anything change upstream? Did the shape of the input data change? Where did this change originate? Did someone change the logic of the transformation producing it?
My dataset is late: Where is the bottleneck upstream? How can this bottleneck be explained? Has it become slower recently? If yes, what has changed about its definition and its input?
My dataset is incorrect: What changed in the shape of the data? Where upstream did the distribution of this column start drifting? What change in transformation logic is correlated with the change of data shape?

OpenLineage is the open source project standardizing lineage and metadata collection across the data ecosystem. With a data-collection platform in place, a data engineer can move fast and fix things fast. They can quickly perform impact analysis to prevent new changes from breaking things by answering such questions as: Will this schema change break downstream consumers? Who should be informed about this semantic change? Can we deprecate this dataset?

Lineage is also the basis of many other data-related needs:

Privacy: Where is my users’ private data consumed? Is it used according to user consent?
Discovery: What datasets exist, and how are they consumed? How is this dataset derived from others? Who owns it?
Compliance: Can I prove that my reporting is correctly derived from my input data?
Governance: Am I using the data correctly?

As the number of datasets and jobs grows within an organization, these questions quickly become impossible to answer without collecting data-lineage metadata. This knowledge is the strong foundation that allows data to be manageable at scale.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 74. The Importance of Data Lineage

Create new playlist

Sign In

Sign Up

Chapter 74. The Importance of Data Lineage

Julien Le Dem

Table of Contents for
74. The Importance of Data Lineage