As a data engineer, you become a sort of collector of datasets coming from various sources. The challenge is that datasets don’t just stay pinned on a board behind glass. They have to be maintained and updated in a timely manner. They have to be transformed and adapted to various use cases. They change form over time; all the layers built upon them have to be updated.
As data pipelines pile up, complexity increases dramatically, and it becomes harder to keep things updated reliably in a timely manner. Observing data lineage in all the layers of transformation—from ingestion to machine learning, business intelligence, and data processing in general—provides a critical source of visibility. With this information, the engineer on call can understand what’s happening and resolve problems quickly when a crisis happens.
Lineage provides the understanding of how a dataset was derived from another one. Operational lineage goes one step beyond by tracing how and when that transformation happened. It captures information such as:
The version of the input that was consumed
The subset of the data that was read
The version of the code doing the transformation
The output’s definition and how each column was derived from the input
The time it took to complete and whether it was successful
The version of the output that was produced
The shape of the output data (schema, row count, distribution, ...)
Tracking operational lineage changes over time provides the critical information needed to quickly answer many questions arising when data-related problems occur:
OpenLineage is the open source project standardizing lineage and metadata collection across the data ecosystem. With a data-collection platform in place, a data engineer can move fast and fix things fast. They can quickly perform impact analysis to prevent new changes from breaking things by answering such questions as: Will this schema change break downstream consumers? Who should be informed about this semantic change? Can we deprecate this dataset?
Lineage is also the basis of many other data-related needs:
As the number of datasets and jobs grows within an organization, these questions quickly become impossible to answer without collecting data-lineage metadata. This knowledge is the strong foundation that allows data to be manageable at scale.
18.221.13.173