Chapter 20. Data Quality for Data Engineers

Katharine Jarmul

If you manage and deploy data pipelines, how do you ensure they are working? Do you test that data is going through? Do you monitor uptime? Do you even have tests? If so, what exactly do you test?

Data pipelines aren’t too dissimilar to other pipelines in our world (gas, oil, water). They need to be engineered; they have a defined start and end point. They need to be tested for leaks and regularly monitored. But unlike most data pipelines, these “real-world” pipelines also test the quality of what they carry. Regularly.

When was the last time you tested your data pipeline for data quality? When was the last time you validated the schema of the incoming or transformed data, or tested for the appropriate ranges of values (i.e., “common sense” testing)? How do you ensure that low-quality data is either flagged or managed in a meaningful way?

More than ever before—given the growth and use of large-scale data pipelines—data validation, testing, and quality checks are critical to business needs. It doesn’t necessarily matter that we collect 1TB of data a day if that data is essentially useless for tasks like data science, machine learning, or business intelligence because of poor quality control.

We need data engineers to operate like other pipeline engineers—to be concerned about and focused on the quality of what’s running through their pipelines. Data engineers should coordinate with the data science team or implement standard testing. This can be as simple as schema validation and null checks. Ideally, you could also test for expected value ranges and private or sensitive data exposure, or sample data over time for statistical testing (i.e., testing distributions or other properties the data at large should have).

And the neat thing is, you can use your data knowledge and apply it to these problems. Do you know how many extreme values, anomalies, or outliers your pipeline will expect today? Probably not. But could you? Why, yes, you could. Tracking, monitoring, and inferring the types of errors and quality issues you see in your pipelines or processing is—in and of itself—a meaningful data science task.

So please, don’t merely see data coming in and going out and say, “Looks good to me.” Take the time to determine what quality and validation measurements make sense for your data source and destination, and set up ways to ensure that you can meet those standards. Not only will your data science and business teams thank you for the increase in data quality and utility, but you can also feel proud of your title: engineer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.202.54