Chapter 5 Data Quality

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5
Data Quality

I have a confession to make: I have a tattoo. Okay, I have five tattoos, but one of them has recently become more relevant to me. Let me take you back twenty years, to 1997. I had just started my master’s program and on the very first day of class, my professor (the incomparable Lou Milanesi, Ph.D.) wrote two things on the whiteboard:

“There is no such thing as a free lunch” and “X = T + E”

The first one was self-explanatory; the second though, for us rookie grad students, required some explanation. Simply: the observed score (X) equals the true score (T), plus error (E). In layman’s terms, the key takeaway is this: whatever you do with data, consider the possibility of error. This resonated with me deeply. This relatively simple concept is at the heart of applied statistics; it drove most of my education for those two years. But there was something else – something almost philosophical – inherent in that simple X = T + E equation. I ruminated on this idea so much that one day, on little more than a whim and a few extra dollars, I finally stopped at the tattoo shop that was nearest my favorite watering hole. With just a couple of characters’ worth of ink, I permanently committed to this simple idea: in whatever you do, consider the possibility of error.

Fast-forward a few decades and I sometimes forget the tattoo is even there. But the concept remains strong; in all things we do, we should consider the possibility of error, shouldn’t we? And as someone who is permanently attached to the philosophy, it’s high time I dust it off and see if it still has legs.

Error in data is ubiquitous. In an existential crisis, it’s almost as if data isn’t data without some level of error in it, at least at the macro level. We spend a lot of time hand wringing about data quality, yet when we find ourselves in a time crunch, our quality assurance procedures are often the first things we skimp on. When there’s time to think and work rationally, people generally choose “good” from the triad of “Good, Fast or Cheap” because they feel it’s the right choice. But in the real world, where time is short and the pressure is high, fast has become the default.

In almost every organization I have ever worked, whether as a consultant or as an employee, someone will tell me that the data quality is a problem. Scroll through any social media trail about data and eventually, you will trip on a data quality thread. Good data quality is imperative to almost anything we do in machine learning and artificial intelligence. Anecdotal stats abound with the time people take “cleaning the data.” Just ask any analyst how much time they spend cleaning the data before they can use it and most of them will tell you it’s more than fifty percent for any given project, and more often aligns with the 80/20 rule.

The Data Quality Imperative

Let’s just all agree we need the data to be of high quality, shall we? It might be the last thing we agree on, but we can agree on that much. Good data quality leads to better, faster and more scalable analytic solutions. If we can’t claim good data quality, then what’s the point of creating a data governance program? Yet we find ourselves in an interesting predicament. We all agree it’s important, but in terms of how to get there, I see a gap the size of the Grand Canyon.

This chapter almost didn’t make it into the book. I was finished writing when I was forced to pause and reconsider. I was documenting recommendations for data warehouse testing for a client and I wanted to include reinforcing literature from third parties. My focus was specifically on what you test in a data warehouse and how you test it. I turned to my good friend Google and sat dumbfounded. The lack of actionable content on the pragmatic work on data quality in a data warehouse is paltry. It’s embarrassing for us as an industry that is so well established, for a topic such as data quality in a data warehouse to be nearly non-existent.

Now, before you come at me on Twitter with reference links, I know there’s content out there on this product or that product and consultants (me included) have blogs on the subject of data quality. I included a number of books in the appendix as well. But what I was looking for was an actual example that a rookie data warehouse architect could use as a starting point for implementing his or her own repeatable, scalable methodology for data quality. Suddenly, it occurred to me that the reason data analysts spend so much time on data quality is that it seems like no one else has.

This has also been reflected in my personal experiences. I have encountered more than one data warehouse that had virtually no data quality processes associated with it, except for maybe row counts. It’s important to note, however, that there is a high level of academic content on data quality standards. What seems to be missing is something in between high-level academic fodder and articles on the specific quality tests. This chapter will explore the data quality challenges we face in areas in the context of data governance, and more specifically getting the (high quality) data out to our average end-users.

In all likelihood, some of the lack of content on data quality procedures is related to the overwhelming nature of the subject itself, particularly for a modern data warehouse. The volume and veracity of data that comes at us every day is like a tsunami. With our stakeholders and end-users already impatiently waiting, are we really going to tell them that the data has to pass five to six tests before they can have it? Or, the tests that we have built into the transformation code starts to slow down the loading processes. Scenarios abound with issues related to how you implement data quality into a data warehouse.

“The road to hell is paved with good intentions.”

Proverb

Let’s organize this thing and take all the fun out of it, shall we? First, a definition: “Data has quality if it satisfies the requirements of its intended use. It lacks quality to the extent that it does not satisfy the requirement.” (Olson, Jack. Data Quality: The Accuracy Dimension. 2003. Morgan Kaufman). I don’t care if you have the smartest people, the best technology, cutting-edge methods and a bottomless cup of coffee, you can’t measure intention. Intention does not have a baseline for comparison. It is a hope, a target, usually only known to the person - or in our case, the analyst.

I am a begrudging former data analyst. I can tell you from the years I spent as a mouse jockey that the intention I had when I started an analysis and the end result were often two different things. I can also tell you that if we are truly doing data exploration, the intention should be “I don’t know yet.” I can understand what Jack Olson was getting at, yet intention is not what we should measure data quality against - it’s actually context. It’s a judgment of fitness of purpose that can and should be objective. But here’s the challenge, and it’s the same quandary we find ourselves in with data governance: how do we hit a moving target? Context and fit-for-purpose changes as the situations change. In a standard measurement situation, I would create a baseline and measure the current against it to get the delta. But if the baseline changes (our context) how can I objectively assess the delta?

What we test

It is broadly known that there are six aspects of data quality. There are a lot of articles on these dimensions; I prefer one from the CDC based on the DAMA UK work. Depending on the article you read, the dimensions may have slightly different names, but they are fundamentally the same:

Completeness
Uniqueness
Timeliness
Validity
Accuracy
Consistency9

You can, in all reasonableness, create standard tests for these dimensions and apply them to your data warehouse. Though what you will find is that it does not address the gap we have explored: context and fit-for-purpose. Accuracy attempts to get close with a definition often referencing the need for the data to reflect the “real world,” but it does not address how people want to use the data.

The definition we can apply for “context” in our six dimensions of data quality is: “The data has a standard, approved definition with an associated algorithm.” This should reflect the business context within which the data lives. It provides us a standard, an algorithm, an objective testing baseline and the ability to look at it and say, “No, I’m not using that definition.” I’ve said this a few times, but there’s a big difference between how a nurse manager defines a patient and how a finance manager defines a patient and for good reason. The intention of the purpose of these two roles is vastly different. The other benefit of adding “context” as one of the dimensions of good data quality is the ability to apply that as a standard test to the data warehouse. The traditional versions of data governance attempted to do this often, and when it is achieved it does advance the standards of data quality coming out of the data warehouse.

Unfortunately, context isn’t enough to fully address the challenges we have with data quality. The trouble we find ourselves in when creating standard definitions as part of our data governance efforts is that the enforced standard definition may be in conflict with the intended use of the data. Or, an alternative way of thinking about it: fit-for-purpose (FFP)10.

As an analyst, there will always be reasons why the data needs to be reviewed and “cleaned,” even in cases where good data quality methods have been applied. There’s a certain je ne sais quoi or indefinable quality that is commonplace for analysts to look for in a dataset. A sort of “sniff test” to assess a dataset’s ability to address the question(s) they’re attempting to answer. It’s not uncommon at this point for analysts to take the dataset and begin a first round of simple analysis, thinking through all potential variables.

This is also the point where the analyst may choose to not follow the standard definition of a metric because it does not fit the purpose. The dataset assessment should identify the minimum fields to which the analyst must apply their algorithms, and sometimes the analyst doesn’t know what these are until they are deep into exploration. There is an art to analyzing data, particularly when open questions like “why is our volume so low?” are presented. Off the cuff analysts will know what data fields are required, but until they fully explore the data and the questions they won’t know what they are looking for or need of the data.

Earlier we used the example of the “patient,” which is defined differently by people in two different roles (the patient manager and the finance manager). Often the definition of a patient is dependent on time; analysts consider whether there was a person in a bed at midnight on some given day. That temporal definition helps finance managers ensure they can charge for a full day’s stay, and it helps the nurse manager plan for staffing.

The trouble begins when we go any deeper than that surface definition. When we start to ask questions like “Why was volume low on that unit compared to last year.” What about the scenario in which the finance manager has to use risk scores to forecast how many patients the hospital will have and how sick those patients might be so they can plan for how much financial risk the organization can manage? What about the type of staffing a nurse manager with lots of complicated patient’s needs? We must take into consideration not just the number of staff, but also the level of staff (RN vs. LPN). Throw in the work of infectious disease; they actually care less about the “patient” status and a whole lot more about whether or not the individual was on a particular unit at a particular time. You can see how quickly a standard definition falls apart. Any of these data would have held its completeness, validity, accuracy, consistency, timeliness and uniqueness. It just failed the fit-for-purpose test. While context is something that we can add to a data warehouse testing methodology, fit-for-purpose may always be the one thing that a person (e.g., an analyst) has to assess.

How we test

Now that we have expanded our list of what to test (adding context and fit for purpose), we must determine how we test. I highly recommend reading the DataOps Cookbook from DataKitchen. It does a nice job of framing data quality tests in a DataOps way. The assumption is you will be using some type of agile practice. The process you choose doesn’t necessarily have to be DataOps.

Let us pause here to make an important distinction. Just because we know we have to test for validity, for instance, doesn’t mean we know exactly how to execute that test. To make this determination, we turn to the traditional best practices in data quality assurance. Examples of tests used in quality assurance are:

Unit Testing
Integration Testing
Functional Testing
Regression Testing

The first step is unit testing, which tests the smallest incremental pieces of the deliverable. It’s often done during the development of the code by the developer, but it is recommended that another set of eyes examine that code before you ship it. Integration testing focuses on the interactions between code packages; it looks for integrations or interactions that break one another. Functional tests feed data into our code package and assess the output, searching for unanticipated results based on the code. Finally, regression tests are specific to changes in code, attempting to isolate the changes to ensure they produce the expected results. Now, we look to combine the type of tests we need to run and how we execute tests to ensure we test the dimensions of data quality. First, we have to consider the different layers of a modern data warehouse: integration, staging, the data repository, and an analytic sandbox. Not all tests are critical across all of these environmental layers. For each of these layers there are different tests:

The second section lists what tests you use to test each aspect of data quality based on each of the layers in a modern data platform and the phase you’re in; building, automating or monitoring. These are just suggestions based on what we have reviewed in this chapter. These tables are exactly what I was looking for when I started that Google search. Modify these to reflect your own environments and the tests that you complete today. You can also find a combined table in the appendix.

There are just more “hooks”

Even if we have the most thoroughly tested data warehouse, with all six dimensions tested if we add context and fit-for-purpose as two new aspects of data quality, no analysts are off the hook from data testing. Thankfully, it is relatively easy to make sure your testing process runs as efficiently and as smoothly as possible. To do this, your first task is to create a vigorous and automated testing schedule for the data environments. Automating as many of the tests as possible will alleviate some of the pressure. Next, create a solid plan for context and fit-for-purpose testing. Context testing of the data will change as WD’s change. Presenting the quality testing in an easy to understand manner, such as a dashboard will help to create a tight alliance between the analytics team and the quality assurance team.

There is no such thing as 100% data quality. It just can’t exist. There’s too much data and too many ways to use the data. What we should strive for, and something that’s much more feasible is using the data and then talking about how we used the data so that we can all understand the data better. Better understanding leads to better outcomes, but only if we actively confer with each other.

As the saying goes, the juice has to be worth the squeeze and sometimes (keenly true in healthcare today) we just assume that the work is worth it. Not all data is created equal. Just as we will not govern every variable in your data environments, it’s not feasible to manage every cell of data in your ever-increasing data environments. When I spoke with Steve Johnson on the topic of data quality, he shared a nugget that I wrote down on a sticky (if you saw my office you would see how sticky notes run my life): “Data quality depends on how people want to use the data.” As we have seen, we can’t measure intention, but we can find and measure what we know definitively to be true. Understanding there is a difference between the two, and managing for that difference, is a great approach to data quality standards.

The applicability of my tattoo, X=T+E, is compelling since it is the primary basis for classic test theory. That said, there is not a one to one comparison between testing a data environment for every possible analysis and the attempt to control the variability of error in correlations, but there is some relationship (see what I did there?).

Wrapping it up

Data quality tests are the canary in the coal mine of your data governance processes. If you have good data governance processes the data quality tests should look stable. Without any quality tests, or without the ability to communicate the quality tests you lose your early warning system for governance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5 Data Quality

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 5 Data Quality