53. Data Qualty

,
Image

Data quality often sucks. A woefully common approach to this problem is to seek better software to process the data.

It’s not unusual for the quality of database software to exceed the quality of the data it processes, yet from the end-user’s viewpoint, system quality is limited by the lesser of the two. Companies everywhere are faced with databases full of inaccuracies and out-of-date or missing information. The problem is as obvious as the nose on your face, but like your own nose, it can be difficult to see. It’s hard for companies to come directly to grips with their own data-quality problems, though nobody has trouble seeing the other guy’s. What companies tend to see instead is a problem in the aggregate of software + data. Since the software is always easier to fix than the data (there is just so awfully much data), companies set out to fix or replace the software.

As none of this makes much sense, the essential thing to discuss here is not why we shouldn’t do it, but why we do it even though we shouldn’t. Part of the reason is a special instance of news improvement (see Pattern 45): The bad news that 2.4 percent of this month’s invoices were returned as undeliverable makes its way up the hierarchy, being greeted at each level with the angry question, “Well, what the hell are you going to do about this and damn quick?”

The damn-quick part immediately precludes extensive manual fixing. The vague answer is that a serious “data cleansing” effort will be started pronto. This charming little phrase means different things as it moves up toward the CEO level. At the bottom of the hierarchy, data cleansing means getting on the phone and Internet and poring over correspondence files to research and correct each separate bad datum. At the top, it means working smarter, somehow teasing out the right data by cleverly processing the bad data. Since funding comes from the top, the funds that are allocated are typically tied to the working smarter approach rather than to a small army of clerks to do the real work.

It’s worth pointing out that data can be corrupted (for example, by incorrect computing), and in this case, there are some at least partially automated ways to undo the damage by retrieving earlier backed-up versions. Similarly, when the same data are separately recorded in multiple systems, some automated data cleansing can help to isolate the better variant. In both cases, automated data cleansing depends on an ability to exploit data redundancy. While it’s easy to imagine an example of redundancy coming to our rescue (System A has an old address, but here’s a break: System B has the new one), real instances of poor data quality that can be automated away are few and far between.

The major cause of declining data quality over time is change. This spoilage in the asset we call “corporate data” can only be repaired by manual fix. Imagining otherwise just puts off the day of reckoning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.249.220