Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 19. Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough

Ken Gleason and Q. Ethan McCallum

Most of the decisions we make in our personal and professional lives begin with a query. That query might be for a presentation, a research project, a business forecast or simply finding the optimal combination of shipping time and price on tube socks. There are times when we are intuitively comfortable with our data source, and/or are not overly concerned about the breadth or depth of the answers we get, for instance, when we are looking at movie reviews. Other times, you might care a little more, for example, if you are estimating your requirements for food and water for the Badwater Ultramarathon.^[76] Or even for mundane things like figuring out how much of a product to make, or where your production bottlenecks are on the assembly line.

But how do we know when to care and when not to care, and about what? Should you throw away the survey data because a couple of people failed to answer certain questions? Should you blindly accept that your daily sales of widgets in Des Moines seem to quintuple on alternate Fridays? Maybe, maybe not. Much of what you (think you) know about the quality of a given set of data relies on past experience that evolves to intuition. But there are three problems with relying solely on intuition. First, intuition is good at trapping obvious outliers (errors that stick out visibly) but likely won’t do much to track more subtle issues. Second, intuition can be wrong. How do you validate what your gut tells you looks funny (or doesn’t)? Third, as discussed above, intuition relies on evolved experience. What happens when you lack direct domain experience? This is hardly an academic question; in the data business, we are often thrown into situations with brand new problems, new datasets, and new data sources. A systematic approach to data quality analysis can guide you efficiently and consistently to a higher degree of awareness of the characteristics and quality of your data before you spend excessive time making personal or business decisions.

Operating from a data quality framework allows you to:

Quit worrying about what you think you know or don’t know about the data.
Step outside conventional wisdom about what you need and don’t need, and establish fresh conceptions about your data and its issues.
Develop and re-use tools for data quality management across a wider variety of scenarios and applications.

This chapter outlines a conceptual framework and approach for data quality analysis that will hopefully serve as a guide for how you think about your data, given the nature of your objective. The ideas presented here are born from (often painful) experience and are likely not new to anyone who has spent any extended time looking at data; but we hope it will also be useful for those newer to the data analysis space, and anyone who is looking to create or reinforce good data habits.

Framework Introduction: The Four Cs of Data Quality Analysis

Just as there are many angles from which to view your data when searching for an answer, there are many viewing angles for assessing quality. Below we outline a conceptual framework that consists of four facets. We affectionately refer to them as The Four Cs of Data Quality Analysis^[77]:

Complete: Is everything here that’s supposed to be here?
Coherent: Does all of the data “add up?”
Correct: Are these, in fact, the right values?
aCcountable: Can we trace the data?

Granted, these categories are fairly broad and they overlap in places. Sometimes a C or two won’t even apply to your situation, depending on your requirements and your place in the data processing chain. Someone interested in gathering and storing the data may have a different view of “complete” than someone who is trying to use the data to build an analysis and drive a decision. That’s fine. We don’t intend the Four Cs to be a universal, airtight methodology. Instead, consider them a starting point for discussion and a structure around which to build your own data quality policies and procedures.

Complete

The notion of a complete dataset is paradoxically difficult to nail down. It’s too easy to say, “it’s a dataset that has everything,” and then move on. But how would we define “everything” here? Is it simply the number of records? Expected field count? Perhaps we should include whether all fields have values—assuming that “lack of value” is not, in and of itself, a value. (This doesn’t yet cover whether the values are all valid; that’s a matter of Correctness, which we describe below.) This leads us down the path of learning which fields are necessary, versus nice to have.

Most of these questions stem from the same root, and this is the very nature of completeness in data:

Do I have all of the data required to answer my question(s)?

Even this can be tricky: you often first have to use the data to try to answer your question, and then verify any results, before you know whether the data was sufficient. You may very well perform several iterations of this ask-then-verify dance. The point is that you should approach your initial rounds of analysis as checks for completeness, and treat any findings as preliminary at best. This spares you the drama of actions taken on premature conclusions, from what is later determined to be incomplete data.

Let’s start with the simplest, but frequently overlooked sort of completeness: do you actually have all of the records for a given query? This is a basic yet essential question to ask if you are running any kind of analysis on data that consists of a finite and well-defined number of records, and it’s important that the totals (say, number of orders or sum of commission dollars or total students registered) match some external system. Imagine presenting a detailed performance analysis on the stock trading activity for your biggest client, only to find that you’ve missed fully half their activity by forgetting to load some data. Completeness can be that simple, and that important. Without getting into aCcountability just yet, it’s simply not enough to assume that the data you received is all you need.

Thinking more about stock trading and modern electronic trading (where computer programs make the bulk of decisions about how orders get executed) offers a wealth of examples to consider. Take a simple one: many modern methods of evaluating whether a given trade was done “well” or not involve measuring the average price achieved versus some “benchmark,” such as the price at which the stock was trading at the time the order arrived (the idea being that a well traded “buy” (“sell”) order should not move the stock up (down) excessively). While it’s not necessary to delve into the gory details, it is sufficient for our purposes to observe that most modern benchmarks require one or more inputs, including the time that an order started. If this sort of performance is a requirement, the order data information would not be considered complete unless the start time for each order was present.

Another example from the world of electronic trading: consider the relationships between orders in a stock trading system. A trader may place an order to buy 100,000 shares of a given stock, but the underlying system may find it optimal to split that one logical order into several smaller “child orders” (say, 100 child orders of 1000 shares each). The trading firm’s data warehouse could collect the individual child order times and amounts but omit other information, such as where the order actually traded or identifiers that tie the child orders to the original order. Such a dataset could indeed be considered complete if the question were, “what was our total child order volume (count) last month?” On the other hand, it is woefully incomplete to answer the question, “what is our breakdown of orders, based on where they were traded?” Unless the firm has recorded this information elsewhere, this question simply cannot be answered. The remedy would be to extend and amend the data warehouse to collect these additional data points, in anticipation of future such queries. There is clear overlap here with the process of initial database design, but it bears review at query analysis time, given how expensive replaying / backfilling the data could be.

Evaluating your dataset for completeness is straightforward:

Understand the question you wish to answer. This will determine which fields are necessary, and what percentage of complete records you’ll need. Granted, this is not always easy: one aspect of data analysis is asking questions that haven’t been asked before. Still, you can borrow a page from game theory’s playbook: look ahead and reason back.^[78] Knowing your business, you can look ahead to the general range of questions you’d expect to answer, and then reason back to figure out what sort of data you’d need to collect, and start collecting it. It also helps to note what data you access or create but don’t currently collect, and start collecting that as well. A common refrain in technology is that disk is cheap^[79] and that data is the new gold. Passing up data collection to avoid buying storage, then, is like passing up money to avoid finding a place to put it.

Confirm that you have all of the records needed to answer your question. The mechanics here are straightforward, but sometimes onerous. Straightforward, in that you can check for appropriate record count and presence of fields’ values. Onerous, in that you typically have to write your own tools to scan the data, and often you’ll go through several iterations of developing those tools before they’re truly road-worthy.^[80] That we live in an age of Big Data™ adds another dimension of hassle: checking a terabyte dataset for completeness is our modern day needle-and-haystack problem. Sometimes, it is simply not possible. In these cases, we have to settle for some statistical sampling. (A thorough discussion of sampling methodology is well beyond the scope of this book.)

Take action. What we’ve talked about so far is almost entirely evaluation; what action(s) should you take when you have assessed completeness? In the case of record-level completeness, it’s easy; you either have all the records or you don’t, and you then find them and backfill your dataset. But what should you do if you find 10% (or whatever your meaningful threshold is) of your records having missing values that prevent some queries from being answered? The choices generally fall into one of the following categories:

Fix the missing data. Great, if you can, though it’s not always possible or practical.
Delete the offending records. This is a good choice if your set of queries need to be internally consistent with each other.
Flag the offending records and note their incompleteness with query results.
Do nothing. Depending on your queries, you may just be able to happily crack on with your analysis, though, in our experience, this tends to be a bad idea more often than not…

These are only four options. There are certainly more, but making (and documenting) the evaluation and subsequent actions can save considerable pain later on.

Coherent

Assuming your data is now complete, or at least as complete as you need it to be, what’s next? Are you ready to uncover the gold mine hidden in your data? Not so fast—we have a couple of things left to think about. After completeness, the next dimension is Coherence. Simply put: does your data make sense relative to itself?

In greater detail, you want to determine whether records relate to each other in ways that are consistent, and follow the internal logic of the dataset. This is a concept that may, at first glance, feel more than a little redundant in the context of data analysis; after all, relational databases are designed to enforce coherence through devices like referential integrity, right? Well, yes, and no.

We can’t always trust (or even expect) referential integrity. Such checks can cause a noticeable performance hit when bulk-loading data, so it’s common to disable them during raw data loads. The tradeoff here is the risk of orphaned or even duplicate records, which cause a particular brand of headache when referential integrity is reenabled. Also, consider data that is too “dirty” for automatic integrity checks, yet is still valid for certain applications. Last, but not least, your data may be stored in a document database or other NoSQL-style form. In this case, referential integrity at the database level is quite intentionally off the menu. (A discussion about whether or not the data is structured in an appropriate fashion may certainly be warranted in this case, but is well beyond the scope of this book.)

Referential integrity is only one sort of coherence. Another form is value integrity: are the values themselves internally consistent where they need to be? Let’s revisit our stock trading order and execution database for an example of value integrity. Consider a typical structure, in which we have two tables:

Table 19-1. Columns in sample database table 1: orders

Value
order_id
side
size
price
original_quantity
quantity_filled
start_time
end_time

Table 19-2. Columns in sample database table 2: fills

Value
original_order_id
fill_quantity_shares
fill_price
fill_time

An order is an order to buy or sell a stock. A fill is a record of one particular execution on that stock. For example, I may place an order to buy 1,000 shares of AAPL at a certain limit price, say $350.^[81] In the simplest case, my order is completely executed (“filled”) and thus I would expect to have a corresponding record in the fills table with fill_quantity_shares = 1000. What if, however, I buy my 1,000 shares in more than one piece, or fill? I could conceivably get one fill of 400 shares, and another of 600, resulting in two records in the fills table instead of one. So the basic relationship is one order to many fills. How, then, does value integrity fit here? In our model of order and fill data, the sum of fill quantities should never be greater than the original quantity of the order. Compare:

select sum(fill_quantity_shares) where fills.order_id = x

to:

select original_quantity from orders where orders.order_id = x

Again, it’s attractive to trust the data and assume that this sort of value level integrity is maintained at record insert time, but in the real world of fragmented systems, multiple data sources and (gasp!) buggy code, an ounce of value analysis is often worth a pound (or ton!) of cure.

Another, more subtle example of value integrity concerns timestamps: Similar to the sum of share quantities above, by definition, all fills on a given order should occur in time no earlier than the order’s start time, and no later than the order’s end time. ^[82]

In the case of our two-table dataset:

if fills.original_order_id = orders.order_id,

then it should be true that:

orders.start_time <= fills.fill_time <= orders.end_time

As with Completeness, evaluation of Coherence can take many forms.

Determine what level and form of coherence you really need. Is it validation on referential integrity? Simple value integrity validation on one or more field values? Or does it require a more complex integrity validation?

Determine how complete your validation needs to be, and what your performance and time constraints are. Do you need to validate every single record or relationship? Or can you apply statistical sampling to pick a meaningful but manageable subset of records to evaluate?

As always, it’s critical that your evaluation fits your needs and properly balances the time/quality tradeoff.

Once you have validated your data, you have to decide what to do with the problems you’ve found. The decisions to fix, omit, or flag the offending records are similar to those for Completeness and will depend as always on your requirements, though the balance may be different. Fixing errors involving referential or value integrity tend to be a mixture of finding orphaned records, deleting duplicates, and so forth.

Correct

Having confirmed that your data is both complete and coherent, you’re still not quite ready to crunch numbers. You now have to ask yourself whether your data is correct enough for what you’re trying to do. It may seem strange to consider this a precursor to analysis, as analysis often serves to somehow validate the dataset; but keep in mind, there may also be “sub-dimensions” of correctness that bear validation before you move on to the main event. Similar to testing for coherence, correctness requires some degree of domain knowledge.

One thing to remember is that correctness itself can be relative. Imagine you’ve gathered data from a distributed system, composed of hundreds of servers, and you wish to measure latency between the component services as messages flow through the system. Can you just assume that clocks on all the machines are synchronized, or that the timestamps on your log records are in sync? Maybe. But even if you configure this system yourself, things change (and break).

One simple check would be to confirm that the timestamps are moving in the proper direction. Say that messages flow through systems s1, s2, and s3, in that order. You could check that the timestamps are related as follows:

message_timestamp(s1) <= message_timestamp(s2) <= message_timestamp(s3)

You may determine that timestamps of messages passing through s2 are consistently less than (newer than) the timestamps when those same messages passed through s1. Once you rule out time travel, you reason that you have a systematic error in the data (brought about by, say, a misconfigured clock on either s1 or s2). You can choose between correcting the systems’ clocks and rebuilding the dataset, or adjusting the stored timestamps, or some other corrective action. It may be tempting, then, to include such logic inline in your queries, but that would add additional complexity and runtime overhead. A more robust approach would be to uncover this problem ahead of time, before making the data available for general analysis. This keeps the query code simple, and also ensures that any corrections apply to all queries against the dataset.

As a second example, imagine you’re in charge of analytics for a system that tracks statistics—times, routes, distances, and so on—for thousands of runners. Runners who train or compete at different distances will typically run longer distances at a slower pace. A person whose best one-mile time is six minutes will, more than likely, run a marathon (26.2 miles) at a speed slower than six minutes per mile. Similarly, older age groups of runners will tend to have slower times (except in some cases for longer distances).

Simple, straightforward validation checks can uncover possible issues of internal correctness (distinct from coherence). How you handle them depends, as always, on your objective and domain knowledge. If you’re simply gathering site statistics, it might not matter. On the other hand, it could also be interesting to learn that certain members or groups consistently under-report race times. In this way, the border between correctness and simply “interesting” data becomes a gray area, but useful to think about nevertheless.

The general process for evaluation of Correctness should start to look familiar by now:

Itemize the elements of your data that can be easily verified out of band. Given a data dump from a point-of-sale system, are all transactions expected to have occurred in the years 2010 and 2011? Perhaps they are all from your friend’s dollar store, where a single line-item sale of $100 could reasonably be considered erroneous?

Determine which of these are important to validate, in the context of what you care about. As data-minded people, it’s common to want to categorize every dataset and grade it on every dimension. To do so would be highly inefficient, because we are limited by time constraints, and requirements tend to change over time. We suggest you review every dataset and attach a log of what was checked at the time.

Understand how much of your data must be correct. As with Coherence, can (must?) you check every record, or is some sort of sampling sufficient?

Decide what you will do with incorrect data. Is it possible to “fix” the records somehow, or must you work without them? Depending on the questions you’re trying to answer, you may be able to weight known-incorrect records and reflect that in the analysis. Another option would be to segregate these records and analyze them separately. It could make for an interesting find if, say, your model were known to perform equally well on correct and incorrect data. (That could indicate that the field in question has no predictive power, and may be safely removed from your feature set.)

aCcountable

Who is responsible for your data? This may seem an odd question when discussing data quality, but it does indeed matter. To explain, let’s first consider how data moves.

Data flow typically follows a pattern of: acquire, modify, use. Acquire means to get the data from some source. To modify the data is to clean it up, enrich it, or otherwise tweak it for some particular purpose. You can use the data to guide internal decisions, and also distribute it externally to clients or collaborators. The pattern is repeated as data passes from source to recipient, who in turn becomes the source for the next recipient, and so on.

This is quite similar to supply chain management for tangible goods: one can trace the flow of raw material through perhaps several intermediate firms. Some goods, such as food and drink, have a final destination in that they are consumed. Others, through recycling and repurposing, may continue through the chain for perpetuity.

One critical element of the supply chain concept is accountability: one should be able to trace a good’s origins and any intermediate states along the way. An outbreak of foodborne illness can be traced back to the source farm and even the particular group of diseased livestock; an automobile malfunction can be traced back to the assembly line of a particular component; and so on.

When assessing your data for quality, can you make similar claims of traceability? That leads us to answer our original question, then. Who’s responsible for your data? You are responsible, as are your sources, and their sources, and so on.

Data’s ultimate purpose is to drive decisions, hence its ultimate risk is being incorrect. Wise leaders therefore confirm any numbers they see in their reports. (The more costly the impact of the decision, the more thoroughly they check the supporting evidence.) They check with their analysts, who check their work and then check with the data warehouse team, who in turn check their work and their sources, and so on. This continues until they either reach the ultimate source, or they at least reach the party ultimately responsible.^[83]

Equally pressing are external audits, driven by potential suitors in search of a sale, or even government bodies in the aftermath of a scandal. These have the unpleasant impact of adding time pressure to confirm your sources, and they risk publicly exposing how your firm or institution is (quite unintentionally, of course) headed in the wrong direction because it has been misusing or acting on bad information.

Say, for example, that you’ve been caught red-handed: someone’s learned that your application has been surreptitiously grabbing end users’ personal data. You’re legally covered due to a grey area in your privacy policy, but to quell the media scandal you declare that you’ll immediately stop this practice and delete the data you’ve already collected.

That’s a good start, but does it go far enough? Chances are, the data will have moved throughout your organization and made its way into reports, mixed with other data, or even left your shop as it’s sold off to someone else. Can you honestly tell your end users that you’ve deleted all traces of their data? Here’s a tip: when the angry hordes appear at your doorstep, pitchforks and torches in hand, that is not the time to say, “um, maybe?” You need an unqualified “yes” and you want to back it with evidence of your data traceability processes.

It’s unlikely that a violent mob will visit your office in response to a data-privacy scandal.^[84] But there are still real-world concerns of data traceability. Ask anyone who does business with California residents. Such firms are subject to the state’s “Shine the Light” legislation, which requires firms to keep track of where customers’ personal data goes when shared for marketing purposes.^[85] In other words, if a customer asks you, “with whom have you shared my personal information?”, you have to provide an answer.

As a final example, consider the highly-regulated financial industry. While the rules vary by industry and region, many firms are subject to record retention laws which require them to maintain detailed audit trails of trading activity. In some cases, the firms must keep records down to the detail of the individual electronic order messages that pass between client and broker. The recording process is simple: capture and store the data as it arrives. The storage, however, is the tough part: you have to make sure those records reflect reality, and that they don’t disappear. When the financial authorities make a surprise visit, you certainly don’t want to try to explain why the data wasn’t backed up, and how a chance encounter between the disk array and a technician’s coffee destroyed the original copy. Possibly the most iconic example of the requirement for traceability and accountability is the Sarbanes-Oxley Act^[86] in the United States that requires certification of various financial statements with criminal consequences.^[87]

Humor aside, surprise audits do indeed occur in this industry and it is best to be prepared. To establish a documented chain of accountability should yield, as a side-effect, a structure that makes it easy to locate and access the data as needed. (The people responsible will, inevitably, create such a structure such that they do not become human bottlenecks for any sort of data request.) In turn, this means being able to quickly provide the auditors the information they need, so you can get back to work.^[88]

Once again, the steps required to achieve proper data accountability depend on your situation. Consider the following ideas to evaluate (or, if need be, define) your supporting infrastructure and policies:

Keep records of your data sources. Note how you access the data: push or pull, web service call or FTP drop, and so on. Determine which people (for small companies), teams (larger companies), and companies (for external sources) own or manage each dataset. Determine whom to contact when you have questions, and understand their responsibilities or service-level agreements (SLAs) for the data: what are the guarantees (or lack thereof) concerning the quality, accuracy, and format of the data, and the timeliness of its arrival?

Store everything. Hold on to the original source data in addition to any modifications you make thereof. This lets you check your data conversion tools, to confirm they still operate as expected. As an added bonus, it dovetails well with the next point.

Audit yourself. Occasionally spot-check your data, to confirm that your local copies match the source’s version. Confirm that any modification or enrichment processes do not introduce any new errors. When possible, check your records against themselves to confirm the data has not been modified, either by subtle storage failure (bit rot), by accident, or by conscious choice (tampering).^[89]

As an example, consider the Duracloud Bit Integrity Checker. Duracloud is a cloud-based storage service, and the Bit Integrity Checker lets admins upload MD5 checksums along with the data. In turn, the service periodically confirms that the stored data matches the supplied checksums.^[90]

Most importantly, if local regulations subject you to audits, make sure you know the format and content the auditors require. Provide sample records to the auditors, when possible, to make sure you collect (and can access) all of the details they expect.

Watch the flow. Keep track of how the data flows through your organization, from source to modification, from enrichment to report. Follow the supply-chain concept to track any research results to their origin. If data is updated, any derivative data should follow suit.

Track access. Understand who can access the data, and how. In an ideal world, any read-write access to the data would occur through approved applications that could track usage, perform record-level auditing, and employ an entitlements system to limit what each end-user sees. (Such a scenario has the added benefit that the applications can, in turn, themselves be audited.)

If you have the unfortunate scenario that certain “power users” have direct read-write access to the raw data (say, through desktop SQL tools), at least provide everyone their own login credentials.^[91]

At the end of the day, whether the audit is yours or the CEO’s, for informal or criminal proceedings, for tube socks or Tube schedules, if you are unable to describe, audit, and maintain the aCcountability chain of your data, the maintenance of proper Completeness, Coherence, and Correctness not only becomes that much more difficult, it becomes largely irrelevant.

Conclusion

The Four Cs framework presented here is only one of many possible ways of looking at data quality, and as we’ve said, your mileage may vary (as will the relevance of any given C). That said, there are three principles inherent to this discussion that bear repeating and that should be relevant to any data quality analysis framework:

Think about quality separately (and first, and iteratively) from the main task at hand. Getting in the habit of thinking about data quality before the real work begins not only saves time, but gives you a better understanding of your capabilities and limitations with respect to analyzing that set.

Separate the actual execution of data quality checks from the main task at hand whenever it is practical to do so. Validation logic tends to be separated from main flow in most programming paradigms; why not in data handling?

Make conscious (and documented) decisions about the disposition of data that doesn’t live up to your (also documented!) Four Cs checks. If you aren’t making and documenting clear decisions and actions that result from your determination of data quality, you might as well skip the entire process.

We hope you found this chapter reasonably complete, coherent, and correct. If not, the authors are the only ones accountable.

Data quality analysis, and any related data munging, is a necessary first step in getting any meaningful insight out of our data. This is a dirty and thankless job, and is often more time-consuming than the acutal analysis. You may at least take comfort in the words of Witten, et. al.:

Time looking at your data is always well-spent.^[92]

Indeed.

^[76]Badwater is a 135-mile foot race that goes through California’s Death Valley. And before you ask: no, neither of the authors has come remotely close to attempting it. We just think it’s cool.

^[77]One may argue that this is more like 3.5 Cs…. Fair enough.

^[78]Dixit and Nalebuff’s Thinking Strategically: The Competitive Edge in Business, Politics, and Everyday Life is a text on game theory. It covers the “look ahead and reason back” concept in greater detail.

^[79]At least, it’s common to say that these days. We began our careers when disk (and CPU power, and memory) was still relatively expensive.

^[80]Keep in mind that any time you write your own tools, you have to confirm that any problems they detect are in fact data problems and not code bugs. That is one advantage to working with the same dataset over a long period of time: the effort you invest in creating these tools will have a greater payoff and a lot more testing than their short-lived cousins.

^[81]We plan to do this with our publication royalties from this chapter.

^[82]However, if you do happen to figure out a way to get your fills to actually happen before you send your order, please tell us how you did it!

^[83]Inside a company, this can mean “someone we can fire.” For external data vendors, this is perhaps “someone we can sue for having provided bad data.” Keep in mind, the party ultimately responsible could be you, if the next person in the chain has managed to absolve themselves of all responsibility (such as by supplying an appropriate disclaimer).

^[84]So to all those who make their money swiping personal data, you may breathe a sigh of relief … for now.

^[85]California Civil Code section 1798.83: http://www.leginfo.ca.gov/cgi-bin/displaycode?section=civ&group=01001-02000&file=1798.80-1798.84. The Electronic Privacy Information Center provides a breakdown at http://epic.org/privacy/profiling/sb27.html

^[86]http://en.wikipedia.org/wiki/Sarbanes%E2%80%93Oxley_Act

^[87]Think of all those times when you felt that someone’s incompetence was criminal…

^[88]…or even, to go home. Some audits have been known to occur at quitting time, and you don’t want to be part of the crew stuck in the office till the wee hours of the morning hunting down and verifying data.

^[89]Whether it’s shady companies removing evidence of fraud, or a rogue sect hiding a planet where they’re developing a clone army, people who remove archive data are typically up to no good.

^[90]https://wiki.duraspace.org/display/DURACLOUD/Bit+Integrity+Checker

^[91]… and then, lobby to build some of the aforementioned applications. It’s quite rare that end users need that kind of raw data access, and yet all too common that they have it.

^[92]Besides providing a witty quote with which to close the chapter, Witten, et. al.’s Data Mining: Practical Machine Learning Tools and Techniques (third edition) includes some useful information on data cleaning and getting to know your data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 19. Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough

Create new playlist

Sign In

Sign Up