Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 11. Data Reliability

Polina Giralt and Blake Bisset

The opening chapters of this book discuss how we live in a world of services. We also live in a world of data. Most services create, collect, process, or present data in some way. We’re surrounded by data services! The goal of this chapter is to explore what makes SLOs for data services different from SLOs for other services.

First, we’ll define data services and consider our data users. The heart of this chapter covers measuring service objectives via 13 data properties. For each property, we’ll explore its measurement and its relationship to system design. We’ll finish with a short explanation of how to ensure data quality via service level objectives, to keep users happy.

Data Services

Welcome to the world of data service reliability. We’re bombarded with data each day. Financial data. Social data. Training data for algorithms. Data that is historical or near real time, structured or unstructured. Privately guarded corporate secrets as well as publicly available government datasets. Microservices consuming tiny amounts of JSON data from queues. Monolithic banking applications creating thousands of regulatory reports. And, of course, every other abstraction through which humanity has struggled to describe and make sense of the world since Grace Hopper plucked the first actual bug out of a computer.

Data application owners need to ensure that their services are reliable—but the essentials of data reliability vary for each system. To examine data reliability, we have to consider the intersection of reliability precepts with the types of data and data systems we use. We need data-specific considerations like integrity, validity, and durability. Then we can balance them against each other and the service properties with which we are already familiar. Enter the SLO.

We can only optimize so many things about a service. Every measurable property for which we can optimize involves a potential trade-off with some other property. When architecting data applications, some of those trade-offs are irrevocable. SLOs help us evaluate these constraints for our applications, with measurable objectives.

Designing Data Applications

To deliver a good user experience, we’ll need to align our points of view with our users’ and establish clear shared expectations. In the immortal words of Obi-Wan Kenobi, another questionably qualified instructor, “You’re going to find that many of the truths we cling to depend greatly on our own point of view.”

For the typical case, reliability expectations are well understood. The most common signal of “reliability” is when systems are resilient enough to recover in ways that make degradation ephemeral to the user. Availability and latency describe a moment, and once restored, the utility of a service is as good for the user as it ever was.

Warning

Data-related properties have a different calculus of risk. Properties like durability, consistency, and integrity must be considered in a unique light, because once lost they are difficult to regain. Recovering from a true durability failure can be impossible, and the effects of these failures will persist forward indefinitely into your users’ future.

If you’re running the streaming frontend for a popular cat-sharing platform, where both you and your clients make money from ads, at the end of the day nobody cares if you lose a few bits off the millionth stream of Fluffy vs. Box XXVIII: The Boxening. But if you make your living providing a safe place for people to store their wedding photos or their million-dollar empires of cat video master files (or their financial records that protect Fluffy from a stint in Club Fed for tax fraud), you don’t get off so easy. Then your users’ happiness depends on mastering a very different problem, at the heart of which is critical data. Financial, legal, creative, or sentimental data often cannot be replaced or reconstructed.

When the value of your service is the data itself, invest data engineering energy into mechanisms to prevent failures, or at least to keep them localized and temporary rather than systemic and sustained.

Like designing any software, architecting data applications requires us to understand what problems we’re trying to solve. This is especially true because many data reliability properties exist in opposition to each other. For instance, the additional processing time needed to ensure completeness can be at odds with freshness or availability. It is important to determine the criticality of data and think about error budgets per application—or dataset—rather than just per failure type. SLOs can help us decide which errors or bottlenecks in the design are our best investment for mitigation.

Note

What do we mean by “data applications”? A data application exists to create, gather, store, transform, and/or present data. This sounds like most services. For our purposes, a data application is a service for which data quality is the primary success metric and for which data-based SLOs will directly map to user experiences.

Users of Data Services

Who are data users? Everyone: services, humans, meshes of services, consumers, producers—anyone or anything that interacts with data. In complex applications, data is often used in several different ways. What are the integration points between services in a data pipeline? How does data move through the system? (We’ll talk a little more about data lineage toward the end of the chapter.)

Note

Defining SLOs requires determining what types of service degradations are meaningful to your users. The first step is understanding the entities that provide and consume your data.

Who will require the information you produce? Different users will have different service objectives (written down or not), and new users will have new SLO needs. From the user’s point of view, the service is reliable if it performs the function that they expect.

What mechanism(s) will data consumers use to receive your data (RESTful API, event stream, pull, push)? Some users expect durability, which will be at odds with retention policies—do you need different retention for different users? Think about what you aim to deliver (and how), and how you can control, protect, and enhance the value of data throughout its life cycle.

Don’t forget that if your service has data providers, they are also users that hold expectations of your service. Metric forwarding clients are users of monitoring services, video channel owners are users of video platform services, and advertisers are users of ad platform services, just as much as the data consumers. What are their needs? Measure how you ingest data, and how it’s internalized form remains true to the content they’ve entrusted to you.

Another question you’ll need to think about is what level to set an SLO at. While thinking about the data properties, you’ll make decisions about which elements are the most important for specific sets of data, for individual services, and for the organization as a whole.

Setting Measurable Data Objectives

This chapter lays out our personal ontology of data reliability (see Figure 11-1). The categories described here are just one of many possible sets of dimensions across which the reliability of data can be considered and measured. Some of them will sound familiar, because they apply to any system. What are they? Which are unique to data?

To observe data reliability with SLOs/SLIs, we need to consider measurable properties and the various kinds of data services. Many of these data properties are either tightly coupled, so it’s hard to have one without the other, or can be in contention with one another, as mentioned previously. Figure 11-1 illustrates those messy relationships. And to further complicate matters, their meaning can be subjective!

Note

In our careers and in our research, we have come across many frameworks for categorizing data concerns. There are hundreds of terms in use describing attributes of data, with little consistency or consensus as to which terms are canonical or what each term means from author to author. We’ll be narrow and opinionated in how we define these terms; we’ll use them as tags for specific kinds of things we want to discuss. We apologize in advance for any offense to academic or professional sensibilities. Our definitions may not match your own; you are welcome to fight us on Twitter.

Data properties and their relationships to each other

Data and Data Application Reliability

live the heat death of the universe, despite all of the nines in the vast armada of servers its engineers construct for it, out in the black vastnesses between the stars. What, then, are realistic user expectations of data reliability, and how do we reason about them?

We can separate data reliability properties into two camps: data properties and data application properties. There are 13 properties that we’ve chosen to examine in detail for this chapter.

The first seven of these are data properties:

1. Freshness: How out of date is the data compared to the most recent source inputs?
2. Completeness: How comprehensive is the dataset? Does it include all data items that represent the entity or event?
3. Consistency: How often do all files, databases, and other state stores agree on a particular datum?
4. Accuracy: To what degree does the data correctly describe the entity, property, or event?
5. Validity: How well does the data conform to the standards, such as schemas or business rules?
6. Integrity: How trustworthy is the data (based on its governance and traceability)?
7. Durability: How likely is it that there is a known-healthy copy of the data?

The other six are data application properties:

1. Security: How well is the data protected from unauthorized access or alteration?
2. Availability: How is service continuity? What’s the response time/success rate?
3. Scalability: How elastic is the data? How fast is the data volume expected to grow?
4. Performance: What is the latency, throughput, and efficiency of the data service?
5. Resilience: How quickly does the data recover after disruption?
6. Robustness: How well does the service handle invalid inputs or stressful conditions?

The remainder of this section explores each of these properties in turn. We start with data properties, and then cover data application properties. In addition to defining each property, we get into the details of how to measure them and discuss how to use these sorts of measurements to set meaningful SLO targets. The goal is to guide design decisions around them.

Data Properties

Data properties are those properties that are inherent to the data itself, mapping to data quality characteristics. Any definition of the term data quality is subjective and/or situationally dependent to the point of being useless, so we will use the SLO framework to set quantitative objectives. For any property, an SLO for that property that’s aligned to the needs of our users and our business helps us assess the reliability of our data application.

Freshness

Freshness, sometimes called timeliness, is a measure of how out-of-date a particular piece or collection of data is. Does the data represent the actual situation, and is it published soon enough? This is not the same thing as the age of the data: week-old data can be fresh, and minute-old data can be stale. Freshness relates to the time difference between the available data and the latest update or additional processing input. That is, data is fresh until new information comes in that renders previous information less useful.

Note

Note that many data properties are strongly tied to the availability data application property, covered in the following section. Having data present in the system that is more fresh or complete or consistent doesn’t matter if it isn’t what the user sees.

Out-of-date information can cost companies time, money, or credibility. How much it costs is a function of the data’s rate of change, the impact of staleness on the use case, and the periodicity of the use that relies on that data. To set an expectation on data freshness, ideally you can measure when and how often users access the data, and calculate the freshness level that keeps the impact within your service requirements.

Think of a weekly report that goes out every Monday morning to the board. In such a case, we might determine that the data it is drawn from can be incomplete or stale during the week, as long as the data is fresh and reliable by Monday morning.

Example SLO: 99% of data for the previous week (Monday through Sunday) is available for reporting by 09:00 each Monday.

With this well-understood expectation, you might design the system as a weekly batch job to process the week’s data and generate the report in time. Or you might decide that the potential business cost of a late failure in that single job would justify redundant jobs, or incremental processing through the week.

Contrast the preceding to a real-time dashboarding tool, which demands continuous freshness.

Example SLO: 97% of data is available in the dashboard tool within 15 minutes of an event occurring.

The expectations from this SLO make it clear that the system design will require a near-real-time event-based solution, such as streaming.

Note that both of these example SLOs also include reference to data completeness, covered in more detail later in this chapter.

Data with high freshness concerns is found in:

High-frequency trading systems that need to respond to the market within milliseconds
Any service where a user is making critical real-time decisions based on the data (military, air traffic controllers, etc.)
Concurrent user systems, such as chat, multiplayer gaming, collaborative documents, or video streaming with real-time comments

Architectural considerations that may be guided by a freshness SLO include determining when to sync an event pipeline to disk; keeping copies up to date via streaming, microbatch processing, snapshotting, or replication; and determining when to use caching or materialized views/projections. Alternatively, query results can be calculated at the time they are requested. The more volatile the data, the more often it’ll require refreshing.

Note

The data our users see—often an output of a data pipeline—is only as fresh as the time it takes to flow through the whole system. How do you identify which component service is the bottleneck? Record data interaction times, whether through data records, logging, or traces, for each component that touches it to provide visibility into system performance.

Completeness

Completeness, aka comprehensiveness, measures the degree to which a dataset includes all data items representing an entity or event. In a data pipeline, completeness is defined as the percentage of events available to a user after they are sent into the system. This property is correlated to durability and availability, since losing objects or access will result in incomplete datasets.

The importance of completeness is familiar to many of us: incomplete information might be unusable or misleading. Let’s say you’re notifying your customers of an important change. You need each customer’s email address—without it, the user contact data is incomplete. Missing records of the users’ notification preferences may result in the wrong messaging behavior.

In another case, you could be using a MapReduce or other distributed worker mechanism to build a dataset, and decide that waiting for 95% of records to return is a better balance than 99.999% because you’re ultimately looking to query a statistically sufficient sampling of records, rather than laboring under a financial and/or regulatory requirement for every transaction in a system to be recorded.

For data that can be rebuilt from a persistent source, or has a predictable size and shape (e.g., a top-50 leaderboard), completeness may be straightforward to measure.

Example SLO: 99.99% of data updates result in 50 data rows with all fields present.

This can be a useful measurement if this exact record count has significant impact; if leaderboard presence is a driving mechanism for user engagement, then any omissions will generate distrust. However, in many cases the expectations for consistency, accuracy, or freshness may be a more compelling representation of the user experience.

For the many provider-generated data sources that have no other source of truth, completeness can be much trickier to measure. One approach might be to audit the counts for all ingested data.

Example SLO: 99.9% of events counted during ingestion will be successfully processed and stored within 15 minutes of ingestion.

Approaches to optimize completeness will depend on the size and scale of the data, as well as the use case. Are you processing financial data where there will be monetary implications to every discrepancy? Or is your application focused on big data for marketing impressions, where accurate aggregate counts are business-critical, but a thousand missed events out of several billion are a rounding error?

Remember that we can sacrifice this or any other property when it makes sense. When would ensuring a high level of completeness be a poor use of resources? For analytical data, completeness might not be a primary concern, as long as the data is directionally correct. Sampling can often give you enough data to make inferences, as long as you make sure that sampling bias doesn’t affect the results.

Example SLO: Each query will process 80% of its input set.

In this case, though, the real concern may be getting the data quickly so it’s useful for analysis. If so, perhaps you could better measure the service quality with a performance or freshness SLO.

Data with high completeness concerns includes:

Regulatory data, used for legal and regulatory reporting
Financial records
Customer data where you need a record for each customer, with required fields: name, contact info, product subscription tier, account balance, and so on

Consistency

Consistency (an old stalwart of the CAP theorem) can be a transaction property, a data property, or a data application property. It means that all copies of files, databases, and other state stores are synchronized and in agreement on the status of the system or a particular datum. For users, it means that all parties querying that state will receive the same results. After a “write” is acknowledged, any and all “reads” ought to result in the same response.

Does the data contain contradictions? We’ll use an example from the healthcare field: if a patient’s birthday is March 1, 1979, in one system but February 13, 1983, in another, the information is inconsistent and therefore unreliable.

Note

“Consistency” is not used here to discuss or measure in any way whether the data store is consistent with any actual real-world state or inventory, which would fall under accuracy, discussed next.

We might lack consistency when we choose storage techniques that store data in multiple independent locations, but for some reason do not ensure that all locations are storing the same data before marking the logical write as complete. Often this is done for reasons of performance or better failure isolation, and consequently better availability. Many use cases tolerate eventual consistency. How much delay is acceptable?

If both scalability and consistency are critical for your data application, you can distribute a number of first-class replicas. This service will have to handle some combination of distributed locking, deadlocks, quorums, consensus, and/or two-phased commits. Achieving strong consistency is complicated and costly. It requires atomicity and isolation and, over the long term, durability.

Measure consistency by quantifying the difference between two or more representations of an object. Data can be measured against itself, another dataset, or a different database. These representations can be the source upstream signal, redundant copies of the data, the downstream storage layer, or some higher-dependability authoritative source.

Example SLO: 99.99999% of monthly financial data processed by the service will match the company’s general ledger system (authoritative source) by the 5th of the following month (month close).

This kind of matching to measure consistency can be expensive, in both service resources and maintainer hours, since it generally requires a dedicated processing job that can iterate over all the data. The SLO can inform decisions about how rigorous this extra processing needs to be. Are incremental checks enough? Are there any filters or sampling that would make the problem more tractable, or the results more focused on the most critical data?

Following are some examples of data with high consistency concerns:

When an authoritative source of data needs to match audits against derived or integrated systems. For example, a hedge fund balance sheet’s gain/loss totals need to reconcile with the transactions in the credit risk management system.
When user queries cannot tolerate out-of-date values. If a user performs aggregate queries where one target dataset has been updated and another hasn’t, the results can contain errors.

Accuracy

Accuracy measures the conformity to facts. Data veracity or truthfulness is the degree to which data correctly describes the real-world entity, property, or event it represents. Accuracy implies precise and exact results acquired from the data collected. For example, in the realm of banking services, does a customer really have $21 in their bank account?

Warning

Accuracy is often a crucial property. Relying on inaccurate information can have severe consequences. Continuing with the previous example, if there’s an error in a customer’s bank account balance, it could be because someone accessed it without their knowledge or because of a system failure. An amount that’s off by a single dollar might be a big problem, but a bank’s internal account system could set an accuracy threshold to be within a few thousand dollars. It’s important to measure what’s “accurate enough” for any use case.

When evaluating accuracy, we can measure and consider both granularity and precision:

Granularity: The level of detail at which data is recorded (periodicity, cardinality, sampling rate, etc.). Granularity helps us reason about whether a dataset has a degree of resolution appropriate for the purpose for which we intend to use it. This is heavily used during downsampling and bucketizing of data for time series monitoring and alerting, and so on.
Precision: The amount of context with which data is characterized, adding clarity of interpretation and preventing errors arising from accidental misuse of data. Are these actually apples and apples, or apples and oranges?

Precision is a measure of how well typed and characterized our data is—not how well it conforms to type (validity), but whether it’s defined and described clearly enough to survive reuse in another context without generating erroneous conclusions or calculations. Consider a set of temperature readings where the temperature collection method varies, or where we measure temperature, then administer a medication, then measure temperature again. Without capturing context in the definition/labeling of either case, we could get errors in some uses of the data. With it, we expose additional avenues for normalization, correlation, extrapolation, or reuse.

When measuring accuracy, we typically need to talk about the ways to audit datasets for real-world accuracy. That is, it is perhaps reasonable to set an SLO that 99.99% of our records will be accurate; if we adhere to that target our system may well be fit enough for purpose, and our users may well be happy. But the interesting part is usually how we determine that real-world percentage. Data accuracy can be assessed by comparing it to the thing it represents, or to an authoritative reference dataset.

Is this a running tally as we reference records in the course of using them (for example, every time a cable technician goes to a customer site they check the records and note any inaccuracies, then we track the rate of corrections needed over a 30-day or 90-day rolling average)? Or do we pull some authoritative third-party source and benchmark to that? Or do we have a periodic real-world collection true-up, like with census data or a quartermaster taking inventory?

Example SLO: 99.999% of records in the dataset are accurate.
Example SLO: 99% of records are within two standard deviations of expected values.
Example SLO: 90% of records have been verified against physical inventory within the last three months.

Accuracy is where our world of data can sometimes become the very real world surrounding us, and may require observing it independently of our normal data intake mechanisms and manipulations (whether through manual work, or RFID scanning, or national security means).

Data with high accuracy concerns includes:

Personnel records, which must remain accurate for the company to operate smoothly
Medical data used for diagnosis
Bank account information, including the balance and transaction history for each customer

Validity

Also referred to as data conformance, validity concerns the question: does the data match the rules? Validity is a measure of how well data conforms to the defined system requirements, business rules, and syntax for format, type, range, and so forth. It describes the data’s shape and contents, and its adherence to schemas; it includes validation logic and nullability. A measure of validity indicates the degree to which the data follows any accepted standards.

Quantifying validity involves measuring how well data conforms to the syntax (format, type, range, nullability, cardinality) of its definition. Compare the data to metadata or documentation, such as a data catalog or written business rules. Validity applies at the data item level and the object/record level.

For example, a nonnumeric value such as U78S4g in a monetary amount field is invalid at the item level. A six-digit zip code is invalid for US addresses but valid for Canadian customers, so the rest of the record needs to be considered in determining validity. Keep in mind that a zip code of 00000 can be valid, but inaccurate.

Example SLO: In the dataset, 99% of values updated after Jan 1, 2018, are valid.

Data with high validity concerns includes:

Any data that users depend on to be in a certain schema or format and that, if incorrect, will be deleted or cause errors in downstream systems
Messages flowing through a schema-managed data pipeline, where any nonconforming message goes to a dead letter queue

Correctness

Correct data is data that is both valid and accurate. One does not necessarily guarantee the other. It is quite possible for data to be appropriately data-shaped (valid), but not reflective of real-world state (accurate). For example, suppose that for 10% of your patient entries the age listed is indeed “an integer greater than 0, but less than 200,” but that integer varies by 1–10 years from the real-world subject’s actual age.

Conversely, another subset of that same data pool can be accurate but not valid: we might find that for another, nonintersecting 10% of records the age is accurate, but is represented as a decimal value (e.g., 12.4 or 30.5 years). Either way, we have a dataset that is, at best, 80% correct, even though its accuracy and validity may be higher.

Some cases are even sneakier. Consider a birth date stored as 10/06/1985. The format looks correct, but the most common US date format, MM/DD/YYYY, is different from the standard European DD/MM/YYYY format. If the value isn’t localized properly, it is incorrect, inaccurate, and invalid all because of the invalid localization, even though it will parse in the system.

Is it better to land bad data or no data? If you validate on the fly, any record that doesn’t conform to a schema is rejected. If you want to be safe, send it to an error log and/or dead letter queue. Alternatively, land then validate. You can process and analyze a batch of data in a staging environment or isolated table(s). Maybe some of it is recoverable. This is a tricky trade-off because you don’t want to encourage producers to provide you with bad data. By accepting malformed inputs, you’re expanding your service’s robustness (see “Robustness”), but taking on additional maintenance loads.

Warning

Don’t try to build services to clean up the data in the middle of a pipeline. This is an antipattern. Fix it at the source by convincing your upstream data producers to write a validity SLO.

Example: Balancing Properties

Let’s look at a scenario that requires a trade-off between validity, accuracy, and completeness. Social Security numbers (SSNs) are nine digits long and can be critical to a system. Due to their unique nature, many systems rely on SSNs as primary keys (this is a bad practice). When a user submits a frontend form to input their SSN, you might validate that they input nine digits and make this field required.

What about a user who doesn’t have an SSN, or doesn’t want to provide it? A nontrivial number of your users might enter 111-11-1111 or 123-45-6789. Some might make up a real-looking random number like 543-67-3469. There will be bad data. How will you know which of your data is bad?

Maybe you should keep the validation, but make the field nullable and choose a better primary key. If the SSN field is optional, you’ll know that some of your data will have a null value. Determine which is more important: accuracy or completeness.

Integrity

Data integrity involves maintaining data in a correct state by controlling or detecting the ways one can improperly modify it, either accidentally or maliciously. It includes data governance throughout the data life cycle: tracking data lineage, determining retention rules, and auditing data trustworthiness. Integrity is not a synonym for security. Data security encompasses the overall protection of data confidentiality, integrity, and availability, while data integrity itself refers only to the trustworthiness of data.

Data with low integrity concerns may be considered unimportant to precise operational functions or not necessary to vigorously check for errors. Information with high integrity concerns is considered critical and must be trustworthy.

In order to safeguard trust across that spectrum, integrity can deal just with local file data on our local system via a mechanism as simple as checking and/or updating the cryptographic hash of important files, or it can extend to secure boot mechanisms, TPMs,² and verification of the integrity of the runtime environment itself. File integrity checks are useful, but are also a lagging indicator. A more timely metric might be transaction integrity. Rather than measuring all transactions, we would likely focus on integrity when we process a transaction class that is particularly important, like the banking and digital rights management transactions that help make sure that we can continue to keep Fluffy in the Fancy Feast manner to which he has grown accustomed.

If, in addition to the checks of the transaction itself, we check the integrity of the attesting system before and after each transaction, and all the checks are successful, then we assume the transaction can be trusted, too.³

Combining all these, we can detect and measure the following:

Files that don’t match integrity checks during periodic scans
Files that don’t match integrity checks on periodic access
Sensitive transactions that don’t pass remote attestation runtime integrity checks

Note

Sometimes data integrity is maintained by locking down data or source code. If a system generates an immutable financial report, we need to be able to reproduce it exactly, even if our data application changed its logic to fix bugs. Regression testing, immutability, and limiting access to source code/databases are ways to help enforce integrity, whereas integrity checks help measure and monitor it.

There is no perfect mechanism for validating integrity—even less so in a distributed system, where getting every component to agree on something as “simple” as incrementing a counter can be a challenge. But scans, checks on access, and transaction/attestation checks can work together to give a regular overall check, an as-we-go run-rate spot check, and an approaching-real-time check that the system isn’t fooling us on the other checks.

Example SLO: 99.9999% of data has passed audit checks for data integrity.

Data with high integrity concerns includes:

Billing application code, which must be unaltered in order to ensure proper application function and auditability for compliance
Critical system logs, which must be unaltered in order to ensure proper detection of intrusions and system changes by security

Is the data required to remain uncorrupted? Can this data be modified only by certain people or under certain conditions? Must the data come from specific, trusted sources? If data integrity is a primary concern, make sure you have a comprehensive data management strategy. This includes procedures related to backups, access controls, logging, and data lineage, along with robust verification, validation, replay, and reconciliation processes.

Durability

Durability measures how likely it is that a known-healthy copy of your data will be in the system when you attempt to access it. This is not a measure of whether, in that moment, you will actually be able to access it, but rather whether the data exists to be accessed at all. Recovering from a true durability failure—where no copies exist and the data cannot be re-created—is usually impossible. Because of this, in cases where durability is important, we need to invest extensively in both robustness and resilience.

Azure claims between 11- and 16-nines durability, depending on the service that is used. Amazon S3 and Google Cloud Platform offer 11 nines, or 99.999999999% annual durability. This durability level corresponds to an average expected loss of 0.000000001% of objects per year. That means that if you store 1 million immutable objects, you can expect to lose a single object once every 100,000 years.

Example SLO: 99.999999999% annual durability

Examples of data with high durability concerns are:

A general ledger accounting service, which is the authoritative source of a company’s transactions
Any system storing important data that cannot be re-created, such as sentimental data

Keep in mind that not all data is equally valuable to the user. Durability for legacy data might not be as critical as it is for current data. Also, certain datasets may be more useful for decision making, even if they’re stored in the same tables as other data.

Operational Instrumentation for Nonrecoverable SLOs

How does an SLO like “11-nines durability” drive system behavior that will deliver user satisfaction? Consider a block data service where we replicate every block five times. Operationally, we could (and will) monitor what percentage of blocks are n-replica degraded. But this doesn’t track with user experience or allow us to be forward-looking and allocate time and resources in light of expected user impact.

Is 100% of our fleet being 1-degraded actually a problem? Should we accelerate hard drive replacements or other repair measures at the expense of other work if 7% of our fleet is 2-degraded? It’s hard to tell. We must measure resilience as well as robustness in order to predict the durability impact of any change.

Because cat videos are so crucial to the fabric of our society (and so justly lucrative!), we have millions of hard drives just for this service. So it’s easy for small deltas in failure rate, replication speed, or anything else to outstrip repair capacity. Rather than just measuring the amount of failure that’s already happened, it makes sense to measure the P90-something time blocks spend in a replica-degraded state, and manage that.

Very resilience! Much reliable! Wow! Now we have a better idea of the risk we’re carrying, and an idea of where our failures outstrip our ability to recover from them. Pretty great, eh?

But not great enough when the value of our stock options is at stake. Such measures don’t tell us quickly if we hit a tipping point or a correlated failure. We need to examine the velocity of change in our time- and percent-degraded metrics, so that we don’t discover surprises too late to avoid permanent data loss in excess of our SLO budget.

Measuring velocity and change in an indirect metric like replica-degraded time lets us make decisions regarding long-lead-time allocation of resources and labor in a responsive way that is intelligently guided to deliver against the scary-nines SLO that actually represents our user experience.

Of course, not all data is created equal, and not all of it is irreplaceable. A service’s durability SLO should reflect the inherent value of its data (whether intrinsic or sentimental) and the benefit derived from the preservation of the data.

Sometimes data can be reingested from the original source, or derived from related data. Where the reconstruction expense (in terms of processing, transit, and opportunity cost) is relatively low, a durability SLO can be more relaxed.

Example SLO: 99% monthly durability for all hourly summary reports.

Relaxed durability can also make sense in cases where the lifetime or relevance of data is brief. When the time or expense of enabling recovery would exceed the valuable life or the lifetime value of our data, we might reasonably decide that You Only Log Once and we’re fine with the reasonably low chance something will happen to this ephemeral data between creating it and the point at which we will be done making use of it.

On the other hand, there’s risk in dismissing critical data as “transient” or “metadata.” Decryption keys, user-to-block indices, edge graphs for user notification preferences, or group editing rights and history are all distinct from user-provided data, but still critical. Consider carefully the importance of even automatically generated data to how your users interact with your services and your core data, and how much pain losing it would inflict upon them.

We’ve seen operators invest in globally distributed, highly replicated architectures for storing and reading data, only to relegate large swaths of “metadata” to cold start active-passive systems that meant days of recovery were needed to restore user access to all that carefully guarded primary data in the event the hot copies were lost.

That could be a perfectly acceptable recovery strategy—unless you make money when users are able to access their data in real time, and lose money when they can’t. It doesn’t do much good to tell them their groundbreaking O’Reilly manuscript is completely safe and they have nothing to worry about in spite of the fact that your entire West Coast data center just slid into the Pacific Ocean, if you have to turn around and tell them that you should be able to restore the “unimportant” metadata that will allow you to locate the “real” data they care about approximately two weeks after their print deadline.

When it comes to the intersection of durability and availability, the world usually doesn’t care if the dog ate your data center. Regardless of the cause of data unavailability, your users only see the system as a whole, and are unlikely to be empathetic to the fine nuances of differences between “real data” and the data critical for that data to be operationally available.

Example SLO: 99.9999% of data will remain intact for the year.

Data Application Properties

Data application properties are system considerations about the ways data gets to and from users, or is transformed from one configuration to another. They are also the metaproperties that make the systems and data upon which we rely easier or harder to work with.

Putting these properties into SLOs encourages us to build systems with a greater reliability than the sum of their parts. They may look familiar because they apply to all services. Let’s discuss their significance in defining data quality and measuring data reliability.

Security

Security focuses on protecting data from unauthorized access, use, or alteration, and in some contexts can be referred to as privacy as well to emphasize particular aspects and duties. It also covers detection of unauthorized use, for ensuring appropriate confidentiality, integrity, and availability. Not all data needs to be secure, but for some datasets, security is essential.

Confidentiality in particular is a data security aspect worth calling out because it shares with durability and integrity (and to a lesser extent, consistency) the characteristic that once lost, it can be very difficult, if not impossible, to regain for a given dataset. Plan and instrument accordingly.

Example SLO: 99% of CVE/NVD⁴ vulnerability hotfixes are applied to 99.9% of vulnerable systems within 30 days of release.
Example SLO: 95% of administrator touches are performed through a secure system proxy, rather than direct access.
Example SLO: 99.9% of employees have been subject to a phishing simulation test within the last 3 months.
Example SLO: 99.9% of systems have been in scope for a penetration test within the last year.
Example SLO: 90% of customer notices are delivered within 4 hours of breach detection.
Example SLO: 99.99999% of processed data requests were from a known service with a valid SSL certificate.

Note

Security is too big of a topic to go into too much detail here. We mention it because we believe it is an important data application property.

Data with high security concerns includes:

Personnel data, such as employee salaries or Social Security numbers
Mobile location tracking
Personal health records that contain private information
Trade secrets
Government records (especially air-gapped classified data)

Note

How can you protect the confidentiality of data that’s used for other purposes, shared between teams, or used for service testing? Mask the data by hashing or tokenizing values.

Design options to invest in security include vulnerability scanning and patching, rate limiting, firewalls, HTTPS, SSL/TLS certificates, multifactor authentication, tokenization, hashing, authorization policies, access logs, encryption, and data erasure.

Availability

Availability is currently the most common SLO. Is the service able to respond to a request? Can the data be accessed now and over time? Service availability measures service uptime, response latency, and response success rate.

Availability hinges on timeliness and continuity of information. Information with high availability concerns is considered critical and must be accessible in a timely fashion and without fail. Low availability concerns may be for data that’s considered supplementary rather than necessary.

If having uninterrupted access to and availability of data is truly required, there needs to be engineering work to implement load balancing, system redundancy, data replication, and fractional release measures like blue/green or rolling deployments for minimizing any downtime. Recovery is a factor too. Ways to promote availability include automatic detection of and recovery from faults; quick, clear rollback paths; good backups; and the implementation of supervisor patterns.

Example SLO: 95% of records are available for reporting each business day.
Example SLO: 97% of query requests are successful each month.

Note

Know when you don’t need availability. If datasets have different availability concerns, look into tiered storage. Utilize cold storage for when low availability is acceptable, such as for backups kept only for legal reasons. Design the system based on the timing and frequency at which users need the data. If users only query data as a weekly report, don’t spend time ensuring 24/7 high availability.

Data with high availability concerns includes:

Customer-generated data, which might be key to a digital product, like cloud-based docs or customer relationship management database records
Any data that’s required to prevent downtime and user service disruption, such as in a credit card processing system
Billing and account data, which must remain accessible to facilitate business continuity

Scalability

Scalability is about growth—the capacity to be changed in size or scale. How much is the data volume expected to grow, and over what period of time? Scalability refers to a system’s ability to cope with increased load. Intentional architecture around scalability means rethinking design whenever load increases; managing demand without deteriorated performance. Options include vertical scaling (increasing the existing database’s resources) and horizontal scaling (adding more databases). The choice of elastic or manual scaling depends on the rate of change, predictability, and budget.

Anticipating changes can be difficult with some systems. Load and performance testing are a good way to profile expected scale needs at a point in time. It’s also important to understand fan-out in the pipeline. This can be accomplished by measuring load parameters such as cache hit rate or requests per second and determining which ones are the most important for the architecture of the system.

Note

As Martin Kleppmann writes in Designing Data-Intensive Applications (O’Reilly), “The architecture of systems that operate at large scale is usually highly specific to the application—there is no such thing as a generic, one-size-fits-all scalable architecture (informally known as magic scaling sauce). The problem may be the volume of reads, the volume of writes, the volume of data to store, the complexity of the data, the response time requirements, the access patterns, or (usually) some mixture of all of these plus many more issues.”

Understanding projected load and growth is important when determining the design of any service. With data applications, there are many tools to enable scaling: caching, implementing load balancers, adding more databases or database resources, database sharding, and using token authentication instead of server-based auth for stateless applications, to name a few.

Example SLO: Service instance utilization will exhibit an upper bound of 90% and a lower bound of 70% for 99% of minutes with system loads between 1,000 and 100,000 queries per second.
Example SLO: 99% of containers will deploy and join the active serving pool within 2 minutes.

Some examples of services with high scalability concerns are:

A service that needs to handle millions of concurrent users, such as a popular social media platform
Services that may experience surges or bursty traffic, such as a massive online retailer on Cyber Monday

Performance

Performance SLOs let us discover, profile, monitor, and reduce system bottlenecks. They help improve resource utilization and inform decisions as user needs evolve. There are two main categories:

Latency

How long does the service take to respond to a request? Latency is the time it takes for data packets to be stored or retrieved. It can measure how much time it takes to get a fully formed response, which may come from a cache, be queried from a precalculated result, or be built dynamically from source dependencies.

Throughput

How many requests can the service handle per minute? How many events can a data pipeline process in an hour? Throughput is a measure of how much work the data service can handle over a period of time.

Example SLO: 99.9% of database writes will respond within 120 ms.
Example SLO: 98% of query requests will complete in less than 10 ms.
Example SLO: 99% of 2 MB chunks will be transcoded in less than 500 ms.

Services with high performance concerns include:

Financial trading systems that need to process large volumes of transactions as quickly as possible
Real-time systems, such as autonomous vehicle navigation and telemetry systems
Facial recognition and identity matching services

Resilience

Fault tolerance is the key to any distributed system. Prefer tolerating errors and faults, rather than trying to prevent them. Resilience, sometimes referred to as recoverability, is the ability of a system to return to its original state or move to a new desirable state after disruption. It refers to the capacity of a system to maintain functionality in the face of some alteration within the system’s environment. Resilient systems may endure environment changes without adapting or may need to change to handle those changes. The resilience of a data application can also impact its availability, completeness, and freshness.

Building resilient services requires rolling with the punches. Sustaining user functionality during/after a disruptive incident also improves other reliability dimensions, such as availability. When one or more parts of the system fail, the user needs to continue receiving an acceptable level of service. So how do we build services to withstand malicious attacks, human errors, unreliable dependencies, and other problems? Resilient services are designed with the understanding that failures are normal, degraded modes are available, and recovery is straightforward.

Measuring resilience requires knowing which services or transactions are critical to sustain for users, the speed of recovery from disruptions, and the proportion of requests that can be lost during a disruptive incident. To measure resilient performance, we can conduct manual architecture inspections and create metrics for automatically measuring system behavior. We’re limited by our creativity in simulating unexpected disruptive conditions, and therefore some organizations deploy chaos engineering for this purpose.

If resilience is a major concern for the users, optimize the system for quick and easy recovery from errors. Prioritize data recoverability with investment in tooling to backfill, recompute, reprocess, or replay data. Testing the recovery process needs to be a part of normal operations to detect vulnerability proactively.

Example SLO: 99.9% of bad canary jobs will be detected by the Canary Analysis Service and rolled back within 10 minutes of deployment.
Example SLO: 99.99% of failed MySQL promotions will be detected and restarted in less than 1 minute.
Example SLO: Privacy protection tools will detect and remove 97% of flagged records within 4 hours.

Data with high resilience concerns includes:

Customer-generated data, which might be key to a digital product
Any data that’s required to prevent downtime and user service disruption
Billing and account data, which must remain accessible to facilitate business continuity

Designing for resilience involves many of the same techniques we use to optimize for availability, durability, and performance: component redundancy, caching, load balancing, dynamic scaling, exponential backoff, timeouts, circuit breakers, input validation, stateless applications, and infrastructure as code.

Robustness

Complementary to resilience is robustness. The IEEE Standard Glossary of Software Engineering Terminology defines robustness as “The degree to which a system or component can function correctly in the presence of invalid inputs or stressful environmental conditions.” The lower the system’s dependency upon its environment and the broader the range of input values that the system can operate within, the more robust it is.

In a distributed system your inputs are infinite and unknowable due to time/networks. The most robust systems need to evolve and adapt to new situations that may not exist at the time of development—probe with chaos engineering. Also important: input validation/sanitization and testing.

Example SLO: 99% of code (including configuration changes and flag flips) is deployed via an incremental and verified process.
Example SLO: 95% of changes successfully pass peer review before commit and push.
Example SLO: 99.99% of all code in all repositories has been scanned for common date bugs within the last 30 days.

Robustness can provide excellent benefit to our users, but ultimately its guardrails can only protect us against threats that we have already anticipated. Sooner or later we will encounter a condition we cannot tolerate and the system will fail, which is why resilience—the ability to recover quickly from a bad state and return to normal operations—can provide more critical long-term benefit, even if robustness seems a more effective approach to many people at first.

Data with high robustness concerns includes:

Customer-generated data, which might be key to a digital product
Any data that’s required to prevent downtime and user service disruption
Billing and account data, which must remain accessible to facilitate business continuity

System Design Concerns

Once a data application has SLO/SLIs based on the properties described here, you can iterate its design based on the properties’ relationships with system considerations, as shown in Table 11-1. There may be many possible designs for your application, but if a property that concerns you impacts it, you want to be intentional with that aspect of the system design.

Table 11-1. The intersection of data and service properties (left) with system design concerns (top)
	Time	Access	Redundancy	Sampling	Mutability	Distributed
Freshness	x	x		x	x
Completeness	x		x	x		x
Consistency	x		x		x	x
Accuracy			x	x	x
Validity					x
Integrity		x			x
Durability		x	x			x
Security		x	x		x
Availability	x	x	x	x		x
Scalability				x		x
Performance	x			x	x	x
Resilience	x	x	x		x	x
Robustness		x				x

Data Application Failures

What else distinguishes data reliability from other types of reliability? The many types of data errors, and their persistence in our applications. Because of the persistence of their impact, investments in considerations and practices to minimize those errors quickly become reasonable, rather than excessive. SLOs can help quantitatively determine how much investment is reasonable in order to build systems more reliable than the sum of their unreliable parts.

As data gets bigger and more complicated, the systems that support data get more complicated too. The boundaries between different “types” of data services blend. Can Kafka be considered a database? How about Route 53?⁵ Data services are optimized to store many types of data, either in flight or at rest: think databases, caches, search indexes, queues for stream processing, batch processing, and combination systems.

As discussed earlier, resilience and robustness characteristics will improve reliability. The failure of one component will not cause the entire system to fail. Fault tolerance is about designing systems that have faults that are deliberately triggered and tested.

Not all failure is created equal, nor handled the same. A fault is a condition that causes a failure—a deviation from spec. It may be correlated with other faults, but each fault is a discrete issue. An error is a failure wherein a single fault—or a collection of faults—results in a bad system state that is surfaced to the user or service calling our system. In an error, we have produced an outcome that cannot or should not be recovered to a known good state transparently.

Another way to think of this is that systems have errors, and components have faults, which can become errors if they are not remediated by the system. Errors can be handled gracefully (when we have anticipated the failure type and provided a means for the system to recover or continue) or ungracefully.

How should we handle a particular fault? Should we recover or retry transparently? Surface an error? Request client system or user action? Like everything else here, those decisions are rich with trade-offs.

Data systems are particularly prone to classes of fault that must be handled as errors, in that often bad data cannot be corrected without resorting to either user input or upstream data/processing/transformations that produced the data in the first place. Backfills, replays, or data munging to correct data takes time and effort.

Each data property has its own failure states, so we have to consider many types of data errors. How do you measure the impact of an outage? SLOs are a form of audit control for finding and fixing “bad data.” SLOs will help you understand which faults to pay attention to.

Other Qualities

Let’s briefly touch on some qualities of the system design concerns presented in Table 11-1, which we don’t have time or room to get into in this chapter. They’re all important aspects of data reliability you should keep in mind, but for the sake of not making this chapter an entire book, we’ll be skipping in-depth conversations about them. We would merely be remiss if we didn’t at least define their qualities:

Time: Latency, throughput, bandwidth, efficiency, uptime. How fast can data be processed? How long does the service take to respond? Minimize response time via caching (local cache, distributed cache) and CDNs.
Access: Access control, authorization, authentication, permissions. Covers both policy and accessibility.
Redundancy: Backups, tiered storage, replication. Key to durability.
Sampling: P95, full coverage, fully representative versus biased.
Mutability: Write once, retention.
Distribution: Transactions (not canceling transactions based on a single failure, but handling rollbacks/graceful retries), idempotence, ordering, localization, parallelism, concurrency, load balancing, partition workloads.

Data Lineage

Services store data in various ways, for different purposes. Data flows through every layer in a technology stack. This makes it important to understand the reliability of upstream data. Imagine a web application served by a single database. The web server depends on the database to render its content. If the only source of the data is unavailable, the site is down. To handle this, we set up objectives and contingencies around interface boundaries. SLO dependencies are dictated by the service dependencies. Downstream services must take the service objectives of upstream independent services into account.

Data can flow through an application like a river, which is probably why there are so many water-related metaphors in the space (streams, pools, data lakes). As the process goes from one step to the next, we’re moving downstream. Where in the process is our application’s data? Who are the upstream producers/publishers? Do these sources have SLOs? Who are the downstream consumers/subscribers of this data? How will they use the data?

A complex data reporting pipeline can consist of a dozen data services. Data lineage is the collected record of the data’s movement during its life cycle, as it moves downstream. Keeping track of lineage in data applications is important for determining both data uses and the system integration points. Any data handoff point between services may benefit from SLOs/SLIs.

Summary

The definition of reliable data depends on what users need from our service. But with well-chosen SLOs, we can describe and quantify those needs to guide our designs and measure how well we’re doing.

There are many sets of properties we can consider when setting data reliability SLOs. In addition to the properties common to any service, such as performance or availability, we’ve described several properties unique to data services. While a service property such as availability is usually ephemeral, the persistent nature of a data property raises the stakes for SLO misses. Any lapse in confidentiality, integrity, or durability may be an irrevocable loss.

In defining reliability SLOs, we must work with our users to establish quantifiable expectations. We can’t just trust our own hard-earned or cleverly derived knowledge and perspective. The problem here is not that we as application owners don’t know anything about our users’ experience of our service, but that we know so much that our understanding isn’t necessarily (or at least universally) true. Ask your users how they want to measure their service objectives. Agree with them on what “better” would look like across any set of data properties.

Modern organizations are often obsessed with “data quality.” They hire tons of engineers to think about it. But quality is ultimately subjective unless you can define and measure it, and it’s inextricably intertwined with the systems that collect, store, process, and produce our data. We must reframe these conversations, and use SLOs to provide a supporting framework of quantitative measurement to help define the mechanisms by which we provide users with reliable data.

¹ Smaller and larger are relative terms here—we have worked on exabyte-scale datastores that relied on multiple thousands of replicated SQL shards for the metadata portion of the service.

² Trusted Platform Modules (dedicated cryptoprocessor microcontrollers), not the folks who keep us honest about our post-incident reviews and action items, do backlog grooming, run our training programs, edit our handbooks, and generally handle the business of actually running things so that engineering managers have enough time to occasionally pretend like we’re still engineers.

³ Mostly because at that point, anyone who is working hard enough to successfully mess with consensus reality in spite of all this arguably deserves to win one, and we personally don’t want to have to contemplate the number of overtime hours necessary to detect, isolate, and unwind whatever shenanigans they’ve inflicted on our data. It may be genuinely better if we just accept that our flagship product is named Boaty McBoatface now.

⁴ Common Vulnerabilities and Exposures (CVE) and the National Vulnerabilities Database (NVD) are the two comprehensive databases that store information about necessary security fixes to software packages.

⁵ Hi, Corey!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 11. Data Reliability

Create new playlist

Sign In

Sign Up

Chapter 11. Data Reliability

Data Services

Designing Data Applications

Warning

Note

Users of Data Services

Note

Setting Measurable Data Objectives

Note

Figure 11-1. Data properties and their relationships to each other

Data and Data Application Reliability

Data Properties

Freshness

Note

Note

Completeness

Consistency

Note

Accuracy

Warning

Validity

Warning

Integrity

Note

Durability

Data Application Properties

Security

Note

Note

Availability

Note

Scalability

Note

Performance

Resilience

Robustness

System Design Concerns

Data Application Failures

Other Qualities

Data Lineage

Summary

Table of Contents for
11. Data Reliability