Chapter 1. What Is Observability?

In the software development industry, the subject of observability has garnered a lot of interest and is frequently found in lists of hot new topics. But with that level of surging interest in adoption, complex topics are often ripe for misunderstanding without a deeper look at the many nuances encapsulated by a simple topical label. This chapter looks at the mathematical origins of the term “observability” and examines how it has been adapted to describe characteristics of production software systems.

We also look at why the adaptation of observability for use in production software systems is necessary. Traditional practices for understanding the internal state of software applications rely on approaches that were designed for simpler legacy systems than those we typically manage today. As system architecture, infrastructure platforms, and user expectations have continued to evolve, the tools we use to reason about those components have not. Systems that only take aggregate measures into account don’t provide the type of visibility needed to isolate very granular anomalies. New methods for quickly finding needles buried in proverbial haystacks were born from necessity.

This chapter will help you understand what observability means, how to determine if a software system is observable, why observability is necessary, and how observability is used to find problems in ways that are not possible with other approaches.

The Definition of Observability

The phase “observability” was coined by engineer Rudolf E. Kálmán in 1960. Since then it has grown to mean many different things in different communities. Let’s explore the landscape before turning to our own definition of observability for modern software systems.

Observability was coined as a term describing mathematical control systems. If you are looking for a mathematical and process engineering oriented textbook, you’ve come to the wrong place. Those books definitely exist: as any mechanical engineer or control systems engineer will inform you (usually passionately and at great length), observability has a formal meaning in traditional systems engineering terminology. In his paper, “On the General Theory of Control Systems”, Kálmán introduces a characterization he calls observability.

In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

This definition of observability would have you study observability and controllability as mathematical duals, along with sensors, linear algebra equations, and formal methods. This traditional definition of observability is the realm of mechanical engineers and those who manage physical systems with a specific end-state in mind. However, when adapted for use with squishier virtual software systems, that same concept opens up a radically different way of interacting with the code you write.

For modern software systems, our definition of observability is also a measure of how well the internal state of your applications can be inferred from the data of its external outputs.

Let’s break that down a bit further. For a software application to have observability, the following things must be true. You must be able to:

  • Understand the inner workings of your application

  • Understand any system state your application many have gotten itself into

  • Understand the things above, solely by observing that with external tools

  • Understand that state no matter how extreme or how unusual

A good litmus test for determining if those conditions are true is to ask yourself the following questions:

  • Can you continually answer open ended questions about the inner workings of your software to explain any anomalous values?

  • Can you understand what any particular user of your software may be experiencing?

  • Can you determine the things above even if you have never seen or debugged this particular state or failure before?

  • Can you determine the things above even if this anomaly has never happened before?

  • Can you ask arbitrary questions about your system and find answers without needing to predict what those anomalies would be in advance?

  • And can you do these things without having to ship any new code to handle or describe that state (which would have implied that you needed to understand it first in order to...understand it)?

Meeting all of the above criteria is a high bar for many software engineering organizations to clear. If you can clear that bar then you, no doubt, understand why observability has become such a popular topic for software engineering teams.

Put simply, our definition of observability for software systems is a measure of how well you can understand and explain any state your system can get into, no matter how novel or bizarre. You must be able to comparatively debug that bizarre or novel state across all dimensions of system state data, and combinations of dimensions, in an ad hoc manner, without being required to define or predict those debugging needs in advance. If you can understand that bizarre or novel state without shipping new code, then you have observability.

Before proceeding, there’s another definition of observability we need to address: the definition being promoted by SaaS developer tool vendors. These vendors are those who insist that “observability” has no special meaning whatsoever—that it is simply another synonym for telemetry, indistinguishable from monitoring. Proponents of this definition relegate observability to being another generic term for understanding how software operates. You will hear this contingent explain away observability as “three pillars” of things they can sell you that they already do today: metrics, logs, and traces.1

It is hard to decide which is more objectionable about this definition: its redundancy (why exactly do we need another synonym for telemetry?) or its epistemic confusion (why assemble a list of one data type, one anti-data type slash mess of strings, and one way of visualizing things in order by time?). Regardless, the logical flaw of this definition becomes clear when you realize its proponents have a vested interest in selling you the tools and mindsets built around the siloed collection and storage of data with their existing suite of metrics, logging, and tracing tools to sell. The proponents of this definition let their business models constrain how they think about future possibilities.

In fairness, we—the authors of this book—are also vendors in the observability space. However, this book is not created to sell you on our tools. We have written this book to explain how and why we adapted the original concept of observability to managing modern software systems. You can apply the concepts in this book, regardless of your tool choices, to practice building production software systems with observability. We believe that, as an industry, it is time to evolve how we manage modern software systems.

Observability for Software Systems

We believe that adapting the mathematical concept of observability for software systems is a unique approach that is worth unpacking. For modern software systems, observability is not about the data types or inputs, nor is it about mathematical equations. It is about how people interact with and try to understand their complex systems. Therefore, observability requires recognizing the interaction between both people and technology to understand how those complex systems work together.

If you accept that definition, many additional details emerge that demand answers:

  • How one gathers that data and assembles it for inspection

  • Technical requirements for processing that data

  • Team capabilities necessary to benefit from that data

We will get to those details and more throughout the course of this book. For now, let’s put some additional context behind observability as it applies to software.

The application of observability to software systems partakes of the rich heritage of the systems engineering roots. However, it is far less mathematical and much more practical. In part, that’s because software engineering is a much younger and more rapidly-evolving discipline than its more mature mechanical engineering predecessor. Production software systems are much less subject to formal proofs. That lack of rigor is, in part, a betrayal from the scars we’ve earned through operating the software code we write in production.

As engineers attempting to understand how to bridge the gap between theoretical practices encoded in clinical tests and the impacts of what happens when our code runs at scale, we did not go looking for a new term, definition, or functionality to describe how we got there. It was the circumstances of managing our systems and teams that led us to evolving our practices away from concepts that simply no longer worked, like monitoring. As an industry, we need to move beyond the current gaps in tooling and terminology to get past the pain and suffering inflicted upon us by dealing with outages and a lack of more proactive solutions.

Observability is the solution to that gap.

Our complex production software systems are a mess for a variety of both technical and social reasons. So it will take both social and technical solutions to dig us out of this hole. Observability is not the entire solution to all of our software engineering problems. But it does help you clearly see what’s happening in all the obscure corners of your software that you are stumbling around and trying to understand in the dark.

If you wake up in the morning and can’t find your glasses, you’re not going to see where you left your fork. And you certainly can’t eat if you can’t see your eggs on the table. When it comes to solving practical problems in software engineering, observability is a darn good place to start.

Why Observability Matters Now

Now that we’re on the same page about what observability means in the context of modern software systems, let’s talk about why this shift in approach matters now.

In short, our traditional approach of using metrics and monitoring of our software to understand what it’s doing falls drastically short. It’s a fundamentally reactive approach. It may have served us well in the past, but modern systems demand a better approach.

For the past two or three decades, the space between hardware and their human operators has been regulated by a set of tools and conventions we call “monitoring”. Practitioners have, by and large, inherited this set of tools and conventions and accepted it as the best approach for understanding that squishy virtual space between the physical and their code. And they have accepted this approach despite the knowledge that, in many cases, its inherent limitations have taken them hostage late into many sleepless nights of troubleshooting. Yet they still grant it feelings of trust and maybe even affection because that captor is the best they have.

With monitoring, software developers can’t fully see their systems. They squint at the system. They try, in vain, to size them up and try to predict all the myriad ways it could possibly fail. Then they watch—they monitor—for those known failure modes. They set performance thresholds and we arbitrarily pronounce them “good” or “bad.” Then they deploy a small robot army to check and re-check those thresholds on their behalf. They collect their findings into dashboards. They then organize ourselves around those robots into teams, rotations, and escalations. When those robots tell them it’s bad, they mobilize. Then, over time, they tend to those arbitrary thresholds like gardeners: pruning, tweaking, and fussing over the noisy signals they told the robots to stream them.

Is This Really the Best Way?

For decades, that’s how developers and operators have done it. Monitoring has been the de facto approach for so long, that they tend to think of it as the only way of understanding their systems, instead of just one way to understand them. Monitoring is such a default practice that it has become mostly invisible. They don’t question whether we should do it, but how.

The practice of monitoring is grounded in many unspoken assumptions about our systems (which we’ll detail below).But as systems continue to evolve—as they become more abstract, more complex, and as their underlying components begin to matter less and less—those assumptions become less true. As we continue to adopt modern approaches to deploying software systems (SaaS dependencies, container orchestration platforms, distributed systems, etc), the cracks in those assumptions become more evident.

As those assumptions become evident, more of us find ourselves slamming into the wall of inherent limitations and realizing that monitoring approaches simply do not work for our new modern worlds. With modern approaches, we are finding that traditional monitoring practices are catastrophically ineffective at helping us understand our systems. The assumptions of metrics and monitoring are now failing us.

To understand why they fail, it helps to examine their history and intended context.

Why Are Metrics and Monitoring Not Enough?

In 1988, by way of SNMPv1 (Simple Network Management Protocol), the foundational substrate of monitoring was born: the metric. A metric is a single number, with flags optionally appended for grouping and searching those numbers. Metrics are, by their very nature, disposable and cheap. They have a predictable storage footprint. They’re easy to aggregate along regular time series buckets And, thus, the metric became the base unit for a generation or two of telemetry—the data we gather from remote endpoints for automatic transmission to monitoring systems.

Many sophisticated apparatuses have been built atop the metric: time series databases, statistical analyses, graphing libraries, fancy dashboards, on-call rotations, ops teams, escalation policies, and a plethora of ways to digest and respond to what that small army of robots is telling you.

But there’s a practical limit to where this model serves you.

If you’ve crossed that limit, realize the change is abrupt. Monitoring approaches seem to continually evolve, right up until they cease to work for you. What worked well enough last month, simply does not work anymore. The inherent limitation becomes evident when you reach a tipping point in terms of complexity.

It’s hard to quantify exactly when that tipping point is reached. Basically, what happens is that the sheer number of possible states the system could get itself into outstrips your team’s ability to pattern-match based on prior outages. Your team can no longer guess which dashboards should be created to display any innumerable amount of failure modes. When that occurs, that’s when the inherent assumptions about monitoring become clear. They cease to be hidden and they very much become the bane of your team’s ability to understand what’s happening.

The hidden assumptions of metrics based systems are that:

  • Your application is monolithic in nature

  • There is one stateful data store (“the database”)

  • Many low-level systems metrics are available and relevant (e.g., resident memory, CPU load average)

  • The application runs on VMs or bare metal, giving you full access to system metrics

  • You have a fairly static set of hosts to monitor

  • Engineers examine systems for problems only after problems occur

  • Dashboards and telemetry exist to serve the needs of operations engineers

  • Dashboards and telemetry exist to serve the needs of operations engineers

  • Monitoring examines “black-box” applications which are inaccessible

  • Monitoring solely serves the purposes of operations

  • The focus of monitoring is uptime and failure prevention

  • Examination of correlation occurs across a limited (or small) number of dimensions

When compared to the reality of modern systems, it becomes clear that traditional monitoring approaches fall short in several ways. The reality of modern systems is that:

  • There are many, many services to manage

  • There is polygot persistence (i.e., “many” databases)

  • Infrastructure is extremely dynamic, with capacity flicking in and out of existence elastically

  • Many far-flung and loosely coupled services are managed, many of which are not directly under your control

  • Engineers actively check to see how changes to production code behave, in order to catch tiny issues early, before they create user impact

  • Automatic instrumentation is insufficient for understanding what is happening in complex systems

  • Software engineers own their own code in production and they are incentivized to proactively instrument their code and inspect the performance of new changes as they’re deployed

  • The focus of reliability is instead how much tolerance to allow for constant and continuous failures, while building resiliency to user-experienced failures by utilizing constructs like error budget, quality of service, and user experience

  • Examination of correlation occurs across a virtually unlimited number of dimensions

The last point is important, because it describes the breakdown that occurs between the limits of correlated knowledge that one human can be reasonably expected to think about and the reality of modern system architectures. There are so many possible dimensions involved in discovering the underlying correlations behind performance issues that no human brain, and in fact no schema, can possibly contain them.

With observability, the ability to compare many high-cardinality dimensions, and many combinations of high-cardinality dimensions, becomes a critical component of being able to discover otherwise hidden issues buried in complex system architectures.

Debugging with Metrics vs Observability

Beyond that tipping point of system complexity, it’s no longer possible to fit a model of the system into your mental cache anymore. By the time you try to reason your way through its various components, your mental model is already likely to be out of date. As an engineer, you are probably used to debugging via intuition. To get to the source of a problem it’s likely you feel your way along a hunch or use a fleeing reminder of some outage long past to guide your investigation. The skills that served you well in the past are no longer applicable in this world. The intuitive approach only works as long as most of the problems you encounter are variations of the same few predictable themes you’ve encountered in the past.

Similarly, the metrics-based approach of monitoring relies on having encountered known failure modes in the past. Monitoring helps detect when systems are over or under predictable thresholds that someone has previously deemed means they’re experiencing an anomaly. But what happens when you don’t know that type of anomaly is even possible?

Historically, the majority of problems that software engineers encounter have been variants of somewhat predictable failure modes. Perhaps it wasn’t known that your software could fail quite in the manner that it did, but if you reasoned about the situation and its components, it wasn’t a logical leap to discover a novel bug or failure mode. It is a rare occasion for most software developers to encounter truly unpredictable leaps of logic because they haven’t typically had to deal with the type of complexity that makes it commonplace (until now, most of the complexity for developers has been in the app bundle).

Every application has an inherent amount of irreducible complexity. The only question is: Who will have to deal with it—the user, the application developer, or the platform developer?

Larry Tessler

Modern distributed systems architectures notoriously fail in novel ways that no one is able to predict and that no one has experienced before. This condition happens often enough that an entire set of assertions has been coined about the false assumptions that programmers new to distributed computing often make.2 Modern distributed systems are also made accessible to application developers as abstracted infrastructure platforms. As users of those platforms, application developers are now left to deal with an inherent amount of irreducible complexity that has landed squarely on their plates.

The previously submerged complexity of application code subroutines that interacted with each other inside the hidden random access memory internals of one physical machine have now surfaced as service requests between hosts. That newly exposed complexity then hops across multiple services, traversing across an unpredictable network many times over the course of a single function. When modern architectures started to favor decomposing monoliths into microservices, software engineers lost the ability to step through their code with traditional debuggers. Meanwhile, their tools have yet to come to grips with that seismic shift.

Examples of this seismic shift can be seen with the trend toward containerization, the rise of container-orchestration platforms, the shift to microservices, the common use of polyglot persistence, the introduction of the service mesh, the popularity of ephemeral auto-scaling instances, serverless computing, lambda functions, and any other myriad SaaS applications in a software developer’s typical toolset. Stringing these various tools together into a modern system architecture means that a request may perform 20-30 hops after it reaches the edge of things you control (and likely multiply that by a factor of two if it includes database queries).

In modern cloud-native systems, the hardest thing about debugging is no longer understanding how the code runs but finding where in your system the code with the problem even lives. Good luck looking at a dashboard or a service map to see which node or service is slow because distributed requests in these systems often loop back on themselves. Finding performance bottlenecks in these systems is incredibly challenging. When something gets slow, everything gets slow. Even more challenging, because cloud-native systems typically operate as platforms, the code may live in a part of the system that this team doesn’t even control.

In a modern world, debugging with metrics requires you to connect dozens of disconnected metrics that were recorded over the course of executing any one particular request, across any number of services or machines, to infer what might have occurred over the various hops needed for its fulfillment. The helpfulness of those dozens of clues depends entirely upon whether someone was able to predict, in advance, if that measurement was over or under the threshold that meant this action contributed to creating a previously unknown anomalous failure mode that had never been previously encountered.

By contrast, debugging with observability starts with a very different substrate: a deep context of what was happening when this action occurred. Debugging with observability is about preserving as much of the context around any given request as possible, so that you can reconstruct the environment and circumstances that triggered the bug that led to a novel failure mode.

The Role of Cardinality

In the context of databases, cardinality refers to the uniqueness of data values contained in a set. Low cardinality means that a column has a lot of duplicate values in its set. High cardinality means that the column contains a large percentage of completely unique values. A column containing a single value will always be the lowest possible cardinality. A column containing unique IDs will always be the highest possible cardinality.

For example, if you had a collection of a hundred million user records, you can assume that userID numbers will have the highest possible cardinality. First name and last name will be high cardinality, though lower than userID because some names repeat. A field like gender would be fairly low-cardinality given the non-binary, but finite, choices it could have. A field like species would be the lowest possible cardinality, presuming all of your users are humans.

Cardinality matters for observability, because high-cardinality information is the most useful data for debugging or understanding a system. Consider the usefulness of sorting by fields like user IDs, shopping cart IDs, request IDs, or any other myriad IDs like instances, container, build number, spans, and so forth. Being able to query against unique IDs is the best way to pinpoint individual needles in any given haystack.

Unfortunately, metrics-based tooling systems can only deal with low-cardinality dimensions at any reasonable scale. Even if you only have merely hundreds of hosts to compare, with metrics-based systems, you can’t use hostname as an identifying tag without hitting the limits of your cardinality key-space. These inherent limitations place unintended restrictions on the ways that data can be interrogated. When debugging with metrics, for every question you may want to ask of your data, you have to decide—in advance, before a bug occurs—what you need to inquire about so that its value can be recorded when that metric is written.

That inherent limitation has two big implications. First, if during the course of investigation, you decide that an additional question must be asked to discover the source of a potential problem, that cannot be done after the fact. You must first go set up the metrics that might answer that question and wait for the problem to happen again. Second, because it requires another set of metrics to answer that additional question, most metrics-based tooling vendors will charge you for recording that data. Your cost increases linearly with every new way you decide to interrogate your data to find hidden issues you could not have possibly predicted in advance.

Debugging with Observability

Conversely, observability tools encourage developers to gather rich telemetry for every possible event that could occur, passing along the full context of any given request, and storing it for possible use at some point down the line. Observability tools are specifically designed to query against high cardinality data. What that means for debugging is that you can interrogate your event data in any number of arbitrary ways. You can ask new questions that you did not need to predict in advance and find answers to those questions, or clues that will lead you to ask the next question. You repeat that pattern again and again, until you find the needle in the proverbial haystack that you’re looking for.

The ability to interrogate your event data in arbitrary ways means that you can ask any question about your system and inspect its corresponding internal state. That means you can investigate and eventually understand any state your system has gotten itself into—even if you have never seen that state before—without needing to predict what those states might be in advance.

Again, observability means that you can understand and explain any state your system can get into—no matter how novel or bizarre—without shipping new code.

The reason monitoring worked so well for so long is that systems tended to be simple enough that engineers could reason about exactly where they might need to look for problems and how those problems might present themselves. For example, it’s relatively simple to connect the dots that when sockets fill up, CPU would overload, and the solution is to add more capacity by scaling application node instances, or by tuning your database, or so forth. Engineers could, by and large, predict the majority of possible failure states up front and discover the rest the hard way once their applications were running in production.

However, monitoring creates a fundamentally reactive approach to system management. You can catch failure conditions that you predicted and knew to check for. If you know to expect it, you check for it. For every condition you don’t know to look for, you have to see it first, deal with the unpleasant surprise, investigate it to the best of your abilities, possibly reach a dead end that requires you to see that same condition multiple times before properly diagnosing it, and then you can develop a check for it. In that model, engineers are perversely incentivized to have a strong aversion to situations that could cause unpredictable failures. This is partially why some teams are terrified of deploying new code (more on that topic later).

Observability Is for Modern Systems

A production software system is observable to the extent that you can understand new internal system states without having to make arbitrary guesses, predict those failure modes in advance, or ship new code to understand that state. In this way, we extend the control theory concept of observability to the field of software engineering.

In its software engineering context, observability does provide benefits for more traditional architectures or monolithic systems. For example, it could certainly save teams from having to discover unpredictable failure modes in production the hard way. But the benefits of observability are absolutely critical when using modern distributed systems architectures.

In distributed systems, the ratio of somewhat predictable failure modes to novel and never before seen failure modes is heavily weighted toward the bizarre and unpredictable. Those unpredictable failure modes happen so commonly, and repeat rarely enough, that they outpace the ability for most teams to set up the appropriate and relevant enough monitoring dashboards to easily show that state to the engineering teams responsible for ensuring the continuous uptime, reliability and acceptable performance of their production applications.

This book is written with these types of modern systems in mind. Any system consisting of many components that are loosely coupled, dynamic in nature, and difficult to reason about are a good fit for realizing the benefits of observability vs. traditional management approaches. If you manage production software systems that fit that description, this book describes what observability can mean for you, your team, your customers, and your business. We will also focus on the human factors necessary to develop a practice of observability in key areas of your engineering processes.

1 Sometimes these claims include time spans to signify “discrete occurrences” as a fourth pillar of a generic synonym for telemetry.

2 See: The Fallacies of Distributed Computing

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.227.69