Chapter 3. Observability-Driven Development

Observability is a practice that fundamentally helps engineers improve how their software runs in production. That practice isn’t applicable only after software is released to production. Observability can, and should, be a part of the software development life cycle itself. This chapter focuses on the practice of observability-driven development and how to shift observability left.1

Test-driven development

Today’s gold standard for testing software prior to its release in production is Test-Driven Development (TDD).2 Within the last two decades, TDD is arguably one of the more successful practices to take hold across the software development industry. TDD has provided a very useful framework for shift-left testing that catches, and prevents, many potential problems long before they reach production. Widely adopted across wide-swaths of the software development industry, TDD should be credited with having uplifted the quality level for code running production services.

TDD is a powerful practice that provides engineers with a clear way to think about software operability. Applications are defined by a deterministic set of repeatable tests that can be run hundreds of times per day. If these repeatable tests pass, then the application must be running as expected. Before changes to the application are actually produced, they start as a set of new tests that exist to verify that change would work as expected. A developer can then begin to write new code in order to ensure the test passes.

TDD is particularly powerful because tests run the same way every time. Data typically doesn’t persist between test runs, it gets dropped; erased and recreated from scratch for each run. Responses from underlying or remote systems are stubbed or mocked. With TDD, developers are tasked with creating a specification that precisely defines the expected behaviors for an application in a controlled state. The role of tests is to identify any unexpected deviations from that controlled state and deal with them immediately. In doing so, TDD removes guesswork and provides consistency.

But that very consistency and isolation also limits what TDD can reveal about what is happening with your software in production. Running isolated tests doesn’t reveal whether or not customers are having a good experience with your service. Nor does passing those tests mean that any errors or regressions could be quickly and directly isolated and fixed before releasing that code back into production.

Any reasonably experienced engineer responsible for managing software running in production can tell you that production environments are anything but consistent. Production is full of interesting deviations that your code might encounter out in the wild, but that have been excised from tests because they’re not repeatable, they don’t quite fit the specification, or they don’t go according to plan. While the consistency and isolation of TDD makes your code tractable, it does not prepare your code for the interesting anomalies that should be surfaced, watched, stressed, and tested because they ultimately shape how your software behaves when real people start interacting with it.

Observability can help you write and ship better code even before it lands in source control—because it’s part of the set of tools, processes, and culture that allow engineers to find bugs in their code quickly.

Observability in the development cycle

Catching bugs cleanly, resolving them swiftly, and preventing them from becoming a backlog of technical debt that weighs down the development process relies on a team’s ability to find those bugs quickly. Yet, software development teams often hinder their ability to do so for a variety of reasons.

For example, consider organizations where software engineers aren’t responsible for operating their software in production. Engineers merge their code into master, cross their fingers hoping that this change won’t be one that breaks prod, and they essentially wait to get paged if a problem occurs. Sometimes they get paged soon after deployment. The deployment is then rolled back and the triggering changes can be examined for bugs. But more likely, problems wouldn’t be detected for hours, days, weeks, or months after that code had been merged. By that time, it becomes extremely difficult to pick out the origin of the bug, remember the context, or decipher the original intent behind why that code was written or why it shipped.

Resolving bugs quickly critically depends on being able to examine the problem while the original intent is still fresh in the original author’s head. It will never again be as easy to debug a problem as it was right after it was written and shipped. It only gets harder from there: speed is key. At first glance, the links between observability and writing better software may not be clear. But it is this need for debugging quickly that deeply intertwines the two.

Determining where to debug

Newcomers to observability often make the mistake of thinking that observability is a way to debug your code, similar to using highly verbose logging. While it’s possible to debug your code using observability tools, that is not the primary purpose of observability. Observability operates on the order of systems, not on the order of functions. Emitting enough detail at the lines level to reliably debug code would emit so much output that it would swamp most observability systems with an obscene amount of storage and scale. It would simply be impractical to pay for a system capable of doing that because it would likely cost somewhere in the ballpark of 1X-10X as much as your system itself.

Observability is not for debugging your code logic. Observability is for figuring out where in your systems to find the code you need to debug. Observability tools help you swiftly narrowing down where problems may be occuring. From which component did an error originate? Where is latency being introduced? Where did a piece of this data get munged? Which hop is taking up the most processing time? Is that wait time evenly distributed across all users, or is it only experienced by a subset thereof? Observability helps your investigation of problems pinpoint likely sources.

Often, observability will also give you a good idea of what might be happening in or around an affected component, what the bug might be, or even provide hints as to where the bug is actually happening: your code, the platform’s code, or a higher-level architectural object.

Once you’ve identified where the bug lives and some qualities about how it arises, then observability’s job is done. From there, if you want to dive deeper into the code itself, the tool you want is a good old-fashioned debugger (for example, gdb). Once you suspect how to reproduce the problem, you can spin up a local instance of the code, copy over the full context from the service, and continue your investigation. While they are related, the difference between an observability tool and a debugger is an order of scale; like a telescope and a microscope, they may have some overlapping use cases, but they are primarily designed for different things.

This is an example of different paths you might take with debugging vs observability:

  • You see a spike in latency. You start at the edge; break down by endpoint, calculate average, 90th, and 99th percentile latency; identify some cadre of slow requests; trace one of them; it shows the timeouts begin at service3. You copy the context from the traced request into your local copy of service3 binary and attempt to reproduce it in the debugger.

  • You see a spike in latency. You start at the edge; break down by endpoint, calculate average, 90th, and 99th percentile latency; notice that only endpoints which are write-only are the ones that are suddenly slower. You break down by db destination host, and note that it is distributed across some, but not all, of your db primaries. For those db primaries, this is only happening to ones of a certain instance type or in a particular AZ. You conclude the problem is not a code problem, but one of infrastructure.

Debugging in the time of microservices

When viewed through this lens, it becomes very clear why the rise of microservices is tied so strongly to the rise of observability. Software systems used to have fewer components, which meant they were easier to reason about. An engineer could think their way through all possible problem areas using only low cardinality tagging. Then, to understand their code logic, they simply always used a debugger or IDE. But once monoliths started being decomposed into many distributed microservices, the debugger no longer worked as well because it couldn’t hop the network.3

Once service requests started traversing networks to fulfill their functions, all kinds of additional operational, architectural, infrastructural, and other assorted categories of complexity became irrevocably intertwined with the logic bugs we unintentionally shipped and inflicted on ourselves.

In monolithic systems, it’s very obvious to your debugger if a certain function slows down immediately after code is shipped that modified that function. Now, such a change could manifest in several ways. For example, you might notice:

  • That a particular service is getting slower

  • That dependent services upstream of that service are also getting slower

  • Downstream dependent services called by the first service are also getting slower

  • All of the above

Furthermore, regardless of which of those manifestations you encounter, it might still be incredibly unclear if that slowness is being caused by:

  • A bug in your code

  • A particular user changing their usage pattern

  • A database overflowing its capacity

  • Network connection limits

  • A misconfigured load balancer

  • Issues with service registration or service discovery

  • Some combination of the above

WIthout observability, all you may see is that all of the performance graphs are either spiking or dipping at the same time.

How instrumentation drives observability

Observability helps pinpoint where problems originate, common outlier conditions, which half a dozen or more things must all be true for the error to occur, and so forth. Observability is also ideal for swiftly identifying whether problems are restricted to a particular build ID, set of hosts, instance types, container versions, kernel patch versions, database secondaries, or any number of other architectural details.

In order for that to be true, a necessary component of observability is focusing on creating useful instrumentation. Good instrumentation drives observability. One way to think about how instrumentation is useful is to consider it in the context of pull requests. Pull requests should never be submitted or accepted without first asking yourself the question, “How will I know if this change is working as intended or not?”

A helpful goal when developing instrumentation is to create reinforcement mechanisms and shorter feedback loops. In other words, tighten the loop between shipping code and feeling the consequences of errors. This is also known as “putting the software engineers on call.” One way to achieve this is to automatically page the person who just merged the code that is being shipped. For a brief period of time, maybe 30 minutes to 1 hour, if an alert is triggered in production it gets routed to them. When an engineer experiences their own code in production, their ability (and motivation) to instrument their code for faster isolation and resolution of issues naturally increases. This feedback loop is not punishment; rather, it is essential to code ownership. We cannot develop the instincts and practices needed to ship quality code if we are insulated from the feedback of our errors.

Every engineer should be expected to instrument their code such that they can answer these questions as soon as it’s deployed:

  • Is your code doing what you expected it to do?

  • How does it compare to the previous version?

  • Are users actively using your code?

  • Are there any emerging abnormal conditions?

A more advanced approach is to enable engineers to test their code against a small subset of production traffic. With sufficient instrumentation, the best way to understand how a proposed change will work in production is to measure how it will work by deploying it to production. That can be done in several controlled ways. For example, that can happen by deploying new features behind a feature flag and only exposing it to a subset of users. Alternatively, a feature could also be deployed directly to production and have only select requests from particular users routed to the new feature. These types of approaches shorten feedback loops to mere seconds or minutes, rather than what are usually substantially longer periods of time waiting for a full release cycle.

If you are capturing sufficient instrumentation detail in the context of your requests, you can systematically start at the edge of any problem and work your way to the correct answer every single time, with no guessing, intuition, or prior knowledge needed. This is one revolutionary advance observability has over monitoring systems, and does a lot to move operations engineering into the realm of science, not magic.

Shifting observability left

While test-driven development ensures developed software adheres to an isolated specification, observability-driven development ensures that software works in the messy reality that is a production environment; strewn across a complex infrastructure, at any particular point in time, experiencing fluctuating workloads, with particular users.

Building instrumentation into our software early in the development lifecycle allows engineers to more readily consider and more quickly see the impact that small changes truly have in production. By focusing on just adherence to an isolated spec, teams inadvertently create conditions where they cease to have visibility into the chaotic playground where that software comes into contact with messy and unpredictable people problems. As we’ve seen in previous chapters, traditional monitoring approaches only reveal an aggregate view of measures that were developed in response to triggering alerts for known issues. Traditional tools provide very little ability to accurately reason about what happens in complex modern software systems.

As a result, teams will often approach production as a glass castle; a beautiful monument to their collective design over time, but one they’re afraid to touch because any unforeseen movement could shatter the entire structure. Developing the engineering skills to write, deploy, and use good telemetry to understand behaviors in production, teams become empowered to reach further and further into the development lifecycle to consider the nuances of detecting unforeseen anomalies that could be roaming around their castles unseen.

Observability-driven development is what allows engineering teams to turn their glass castles into playgrounds. Production environments aren’t immutable, they’re full of action, and engineers should be empowered to confidently walk into any game and score a win. But that only happens when observability isn’t considered solely the domain of SREs, infrastructure engineers, or operations teams. Software engineers must adopt observability and work it into their development practices in order to unwind the cycle of fear they’ve developed over making any changes to production.

1 https://en.wikipedia.org/wiki/Shift-left_testing

2 https://en.wikipedia.org/wiki/Test-driven_development

3 As a historical aside, the phrase “strace for microservices” was an early attempt to describe this new style of understanding system internals before the word “observability” was adapted to fit the needs of introspecting production software systems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.32.213