Chapter 2. How Observability Relates to DevOps and Cloud Native

So far, we’ve referenced observability in the context of modern software systems. Therefore, it’s important to unpack how observability fits into the landscape of other modern practices such as the DevOps, SRE (Site Reliability Engineering), and Cloud Native movements. This chapter examines how these movements have both influenced the need for observability and integrated it into their practices.

Observability does not exist in a vacuum; instead, it is both a consequence and an integral part of the DevOps, SRE, and Cloud Native movements. Like testability, observability is a property of these systems that improves understanding of them. Rather than being something that we add once, or having a one-size-fits-all solution, observability and testability require continuous investment. As they improve, benefits accrue for the developers and end users of our systems. By examining why the DevOps, SRE, and Cloud Native movements created a need for observability and integrated its use, we can better understand why observability has become a mainstream topic and why increasingly diverse teams are adopting this practice.

Cloud Native DevOps, and SRE in a nutshell

In contrast with the monolithic and waterfall development approaches employed in the 1990s to early 2000s, modern software development and operations teams increasingly use Cloud Native and Agile methodologies. In particular, these methodologies enable teams to autonomously release features without tightly coupling their impact to other teams. That capability unlocks several key business benefits, including higher productivity, better profitability, and more1. For example, the ability to resize individual service components upon demand and pool resources across a large number of virtual and physical servers means the business benefits from better cost controls and scalability.

However, these benefits are not free. An often overlooked aspect of introducing these capabilities is that it also introduces a management cost. Abstracted systems with dynamic controls introduce new challenges of emergent complexity and non-hierarchical communications patterns. Older monolithic systems had less emergent complexity, and thus simpler monitoring approaches sufficed; we could easily reason about what was happening inside those systems and where unseen problems might be occurring. Today, running Cloud Native systems feasibly at scale demands more advanced sociotechnical practices like observability.

The Cloud Native Computing Foundation defines Cloud Native as “building and running scalable applications in modern, dynamic environments... [Cloud Native] techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.”2 By minimizing toil3 (repetitive manual human work) and emphasizing observability, Cloud Native systems empower developers to be creative. This definition focuses not just on scalability, but also upon development velocity and operability as goals.

It’s worth focusing on the fact that the shift to Cloud Native is not just one that requires adopting a complete set of new technologies; it also requires changing how people work. That shift is inherently sociotechnical. On the surface, using the toolchain itself has no explicit requirement to adopt new social practices. But to achieve the promised benefits of the technology, it becomes necessary to also change the way that people work. Although this should be evident from the stated definition and goals, it’s not uncommon for teams to get several steps in before realizing that their old work habits do not help them address the management costs introduced by this new technology. That is why successful adoption of Cloud Native design patterns is inexorably tied to the need for observable systems and for DevOps and Site Reliability Engineering (SRE) practices.

Similarly, DevOps and SRE both highlight a desire to shorten feedback loops and reduce operational toil in their definitions and practices. DevOps provides “Better Value; Sooner, Safer, & Happier”4 through culture and collaboration between development and operations groups. SRE joins together systems engineering and software skillsets to solve complex operational problems through developing software systems rather than manual toil5. As we’ll explore in this chapter, the combination of Cloud Native technology, DevOps and SRE methodologies, and observability are stronger together than each of their individual parts.

Observability: Debugging Then vs. Now

The goal of observability is to provide a level of introspection that helps people reason about the internal state of their systems and applications. That state can be achieved a number of ways. For example, that could be done utilizing a combination of logs, metrics, and traces. But the goal of observability itself is agnostic to how it’s accomplished.

For monolithic systems, where we could anticipate the potential areas of failure, one person in isolation could debug our systems and achieve appropriate observability using verbose application logging, or coarse system-level metrics such as CPU/disk utilization combined with flashes of insight. However, these legacy tools and instinctual techniques no longer work for the new set of management challenges created by the opportunities of Cloud Native systems.

Among the example technologies mentioned in the Cloud Native definition are containers, service meshes, microservices, and immutable infrastructure. Compared to legacy technologies like virtual machines and monolithic architectures, containerized microservices inherently introduce new problems such as cognitive overload from interdependencies between components, transient state discarded after container restart, and incompatible versioning between separately released components. Immutable infrastructure means that it’s no longer feasible to ssh into a host for debugging, as it may perturb the state on that host. Service meshes add an additional routing layer which provide a powerful way to collect information about how service calls are happening, but that data is of limited use without the ability to store it for later analysis.

Debugging anomalous issues requires a new set of capabilities to help engineers detect and understand problems from within their systems. Tools such as distributed tracing can help capture the state of system internals when specific events occurred. By adding wide context to each event, it’s possible to create a rich view of what was happening in all the parts of a system that are typically hidden and impossible to reason about. For example, if engineers have the ability to systematically drill down into which hosts should be examined using nested sets of metrics aggregated at different levels (or exemplar wide events), then it no longer matters that logs are sharded across kubelets and ephemerally retained. If we can visualize each individual step in service request executions with distributed tracing, then it no longer matters that services have complex dependencies. If we can visualize the relationship between calling and receiving code paths and versions, then version skew between components is a solvable problem. Observability provides a shared context that enables teams to debug problems in a cohesive and rational way, regardless of how complex a system might be, rather than relying upon the entire state of the system to fit within one person’s mind.

Observability empowers DevOps and SRE practices

It’s the job of DevOps and SRE teams to understand production systems and tame complexity, and thus it’s natural for them to care about the observability of the systems they build and run. SRE focuses on the idea of managing services according to Service Level Objectives (SLOs) and Error Budgets. DevOps focuses on managing services through cross-functional practices where developers own running their code in production. Rather than starting with a plethora of alerts that enumerate potential causes of outages, mature DevOps and SRE teams both measure whether there are visible symptoms of user pain, then drill down into understanding the outage using observability tooling.

That shift away from cause-based monitoring and towards symptom-based monitoring means that a need exists to be able to explain the failures you see in practice, rather than the traditional approach of enumerating a growing list of known failure modes. Rather than burning a majority of their time responding to a slew of false alarms that have no bearing upon the end-user visible performance, teams can instead focus on systematically winnowing hypotheses and devising mitigations for actual systems failures. [for more on this, see the chapter on SLOs and Observability]

Beyond adoption of observability for break/fix use cases, forward-thinking DevOps and SRE teams use engineering techniques such as feature flagging, continuous verification, incident analysis. Observability supercharges these use cases by providing the data required to effectively practice them.

  • Chaos engineering and continuous verification requires us to have observability to “detect when the system is normal and how it deviates from that steady-state as the experiment’s method is executed.”6 We cannot meaningfully perform chaos experiments without the ability to understand the system’s baseline state, to predict expected behavior under test, and to explain deviations from expected behavior. “There is no point in doing chaos engineering when you actually don’t know how your system is behaving at your current state before you inject chaos.”7

  • Feature flagging introduces novel combinations of flag states that we cannot test exhaustively. Thus, we need observability to understand the individual and collective impact of each feature flag, user by user. The notion of monitoring behavior component by component no longer holds when an endpoint can execute in multiple different ways depending upon what user is calling it with which flags enabled.

  • Progressive release patterns such as canarying and blue/green deployment require observability to effectively know when to stop the release and to analyze whether the system’s deviations from expected are a result of the release.

  • Incident analysis and blameless postmortems require us to construct clear models about our sociotechnical systems -- not just what was happening inside the technical system at fault, but also what the human operators believed to be happening during the incident. Thus, robust observability tooling facilitates performing excellent retrospectives by providing a post-facto paper trail and details to cue retrospective writers.

As the practices of DevOps and SRE continue to evolve, and as platform engineering grows as an umbrella discipline, inevitably more innovative engineering practices will emerge in your toolbelts. But all of those innovations will depend upon having observability as a core sense to understand our modern complex systems. The shift toward DevOps, SRE, and Cloud Native practices have created a need for a solution like observability. In turn, observability has also supercharged the capabilities of teams that have adopted its practice.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
44.222.113.28