Chapter 2. Breaking Down the Problem

I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS.1

James Mickens

The concept of tracing the execution of a computer program is not a new one, in any sense. Being able to understand the call stack of a program is fairly critical, one might say, to all manner of profiling, debugging, and monitoring tasks. Indeed, stack traces are likely to be the second-most utilized debugging tool in the world, right behind printf statements liberally scattered throughout a codebase. Our tools, processes, and technologies have improved over the past two decades and demand of us new methodologies and patterns of thinking, though. As we recalled in the last chapter, modern architectures such as microservices have fundamentally broken these classic methods of profiling, debugging, and monitoring. Distributed tracing stands ready to alleviate these issues, to fix the holes in our tools that we have destroyed with our tools.

There’s just one problem - distributed tracing can be hard.

Why is this the case? Three fundamental problems generally occur when you’re trying to get started with distributed tracing.

First, you need to be able to generate trace data. Support for distributed tracing as a first-class citizen in your runtime may be spotty, or nonexistant. Your software might not be structured to easily accept the instrumentation code required to emit tracing data. You may use patterns that are antithetical to the request-based style of most distributed tracing platforms. Often, distributed tracing initiatives are dead on arrival due to the challenges of instrumenting an existing codebase.

The second problem is how you collect and store the trace data generated by your software. Imagine hundreds or thousands of services, each emitting small chunks of trace data for each request, potentially millions of times per second. How do you capture that data and store it for analysis and retrieval? How do you decide what to keep, and how long to keep it? How do you scale the collection of your data in time with requests to your services?

Finally, once you’ve got all of this data, how do you actually derive value from it? How do you translate the raw trace data that you’re receiving into actionable insights? How do you use trace data to generate and inform your SLO’s and SLI’s? Can you turn your trace data into value for other parts of the business, outside of just engineers? These questions, and more, stymie and confuse many people who are trying to get started with distributed tracing.

The Pieces of a Distributed Tracing Deployment

To answer these questions, and help you organize your thinking about the subject, we’ve broken down distributed tracing deployments into three main areas of focus, which is also how we’ve organized the book. These three pieces build off of each other, but may be generally useful to different people at different times — by no means do you need to be an expert on all three! Inside each section you’ll find helpful explanations, lessons, and examples of how to build and deliver a distributed tracing deployment at your organization which should help in building confidence in your systems and software.

Instrumentation

Distributed tracing requires traces. Those traces are created via instrumentation. We generally use instrumentation to refer to any method by which you generate trace data from your service that is suitable to be used by a trace analyzer in order to create distributed traces. In this section, you’ll learn about spans, the building blocks of request-based distributed traces, and how they may be generated by your services. We’ll discuss the current state of the art in instrumentation frameworks such as OpenTelemetry, a widely-supported open source project that offers an instrumentation API (and more) that allows for easy bootstrapping of distributed tracing into your software. In addition, we’ll discuss the best practices for instrumenting both legacy code as well as green field development.

Data Collection

Once you’ve generated your traces, you need to store and analyze them. You’ll learn about the various data sources that emit trace data and how they relate to each other. We’ll guide you through the tradeoffs around overhead, and how to decide what traces should be kept, and which should be thrown away through a mechanism known as sampling. Finally, you’ll learn about how trace data is coordinated and dispatched to centralization systems, allowing you to analyze it to gain valuable insights.

Delivering Value

Once you’ve got instrumented services and a data collection system, the real fun begins! How do you combine traces with your other observability tools and techniques such as metrics, and logs? How do you measure what matters — and for that matter, how do you define what matters to begin with? Distributed tracing provides the tools you’ll need to answer these questions, and we’ll help you figure it out in this section. You’ll learn how to use traces to improve your baseline performance, as well as how tracing assists you in getting back to that baseline when things catch on fire.

All that said, there’s still an open question here — how does distributed tracing relate to microservices, and distributed architectures more generally? We touched on this in the last chapter, but let’s digress for a moment on the relationship between these things.

Distributed Tracing, Microservices, Serverless, Oh My!

There’s a certain line of thinking about microservices, now that we’re several years past them being the “hot thing” in every analyst’s portfolio of “Top Trends For 20xx" — namely, that the battle has been won. The exploding popularity of cloud computing, Kubernetes, containerization, and other development tools which enable rapid provisioning and deployment of hardware (or hardware-like abstractions) has transformed the industry, undoubtedly. These factors can make it feel like asking the question “Should I use Microservices?” would be to out oneself as a fool or charlatan.

Take a step back here and we’ll look at some real-world data. First and foremost, there’s some evidence that containers aren’t exactly as popular in production as the hype may make them seem, with only 25% of developers using them in production.2 Quite a few engineering organizations are still using pretty standard, traditional monoliths for a lot of their work. Why is this the case? One reason may be, ironically enough, because of the lack of accessible distributed tracing tools.

Martin Fowler identified three primary considerations to the adoption of microservices in your organization3 — the ability to rapidly provision hardware, the ability to rapidly deploy software, and a monitoring regime that can detect serious problems quickly. The things we love about microservices - independence, idempotence, etc. - are also the things that make them difficult to understand, especially when things go wrong. Serverless technologies only add further confusion to this equation, by giving you less visibility into the runtime environment of a particular function and often being stubbornly resistent to monitoring through your favorite tools.

How, then, should we consider distributed tracing arrayed against these questions? First, distributed tracing solves the monitoring question raised by Fowler. It allows you to gain critical insights into the performance and status of individual services as part of a chain of requests in a way that would be difficult or time-consuming to do otherwise. Distributed tracing gives you the ability to understand exactly what a particular, individual service is doing as part of the whole, enabling you to ask and answer questions about the performance of your services. Traditional metrics and logging simply can’t compare to the power of distributed tracing in this area. Metrics, for example, will allow you to get an aggregate understanding of what’s happening to all instances of a given service, and even allow you to narrow your query to specific groups of services, but fail to account for infinite cardinality.

What’s Cardinality?

Cardinality is a mathematical term that referes to the number of elements in a set or group, and we’ll discuss this more in later chapters.

Logs, on the other hand, provide for extremely fine-grained detail on a given service, but have no built-in way to provide that detail in the context of a request. You can — and quite possibly, are — using both metrics and logs to discover problems and address them, but distributed tracing provides a best-of-both-worlds approach that eclipses both.

These things all relate, ultimately. Distributed tracing is a requirement for microservices, insomuch as distributed architectures are a requirement for distributed tracing. If you’re trying to create a distributed application of any sort, distributed tracing should be one of your primary concerns insomuch as you want that application to be reliable and observable.

The Benefits of Tracing

What, then, does tracing get you specifically? We’ll talk about this throughout the rest of the text, but let’s talk about the high level quick wins first. Distributed tracing can transform the way that you develop and deliver software, bar none. It has benefits not only from a software quality perspective, but to your organizational health as well.

Distributed tracing improves your developer productivity, and your development output. It is the best and easiest way for developers to understand the behavior of distributed systems in production. You will spend less time troubleshooting and debugging a distributed system by using distributed tracing than you would without it, and you’ll discover problems and issues you didn’t even realize you had thanks to it.

Distributed tracing supports modern, polyglot development. Since distributed tracing is agnostic to your programming language, monitoring vendor, or runtime environment, you can propagate a single trace from an iOS native client through a C++ high performance proxy through a Java or C# backend to a web-scale database and back, all visualized in a single place, using a single tool. No other set of tools allows you this freedom and flexibility.

Distributed tracing reduces the overhead required for deployments and rollbacks by quickly giving you visibility into changes. This not only reduces the mean time to resolution of incidents, but decreases the time to market for new features, and the mean time to detection of performance regressions. This also improves communication and collaboration across teams, as your developers aren’t siloed into a particular monitoring stack for their slice of the pie — everyone, from front-end developers to database nerds can look at the same data to understand how changes impact the overall system.

Setting The Table

After all that, we hope that we have your attention! Let’s recap:

  • Distributed Tracing is a tool that allows for profiling and monitoring distributed applications by way of traces, visual representations of each request in a chain.

  • Distributed Tracing is agnostic to your programming language, runtime, or deployment environment and can be used with almost every type of application or service.

  • Distributed Tracing improves teamwork, coordination, and reduces time to detect and resolve performance issues with your application.

However, to get these benefits, first you’ll need some trace data, then you’ll need to collect it, and finally you’ll have to analyze it. Let’s start at the beginning, then, and talk about instrumenting your code for distributed tracing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.109.151