Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 13. Observability: Monitoring, Logging, and Tracing

You have learned that testing is a vital skill to be mastered for the effective implementation of continuous delivery, but equally import is observability. Testing enables verification and promotes understanding at build and integration time, whereas observability allows verification and enables debugging at runtime. In this chapter, you will examine what you should observe and how, and you will learn about the implementation of monitoring, logging, tracing, and exception tracking. You will also explore several best practices for each of these implementations, and learn how to combine them with visualization to not only increase your understanding of your running systems, but also identify how to close the feedback loop and continuously enhance your applications.

Observability and Continuous Delivery

Continuous delivery does not end with the application being deployed into production. In fact, you could argue that deploying your application is really the beginning and that the process of continuous delivery stops only when an application or service is retired or decommissioned. Throughout the lifetime of an application, it is vital that you are able to understand what is occurring, and what has occurred, within the system. This is what observability is all about.

Why Observe?

An application is rarely deployed only once and never modified or updated again. A more typical pattern is that the business evolves or the organization changes, which generates new requirements, and, in turn, triggers the creation and deployment of multiple new versions of the application. Often, these new requirements are generated from insight into the application itself—for example, are key performance indicators (KPIs) being met, or is the application running at close to capacity? It is also common for a deployed application to crash or otherwise misbehave, so you may have to run tests and simulations locally in order to re-create the issues, or you may even have to log on to production systems to debug the application in situ.

Monitoring and Observability

The recent popularity of the term observability has driven some in the industry to question what exactly is meant by this term and how it relates to monitoring. An excellent blog post by Cindy Sridharan titled “Monitoring and Observability” explores these topics in detail, and provides many useful references. Fundamentally, Sridharan argues that the goals of monitoring and observability are different but complementary:

“Monitoring” is best suited to report the overall health of systems. Aiming to “monitor everything” can prove to be an antipattern. Monitoring, as such, is best limited to key business and systems metrics derived from time-series based instrumentation, known failure modes, as well as blackbox tests. “Observability,” on the other hand, aims to provide highly granular insights into the behavior of systems along with rich context, perfect for debugging purposes. Since it’s still not possible to predict every single failure mode a system could potentially run into or predict every possible way in which a system could misbehave, it becomes important that we build systems that can be debugged armed with evidence and not conjecture.

Observability may be a popular new term, but the chances are that you or your teammates have been creating “observable” Java applications for many years. If you have ever coded auditing, events logging, or exception tracking, then you have been attempting to observe the system behavior.

Monitoring, logging, and tracing help with all of these situations. These practices provide insight, often referred to an observability, into what is currently occurring or going wrong, as well as a record as to what the application has done. This allows you to “close the loop” on the continuous delivery process, as shown in Figure 13-1.

Once you understand the power that feedback provides, you will undoubtedly want to observe “all the things,” but there is value in being systematic in focusing your efforts. Let’s now look at what to observe.

What to Observe: Application, Network, and Machine

In general, you will tend to monitor and observe your entire system at three levels: application, network, and machine. Application metrics are usually the most challenging to create and understand—yet they are the most important—and this is because they are very specific to your business and requirements. One perspective on monitoring is that it can be used to implement some form of testing in production; you know what a potential failure looks like, and you are asserting that everything is good. For example, you know that there will be trouble if a variety of scenarios occur:

Your virtual machine runs out of block storage (disk) space.
A network partition occurs.
Your web application is returning a 404 HTTP status code for nearly all valid page requests.

For each of these scenarios, you can write a monitoring test. The first two will most likely be checked at the OS level. With the third, you could implement a counter or meter that outputs the number of 404s being generated, and create an alert based on this.

Monitoring and logging can also be used to provide data that is required to answer questions from the business in real time or at a later date. For example, your marketing team may want to know the average shopping basket checkout value during a promotion they are running, or your subscription retention team may want to mine activity logs to see whether they can identify behavior that suggests a customer will soon terminate their commercial contract. In order to implement effective monitoring, logging, and tracing, you have to design with observability in mind.

How to Observe: Monitoring, Logging, and Tracing

There are three primary approaches to observing modern software applications—monitoring, logging, and tracing:

Monitoring: This is used to observe a system in near real-time, and typically involves the generation and capture of ephemeral metrics, values, and ranges. You generally have to know what data you want to observe in this approach. Because of the simplicity of the numbers captured, you cannot mine the data for additional insight later (other than producing aggregates or looking for trends).
Logging: This is generally used to observe the system in the future, perhaps after a particular event (or failure) has occurred. Logs tend to be semantically richer and capture more data in comparison with metrics. Therefore, you can usually mine logs in order to generate additional insight. Logs can also be analyzed to help you generate future questions.
Tracing: This captures the flow of a request as it traverses the (distributed) system, and captures metadata and timing at specific points you believe are interesting. Examples include traffic ingress to an API gateway, handling of the request by your application, and handling of a query against a database.

The outputs of these approaches will allow you to examine the behavior of your application and surrounding system and to reflect on how this can be improved. However, certain outputs demand immediate attention.

Alerting

Certain events that occur during the life of an application require human intervention; you want to be emailed, phoned, or paged when something bad is happening so that you can fix it. For this, you need to create alerts that are triggered based on specified thresholds or occurrences of data from monitoring and logging.

Many alerts can be designed and configured before an application is even deployed, although this does require some up-front planning. The known unknowns of running out of disk space or exceeding the JVM heap space are good examples that should generate alerts. You will want to be aware of impending failure, and ideally fix this before it impacts your users. In the examples provided, you will provision more disk space or reengineer the application to use less memory. Other scenarios that should generate alerts can be found only with the experience of running the system in production; these are the unknowns unknowns. This means it is necessary to continually iterate on creating and maintaining alerts.

Avoiding Alert Overload from Microservices at the Financial Times

Creating effective alerts is challenging, particularly when moving to a new infrastructure or working with a new architectural style. Daniel wrote an article based on a talk by Sarah Wells from FT.com “Observability and Avoiding Alert Overload from Microservices at the Financial Times”, that explains how her team identified and overcame a series of challenges when embracing a microservice architecture. Three key takeaways from the article are as follows:

A core goal of monitoring and alerting is to know about problems before clients do, so the practice of running synthetic requests that mimic user functionality behavior is vital.
Creating alerts should be part of the normal development workflow: code, test, alerts. To ensure that the development team knows if an alert stops working, tests should be added to validate the alert.
Alerts must continually be cultivated, and if an alert is received that doesn’t make sense or does not require human interaction, it must be corrected or removed.

Alerts for metrics can be implemented using popular tooling such as the commercial PagerDuty and open source Bosun. Basic metric alerting can even be implemented in Prometheus. Alerting based on log content can be implemented by commercial tools like Humio and Loggly and open source Graylog 2.

Designing Systems for Observability

Retrofitting monitoring, logging, and tracing into applications can be difficult, because often the required data is not easily available or is difficult to expose without impacting the application functionality. Therefore, it is important to design your system with monitoring in mind, specifically:

Design your application to be capable of monitoring and logging from day one—include metrics and logging frameworks in your build dependencies (or, ideally, in the archetype of the project template).
Ensure that any module (or microservice) boundaries that you create are capable of exposing data that an upstream system may require.
Provide context data on downstream network calls (i.e., which service is calling, and on behalf on which application account).
Ask yourself, the operations team, and your business what type of questions they are likely to ask in the future, and plan to expose the metrics and log data as the application is being designed and built. For example:
- How effectively is a single instance of your application processing an event queue?
- How do you know if the application is fundamentally unhealthy?
- How many customers are currently logged into the application?

Design and Build Applications with Monitoring from Day One

As retrofitting monitoring, logging, and tracing into existing applications is difficult, you should include appropriate frameworks to support these practices from day one. This is especially true if building distributed applications like microservices and serverless functions, because not only will the applications need to support the frameworks, but so will the platform and infrastructure (e.g., collecting and presenting metrics for the system-level view of monitoring or implementing tooling for aggregated logging).

You will now learn how to implement each of the observability approaches with Java applications, but keep in mind the benefits of designing and implementing observability up front.

Metrics

Metrics are a numeric representation of some properties that your system has over intervals of time, such as maximum number of threads being used by your application, current heap memory available, or number of application users logged in during the last hour. Numbers are easily stored, processed, and compressed, and as such, metrics enable longer retention of data, as well as easier querying, which can, in turn, be used to build dashboards to reflect historical trends. Additionally, metrics better allow for gradual reduction of data resolution over time, so that after a certain period of time, data can be aggregated into daily or weekly frequency.

In this section, you will learn about the various types of metrics and the use cases for each. You will also be introduced to several of the most popular metrics libraries for Java—Dropwizard Metrics, Spring Boot Actuator, and Micrometer—and you will see examples of the various types of metrics demonstrated using these libraries.

Type of Metrics

There are, generally speaking, five metric types:

Gauges: The simplest metric type, a gauge simply returns a value. A gauge is useful for monitoring the eviction count in a cache or the average spending amount within a shopping basket that checks out.
Counters: A simple incrementing and decrementing integer. A counter can be used to monitor the number of failed connections to the database, or the number of users logged in to the website.
Histograms: This measures the distribution of values in a stream of data. A histogram is useful for monitoring the average response time of a downstream service or the number of results returned by a search.
Meters: This measures the rate at which a set of events occur. A meter can be used to measure the rate in relation to total cache lookups as cache misses are occurring or the rate in relation to time that users are abandoning shopping baskets with a product still present.
Timers: A histogram of the duration of a type of event and a meter of the rate of its occurrence. A timer can be used to monitor the time it takes to serve a web request or load a user’s saved shopping basket.

All of these metric types can be useful for monitoring a system from an operational (application) perspective or business perspective.

Dropwizard Metrics

The popular Dropwizard Metrics library (formerly the Coda Hale Metrics library) started life as a personal project, alongside what is now called the Dropwizard Java application framework. This metrics library is extremely flexible, and principles from it have been copied in many other metric frameworks, even across other language platforms.

The Codahale Metrics library can be imported into your project via the following dependency (and Example 13-1 is shown using Maven).

Example 13-1. Importing the Dropwizard Metrics library into a Java project

<dependency>
    <groupId>com.codahale.metrics</groupId>
    <artifactId>metrics-core</artifactId>
    <version>${metrics-core.version}</version>
</dependency>

Metrics configuration and metadata

The starting point for metrics is the MetricRegistry class, which is a collection of all the metrics for your application (or a subset of your application). Generally, you need only one MetricRegistry instance per application, although you may choose to use more if you want to organize your metrics in particular reporting groups. Global named registries can also be shared through the static SharedMetricRegistries class. This allows the same registry to be used in different sections of code without explicitly passing a MetricRegistry instance around.

Each metric is associated with a MetricRegistry and has a unique name within that registry. This is a simple dotted name, like uk.co.bigpicturetech.queue.size. This flexibility allows you to encode a wide variety of context directly into a metric’s name. If you have two instances of com.example.Queue, you can make them more specific: uk.co.bigpicturetech.queue.size versus uk.co.bigpicturetech.inboundorders.queue.size, for example.

MetricRegistry has a set of static helper methods for easily creating names:

MetricRegistry.name(Queue.class, "requests", "size")

MetricRegistry.name(Queue.class, "responses", "size")

Implementing a gauge

You can create a gauge with minimal effort by using Codahale Metrics. If, for example, your application has a value that is maintained by a third-party library, you can easily expose this by registering a Gauge instance that returns the corresponding value, as shown in Example 13-2.

Example 13-2. Gauge example using Codahale Metrics libary

registry.register(name(SessionStore.class, "cache-evictions"), new Gauge<Integer>() {
    @Override
    public Integer getValue() {
        return cache.getEvictionsCount();
    }
});

This creates a new gauge named com.example.proj.auth.SessionStore.cache-evictions that will return the number of evictions from the cache.

The Codahale Metrics library provides all of the common metrics mentioned earlier in this chapter, and the best way to learn more about how to implement them is to consult the documentation.

Spring Boot Actuator

Spring Boot Actuator is a subproject of Spring Boot that provides several features to support production-readiness of your applications. After Actuator is configured in your Spring Boot application, you can interact and monitor your application by invoking different HTTP endpoints exposed, such as application health, bean details, version details, configurations, logger details, etc.

To enable Spring Boot Actuator, you need to include only the following dependency in your existing build script (Example 13-3 is using Maven).

Example 13-3. Enabling Actuator within a Spring Boot-based project

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
    <version>${actuator.version}</version>
</dependency>

Creating a counter

To generate your own metrics with Actuator, you simply inject a CounterService and/or GaugeService into your bean. CounterService exposes increment, decrement, and reset methods, and GaugeService provides a submit method. Example 13-4 provides a simple illustration.

Example 13-4. Creating a counter with Spring Boot Actuator metrics

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.actuate.metrics.CounterService;
import org.springframework.stereotype.Service;

@Service
public class MyService {

    private final CounterService counterService;

    @Autowired
    public MyService(CounterService counterService) {
        this.counterService = counterService;
    }

    public void exampleMethod() {
        this.counterService.increment("services.system.myservice.invoked");
    }

}

Micrometer

Micrometer provides a simple facade over the instrumentation clients for the most popular monitoring systems, allowing you to instrument your JVM-based application code without vendor lock-in. The tagline on the project’s website is “Think SLF4J, but for application metrics!”

Micrometer can be imported into your Java application by using the following dependency (Example 13-5 is shown in Maven).

Example 13-5. Importing Micrometer into your Java project

<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-registry-prometheus</artifactId>
  <version>${micrometer.version}</version>
</dependency>

Creating a timer

The metrics APIs exposed within the Micrometer framework are based on the fluent-DSL pattern, so creating a timer is relatively simple. The primary difficulty with initializing a timer typically revolves around how the timer is wrapped around the method to be invocated; see Example 13-6.

Example 13-6. Timers in Micrometer

Timer timer = Timer
    .builder("my.timer")
    .description("a description of what this timer does") // optional
    .tags("region", "test") // optional
    .register(registry);

timer.record(() -> dontCareAboutReturnValue());
timer.recordCallable(() -> returnValue());

Runnable r = timer.wrap(() -> dontCareAboutReturnValue()); (1)
Callable c = timer.wrap(() -> returnValue());

Best Practices with Metrics

There are many good practices in relation to generating and capturing metrics:

Always expose core JVM internal metrics, such as: nonheap and heap memory usage; how often the garbage collector (GC) runs; and thread details, including the number of threads, current status, and CPU usage. The majority of modern metrics frameworks provide this as a bundled feature, so it is simply a matter of enabling this.
Attempt to expose core application-specific technical details that will supplement the JVM internal details. For example, the queue depth of an internal processing queue, the cache statistics (size, hits, average entry age, etc.) of any internal caches, and throughput of core processing.
Report on error and exception details. For example, the number of HTTP 5xx status codes returned when users call a REST API, the number of exceptions caught when calling a third-party dependency that is critical to your flow, and the number of exceptions that propagate through to the end user (which you should always attempt to minimize).
Ensure that development and operation teams work together when designing and implementing infrastructure and platform metrics. Every layer of abstraction within a platform will need to be monitored, and developers and operators may have differing requirements. Example layers of abstraction include application framework (e.g., the Spring or Java EE framework), the runtime Java container (e.g., GlassFish or Tomcat), the JVM, the container implementation (e.g., Docker), the orchestration platform (e.g., Kubernetes), the virtualized cloud hardware (e.g., the VMs and software-defined networks [SDNs]), and physical infrastructure.
Work closely with your business team in order to know what KPIs they want to track. Other systems might be best placed to provide this data, such as an associated data store or an ETL-based batch processing system. However, often a few well-chosen metrics can provide a lot of value in regards to real-time insight into the system. For example, when working with an e-commerce startup, it is common to expose metrics that indicated the number of users currently logged in, the average conversion from adding a product to the basket for purchasing, and the average basket value.

Now that you have developed a good understanding of metrics, it is equally valuable to learn about logging.

Logging

A log is an immutable append-only record of discrete events that happened over time, such as when the application initialized, when a disk read failed, or when an application user logged out.

Forms of Logging

Generally, logs are produced in one of three forms:

Plain text: A log record might take the form of free-form text. In the Java world, this is commonly seen within old applications that use System.out.println to log what is happening within an application. Unfortunately, this means that every log statement is uniquely formatted.
Structured: Here a log entry implements a defined structure, ranging from a simple JSON format entry to an XML format with a strict schema.
Binary: This type of log is generally intended for consumption by an application, where human readability is less of a concern. Examples include the MySQL binlog used for replication, and Protobuf or Avro logs of events that are used for point-in-time recovery.

Logs are useful when you need additional insight along with extra contextual information and other alerting and metrics do not provide enough. However, lots of logging information can be overwhelming, so you should also add metadata to log entries, such as the level of the entry and the cause (user, IP address, etc.) of the action. Many logging frameworks provide level categorization, such as ERROR, WARN, INFO, DEBUG, TRACE.

As with any new technology, there is a temptation to overuse logging when you first discover it. One method to help manage this is to understand and use the log levels. When you are writing a log statement, ponder to yourself whether this information that will be generated would be useful on a day-to-day basis. If it would be, it may well be an INFO statement. If the information is useful only to you, the developer, when trying to track down a bug, then DEBUG or TRACE is probably more appropriate. Any errors should, of course, be output using the ERROR level, but it is worth agreeing with the rest of the development team on where within the stack an error will be logged.

Our recommendation is to log an error at the highest possible level (closest to the call or user-initiated action). Attempting to log a single error multiple times within a call stack often just adds noise to the logs, and makes it even more challenging to track down the issue.

Guard Against “Overlogging”

The real power of log levels is that they allow the amount of logging to be modified at deployment or runtime. For example, if an application is not performing as expected, an operator can enable a more fine-grained logging level, such as DEBUG, in order to gain more insight. However, there will be a cost in performance for generating extra debug issues, and the irony is that many issues disappear when you start looking for them. This is often due to timing and memory usage patterns changing with the additional logging. On the contrary, we have also seen an application completely fall over when logging was enabled, as the memory requirements for generating log statements in a production environment were massive (and the associated TRACE statements had always been used in only tightly controlled development environments with minimal data).

There are several choices for logging frameworks with the Java ecosystem. You will now learn about the two most popular: SLF4J (with Logback) and Log4j 2.

Don’t Invent Your Own Logger

Please don’t attempt to implement your own logging framework, or almost as bad, simply use System.out.println. The modern Java logging frameworks are highly evolved, and offer much more flexibility compared with simply echoing details to the console output (which may or may not exist when running in a containerized environment).

SLF4J

The Simple Logging Facade for Java (SLF4J) serves as a simple facade or abstraction for various logging frameworks (e.g., java.util.logging, Logback, Log4j), allowing you to plug in the desired logging framework at deployment time. You can include SLF4J (in this case, using Logback under the hood) via Maven, as shown in Example 13-7.

Example 13-7. Including SLF4J with Logback, via Maven

<dependency> 
  <groupId>org.slf4j</groupId>
  <artifactId>slf4j-jdk14</artifactId>
  <version>${slf4j.version}</version>
</dependency>
<dependency> 
  <groupId>ch.qos.logback</groupId>
  <artifactId>logback-classic</artifactId>
  <version>${logbacl.version}</version>
</dependency>

The usage of SLF4J is simple, as you can see in Example 13-8, from the SLF4J user manual.

Example 13-8. Using the SLF4J APIs

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class HelloWorld {
  public static void main(String[] args) {
    Logger logger = LoggerFactory.getLogger(HelloWorld.class);
    logger.info("Hello World");
  }
}

SLF4J also supports Mapped Diagnostic Context (MDC), which allows you to add context-specific key-value data to a logger and can provide useful information for searching and filtering data from a system that is dealing with many user requests in a distributed system. If the underlying logging framework offers MDC functionality, SLF4J will delegate to the underlying framework’s MDC. Currently, only Log4j and Logback offer MDC functionality.

Log4j 2

Apache Log4j 2 is an upgrade to the original Log4j that provides significant improvements over the first (and very popular) version. The Log4j 2 website claims that it provides many of the improvements available in Logback while fixing some inherent problems in Logback’s architecture. One of the key differences with version 2 of the logging framework is that the API for Log4j is separate from the implementation, making it clear for application developers which classes and methods they can use while ensuring forward compatibility. Applications coded to the Log4j 2 API always have the option to use any SLF4J-compliant library as their logger implementation with the Log4j-to-SLF4J adapter.

While the Log4j 2 API will provide the best performance, Log4j 2 provides support for the Log4j 1.2, SLF4J, Commons Logging, and java.util.logging (JUL) APIs. If performance is an especially important issue for you, you may be interested in the fact that Log4j 2 contains asynchronous loggers based on the LMAX Disruptor inter-thread messaging library, which can provide higher throughput and orders of magnitude lower latency than Log4j 1.x and Logback.

You can include Log4j 2 in your Maven project with the dependencies shown in Example 13-9.

Example 13-9. Including Log4j 2 in your Maven-based application

<dependencies>
  <dependency>
    <groupId>org.apache.logging.log4j</groupId>
    <artifactId>log4j-api</artifactId>
    <version>${log4j.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.logging.log4j</groupId>
    <artifactId>log4j-core</artifactId>
    <version>${log4j.version}</version>
  </dependency>
</dependencies>

The use of the Log4j 2 API is similar to that of SL4JF API, so if you are used to this framework, then you will feel right at home; see Example 13-10.

Example 13-10. Usage of Log4j 2

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
 
public class HelloWorld {
    private static final Logger logger = LogManager.getLogger("HelloWorld");
    public static void main(String[] args) {
        logger.info("Hello, World!");
    }
}

Logging Best Practices

Lots of great articles online share logging best practices, and we’ve collated several of the recommendations here, combined with our own experience:

Don’t log every little detail. This not only can have a performance impact, but also adds a lot of noise to logs. Any future maintenance in the code will also potentially have to modify all of the logs.
Conversely, do log important details, particularly around core flows or forks of processing within the overall application processing, as it is often a good idea to start at these places when debugging strange issues.
Write meaningful logging information that will help you and others diagnose information in the future. Be sure to include relevant context—finding the phrase “Transaction failed” within a log without any other context is never helpful. Make the information machine parsable as well, which will also aid in searching for keywords.
Log at the correct level: INFO for general information, DEBUG/TRACE for finer-grained diagnostic information, and WARN/ERROR for events that should require additional follow-up.
Use a static modifier for your Logger object, as this means that the Logger will be created only once, reducing overhead.
You can customize your layout in the logs (for example, with Log4j Pattern Layouts).
Consider using a JSON layout for structured logging. This makes logs easier to parse into an external, centralized log aggregation platform.
If you are working with SLF4J and you are running into issues with getting appenders configured correctly (or receiving no logging output), you can often resolve these issues after enabling the internal debugging by setting the log4j.debug system property in the configuration file or adding -Dlog4j.debug to the application/JRE startup command.
Don’t forget to rotate logs regularly to prevent the log files from growing too large, or the loss of data. Closely related to this topic is the recommendation that all logs should be asynchronously shipped off to a centralized log store, and a maximum number of rotated log files stored locally.
Get in the habit of periodically scanning all logs, looking for unexpected WARNs, ERRORs, and exceptions. This can often be a great way to catch an issue before it becomes more significant.

Don’t Log Sensitive Data

Although it may be tempting for debugging purposes, you should never log any sensitive information, such as confidential user or business data, personally identifiable information (PII), or any data that would fall under legal regulations, such as the EU’s General Data Protection Regulation (GDPR). Not only can logging sensitive information lead to compliance violations and fines, but it is a potential security vulnerability. We have both seen logs that have recorded credit card information, passwords (and failed password attempts, which often contain passwords a user uses somewhere else), and answers to account reset questions.

One of our favorite logging articles is Brice Figureau’s “The 10 Commandments of Logging”, and we recommend reading this for a more in-depth overview of logging practices.

Logging in the (Ephemeral) Cloud

When deploying Java applications on an IaaS or PaaS cloud platform, and especially on a FaaS serverless platform, don’t forget that the underlying infrastructure will most likely be ephemeral, meaning that it could disappear at a moment’s notice. You obviously have to code your application to be resilient to this, but you must also configure your logs appropriately. Primarily, you must ship your logs to a centralized collection or aggregation service, such as an ELK stack or commercial platform such as Humio, and it can also be beneficial to think about where you are storing your logs locally. For example, storing logs on a mounted persistent volume can help prevent data loss during an instance crash, but this will also have performance implications (i.e., less performance than writing log data to a locally attached volume).

Request Tracing

The basic idea behind request tracing is relatively straightforward: specific inflection points must be identified within a system, application, network, and middleware—or indeed any point on a path of a (typically, user-initiated) request—and instrumented. These points are of particular interest, as they typically represent forks in execution flow, such as the parallelization of processing using multiple threads, a computation being made asynchronously, or an out-of-process network call being made. All of the independently generated trace data must be collected, coordinated, and collated to provide a meaningful view of a request’s flow through the system.

Traces, Spans, and Baggage

As defined by the Cloud Native Computing Foundation (CNCF) OpenTracing API project, a trace tells the story of a transaction or workflow as it propagates through a system. In OpenTracing and Dapper, a trace is a directed acyclic graph (DAG) of spans, which are also called segments within some tools, such as AWS X-Ray. Spans are named and timed operations that represent a contiguous segment of work in that trace. Additional contextual annotations (metadata, or baggage) can be added to a span by a component being instrumented—for example, an application developer may use a tracing SDK to add arbitrary key-value items to a current span. It should be noted that adding annotation data is inherently intrusive: the component making the annotations must be aware of the presence of a tracing framework.

Trace data is typically collected “out of band” by pulling locally written data files (generated via an agent or daemon) via a separate network process to a centralized store, in much the same fashion as currently occurs with log and metrics collection. Trace data is not added to the request itself, because this allows the size and semantics of the request to be left unchanged, and locally stored data can be pulled when it is convenient.

When a request is initiated, a parent span is generated, which, in turn, can have causal and temporal relationships with child spans. Figure 13-2, taken from the OpenTracing documentation, shows a common visualization of a series of spans and their relationship within a request flow.

This type of visualization adds the context of time, the hierarchy of the services involved, and the serial or parallel nature of the process/task execution. This view helps to highlight the system’s critical path, and can provide a starting point for identifying bottlenecks or areas to improve. Many distributed tracing systems also provide an API or UI to allow further drill-down into the details of each span.

Java Tracing: OpenZipkin, Spring Sleuth, and OpenCensus

The world of distributed tracing is both fast evolving and becoming increasingly (cloud) platform specific. These facts, in combination with limitations of scope, mean that no implementation guide will be provided in this book. Interested readers are pointed to the popular open source frameworks OpenZipkin, Spring Cloud Sleuth, and OpenCensus for more information, which all provide Java SDKs.

Closely related to distributed tracing, application performance management (APM) is also a useful tool for developers and operators to understand and debug a system. Historically, the commercial solutions have had much more functionality in comparison with open source tooling, but Naver’s Pinpoint is now offering much of the expected core functionality and provides distributed tracing features.

Recommended Practices for Tracing

Distributed tracing within the Java space is a relatively new practice, and therefore there are limited “best” practices. However, recommended practices include the following:

You must remember to forward the tracing headers to all downstream services, middleware, and data stores; otherwise, part of the application will not be covered by the traces.
In relation to the previous point, if you are working with a polyglot application stack, you should integrate Zipkin (or your tracing solution of choice) into the additional language frameworks. Zipkin is great for this purpose, as it is a language-agnostic tracing solution.
Do not attempt to add a large amount of “baggage” metadata. Although this is collected out-of-band of the request itself, this can still result in noisy traces.

Finally, consider whether you want to run your own trace collection service, and whether you have the skills and resources available to make this a viable solution. Many of the cloud vendors offer excellent fully managed services.

Exception Tracking

Even if you have followed all of the advice within this chapter and implemented aggregated logging and centralized monitoring, you will still encounter scenarios within production systems where something goes wrong and you won’t know about it. This is almost inevitable with the complexity of the systems being implemented today. Ideally, you always want to know about a problem before an end user sees this or (worse still) reports to you that your system is broken. Therefore, an additional tool in your issue management toolbox should be an exception-tracking system.

An exception-tracking system is typically provided by a SaaS vendor, although in-house solutions are also available (such as the open source Ruby on Rails Errbit application, which is Airbrake compatible). A client SDK is added to your Java application, typically as a Maven or Gradle dependency, which captures any exceptions that are uncaught or have propagated to the view layer and reports the details to the exception-tracking service. Many tracking services have informative dashboards that help you in diagnosing and finding the associated issue, and they also typically alert you to the issue in near real time (or integrate with other services that provide this feature).

Exposed Exceptions Can Provide Information to Hackers!

If an internal exception or error is propagated through to the end user, this is obviously a bad user experience, but the error may also leak sensitive or useful information to a hacker. Indeed, the hacker may have been trying to break your system, and even if they succeed, they should receive no information of the issue. For this reason, you should avoid including overly descriptive error messages, stack traces, or PII data with an error message that is displayed (intentional or otherwise).

In addition to utilizing an exception-tracking system, we also recommend implementing a catchall error-handling web page that is displayed by default on the event of an uncaught exception. This page can typically be configured within modern Java web frameworks, or alternatively by configuring a static error page within your web server or API gateway that is displayed when an error is indicated within the HTTP response (e.g., a 5xx HTTP status code). Any error page should apologize for the inconvenience, and suggest that the user contact the company help desk. If the error page is generated within the application server, it is acceptable to provide a UUID as a reference to the error.

Don’t Forget the Client Side

If you are working on an application that exposes a web-based interface, errors can also occur in the client-side code. These also need to be caught and tracked. Many of the commercial tooling mentioned can be integrated with frontend JavaScript to accomplish this, such as Sentry.

Airbrake

A popular cross-language exception tracker is Airbrake. To install the Airbrake client into your Java code, you can simply import the dependency via Maven, as shown in Example 13-11.

Example 13-11. Importing the Airbrake SDK into your Java project

 <dependency>
     <groupId>io.airbrake</groupId>
     <artifactId>airbrake-java</artifactId>
     <version>${airbrake.version}</version>
 </dependency>

As stated in the Airbrake Java client GitHub repository README, the easiest way to use Airbrake is by configuring a Log4j appender. Therefore, when an uncaught exception occurs, Airbrake will POST the relevant data to the Airbrake server specified in your environment. (Don’t forget that you are still responsible for preventing or translating the display of this error to the end user.) You saw an example Log4j configuration in the preceding example, and Example 13-12 is a modified version configured to report errors to the external Airbrake service (which could be a self-hosted Errbit service).

Example 13-12. Log4j properties configuration file for reporting exceptions to an external Airbrake service

log4j.rootLogger=INFO, stdout, airbrake

log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=[%d,%p] [%c{1}.%M:%L] %m%n

log4j.appender.airbrake=airbrake.AirbrakeAppender
log4j.appender.airbrake.api_key=YOUR_AIRBRAKE_API_KEY
#log4j.appender.airbrake.env=development
#log4j.appender.airbrake.env=production
log4j.appender.airbrake.env=test
log4j.appender.airbrake.enabled=true
#log4j.appender.airbrake.url=http://api.airbrake.io/notifier_api/v2/notices

If you are not using Log4j, or want to send other exceptions to your exception-tracking service, you can call the Airbrake client directly, as shown in Example 13-13.

Example 13-13. Calling the Airbrake service directly via the SDK

try {
    doSomethingThatThrowsAnException();
}
catch(Throwable t) {
    AirbrakeNotice notice = new AirbrakeNoticeBuilder(
                            YOUR_AIRBRAKE_API_KEY, t, "env").newNotice();
    AirbrakeNotifier notifier = new AirbrakeNotifier();
    notifier.notify(notice);
}

System-Monitoring Tooling

You’ve seen how important it is to generate and collect metrics and logs from your Java applications within this chapter, and the same advice applies to the OS and infrastructure that your applications run on.

collectd

collectd gathers metrics from various sources (e.g., the operating system, applications, log files, and external devices) and stores this information or makes it available over the network. Those statistics can be used to monitor systems, find performance bottlenecks, and predict future system load. collectd runs as a daemon on each machine instance, and all of the functionality is provided as a series of plugins. collectd’s configuration is kept as easy as possible—besides which modules to load, you don’t need to configure anything else, but you can customize the daemon to your liking if you want. collectd utilizes a data push model: the data is collected and sent (pushed) to a multicast group or server. Thus, there is no central instance that queries any values.

Because of space limitations (and subtle differences between Linux distros), we won’t cover how to install and set up a central collectd server. Usually, this would be done by a centralized operations team in a large organization, and for smaller teams using public cloud services, you can often transform collectd metric data into the vendor’s proprietory centralized metrics collection framework (e.g., Amazon CloudWatch has a collectd plugin). The client collectd daemon can be installed as a binary (available via the project’s download page), and the configuration is specified by modifying the /etc/collectd.conf configuration file. More information can be found on the collectd website.

rsyslog

Modern Java applications involve lots of moving parts that are often distributed across multiple machines, and tracking what is happening and diagnosing issues at the OS level can be challenging. Therefore, centralizing your log output can be useful. Syslog is a standard developed in the 1980s for recording logging messages, and used widely, especially in Unix environments. All mainstream Linux distributions install a syslog implementation as part of the base system, which is a strong reason for adopting it in preference to other, less widely deployed systems. Rsyslog builds upon the basic syslog protocol, and extends it with content-based filtering, flexible configuration options, and a bunch of useful extensions, such as the support for ISO 8601 timestamps and the ability to log directly into various database engines.

Typically, this type of centralized log management will be implemented by a centralized operations team, but it is not difficult to run your own central receiving server. For the sake of brevity (and the subtle differences based on Linux distros), we won’t cover the installation or configuration of a receiving server. For the client servers, all you need to do is tell syslog to forward all logs to the central server. This is typically achieved by adding the following to the base of the /etc/rsyslog.conf config file:

*.* syslog.mycentralserver.com

This will send all log messages sent via syslog to the central receiving server.

Sensu

Sensu is an open source and commercial infrastructure and application-monitoring and telemetry solution that provides a framework for monitoring almost everything: from infrastructure to application health, and business KPIs. Sensu is designed to solve monitoring challenges introduced by the types of modern infrastructure platforms that we have talked about in this book (e.g., a mix of static, dynamic, and ephemeral infrastructure when using public, private, and hybrid clouds). Sensu is often deployed in place of existing infrastructure-monitoring solutions such as Nagios.

Sensu exposes all of its configuration as JSON files, so it is easy to automate and manage configuration via VCSs. Sensu also integrates well with alerting tools like PagerDuty, Slack, and email.

In general, Sensu can coexist with other tooling like Prometheus, and it is common to see both being utilized at an organization. Developers tend to gravitate toward Prometheus because of its user experience (UX) and extensive query features, and operators tend to embrace Sensu because of its extensive integration with infrastructure (including the ability to reuse existing Nagios health checks).

Collection and Storage

Any metric and logging data must be reliably captured and stored for later analysis. This section will explore a popular solution for each of these requirements.

Prometheus

Prometheus is an open source systems monitoring and alerting toolkit originally built at SoundCloud. It is now a standalone open source project (hosted by the CNCF) and maintained independently, and developers from many organizations now contribute. Prometheus fundamentally stores all data as time series: streams of timestamped values belonging to the same metric and the same set of labeled dimensions. Prometheus works well for recording any purely numeric time series. It fits both machine-centric monitoring as well as monitoring of highly dynamic service-oriented architectures. In a world of microservices, its support for multidimensional data collection and querying is a particular strength.

Prometheus provides its own Java SDK that provides all of the metrics types discussed previously. However, the Prometheus API is specific to this collection platform, and instead it is often advantageous to use a platform-agnostic library and integrate this with Prometheus. All of the main metrics libraries provide Prometheus integration, including Dropwizard/Codahale Metrics, Micrometer, and Spring Boot Actuator/Metrics. Metrics stored within Prometheus can easily be visualized via Grafana.

Elastic-Logstash-Kibana

When discussing how to aggregate and store log data, you will often hear talk of the ELK stack. ELK is an acronym for three open source projects: Elasticsearch, Logstash, and Kibana. Elasticsearch is a search and analytics engine. Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a “stash” like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch. Both SLF4J and Log4j 2 can be configured to format data into JSON that is ready for consumption by Logstash and Elasticsearch.

Beware of the Boiling Frog: Abraham’s Experience

There is an old fable that states that if you put a frog in a pot of boiling water, it will immediately jump out, but if you put it in warm water and then gradually increase the temperature to a boil, the frog won’t notice the increase and will boil alive. While the veracity of the story is probably questionable (and I’m most certainly not encouraging you to test it yourself), there is some knowledge that you can apply here.

A limiting factor of many metrics and visualization tools is the amount of resources that they require: storing data is expensive, and indexing it to quickly visualize it even more so. For this reason, I have frequently seen teams that will impose an age limit in their visualization tools, typically two weeks, but sometimes even less.

Most people don’t mind that because they have configured their graphs with a rather short window, usually something between the last 30 minutes and the last day; I have rarely met teams that tend to be interested in checking what happened more than a day ago. I do like to check further, though, because I like to understand the long-term patterns: how much higher is traffic usually on weekdays compared to weekends? How many new users do we have in the summer months compared to the rest of the year? Is it normal that visits plummet like this during the holiday season, or are we having a bad year? Questions like these are the ones that bug me, and the ones that most monitoring tools cannot answer.

It is true that many problems tend to be sudden, and that these can be noticed with a short-term monitoring view. For the longer-term ones, I tend to maintain my own list of metrics that I keep in a rather unsophisticated way (typically, a spreadsheet with daily totals), and where I can observe patterns. It may sound like overkill, particularly if you have spent so much effort setting up a proper ELK stack or similar, but for me it has paid off: once I was able to identify one of these slowly cooking frogs, a performance issue that kept creeping up ever so slightly over the course of two months, but that was clearly growing at an exponential rate. Hopefully, we managed to catch it at the beginning, when it still wasn’t too much of an issue, but imagine what could have happened if I hadn’t kept an eye on the longer trends.

Visualization

Designing systems with observability in mind and collecting appropriate metric and logging data is a good first step toward understanding your application and system. However, an equally important step is converting this data to something that provides insight and drives actions and improvement. How you do this depends on your target audience: business, operations, or development. The goal of this section is to provide an overview of what is possible. Because of the scope of the book, you are invited to follow up with further reading and web searches.

Visualization for Business

The primary driver when creating visualizations for business use is to focus on the most important information and to minimize noise. A popular mechanism for displaying textual and numeric insight is a dashboard. The dashing.io framework, along with a more actively maintained fork, Smashing (shown in Figure 13-3) is a simple-to-use and effective dashboard tool. Dashboards are created using ERB Ruby scripts (much like JSPs), and data can be submitted to the tool via a REST-like API.

Operational Visualization

Popular operations visualization tooling includes Graphite and the more modern Grafana, shown in Figure 13-4. These tools make it easy to create dashboards that focus on global system health and performance, as well as service- or infrastructure-specific properties. Core goals for visualization within this space include providing the ability for engineers to self-serve and create their own dashboards, and to create automated alerts on anything that should require an action to be taken.

Another popular requirement from operators is the ability to understand the flow of requests and data across a system, and for this, the output of APM tooling can be valuable. Figure 13-5 demonstrates a request/response scatter chart from the user-generated request to the associated database query using the open source Pinpoint APM solution.

Visualization for Developers

Developers are well catered to by visualization tooling like Kibana, which is often used as part of the ELK stack. Whereas Grafana is focused on metrics, Kibana, shown in Figure 13-6, is focused on logs, and enables full-text querying in addition to graphing. This functionality is invaluable for developers when debugging complex issues.

If you are utilizing distributed tracing, many of these tools provide a graphical interface that can be queried to show a single trace. As demonstrated in Figure 13-7, the benefit of this type of visualization is that it allows you to quickly identify the flow of the request/response and data across a single user-triggered action. Long spans allow you to locate a long-running process, and broken spans quickly highlight processes or services that are failing.

Although the lure of the command line can be tempting for many developers, you can also get a lot of value from the appropriate use of visualization. A core goal of visualization in this domain is to ensure that developers have self-service access to the tooling, and can create dashboards, charts, and trace queries with minimal overhead.

Summary

In this chapter, you have learned about the fundamentals of observability:

Throughout the lifetime of an application, it is vital that you are able to understand what is occurring, and what has occurred, within the system. This is what observability is all about.
In general, you will tend to monitor and observe your entire system at three levels: application, network, and machine.
There are three primary approaches to observing modern software applications: monitoring, logging, and tracing.
Monitoring is used to observe a system in near real-time, and typically involves the generation and capture of ephemeral metrics, values, and ranges.
Logging is generally used to observe the system in the future, perhaps after an event (or failure) has occurred.
Tracing captures the flow of a request as it traverses the (distributed) system, and captures metadata and timing at specific points you believe are interesting.
Certain events that occur during the life of an application require human intervention. For this, you need to create alerts that are triggered based on specified thresholds or occurrences of data from monitoring and logging.
Retrofitting monitoring, logging, and tracing to applications can be difficult. Therefore, it is important to design your system with monitoring in mind.
You always want to know about a problem before an end user sees it. Therefore, an additional tool in your issue management toolbox should be an exception-tracking system.
Using visualization tools and dashboards correctly can provide insight and reduce the amount of noise that is presented by raw metric and log data.

At this point in the book, you have learned about the technical details of implementing continuous delivery. The next chapter focuses on the challenges of migrating an existing organization or application to this way of working.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.