1 Introduction

Let’s begin with an origin story for a company called Example.com. Once upon a time(-series), Example.com had a sysadmin. She managed infrastructure that lived in data centers. Every time a new host was added to that environment she installed a monitoring agent and set up some monitoring checks. Every now and again one of those hosts would break and a check would trigger. A notification would be sent, and she would wake up and run rm -fr /var/log/*.log to fix it.

For many years this approach worked just fine. Of course, there was some drama. Occasionally something would go wrong for which there wasn’t a check, or there just wasn’t time to act on a notification, or some applications and services on top of the hosts weren’t monitored. But largely, monitoring was fine.

Then the Information Technology (IT) industry started to change. Virtualization and Cloud computing were introduced, and the number of hosts that needed to be monitored increased by one or more orders of magnitude. Some of those hosts were run by people who weren’t sysadmins, or the hosts were outsourced to third parties. Some of the hosts in her data center were moved into the Cloud, or even replaced with Software-as-a-Service applications.

Most importantly, IT became a core channel for businesses to communicate with and sell to their customers. Applications and services that had previously been seen as just technology now became critical to customer satisfaction and providing high-quality customer service. IT was no longer a cost center, but something a company’s revenue relied on.

As a result, aspects of monitoring began to break down. It became hard to keep track of hosts (there were a lot more of them), applications and infrastructure became more complex, and expectations around availability and quality became more aggressive.

It got harder and harder to check for all the possible things that could go wrong using the current system. Notifications piled up. More hosts and services meant more demand on monitoring systems—most of which were only able to scale by adding bigger, more powerful hosts, and could not be easily distributed. Under these loads, detecting and locating faults and outages grew ever slower and more challenging.

The organization began demanding more data to both demonstrate the quality of the service they were delivering to customers and to justify the increased spending on IT services. Many of these demands were made for data that existing monitoring simply wasn’t measuring or couldn’t generate. Her monitoring system became a tangled mess.

For many in the industry, this is the state of monitoring right now, and it’s not a happy place. It doesn’t have to be like this—you can build a better solution that addresses the change in how IT works and scales for the future.

1.1 Welcome to the Art of Monitoring

This is a hands-on guide to building a modern, scalable monitoring framework using up-to-date tools and techniques. We’re going to build that framework from the ground up. We’ll include best practices for both sysadmins and developers. We’ll show developers how they can better enable monitoring and metrics, and we’ll show sysadmins how to take advantage of metrics to do better fault detection and gain insights into performance. We’ll address the change in IT environments as a result of the dynamic infrastructure introduced by virtualization, containerization, and the Cloud. The goal of this book is to provide a monitoring framework that helps you and your customers better manage IT.

Before we launch into the guide, it’s important to talk about what monitoring is, why it exists, and some of the challenges that exist in each monitoring domain.

We’ll then talk about what’s in the book, what you’ll learn, and how you can change the way you perceive and implement monitoring.

1.2 What is monitoring?

From a technology perspective, monitoring is the tools and processes by which you measure and manage your IT systems. But monitoring is much more than that. Monitoring provides the translation between business value and the metrics generated by your systems and applications. Your monitoring system translates those metrics into a measurable user experience. That measurable user experience provides feedback to the business to help ensure it’s delivering what customers want. The user experience also provides feedback to IT to indicate what isn’t working and what’s delivering insufficient quality of service.

Your monitoring system has two customers:

  • The business
  • Information Technology

1.2.1 The business as a customer

The first customer of your monitoring system is the business. Your monitoring exists to support the business—and to make sure it continues to do business. Monitoring provides the user experience data that allows the business to make good product and technology investments. Monitoring also helps the business measure the value technology delivers.

1.2.2 Information Technology as a customer

IT is the second customer. That’s you, your team, and the other folks who manage and maintain your technology environment. You rely on monitoring to let you know the state of your technology environment. You also use monitoring quite heavily to detect, diagnose, and help resolve faults and other issues in your technology environment. Monitoring contributes much of the data that informs your critical product and technology decisions, and measures the success of those projects. It’s a key part of your product management life cycle, your relationship with your internal customers, and it helps demonstrate that the business’s money is being well spent. Without monitoring you are not doing your job.

1.3 What does monitoring actually look like?

So, does this vision of monitoring mesh with the real-world implementation of most monitoring systems? That depends. The evolution of monitoring in organizations varies dramatically, or as William Gibson put it:

The future is not evenly distributed.

To explore this we’ve created a three-level maturity model that reflects the various stages of monitoring evolution organizations tend to experience. The stages are:

  • Manual, user-initiated, or no monitoring
  • Reactive
  • Proactive

We don’t believe or claim this model is perfect. The stages identified are broad. Organizations may find they’re at any number of points on the broad spectrums inside those stages. Additionally, what makes measuring this maturity difficult is that not all organizations experience this evolution in linear or holistic ways. This can be the consequence of having employees with varying levels of skill and experience over different periods. It can be due to different segments, business units, or divisions of an organization having very different levels of maturity. Or it can be both.

Now on to the stages.

1.3.1 Manual, user-initiated, or no monitoring

Monitoring is largely manual, user initiated, or not done at all. If monitoring is performed, it’s commonly managed via checklists, simple scripts, and other non-automated processes. Often monitoring becomes cargo cult behavior, with only the components that have broken in the past being monitored. Faults in these components are remediated by repeatedly following rote steps that have also “worked in the past.”

The focus here is entirely on minimizing downtime and managing assets. Monitoring in this way provides little or no value in measuring quality or service, and provides little or no data that helps IT justify budgets, costs, or new projects.

This is typically found in small organizations with limited IT staffing, no dedicated IT staff, or where the IT function is run or managed by non-IT staff, such as a finance team.

1.3.2 Reactive

Reactive monitoring is mostly automatic with some remnants of manual or unmonitored components. Tooling of varying sophistication has been deployed to perform the monitoring. You will commonly see tools like Nagios with stock checks of basic concerns like disk, CPU, and memory. Some performance data may be collected. Most alerting will be based on simple thresholds, and sent via email or messaging services. There may be one or more centralized consoles displaying monitoring status.

There is a broad focus on measuring availability and managing IT assets. There may be some movement towards using monitoring data to measure customer experience. Monitoring provides some data that measures quality or service and provides some data that helps IT justify budgets, costs, or new projects. Most of this data needs to be manipulated or transformed before it can be used. A small number of operationally focused dashboards exist.

This is typical in small to medium enterprises and common in divisional IT organizations inside larger enterprises. Typically reactive monitoring is built and deployed by an operations team. You’ll often find large backlogs of notifications, and stale check configuration and architecture. Updates to monitoring systems tend to be reactive in response to incidents and outages. New monitoring checks are usually the last step in application or infrastructure deployments.

1.3.3 Proactive

Monitoring is considered core to managing infrastructure and the business. Monitoring is automatic and generated by configuration management. You’ll see tools like Nagios, Sensu, and Graphite with widespread use of metrics. Checks will tend to be more application-centric, with many applications instrumented as part of development. Checks will also focus on measuring application performance and business outcomes rather than just stock concerns like disk and CPU. Performance data will be collected and used frequently for analysis and fault resolution. Alerting will be annotated with context and will likely include escalations and automatic responses.

There is a focus on measuring quality of service and customer experience. Monitoring provides data that measures quality or service and provides data that helps IT justify budgets, costs, or new projects. Much of this data is provided directly to business units, application teams, and other relevant parties via dashboards and reports.

This is typical in web-centric organizations and many mature startups. This type of approach is also commonly espoused by organizations that have adopted a DevOps culture/methodology. Monitoring will still largely be managed by an operations team, but responsibility for ensuring new applications and services are monitored may be delegated to application developers. Products will not be considered feature complete or ready for deployment without monitoring.

1.4 Model distribution

Broadly based on some of our monitoring research, we’ve created a distribution for our monitoring maturity model.

Monitoring Maturity Model Distribution.
Monitoring Maturity Model Distribution.

As you can see, the vast majority of environments fall into the Reactive level, which may not come as a surprise to most engineers. The Reactive level of maturity is relatively simple to achieve and appears to satisfy most of the basic needs for monitoring.

As stated, neither the model nor the proposed distribution is perfect. But, even given the broadness of the potential distribution, we can make some architectural predictions about the implementation of monitoring inside a Reactive level organization.

Traditional Monitoring
Traditional Monitoring

This image represents the classic monitoring configuration we’ve seen repeated in a wide cross-section of Reactive-level organizations.

We have a Nagios instance that runs host and service checks, sends SMS or email notifications when something is wrong, and serves as the primary dashboard for interacting with notifications. There are numerous variants of this base setup with both open source and commercial tools but this remains the basic configuration you’re likely to see in Reactive-level organizations.

It’s also a basic configuration that is fundamentally flawed. We talked earlier about the two customers of monitoring: the business and technology. Our Reactive environment doesn’t serve the former at all and barely services the latter.

1.5 Becoming Proactive

A Reactive environment generates infrastructure-centric monitoring outputs: a host is down, a service is broken. There’s no business or application-centric outputs. Without those outputs the business can’t rely on monitoring to provide inputs to business decisions. You certainly can’t use the data to justify budget for improving or updating the infrastructure. Or, often more importantly, for investing in your team.

As the Reactive environment is infrastructure-centric, it also only serves a segment of our technology customer—generally only operational teams—and doesn’t provide useful, application-centric data to developers. As a result, non-operations staff are disconnected from the reality of the performance and availability of the infrastructure and applications being monitored. Developers usually receive outputs secondhand, discouraging accountability for issues and faults.

Note It’s important to mention here that this critique of the Reactive model of monitoring does not (yet) touch on choices of tools and technology. This is not about picking on one tool or another or wars between toolchains. It’s purely about the ability to deliver customers what they need, and to make it easier for you to do your job.

So how do we take our typical Reactive environment and turn it into a much more palatable Proactive environment? Measurement. We’re going to update our Reactive environment to focus on events, metrics, and logs. We’ll replace a lot of our existing monitoring infrastructure—for example, service and host-centric checks—and replace them with event and metric-driven checks.

In our monitoring framework, events, metrics, and logs are going to be at the core of our solution. The data points that make up our events, metrics, and logs will provide the source of truth for:

  • The state of our environment.
  • The performance of our environment.

So, rather than infrastructure-centric checks like pinging a host to return its availability or monitoring a process to confirm if a service is running, we’re going to replace most of those fault detection checks with metrics.

If a metric is measuring then the service is available. If it stops measuring then it’s likely the service is not available.

Visualization of those events, metrics, and logs will also allow for the ready expression and interpretation of complex ideas that would otherwise take thousands of words or hours of explanation.

In Chapter 2, we’ll walk through our proposed framework in detail including how we’ve chosen to design it, and why we’ve chosen certain types of tools and techniques.

To help articulate this framework in the book we’ve used a make-believe company called Example.com so you can see what a real-world build might look like. Let’s take a quick look at the world of Example.com. Example has three main sites:

  • Production A
  • Production B (DRP)
  • Mission Control

Each site is geographically separated. We’re going to focus on applications in our production site, Production A, but we’re going to show you how you can build as resiliently as possible across multiple sites. Example also has a DRP site, Production B, and a Mission Control site that contains management infrastructure including consoles and dashboards. Where relevant, we’ll demonstrate how to connect these sites into your monitoring framework as well.

Example also has test environments. In the real world we’d replicate much, if not all, of our new monitoring in these environments. This helps catch regressions and performance issues, and helps ensure monitoring is a first-class requirement when building applications and services.

Example is primarily a Linux environment, running recent versions of Red Hat Enterprise Linux and Ubuntu, and operates a number of customer-facing internal and external applications. Almost all of its applications are web based, with the stack including:

  • Java and JVM-based applications
  • Ruby on Rails
  • LAMP-stack applications

Their database stack is a mix of MySQL/MariaDB, PostgreSQL, and Redis.

Much of the environment is managed with configuration management tools, and each environment has a Nagios server for monitoring.

Lastly, Example is beginning to explore the use of tools like Docker, and SaaS products like GitHub, PagerDuty, and others.

This environment provides a representative sample of technologies you’re likely to manage yourself that can be adapted to a wide variety of other environments and stacks.

1.6 What’s in the book?

In this book, you’ll learn how to build a monitoring framework. We’ll describe our proposed framework in Chapter 2 and build it, piece by piece, in subsequent chapters, then finally make use of the framework to monitor infrastructure, services, and applications.

It’s really important to understand that this isn’t a monitoring bible for every technology stack. We do use a lot of example applications covering a wide range of technologies to show you how to monitor different components. We don’t, however, provide detailed lists of exactly what you should monitor for every technology stack. This is because every environment and application is developed, built, and coded differently. Every organization also has different architecture and monitoring objectives, thresholds, and concerns.

We’ll explore much of what you might need to monitor, identify critical checks, and introduce a series of patterns you can adopt or adapt. You should be able to build the framework into a solution for your organization that meets your specific needs.

Let’s look at what’s in each chapter.

  • Chapter 1: This introduction.
  • Chapter 2: Our monitoring framework: monitoring, metrics, and measurement. This chapter provides background on the decisions and architecture of our monitoring framework.
  • Chapter 3: Managing events and metrics with an event router called Riemann.
  • Chapter 4: Storing and visualizing metrics with Graphite and Grafana..
  • Chapter 5: Host monitoring with collectd.
  • Chapter 6: Using collectd events in Riemann and Graphite.
  • Chapter 7: Monitoring containers. In this chapter we look at monitoring containers, primarily Docker.
  • Chapter 8: Collecting logs for diagnosis and status, covers the Elasticsearch Logstash Kibana or ELK stack.
  • Chapter 9: Building monitored applications: How to add instrumentation, metrics, logging, and events to your applications.
  • Chapter 10: Notifications: Building contextual and human-friendly notifications.
  • Chapters 11 to 13: Monitoring a stack. We’ll put all our components together to monitor an example host, service, and application stack. These chapters will present a full picture of how our framework will work.
  • Appendix A: An introduction to Clojure, which Riemann uses as a configuration language. (We recommend you read this prior to Chapter 3.)

Finally, one topic we’re not covering directly in the book is the monitoring of non-host devices: networking equipment, storage devices, data center equipment. However, many of the techniques we’re exploring in the book can be replicated on these kinds of devices. Modern devices allow you to push metrics, provide metric and status endpoints, and generate appropriate events and logs.

1.7 Tool choices

In this book we look at mostly free and open source monitoring tools and solutions. There are a number of commercial tools and online services that provide monitoring services but we won’t cover them in much detail.

We recognize that there are a lot of moving pieces here. You might look at the list of tools we’re introducing and say “that’s a lot of software I have to learn and manage.” To help with this, the book is arranged so that you can potentially implement pieces of the framework rather than the whole. Most chapters have a stand-alone component that could use in addition to integration with the other components.

We’ve also chosen tools we think are best of breed in their domains. These choices are based on research, experience, and consultation with colleagues in the industry. Where possible, in each chapter we’ve listed alternative tools you could explore if you find the tools introduced don’t suit you or don’t meet your needs.

Perhaps a better way of looking at these tool choices is that they are merely ways to articulate the change in monitoring approach that is proposed in this book. They are the trees in the woods. If you find other tools that work better for you and achieve the same results then we’d love to hear from you. Write a blog post, give a talk, or share your configuration.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.74.54