Transparency

Shipboard engineers can tell when something is about to go wrong by the sound of the giant diesel engines. They’ve learned, by living with their engines, to recognize normal, nominal, and abnormal. They are constantly surrounded by the sounds and rhythms of their environment. When something is wrong, the engineers’ knowledge of the linkages within the engines can lead them to the problem with speed and accuracy—and with just one or two clues—in a way that can seem psychic.

The power plant in a ship radiates information through ambient sounds and vibration, through gauges with quantitative information, and in extreme (usually bad) cases through smell. Our systems aren’t so naturally exposed. They run in invisible, faceless, far-distant boxes. We don’t see or hear the fans spin. No giant reel-to-reel tape drives whiz back and forth. If we are to get the kind of “environmental awareness” that the shipboard engineers naturally acquire, we must facilitate that awareness by building transparency into our systems.

Transparency refers to the qualities that allow operators, developers, and business sponsors to gain understanding of the system’s historical trends, present conditions, instantaneous state, and future projections. Transparent systems communicate, and in communicating, they train their attendant humans.

In debugging the “Black Friday problem” (see Chapter 6, Case Study: Phenomenal Cosmic Powers, Itty-Bitty Living Space), we relied on component-level visibility into the system’s current behavior. That visibility was no accident. It was the product of enabling technologies implemented with transparency and feedback in mind. Without that level of visibility, we probably could’ve known that the site was slow (if a disgruntled user called us or someone in the business happened to hit the site) but have no idea why. It would be like having a sick goldfish—nothing you do can help, so you just wait and see whether it lives or dies.

Debugging a transparent system is vastly easier, so transparent systems will mature faster than opaque ones.

When making technical or architectural changes, you are totally dependent on data collected from the existing infrastructure. Good data enables good decision-making. In the absence of trusted data, decisions will be made for you based on somebody’s political clout, prejudices, or whoever has the best “executive style” hair.

Finally, a system without transparency cannot survive long in production. If administrators don’t know what the system is doing, it can’t be tuned and optimized. If developers don’t know what works and doesn’t work in production, they can’t increase its reliability or resilience over time. And if the business sponsors don’t know whether they’re making money on it, they won’t fund future work. Without transparency, the system will drift into decay, functioning a bit worse with each release. Systems can mature well if, and only if, they have some degree of transparency.

This section takes our first slice at transparency. We’ll see what machine and service instances must do to create transparency. Later, in Chapter 10, Control Plane, we see how to knit instance-level information with other sources to create system-level transparency. That system-level view will provide historical analysis, present state, instantaneous behavior, and future projections. The job of an individual instance is to reveal enough data to enable those perspectives.

Designing for Transparency

Transparency arises from deliberate design and architecture. “Adding transparency” late in development is about as effective as “adding quality.” Maybe it can be done, but only with greater effort and cost than if it’d been built in from the beginning.

Visibility inside one application or server is not enough. Strictly local visibility leads to strictly local optimization. For example, a retailer ran a major project to get items appearing on the site faster. The nightly update was running until 5 or 6 a.m., when it needed to complete closer to midnight. This project optimized the string of batch jobs that fed content to the site. The project met its goals, in that the batch jobs finished two hours earlier. Items still did not appear on the site, however, until a long-running parallel process finished, at 5 or 6 a.m. The local optimization on the batch jobs had no global effect.

Visibility into one application at a time can also mask problems with scaling effects. For instance, observing cache flushes on one application server would not reveal that each server was knocking items out of all the other servers’ caches. Every time an item was displayed, it was accidentally being updated, therefore causing a cache invalidation notice to all other servers. As soon as all the caches’ statistics appeared on one page, the problem was obvious. Without that visibility, we would’ve added many servers to reach the necessary capacity—and each server would’ve made the problem worse.

In designing for transparency, keep a close eye on coupling. It’s relatively easy for the monitoring framework to intrude on the internals of the system. The monitoring and reporting systems should be like an exoskeleton built around your system, not woven into it. In particular, decisions about what metrics should trigger alerts, where to set the thresholds, and how to “roll up” state variables into an overall system health status should all be left outside of the instance itself. These are policy decisions that will change at a very different rate than the application code will.

Enabling Technologies

By its nature, a process running on an instance is totally opaque. Unless you’re running a debugger on the process, it reveals practically nothing about itself. It might be working fine, it might be running on its very last thread, or it might be spinning in circles doing nothing. Like Schrödinger’s cat, it’s impossible to tell whether the process is alive or dead until you look at it.

The very first trick, then, is getting information out of the process. This section examines the most important enabling technologies that reduce the opacity of that process boundary. You can classify these as either “white-box” or “black-box” technologies.

A black-box technology sits outside the process, examining it through externally observable things. Black-box technologies can be implemented after the system is delivered, usually by operations. Even though black-box technologies are unknown to the system being observed, you can still do helpful things during development to facilitate the use of these tools. Good logging is one example. Instances should log their health and events to a plain old text file. Any log-scraper can collect these without disturbing the server process.

By contrast, white-box technology runs inside the process. This kind of technology often looks like an agent delivered in a language-specific library. These must be integrated during development. White-box technologies necessarily have tighter coupling to the language and framework than black-box technologies.

White-box technology often comes with an API that the application can call directly. This provides a great increase in transparency, because the application can emit very specific, relevant events and metrics. It comes at the cost of coupling to that provider. That coupling is a small price to pay when compared to the degree of clarity it provides.

Logging

Despite millions of R&D dollars on “enterprise application management” suites and spiffy operations centers with giant plasma monitors showing color-coded network maps, good old log files are still the most reliable, versatile information vehicle. It’s worth a chuckle once in a while to realize that here we are, in the twenty-first century, and log files are still one of our most valuable tools.

Logging is certainly a white-box technology; it must be integrated pervasively into the source code. Nevertheless, logging is ubiquitous for a number of good reasons. Log files reflect activity within an application. Therefore, they reveal the instantaneous behavior of that application. They’re also persistent, so they can be examined to understand the system’s status—though that often requires some “digestion” to trace state transitions into current states.

If you want to avoid tight coupling to a particular monitoring tool or framework, then log files are the way to go. Nothing is more loosely coupled than log files; every framework or tool that exists can scrape log files. This loose coupling means log files are also valuable in development, where you are less likely to find ops tools.

Even in the face of this value, log files are badly abused. Here are some keys to successful logging.

Log Locations

Despite what all those application templates create for us, a logs directory under the application’s install directory is the wrong way to go. Log files can be large. They grow rapidly and consume lots of I/O. For physical machines, it’s a good idea to keep them on a separate drive. That lets the machine use more I/O bandwidth in parallel and reduces contention for the busy drives.

Even if your instance runs in a VM, it’s still a good idea to separate log files out from application code. The code directory needs to be locked down and have as little write permission as possible (ideally, none).

Apps running in containers usually just emit messages on standard out, since the container itself can capture or redirect that.

If you make the log file locations configurable, then administrators can just set the right property to locate the files. If you don’t make the location configurable, then they’ll probably relocate the files anyway, but you might not like how it gets done. Odds are it’ll involve a lot of symlinks.

On UNIX systems, symlinks are the most common workaround. This involves creating a symbolic link from the logs directory to the actual location of the files. There’s a small I/O penalty on each file open, but not much compared to the penalty of contention for a busy drive. I’ve also seen a separate filesystem dedicated to logs mounted directly underneath the installation directory.

Logging Levels

As humans read (or even just scan) log files for a new system, they learn what “normal” means for that system. Some applications, particularly young ones, are very noisy; they generate a lot of errors in their logs. Some are quiet, reporting nothing during normal operation. In either case, the applications will train their humans on what’s healthy or normal.

Most developers implement logging as though they are the primary consumer of the log files. In fact, administrators and engineers in operations will spend far more time with these log files than developers will. Logging should be aimed at production operations rather than development or testing. One consequence is that anything logged at level “ERROR” or “SEVERE” should be something that requires action on the part of operations. Not every exception needs to be logged as an error. Just because a user entered a bad credit card number and the validation component threw an exception doesn’t mean anything has to be done about it. Log errors in business logic or user input as warnings (if at all). Reserve “ERROR” for a serious system problem. For example, a circuit breaker tripping to “open” is an error. It’s something that should not happen under normal circumstances, and it probably means action is required on the other end of the connection. Failure to connect to a database is an error—there’s a problem with either the network or the database server. A NullPointerException isn’t automatically an error.

Human Factors

Above all else, log files are human-readable. That means they constitute a human-computer interface and should be examined in terms of human factors. This might sound trivial—even laughable—but in a stressful situation, such as a Severity 1 incident, human misinterpretation of status information can prolong or aggravate the problem. Operators for the Three Mile Island reactor misinterpreted the meaning of coolant pressure and temperature values, leading them to take exactly the wrong action at every turn. (See Inviting Disaster [Chi01], pages 49--63.) Although most of our systems will not vent radioactive steam when they break, they will expel our money and our reputation. Therefore, it behooves us to ensure that log files convey clear, accurate, and actionable information to the humans who read them.

If log files are a human interface, then they should also be written such that humans can recognize and interpret them as rapidly as possible. The format should be as readable as possible. Formats that break columns and create a ragged left-to-right scanning pattern are not human-readable.

Voodoo Operations

As I said before, humans are good at detecting patterns. In fact, we appear to have a natural bias toward detecting patterns, even when they aren’t there. In Why People Believe Weird Things [She97], Michael Shermer discusses the evolutionary impact of pattern detection. Early humans who failed to detect a real pattern—such as a pattern of light and shadow that turned out to be a leopard—were less likely to pass on their genes than those who detected patterns that weren’t there and ran away from a clump of bushes that happened to look like a leopard.

In other words, the cost of a false positive—“detecting” a pattern that wasn’t—was minimal, whereas the cost of a false negative—failing to detect a pattern that was there—was high. Shermer claims that this evolutionary pressure creates a tendency toward superstitions. I’ve seen it in action.

Given a system on the verge of failure, administrators in operations have to proceed through observation, analysis, hypothesis, and action very quickly. If that action appears to resolve the issue, it becomes part of the lore, possibly even part of a documented knowledge base. Who says it was the right action, though? What if it’s just a coincidence?

I once found a practice in the operations group for one of my early commerce applications that was no better than witchcraft. I happened to be in an administrator’s cubicle when her pager went off. On seeing the message, she immediately logged into the production server and started a database failover. Curious, and more than a little alarmed, I asked what was going on. She told me that this one message showed that a database server was about to fail, so they had to fail over to the other node and restart the primary database. When I looked at the actual message, I got cold shivers. It said, “Data channel lifetime limit reached. Reset required.”

Naturally, I recognized that message, having written it myself. The thing was, it had nothing at all to do with the database. It was a debug message (see Debug Logs in Production) informing me that an encrypted channel to an outside vendor had been up and running long enough that the encryption key would soon be vulnerable to discovery, just because of the amount of encrypted data that the channel served. It happened about once a week.

Part of the problem was the wording of the message. “Reset required” doesn’t say who has to do the reset. If you looked at the code, it was clear that the application itself reset the channel right after emitting that message—but the consumers of the message didn’t have the code. Also, it was a debug message that I had left enabled so I could get an idea of how often it happened at normal volumes. I just forgot to ever turn it off.

I traced the origin of this myth back about six months to a system failure that happened shortly after launch. That “Reset required” message was the last thing logged before the database went down. There was no causal connection, but there was a temporal connection. (There was no advance warning about the database crash—it required a patch from the vendor, which we had applied shortly after the outage.) That temporal connection, combined with an ambiguous, obscurely worded message, led the administrators to perform weekly database failovers during peak hours for six months.

Final Notes on Logging

Messages should include an identifier that can be used to trace the steps of a transaction. This might be a user’s ID, a session ID, a transaction ID, or even an arbitrary number assigned when the request comes in. When it’s time to read ten thousand lines of a log file (after an outage, for example), having a string to grep will save tons of time.

Interesting state transitions should be logged, even if you plan to use SNMP traps or JMX notifications to inform monitoring about them. Logging the state transitions takes a few seconds of additional coding, but it leaves options open downstream. Besides, the record of state transitions will be important during postmortem investigations.

Instance Metrics

The instance itself won’t be able to tell much about overall system health, but it should emit metrics that can be collected, analyzed, and visualized centrally. This may be as simple as periodically spitting a line of stats into a log file. The stronger your log-scraping tools are, the more attractive this option will be. Within a large organization, this is probably the best choice.

An ever-growing number of systems have outsourced their metrics collection to companies like New Relic and Datadog. In these cases, providers supply plugins to run with different applications and runtime environments. They’ll have one for Python apps, one for Ruby apps, one for Oracle, one for Microsoft SQL Server, and so on. Small teams can get going much faster by using one of these services. That way you don’t have to devote time to the care and feeding of metrics infrastructure—which can be substantial. Some developers from Netflix have quipped that Netflix is a monitoring system that streams movies as a side effect.

Health Checks

Metrics can be hard to interpret. It takes some time to learn what “normal” looks like in the metrics. For quicker, easier summary information we can create a health check as part of the instance itself. A health check is just a page or API call that reveals the application’s internal view of its own health. It returns data for other systems to read (although that may just be nicely attributed HTML).

Health checks should be more than just “yup, it’s running.” It should report at least the following:

  • The host IP address or addresses

  • The version number of the runtime or interpreter (Ruby, Python, JVM, .Net, Go, and so on)

  • The application version or commit ID

  • Whether the instance is accepting work

  • The status of connection pools, caches, and circuit breakers

The health check is an important part of traffic management, which we’ll examine further in Chapter 9, Interconnect. Clients of the instance shouldn’t look at the health check directly; they should be using a load balancer to reach the service. The load balancer can use the health check to tell if a machine has crashed, but it can also use the health check for the “go live” transition, too. When the health check on a new instance goes from failing to passing, it means the app is done with its startup.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.160.43