Chapter 8. Server Monitoring

With many monitoring efforts beginning in the sysadmin/ops engineer team, it’s no wonder that many of us immediately associate “monitoring” with “the thing the sysadmins do.” This is unfortunate since we’ve seen there’s so much more to monitoring than just what happens on a server.

Of course, there’s an element of truth in the misconception: a lot really does happen on the server! Even in a serverless architecture, there are still servers underneath that provide the platform and all that makes it tick. We’re going to delve down into what sort of common services you’ll encounter on servers these days, what metrics and logs are provided, and how to make sense of it all.

One note before we jump in: this chapter is going to use Linux as the assumed operating system, since that’s what I’m most familiar with. For the readers applying these lessons to Windows, nearly all of the stuff we’ll be covering is just as applicable to Windows in a general sense, though your tools are different.

Standard OS Metrics

Over the course of this book, I’ve railed against the obsession with the standard OS metrics (CPU, memory, load, network, disk) and for good reason: starting your monitoring work with them is starting with the metrics that offer the least signal of all toward your main concern (that your app is working). In order to know if things are working, you have to start at the top instead, which I covered xref.

However, that isn’t to say these metrics are not without usefulness. In fact, they can be an ally when used in the proper context: diagnostics and troubleshooting. In that context, these metrics are some of the most powerful metrics you have available.

My recommendation for how to use these metrics: automatically record them for every system you have, but don’t set up alerts on them (unless you have a good reason). Pretty much every monitoring tool out there collects these by default with little to no intervention from you; so rather than discuss how to collect them, I’m going to discuss what they mean and how to use them. I’ll be using well-known Linux command-line tools in my explanations—tools that you are likely well-acquainted with.

CPU

Monitoring CPU usage is the most straightforward of all these metrics. The metrics come from /proc/stat and are available interactively with a number of utilities. We’ll use top here:

top - 21:13:27 up 98 days,  2:01,  1 user,  load average: 0.00, 0.01, 0.05
Tasks: 105 total,   1 running, 104 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:    500244 total,   465708 used,    34536 free,    85104 buffers
KiB Swap:        0 total,        0 used,        0 free.   244488 cached Mem

The third line has the CPU information we want: the percent utilization. As we can see, this particular server is 100% idle (id). To determine the utilization percentage, add user (us), system (sy), niced processes (ni), hardware interrupts (hi), and software interrupts (si). iowait processes (wa) and stolen time (st) aren’t included, since they are waiting rather than being serviced.

Memory

Percent used versus free is the main thing here. Memory used can be further broken down by shared, cached, and buffered, which are all counted as “used.” Most tools report memory metrics based on values reported by /proc/meminfo. Here we’ll use the free command, which gets its data from /proc/meminfo as well, with the -m switch to show megabytes to make this discussion easier on us both:

             total       used       free     shared    buffers     cached
Mem:           488        426         62         31         75        222
-/+ buffers/cache:        127        360
Swap:            0          0          0

The output is often misunderstood, so let’s break it down. The first row would seem to say that this system has 488 MB of total memory, 426 MB used, and thus 62 MB free, but what’s with the remaining columns? And row two?

First, let’s talk about buffers and caches. In Linux memory management, file system metadata (such as permissions and contents of a directory) for recently accessed areas of the disk are stored in buffers. The expectation is that you’re likely to request this information again soon, so storing it in buffers will result in a quicker access time for you. Caches work similarly but for the contents of recently accessed files. Being that these are transient areas of memory with oft-changing contents, the memory used by them is technically available for use by any processes that need the memory.

Now, about that second row: since the memory is available should it be required, the second row is more useful for determining memory usage. This row represents memory used (minus buffers and cache) and free (plus buffers and cache). When determining whether you need more memory on a system, look at the second row, not the first. Some tools record memory used/free with buffers and cache built into the calculation, while some report the straight metrics from /proc/meminfo, leaving you, the user, to do the math on what the memory usage really is.

The third line is self-explanatory: swap. If your systems use swap partitions/files (they’re relatively uncommon in cloud infrastructure these days), then track it. Alerting on low free memory and increasing swap utilization is a great indicator of increased memory pressure, if your app is memory-sensitive.

Another way to watch for serious memory issues is by monitoring the OOMKiller spawning in your logs. This process is responsible for terminating processes in an effort to increase the available memory to a system when it’s under high pressure. Grepping for killed process in your syslog will spot this. I recommend creating an alert in your log management system for any occurrences of OOMKiller. Any time the OOMKiller is coming into the picture, you’ve got a problem somewhere, especially because OOMKiller is unpredictable in its choice of target processes to terminate.

Network

Monitoring network performance on a server is similar to the network: all the same metrics apply. The information ultimately comes from /proc/net/dev on Linux with both ifconfig and ip (from the iproute2 package) being the de facto tools for interacting with it. At minimum, be sure to collect octets in/out, errors, and drops on your server interfaces. For details on what these metrics mean, reference Chapter 9.

Disk

Disk performance can be viewed in a variety of ways interactively, but they all read from the same source: /proc/diskstats. We’ll use iostat (available in the sysstat package, along with many other great tools) with the -x flag to give us an extended set of metrics to look at:

~$ iostat -x
Linux 3.13.0-74-generic (ip-10-0-1-196) 12/03/2016 _x86_64_ (1 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.09    0.01    0.01   0.03    0.00      99.86

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
xvda            0.00     0.21      0.06    0.40    1.53     3.64   22.41

                avgqu-sz   await r_await w_await  svctm  %util
                0.00       1.16    0.89    1.21   0.34   0.02

We can see that I only have one disk and that it’s mostly idle. iowait is an important metric here: it represents the amount of time the CPU was idle due to waiting on the disk to complete operations. High iowait is something we want to avoid.

The bottom part of the iostat output talks specifically about our disk performance. There are a bunch of metrics here, some more useful than others. In the interest of brevity, I’m only going to hit the most important ones: await and %util. These two metrics directly speak to the utilization and pressure on the disk.

await is the average time (in milliseconds) taken for issued requests to be served by the disk. This number includes both the time spent in queue and the time spent performing the request. %util is most easily thought of as the level of usage saturation of the disk. You’ll want to keep this under 100%. Do note, however, that this metric can be misleading when the volume in question is part of a RAID array due to an inability to inspect it on a per-disk basis.

Running iostat without -x gives us another very useful metric: tps. +tps=, or transfers per second, is also known as I/O per Second (IOPS). IOPS is an important metric for any service that makes use of disks, such as database servers:

~$ iostat
Linux 3.13.0-74-generic (ip-10-0-1-196) 12/03/2016 _x86_64_ (1 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.09    0.01   0.01     0.03   0.00      99.86

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda              0.46      1.53         3.64       12987412   30858420

IOPS is a useful metric for determining when you need additional transfer capability (e.g., more spindles) or for spotting general performance issues. For example, if you’re tracking IOPS over time and you notice that this metric has experienced a sudden drop, you may have a disk performance problem on your hands.

Load

Load is a measurement of how many processes are waiting to be served by the CPU.1 It’s represented by three numbers: a 1 m average, a 5 m average, and a 15 m average. The most common method to see this interactively is via the uptime command (which pulls data from /proc/loadavg):

~$ uptime
 19:41:21 up 98 days, 29 min,  1 user,  load average: 0.00, 0.01, 0.05

A system with one CPU core and a load of 1.0 means that there is exactly one process waiting. Generally speaking, a load of 1.0 per core is considered to be perfectly acceptable.

The problem is that the load metric doesn’t translate to system performance. It’s not uncommon to find a server with a high load metric that is performing just fine. I’ve seen web servers with a 15 m load metric of over 500, but customers still able to use the system with no impact. As we learned back in Chapter 1, if nothing is impacted, is there really a problem?

The one exception to this is that load makes for a somewhat decent proxy metric. That is, an abnormal load metric is often an indicator of other problems (though, sometimes it isn’t!).

In general, I think relying on the load metric for anything is a waste of time.

SSL Certificates

I’m sure every person reading this book has had an SSL certificate expire on them without realizing it until it was too late. It sucks, but it happens.

Monitoring SSL certificates is simple: you just want to know how long you have until they expire and for something to let you know before that happens.

There are a few options I’ve found for how to best handle this problem:

  • Many domain registrars and certificate authorities (CAs) are capable of monitoring and alerting you on SSL certificate expiration (if you bought the certificate through them). This is the easiest method to implement. The downside is that they often alert you via email, which as we know, means you’ll probably miss it. Another downside of this is that you’re only checking the certificate itself, not the location where the certificate is in use, which is especially problematic for a wildcard certificate that might be installed in a dozen locations. If the alert can only come in via email, I recommend having it sent to a ticket system so you get a ticket opened instead of sitting in someone’s inbox.

  • If the SSL certificate is in use externally, you can use external site monitoring tools (e.g., Pingdom and StatusCake) to check and alert you on the certificate expiration. Tools such as these have the flexibility we want, but the downside is that they can’t monitor anything that isn’t publicly accessible (such as an internal service).

  • If you have a lot of internally used SSL certificates, you’re left with one option: an internal monitoring tool of some sort to check and report on the certificate. I haven’t found any great tools for this, but a simple shell script that runs regularly and reports to your monitoring system or ticket system works rather well. Many on-premise monitoring systems also have the ability to monitor certification expiration.

SNMP

Let me put this in no uncertain terms: stop using SNMP for servers.

I’ll be going into much more detail about how SNMP works and the challenges of working with it in Chapter 9, but suffice it to say: it’s not a fun protocol to work with. Though you’re (mostly) stuck with it for the purpose of monitoring network gear, that’s (thankfully) not the case when it comes to monitoring servers.

Why shouldn’t you use SNMP?

  • Adding more functionality means extending the agent, which is a pain.

  • It requires running an inherently insecure protocol on your network. Yes, there is v3, which has encryption and some semblance of a security model, but it’s nowhere near enough. Your security folks will thank you for not doing this.

  • It requires a centralized poller for gathering metrics, which can be difficult to scale and manage. This isn’t a deal-breaker, as there are certainly ways to make this no longer a problem (some modern monitoring tools use centralized pollers).

  • There are far better options available with easier configuration and more capabilities.

Rather than using SNMP, opt for a push-based tool like collectd, Telegraf, or Diamond.

Web Servers

If you’re in the enterprise world, your experience with web server performance is likely limited to low-traffic, single-node web servers. However, if you’re in the webapp world, the performance of your web servers are one of the most critical components of your app. Monitoring web servers doesn’t differ between the two use cases, though the amount of time you spend looking at the metrics will obviously be higher for those in the webapp world.

When it comes to web servers, there is one golden metric for assessing performance and traffic level: requests per second (req/sec). Fundamentally, req/sec is a measurement of throughput. Less critical to performance, but still important for overall visibility, is monitoring your HTTP response codes. As you may know, the HTTP protocol has many different possible responses to a request. The most common is 200 OK, while there are also other common ones, such as 404 Not Found, 500 Internal Server Error, and 503 Service Unavailable.

Table 8-1. HTTP response codes (abbreviated)
Response code group Group meaning Common response codes

1xx

Informational

100 Continue

2xx

Success

200 OK, 204 No Content

3xx

Redirection

301 Moved Permanently, 302 Found

4xx

Client errors

400 Bad Request, 401 Unauthorized, 404 Not Found

5xx

Server errors

500 Internal Server Error, 503 Service Unavailable

In total, there are 61 official HTTP responses, though some applications and web servers implement additional ones.2

These response codes are recorded in the request log for your web server. For example, on NGINX, an entry looks like this:

10.0.1.52 - - [10/Dec/2016:19:41:17 +0000] "GET / HTTP/1.1" 200 24952 "http://practicalmonitoring.com/“ "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/98 Safari/537.4" “192.168.1.50”

Between the request command (GET / HTTP/1.1) and the byte size of the request (24952) is the HTTP response—an HTTP 200 in this case, showing that the request was successful.

Of course, not all requests are successful. A rising number of non-200 responses (such as 5xx or 4xx) to clients can indicate issues with your app which could be costing you real money in lost sales.

There’s another metric that often gets people confused: connections. The short answer here is that connections are not requests, and you should pay more attention to requests than to connections. The longer answer leads us to the topic of keepalives.

Prior to the use of keepalives, each request required its own connection. Given that websites have multiple objects that need to be requested for the page to load, this led to a whole lot of connections. The trouble is that opening a connection requires going through the full TCP handshake, setting up a connection, moving data, and tearing down the connection—dozens of times for a single page. Thus, HTTP keepalives were born: the web server holds open the connection for a client rather than tearing it down, to allow connection reuse for the client. As a result, many requests can be made over a single connection. Of course, the connection can’t be held open forever, so keepalives are governed by a timeout (15 seconds in Apache, 75 seconds in NGINX). There is also keepalive configuration on the browser side called persistent connections with their own timeout values. All modern browsers use persistent connections by default.

One final useful metric is request time. NGINX and Apache both expose request time on a per-request basis in the access logs. By default, this isn’t included, so you’ll need to update the log format to include it.

Database Servers

The first thing to monitor is the number of connections. Of particular note here is MySQL: for reasons beyond the scope of this book, MySQL refers to its client connections as threads, spawning exactly one thread per client connection, so don’t be confused when you go looking for the connections metric and can’t find it. All the other database engines refer to them as simply connections.

While the number of connections to your database is a good indicator of overall traffic levels, it isn’t necessarily indicative of how busy the database actually is. For that, we’ll need to look at queries per second (qps).

Measuring queries per second is a much more direct measurement of how busy a server is. The qps measurement will fluctuate more in sync with actual busyness of the app and is a wonderful indicator of exactly how loaded your database servers are.

Slow queries are the bane of high-performance database infrastructures. A slow query will often manifest as a slow user experience, which we certainly don’t want. There are lots of reasons for slow queries and many strategies for fixing them, but the first step to fixing slow queries is finding them. Slow queries are logged in a log file with the execution time, number of times executed, and the exact query. There are many tools out there that make parsing this information easier (usually APM tools).

If you’re running a database infrastructure of any scale, you’re probably using replicas (previously known as slaves, but many vendors have changed the terminology) so monitoring replication delay is important—you certainly don’t want to find out about an out-of-sync replica days later. A normal delay is determined by the settings in your database server’s configuration.

Finally, of particular importance here is the IOPS measurement I introduced at the start of this chapter. Databases are generally IO-constrained due to being heavy on reads/writes, so make sure you keep tabs on IOPS. Nothing is more frustrating than troubleshooting slow database performance and finding out an hour into it that it’s just a failing disk that’s caused IOPS to drop.

Entire books could be written about database performance monitoring and tuning. Oh wait, they have! I highly recommend reading Baron Schwartz’s High Performance MySQL (O’Reilly, 2008) and Laine Campbell and Charity Majors’ Database Reliability Engineering (O’Reilly, 2017). If you’re at all interested in squeezing the most out of your database infrastructure and building for scalability, definitely read those.

Load Balancers

Load balancers are most often used for HTTP traffic, though they can be used for other types of traffic. We’re only going to talk about HTTP here, though.

Load balancer metrics are very similar to web server metrics in that you’re tracking the same sorts of things. Load balancers function by presenting themselves to the client as a single node. There may be any number of servers on the backend that the remote client never interacts with directly. As such, the metrics are duplicated in two groups: frontend and backend. It’s the same set of metrics, but they tell you different things about the health of the load balancer itself and the backend servers. You’ll want to pay attention to both sides.

It’s worth mentioning here again that load balancers determine the state of their backend servers through a health check. The simplest health check is a simple connection to a specific port (such as checking that port 80 responds), but many load balancers also support HTTP health checks, which makes the /health endpoint pattern from Chapter 7 quite useful in load balancer health checks.

Message Queues

A message queue is made up of two “speakers”: a publisher and a subscriber (message queues are sometimes called pub-sub systems because of this). Monitoring a queue is primarily about two things: queue length and consumption rate.

Queue length refers to the number of messages on a queue waiting to be taken off by one or more subscribers. A normal queue length depends on how your app works, so you want to pay attention to queues that get backed up with more messages than normal. Consumption rate is the rate at which messages are being taken off of the queue, or consumed. This metric is usually expressed in messages per second. Just as with queue length, a normal consumption rate depends on how your app works. Watch for an abnormal rate.

Any messaging queue software is going to provide you with significantly more metrics than just those two, and you’ll want to determine if they’re useful for your environment or not by reading the relevant documentation. Start with these two though, and you’ll be good for a while.

Caching

A cache’s primary metrics are the number of evicted items and the hit/miss ratio (sometimes called the cache-hit ratio).

As the cache grows, older items are removed from the cache—evicted, that is. High evictions are a good signal for a cache being too small, causing too many items to be evicted in order to make room for new items.

When your app requests something from the cache and the item is found, it’s referred to as a cache hit. Likewise, if the item is requested and not found, it’s called a cache miss. Given that the purpose of a cache is to speed up common requests, cache misses slow things back down. Therefore, watching the hit/miss ratio is a great indicator of cache performance. Ideally, you’d want this at 100% hit, but that’s usually unrealistic for modern apps. Over time you’ll start to understand what a normal ratio is for your app.

These metrics are closely related and work together—it’s a balancing act.

DNS

Unless you’re running your own DNS servers, there’s really nothing to monitor here. In case you are running your own DNS servers, well, that’s a different story.

If you’re running your own DNS, there are a few things you’re going to care about: zone transfers and queries per second.

Without going too deep into the inner workings of DNS, slaves are kept in sync with the master via zone transfers. Depending on configuration, these can be either transfer of the full zone (AXFR) or incremental transfers (IXFR). These are recorded in the log, and you’ll want to keep tabs on them for spotting sync issues. An out-of-sync slave will serve up potentially stale information, which is going to leave you with odd troubleshooting problems.

Monitoring queries per second helps you to understand the load your servers are facing and are the primary measure of it for DNS servers. You’ll want to record it on at least a per-server basis, but per-zone and per-view is much better for more granularity in your metrics.

If you are running BIND, check out the statistics-channel configuration option: enabling it will expose all of these metrics in one place. Many tools are out there to take advantage of this, such as collectd’s BIND plugin.

NTP

Some of the weirdest issues I’ve ever troubleshot have come down to poor time synchronization. For example, Kerberos tickets (authentication system used in Linux and Microsoft’s Active Directory) are strongly dependent on accurate time synchronization between servers and clients. Some apps also make use of the system time for kicking off jobs, while accurate time is a crucial aspect of troubleshooting a distributed architecture.

The NTP system can be complex and esoteric, but if you’re only running clients and not your own stratum 1 server, then there’s only one thing you need to be concerned with: time drift between the client and server.

ntpstat, available in Ubuntu 15.10 and later and CentOS 7 and later, is useful for giving you a quick answer to whether a client is synced properly or not:

~$ ntpstat
unsynchronised
   polling server every 64 s

And when it’s in sync:

~$ ntpstat
synchronised to NTP server (96.244.96.19) at stratum 3
   time correct to within 7973 ms
   polling server every 64 s

What’s neat is that the exit code from ntpstat corresponds to whether it’s synced or not, making for an easier way to monitor it: 0 for synced and 1 for unsynced. Using a shell script or something more advanced (such as collectd) makes monitoring this straightforward.

There are plenty more metrics you can look at if you’re running NTP servers yourself, but this is becoming uncommon. If you do run NTP servers, you’ll need to pay attention to drift between peers and your servers as well (ntpdate provides this information).

Miscellaneous Corporate Infrastructure

For those of you running traditional corporate infrastructure, there are two more things you might be managing that those in web-based environments won’t be contending with: DHCP and SMTP.

DHCP

There are two things you want to pay attention to here: the DHCP server handing out leases and whether the DHCP pools have enough lease capacity.

If you’re running DHCP on Linux, chances are you’re using ISC’s DHCPd. Unfortunately, ISC’s DHCPd is a real pain to properly monitor due to how it exposes performance data. In other words, you might have to put in a little bit of work here.

Lease information is stored in /var/lib/dhcpd.leases (that path might be different depending on your distribution). This file appends new leases to the end, so it’s entirely possible (and common) that the file will have two (or more!) leases for the same device, though only one of them is valid (the most recent one). Parsing this will give you the information you’re looking for on current lease usage. In order to get the data on the size of the lease pool, you’ll need to parse the pool definitions in the main DHCPd config file (/etc/dhcp/dhcpd.conf) to get the size of the IP range.

SMTP

If you are running your own email services, then monitoring email is quite important. Email services are generally stable but can quickly ruin your day when things go wrong.

There are a lot of common email server packages out there, so I’ll cover the general metrics common to them all.

The outbound email queue measures how much email is waiting to be sent out for delivery. It’s best to measure this in relation to what’s normal and alert on it when things are abnormal.

Measuring the total amount of email sent and received (both in total and per mailbox) is great for spotting patterns and abnormal behavior (such as a potentially compromised mailbox).

Likewise, the size of mailboxes is a great indicator of how much capacity you need to plan for. I like to measure this both as a total and per-mailbox to spot power users and perhaps help them cut down storage.

Monitoring Scheduled Jobs

One of the trickiest things in monitoring seems like it should be so simple: how to monitor scheduled tasks/cron jobs where an absence of data is the symptom of something awry.

We’ve all been there before: the backup didn’t run and no one noticed for a few weeks. Oops. Now to add in some monitoring so that doesn’t happen again…

Since most setups send an email or append to a log on success, it’s easy to miss when a failure happens. One way to handle alerting on an absence of data is to create data where there was none before:

run-backup.sh 2>&1 backup.log || echo “Job failed” > backup.log

This will redirect the script’s stderr to stdout and then write both to backup.log. This requires that your script implements solid error handling. The second part is the real magic—if: if run-backup.sh fails entirely, then Job failed is written to backup.log.

Once you have this data in your log, you can send it to your log management systems and set up alerts on the data.

In some cases, this approach doesn’t work: for whatever reason, you can’t turn an absence of data into a presence of data. What you really need is something that can detect when data doesn’t appear. This situation’s solution is commonly known as a dead man’s switch: the default is to do an action, unless something tells it otherwise.

Implementing this in shell is simple enough:

#!/bin/sh

# Time in minutes
TIME_LIMIT=$((60*60))

# State file for updating last touch
STATE_FILE=deadman.dat

# Last access time of the state file (in epoch)
last_touch=$(stat -c %Y $STATE_FILE)

# Current time (in epoch)
current_time=$(date +%s)

# How much time is remaining before the switch fires
timeleft=$((current_time - last_touch))

if [ $timeleft -gt $TIME_LIMIT ]; then
  echo "Dead man's switch activated: job failed!"
fi

Using this is equally simple: put the code into its own cron job running every minute, then modify the preceding job above like this:

run-backup.sh && touch deadman.dat

The dead man’s switch will now fire automatically if the state file is older than a certain age.

I should caution you this is a naive implementation and could use much improving, but the core idea is sound.

As a bonus, there are hosted services that do this without the need for you to engineer it all yourself. Search Google for cron job monitoring and you’ll find many options.

Logging

Logging can be thought of as three separate problems: collection of logs, storage of logs, and analysis of logs.

Collection

I like to group location of logs into two groups: those in syslog and those that aren’t.

If your logs are being handled by a syslog daemon already, simply configure the daemon to do log forwarding to another server. Consult your syslog daemon’s documentation for specifics on how to do it.

If your logs aren’t being handled by syslog, you have two choices:

  1. Update the configuration on whatever it is that’s emitting the logs, and have it send them to syslog.

  2. Update your syslog configuration to ingest the flat file from disk into syslog. At this point, the log entries are effectively being managed by your syslog daemon and will be forwarded like your other syslog entries.

If you’re using a tool that supports the collection of non-syslog log files, then you can use whatever that tool recommends as a third option. There’s nothing wrong with that approach at all, though there is something to be said for consistency in how you ingest and send logs.

Storage

Once you’re collecting logs, you’ve got to send them somewhere. In the old days, we would forward all these logs to a central log server that was just a simple syslog receiver and then use standard *NIX tools (e.g., grep) to search through logs. This is a suboptimal solution for (at least) one reason: it’s hard to search the logs. More often than not, if you’re using this method for log storage, no one is looking at them or making use of the logs in any way.

Thankfully, we have lots of great tools available to us now for the storage and analysis of logs. We can split these tools into two categories: SaaS and on-premise.

As you know, I’m a proponent of SaaS monitoring tools. There are also many well-known and capable on-premise tools. I’m not going to make any specific recommendations (you can, however, find many great options with a quick search online). The important part is this: don’t send your logs to some syslog server, never to be seen again. Send them to a solid log management system where you can actually get value from them.

Analysis

You’re collecting all the logs you want and sending them to a log management service somewhere. Great! Now what?

Now that you’ve got the plumbing, it’s time to do something useful: log analysis.

Analyzing logs isn’t a single problem, unfortunately. On one end of the spectrum, you’ve got shell scripts that grep for certain strings; and on the other end, you have tools like Splunk doing heavy statistical analysis on contents, and everything in between.

There are a great many interesting things you’ll find in your logs, most of which will depend entirely on your infrastructure. To get you started, I recommend logging and paying attention to these:

  • HTTP responses

  • sudo usage

  • SSH logins

  • cron job results

  • MySQL/PostgreSQL slow queries

Analyzing logs is largely a matter of which tool you use, whether it’s Splunk, the ELK stack, or some SaaS tool. I strongly encourage you to use a log aggregation tool for analyzing and working with your log data.

Wrap-Up

Whew, what a chapter, eh? We hit a whole lot of topics:

  • Why the standard OS metrics aren’t as usual for alerts as you might think and how to use them more effectively

  • How to monitor the typical services you’ll be using: web servers, database servers, load balancers, and others

  • What logging looks like from the server perspective

A server is only as reliable as the network on which it depends, so let’s dive into the world of weird SNMP and network monitoring.

1 For an extended look at load averages, Brendan Gregg wrote a great article on the topic.

2 For the full list of defined HTTP responses, see RFC 7231 Section 6.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.80.209