Creating infrastructure around logs

Okay, let's do some arithmetic. Suppose that you have a rather popular but not on a world scale (yet) website with about 50,000 visits per day. This is a number that managers brag about during their meetups; they get it from some analytics software. It almost means nothing regarding your job. Because what is a visit? Let's say that what you have is an e-commerce site; you sell some nonseasonal stuff, for example, power tools. Your average visitor will look at one to two pages with spikes to early tens when actually choosing and buying something. Let it be three pages per visit on average. What is a page? For you, it is a series of HTTP responses—the main document and all the embedded objects. People notoriously underestimate the sheer size of modern web pages. It would be a safe bet to say that your pages include on average 100 objects (HTML documents, images, scripts, style sheets, and so on) amounting to the size of over a megabyte.

This will be 100 x 3 x 50,000 per day or 15,000,000 / 24 / 3600 = 174 requests per second (RPS) on average. Averaging RPS during the day will render a rather useless number unless you operate in all world's time zones and that is not very common for websites selling actual material stuff. There is a good enough heuristic to estimate peaks—multiply average by 10.

Now we have a number of lines in your daily access log (15 million) and a very rough upper limit of logging rate that you will have to deal with (a thousand and a half lines a second). These numbers all mean that you need tools because a human being is not able to consume all this information in time.

Configuring log rotation

The main and simplest tool to make the amount of logging data more manageable is log rotation. You probably have it set up already. There is a pretty standard log rotator included in many Linux distributions uninventively named logrotate. Its FreeBSD counterpart is newsyslog.

Examples of Nginx log rotation configuration in logrotate and newsyslog are shown here.

This is an example of logrotate configuration from a Linux box:

Configuring log rotation

And this is a sample configuration of newsyslog from a rather modern FreeBSD server:

Configuring log rotation

What they do is take care of huge logs by creating an archive of old records based on time and size of current files. It is not exactly rocket science, but there are at least several pitfalls attracting people by the numbers.

First, do have free space monitoring. And also do have monitoring of your free space monitoring. It is a surprisingly popular cause of major malfunctions. Just as the publisher warned us that the hard drive will fail while we are writing this book, because they always do, we will take the liberty of warning you that at least once in your career, disks will become totally filled up with logs. Usually, this leads to some very unpleasant effects but is easily fixable.

What are preventing measures? Set up a log store. It should be a couple of separate machines with huge and cheap (with rotating parts) mirrored disks that store your logs indefinitely. Their goal is to relieve your workhorses, actual web servers from storing log archives and from running heavy greps and messing with performance. And your rotation procedures should include copying each archive to the log store after it is created. Your processes will get a little more complex because you will have your most current log still spread out on your web servers, whereas older data will already be archived away to the log store, but it is totally worth it.

Also, move to a better compression algorithm than the default gzip. In this particular case of logs, you may save up to 50% of space just by switching from gzip. logrotate supports specifying the command it will use for compression while newsyslog has native support for both bzip2 and xz compression. xz is usually better. The only downside of using xz is high memory requirements; keep this in mind. A separate log store, again, is very useful. It may also be configured to recompress gzipped files into xz thus saving space without sacrificing performance on the web servers. The idea is to gzip the logs on the web servers, move them to the log store cluster, decompress them, and compress again with xz.

The second important part to log rotation is not losing a single record during the actual rotation. The optimal algorithm looks like this:

  1. First of all, imagine that Nginx is up and running and writing log records to some access.log files.
  2. A log rotator is fired and the access.log is chosen for either size or age reason.
  3. The log rotator renames the access.log according to the rotation scheme, for example, to access.log.0.
  4. The log rotator creates a new empty access.log.
  5. Now, Nginx does not lose any access to the older file because it has its descriptor and the filename does not matter after the file is open by a process. So, Nginx continues to write records to access.log.0 file.
  6. The log rotator cannot compress the old file because it is still written to, so it signals Nginx to release the old file descriptor and to reopen the log file by its name again.
  7. Nginx is happy to oblige. The new empty access.log gets opened and starts to receive new log records, whereas the old file is ready to be removed after compression.
  8. The log rotator runs the compressor that creates a new file access.log.0.xz while deleting the old log.

It looks surprisingly complex for a seemingly simple procedure. The reason is steps 4, 5, and 6, which guarantee that logs are not renamed and deleted without Nginx knowing.

There is nothing Nginx-specific here. It just so happens that the authors thought about this problem and implemented the special reopen command in Nginx, which is initiated by the USR1 signal to the master process.

If your log rotator omits the command altogether, the rotation will not work at all—Nginx will always write to the old log without noticing that you renamed it. And trying to compress a file that is currently appended to is a recipe for losing some lines.

If your log rotator will restart Nginx on each rotation, then your logs will be okay, but you may lose some performance if you do graceful restarts (with the SIGHUP signal). You may even lose some requests if you do hard restarts (the old apachectl restart command-style restarts are not supported by Nginx executable but could be implemented with init scripts of your OS).

Working with a lot of log data

Once your Nginx installation starts to get more than several thousands of users a day, you or your managers will definitely want to get more insights from those logs. Your job will be to provide an infrastructure for that and troubleshoot problems. You can also piggyback on that endeavor to end up with a great real-time search of all your logs much more efficient than the good old grep.

The evolution of log analytics through the years is an interesting and huge topic mostly outside the scope of this book. Many of us remember the (in)famous Webalizer and AWStats packages. They are still perfectly functional, by the way, even if a bit rusty. It is not recommended to invest in these tools for modern websites though. They are not very efficient, and you will have a hard time adding the features that are expected these days.

Some of the newer solutions that are available on the market are summarized below. By all means do your own research. This is really a giant topic in itself:

  • The logstash/ElasticSearch/kibana stack is a combination of Java-based tools, each of which deserves a whole book devoted to it. A working deployment allows you to store all your logs in a database indexed for all needed types of queries and reports. The kibana part of the stack provides gorgeous visualizations of time-based data. Logs are exactly that and fit perfectly. Maintaining an instance may quickly become a full-time job.
  • Scribe is a central logging solution developed, open sourced, and then abandoned by Facebook. It is of historical interest only. Facebook has moved on from Scribe and if you still have a Scribe installation or have inherited one, you are in trouble. One of the easier alternatives is fluentd.
  • Fluentd is a modern centralized logging system written in Ruby. It may be compared to the logstash part of the first stack. It has pluggable inputs and outputs. Once you have it configured to consume Nginx logs, it may feed the results to an ElasticSearch instance.
  • Apache Flume is an older project in the family of Apache Hadoop stack of technologies. It is used to collect data into your HDFS (which is the storage for Hadoop). It is sometimes used for web logs too.
  • Splunk is a commercial full-stack solution in order to collect, parse, store, and query logs. It calls itself "Google for your logs", and we will not comment on that. Splunk is interesting because it is also widely used to do real-time monitoring of incoming logs. A good example of such a task is intrusion detection.

Reading logs

The most interesting part for many readers ahead is that we will show you examples of different records from real Nginx log files and analyze what happened and how to fix it in each case. These will be rather simple situations many of which could be either familiar to a seasoned web system administrator or evident from the message.

We still recommend following each and every example. Sometimes, people develop a kind of selective blindness to things they do not understand fully. It is also very natural to skip unknown parts and to try to deduce their meaning from what they are surrounded with—this is how language learning works both for children and adults. Alas, human languages are highly redundant and therefore are specially catered to nonperfect, lossy understanding. Logs are usually not.

Let's start with a very simple and very famous 404 error – and how it looks from two perspectives, error log and access log.

The record from error log:

2016/01/29 02:25:14 [error] 18876#0: *1 "/home/kappa/books/index.html" is not found (2: No such file or directory), client: 127.0.0.1, server: localhost, request: "GET /books/ HTTP/1.1", host: "kantara"

And now the record about the same event from the access log in combined format:

127.0.0.1 - - [29/Jan/2016:02:25:14 +0300] "GET /books/ HTTP/1.1" 404 151 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0"

We will break them both down now.

Error log record

"2016/01/29 02:25:14"

This is obviously a timestamp, but note that it does not contain any time zone information. It is local time as seen by the server in its configured timezone. This minor fact means that when you transfer this log file into another timezone and do not save timezone information somewhere, your software may become confused and apply the new local timezone. After this, comparing this timestamp with the timestamp from access_log would be wrong.

"[error]"

This is the severity level of the message. Remember that earlier in this chapter we discussed the format of the error_log directive and there was this second parameter, the threshold. Well, this is the field that gets compared to the configured threshold to determine whether a particular message is serious enough to bother this particular system administrator with. Other possible values include things from debug to emerg (short for emergency). See the error_log directive documentation at http://nginx.org/en/docs/ngx_core_module.html#error_log.

"18876#0:"

Now, this part is not understood by many people. The pair of numbers gives information about which path of the Nginx ensemble of processes put this record into the log. The number before # is the PID, the identifier of the process, and the second number is the thread identifier or TID. TID is usually 0 on current Nginx on Linux. On Windows, it may be some big number. Nginx does not use multithreading in its current version. There are rumors that threads will be much more prominent on all platforms in Nginx 2.0.

"*1"

This is the identifier of the connection in the context of which this error happened. Actually, it is an integer counter, and it allows you to group errors by connections. By the way, the connection number and also the TID part of the previous item are not recognized by many Nginx users. Take some of your colleagues by surprise and ask about it sometime just for fun.

"/home/kappa/books/index.html" is not found (2: No such file or directory)

This is the actual error message formulated by Nginx accompanied by the OS-level errno number (ENOENT in this case) and strerror message in parentheses.

"client: 127.0.0.1, server: localhost"

This is the addresses of both sides of the connection. We have Nginx running right here on the workstation. This is why we see the connection over the loopback. Nginx has chosen not to do reverse DNS resolving of the client addresses for performance reasons, whereas the server name is known beforehand. This is why we see the same address in both IP and domain name forms.

request: "GET /books/ HTTP/1.1", host: "kantara"

Now, these are the data about the actual request. First, the string representation of the request itself and then the host value taken from the Host: HTTP request header sent by the browser.

It is interesting that besides the very first items in the record everything is more or less free-form and not required. The timestamp is obviously always there as are the pid and the tid (especially if it is a constant 0), but the connection is not always up and there, of course, may not be any current requests without the connection.

Error logs are notoriously not very machine-readable. You should never rely on existence of a certain type of data in a record unless you made sure that the whole record is written via a known and fixed template. For example, it is fairly easy to parse out all the ENOENT messages, but creating a summary of all types of errors will be harder.

The access log, on the contrary, is made for parsing. Let's see the record again:

127.0.0.1 - - [29/Jan/2016:02:25:14 +0300] "GET /books/ HTTP/1.1" 404 151 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0"

We already analyzed a combined record earlier in this chapter, so we won't do this again. Just look at two interesting parts.

We mentioned the weird historical date/timestamp format but at least it contains timezone and is totally unambiguous. We also see the famous 404 code in the sixth field, and that is the only sign of error here! Otherwise, it is a perfectly good HTTP request that was served with a fittingly perfect HTTP response of 151 bytes.

There will be two very popular 404 errors in your logs when you start a new website:

2016/02/09 19:09:11 [error] 39110#0: *1019042 open() "/site/example.com/www/robots.txt" failed (2: No such file or directory), client: 157.55.39.200, server: example.com, request: "GET /robots.txt HTTP/1.1", host: "example.com"
2010/10/17 22:44:05 [error] 44858#0: *809 open() "/site/example.com/favicon.ico" failed (2: No such file or directory), client: 95.26.237.86, server: example.com, request: "GET /favicon.ico HTTP/1.1", host: "example.com"

These are the so-called well-known files that HTTP clients request and use. You should get some robots.txt and some favicon for your sites at least for the sake of your own sanity. Refer to http://www.robotstxt.org/ and https://en.wikipedia.org/wiki/Favicon for more information on these files.

It is time to see some more errors:

2016/02/09 13:19:00 [error] 39110#0: *1014628 kevent() reported that connect() failed (61: Connection refused) while connecting to upstream, client: 204.8.105.53, server: example.com, request: "GET /admin.php HTTP/1.0", upstream: "http://127.0.0.1:3000/admin.php", host: "example.com"

You should read this almost effortlessly. This is an example of Nginx acting as a proxy, and this is certainly the most popular use for it. Being a proxy, this Nginx instance is trying to connect to an upstream on behalf of a client. The upstream that has problems is listed in the end before the familiar host item. The mentioned kevent() is the so-called implementation detail that should not have leaked here but well, it has. It is a part of the mechanism Nginx uses to work with network connections under FreeBSD, Mac OS X, and other BSD operating systems.

On a Linux box, the same record would look like this:

2014/07/29 10:18:41 [error] 14243#0: *100053182 connect() failed (111: Connection refused) while connecting to upstream, client: 37.73.249.120, server: example.com, request: "GET /example.com/404 HTTP/1.1", upstream: "http://[2c32:6d8:0:172a::318]:8080/", host: "example.com"

What is interesting in that record? First, no kevent(). Second, the errno code has changed! And indeed, our FreeBSD and Linux boxes have 61 and 111 for ECONNREFUSED, respectively. So no, you cannot rely on this code and more so you cannot rely on the Connection refused string. On Windows the same error may have this message: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

And second, the upstream is using IPv6, which may break some scripts if they search for the TCP port number after the first colon.

We want to show you another special kind of file not found errors that are a sign of modern times:

2016/01/24 18:28:09 [error] 39111#0: *755667 open() "/site/example.com/www/wp-login.php" failed (2: No such file or directory), client: 109.198.238.60, server: example.com, request: "GET /wp-login.php HTTP/1.1", host: "www.example.com"
2016/01/24 18:27:01 [error] 39111#0: *755651 open() "/site/example.com/www/administrator/index.php" failed (2: No such file or directory), client: 82.199.126.95, server: example.com, request: "GET /administrator/index.php HTTP/1.1", host: "example.com"

These are only interesting because they come from bots that attempt to hack into your system. They are very persistent in trying some URLs that look like administration or login scripts and that never ever existed on your site.

It is too cheap for them to just try any host they see on the Internet without even having a database of unsuccessful attempts. They will come from thousands of different IP addresses many of which will look totally innocent because those are infected computers all over the world controlled centrally. They have become a norm already; you should not most of the time even bother with any countermeasures (unless of course you run some old installation of WordPress, and in this case, you are probably hacked already and earn money for these people by serving some porn ads to your users alongside your own content).

Here is an error that contains much less information:

2014/07/29 00:00:18 [info] 14238#0: *95951600 client 77.205.98.18 closed keepalive connection

Can you guess why? Because, as we said a little bit earlier, errors happen all the time even when there is no request under processing. This is exactly the case: a client closed a connection that was left open after a successful request/response pair as a way to optimize the following requests. This is named KeepAlive. Nginx is happy to serve many requests in one connection consequently, but the client is free to close the connection at any time. Now you should understand why this information has [info] instead of [error]. Ideas about whether you should do anything about it are left as an exercise.

2014/07/29 00:02:11 [info] 14241#0: *95959742 client timed out (110: Connection timed out) while waiting for request, client: 62.90.94.31, server: 0.0.0.0:443

A similar message not having any information about a request because it is actually an error of not being able to get the request before timeout.

2014/07/29 00:00:18 [info] 14238#0: *95951764 SSL_read() failed (SSL: error:14094412:SSL routines:SSL3_READ_BYTES:sslv3 alert bad certificate:SSL alert number 42) while waiting for request, client: 176.115.120.138, server: 0.0.0.0:443

It is quite an enigmatic error message that you won't be able to do anything about. The SSL code is complex, and there are a lot of weird SSL implementations out there. Something went wrong. You should take note and either try to reproduce or wait for more of the same.

2014/07/29 00:02:24 [info] 14240#0: *95968051 client sent too long URI while reading client request line, client: 87.244.170.11, server: example.com, request: "GET /log_error?login=takoe&error=<some very-very long string>

We trimmed this one by hand because it took almost the whole screen. There is a limit on the total size of the request headers. It may be changed with the large_client_header_buffers directive. See http://nginx.org/en/docs/http/ngx_http_core_module.html#large_client_header_buffers for more information. This is definitely something that you may fix by increasing the configured value, but we would recommend against it and speak to your application developers team instead. It looks like they have chosen the wrong tool for their task here. Requests of such size should use the POST method instead of GET.

There is another error we wanted to show here as an example of what problems really big websites face sometimes:

2013/05/16 12:21:11 [crit] 21947#0: *31843937 open() "/usr/local/nginx/html/50x.html" failed (24: Too many open files), client: 88.2.3.44, server: example.com, request: "GET /file/images/background.jpg HTTP/1.1", upstream: "http://10.10.4.1:81//file/images/background.jpg", host: "example.com"

You should be able to read and understand every single character of that message now. What exactly is 24: Too many open files? There is a limit on the number of files that any single process can hold open. Usually, the limit is really big. Run this command to see the limit your shell has:

% ulimit -Sn

Once you have your Nginx serving more files than that simultaneously, this error will appear in the error log. Nginx has a way of increasing the limit itself, see http://nginx.org/en/docs/ngx_core_module.html#worker_rlimit_nofile. Increasing the hard limits for all processes is OS-dependent. On Linux, you will need to add something like fs.file-max = 50000 to your /etc/sysctl.conf and then run the following code:

% sysctl -p

Tip

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

1. Log in or register to our website using your e-mail address and password.

2. Hover the mouse pointer on the SUPPORT tab at the top.

3. Click on Code Downloads & Errata.

4. Enter the name of the book in the Search box.

5. Select the book for which you're looking to download the code files.

6. Choose from the drop-down menu where you purchased this book from.

7. Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR/7-Zip for Windows
  • Zipeg/iZip/UnRarX for Mac
  • 7-Zip/PeaZip for Linux
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.176.5