The caching layer of Nginx

If there is one universally known and acclaimed algorithm to speed things up, it is caching. Pragmatically speaking, caching is a process of not doing the same work many times. Ideally, each distinct computational unit should be executed once. This, of course, never happens in the real world. Still, techniques to minimize repetitions by rearranging work or using saved results are very popular. They form a huge discipline named "dynamic programming."

In the context of a web server, caching usually means saving the generated response in a file so that the next time when the same request is received; it could be processed by reading this file and not computing the response again. Now please refer to the steps outlined in the first section of this chapter. For many of the real-world websites, the actual computing of the responses is not the bottleneck; transferring those responses to the slow clients is. That's why the most efficient caching happens right in the browser, or as developers prefer to say, on the client side.

Emitting caching headers

All browsers (and even many non-browser HTTP clients) support client-side caching. It is a part of the HTTP standards, albeit one of the most complex to understand. Web servers do not control client-side caching to full extent, obviously, but they may issue recommendations about what to cache and how, in the form of special HTTP response headers. This is a topic thoroughly discussed in many great articles and guides, so we will mention it shortly, and with a lean towards problems you may face and how to troubleshoot them.

In spite of the fact that browsers have been supporting caching on their side for at least 20 years, configuring cache headers was always a little confusing, mostly due to the fact that there are two sets of headers designed for the same purpose, but having different scopes and totally different formats.

There is the Expires: header, which was designed as a quick and dirty solution and also the new (relatively) almost omnipotent Cache-Control: header, which tries to support all the different ways an HTTP cache could work.

This is an example of a modern HTTP request-response pair containing the caching headers. These are the request headers sent from the browser (here Firefox 41, but it does not matter):

User-Agent:"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0"
Accept:"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
Accept-Encoding:"gzip, deflate"
Connection:"keep-alive"
Cache-Control:"max-age=0"

Then, the response headers are:

Cache-Control:"max-age=1800"
Content-Encoding:"gzip"
Content-Type:"text/html; charset=UTF-8"
Date:"Sun, 10 Oct 2015 13:42:34 GMT"
Expires:"Sun, 10 Oct 2015 14:12:34 GMT"
Server:"nginx"
X-Cache:"EXPIRED"

We highlighted the parts that are relevant. Note that some directives may be sent by both sides of the conversation. The browser sent the Cache-Control: max-age=0 header because the user pressed the F5 key. This is an indication that the user wants to receive a response that is fresh. Normally, the request will not contain this header and will allow any intermediate cache to respond with a stale but still nonexpired response.

In this case, the server we talked to responded with a gzipped HTML page encoded in UTF-8 and indicated that the response is okay to use for half an hour. It used both mechanisms available, the modern Cache-Control:max-age=1800 header and the very old Expires:Sun, 10 Oct 2015 14:12:34 GMT header.

The X-Cache: "EXPIRED" header is not a standard HTTP header, but was also probably (there is no way to know for sure from the outside) emitted by Nginx. It may be an indication that there are, indeed, intermediate caching proxies between the client and the server, and one of them added this header for debugging purposes. The header may also show that the backend software uses some internal caching.

Another possible source of this header is a debugging technique used to find problems in the Nginx cache configuration. The idea is to use the cache hit or miss status, which is available in one of the handy internal Nginx variables as a value for an extra header, and then you are able to monitor the status from the client side. This is the code that will add such a header:

add_header X-Cache $upstream_cache_status;

Nginx has a special directive that transparently sets up both standard cache control headers, and it is named expires. This is a piece of the nginx.conf file using the expires directive:

location ~* .(?:css|js)$ {
  expires 1y;
  add_header Cache-Control "public";
}

The pattern uses the so-called noncapturing parentheses, which is a feature first appeared in Perl regular expressions. The effect of this regexp is the same as that of a simpler .(css|js)$ pattern, but the regular expression engine is specifically instructed not to create a variable containing the actual string from inside the parentheses. This is a simple optimization.

Then, the expires directive declares that the content of the css and js files will expire after a year of storage. The actual headers as received by the client will look like this:

Server: nginx/1.9.8 (Ubuntu)
Date: Fri, 11 Mar 2016 22:01:04 GMT
Content-Type: text/css
Last-Modified: Thu, 10 Mar 2016 05:45:39 GMT
Expires: Sat, 11 Mar 2017 22:01:04 GMT
Cache-Control: max-age=31536000

The last two lines contain the same information in wildly different forms. The Expires: header is exactly one year after the date in the Date: header, whereas Cache-Control: specifies the age in seconds so that the client can do the date arithmetics itself.

The last directive in the provided configuration extract adds another Cache-Control: header with a value of public. What this means is that the content of the HTTP resource is not access-controlled and therefore may be cached not only for one particular user but also anywhere else. A simple and effective strategy that was used in offices to minimize consumed bandwidth was to have an office-wide caching proxy server. When one user requested a resource from a website on the Internet and that resource had a Cache-Control: public designation, the company cache server would store that to serve to other users on the office network.

This may not be as popular today due to cheap bandwidth, but because history has a tendency to repeat itself, you need to know how and why Cache-Control: public works.

The Nginx expires directive is surprisingly expressive. It may take a number of different values. See this table:

off

This value turns off the Nginx cache headers logic. Nothing will be added, and more importantly, the existing headers received from upstreams will not be modified.

epoch

This is an artificial value used to purge a stored resource from all caches by setting the Expires header to "1 January, 1970 00:00:01 GMT".

max

This is the opposite of the "epoch" value. The Expires header will be equal to "31 December 2037 23:59:59 GMT", and the Cache-Control max-age set to 10 years. This basically means that the HTTP responses are guaranteed to never change, so clients are free to never request the same thing twice and may use their own stored values.

Specific duration

An actual specific duration value means an expiry deadline from the time of the respective request. For example, expires 10w. A negative value for this directive will emit a special header Cache-Control: no-cache.

"modified" specific time

If you add the keyword "modified" before the time value, then the expiration moment will be computed relatively to the modification time of the file that is served.

"@" specific time

A time with an @ prefix specifies an absolute time-of-day expiry. This should be less than 24 hours. For example, Expires @17h;.

Many web applications choose to emit the caching headers themselves, and this is a good thing. They have more information about which resources change often and which never change. Tampering with the headers that you receive from the upstream may or may not be a thing you want to do. Sometimes, adding headers to a response while proxying it may produce a conflicting set of headers and therefore create unpredictable behavior.

The static files that you serve with Nginx should have the expires directive in place. However, the general advice about upstreams is to always examine the caching headers you get and refrain from overoptimizing by setting up a more aggressive caching policy.

The corporate caching proxy configuration that we described earlier in this chapter together with an erroneous public caching policy on nonpublic resources may result in a situation where some users will see pages that were generated for other users behind the same caching proxy. The way to make that happen is surprisingly easy. Imagine that your client is a book shop. Their web application serves both public pages with book details, cover images, and so on and private resources with recommendation pages and the shopping cart. Those will probably have the same URL for all users and once, by mistake, declared as public with the expiration date in the distant future, they may freely be cached by intermediate proxies. Some more intelligent proxies will automatically notice cookies and either add them to the cache key or refrain from caching. But then again, less sophisticated proxies do exist, and there are a number of reports when they do show pages that belong to other people.

There are even techniques such as adding a random number to all URLs to defeat such caching configurations by making all URLs unique.

We would also like to describe a combination of unique URLs and long expiration dates, which are widely used today. Modern websites are very dynamic, both in the sense of what happens to the document after it is loaded and how often the client-side code changes. It is not unusual to have not only daily but even hourly releases. This is a luxury of the web as an application delivery mechanism, and people seize the opportunity. How to combine rapid releases with caching? The first idea was to code the version into the URLs. It works surprisingly well. After each release, all the URLs change; the old ones start to slowly expire in the cache stores of different levels, whereas the new ones are requested directly from the origin server.

One clever trick was developed upon this scheme, and it uses a hash of the content of the resource instead of the version number as a unique element of the URL. This reduces extra cache misses when a new release does not change all the files.

Implementing this trick is done on the application side. Nginx administrator is only responsible for setting up long expiration date by using, for example, the expires max directive.

The one obvious thing that limits the effect of the client-side caching is that many different users may issue the same or similar requests, and those will all reach the web server. The next step to never doing the same work many times is caching on the server.

Caching in Nginx upstream modules

Caching infrastructure is implemented as a part of the upstream interface if you excuse us to use object-oriented programming terminology. Each of those upstream modules has a group of very similar directives, which allow you to configure the local caching of responses from that particular upstream.

The basic scheme is very simple—once a request is determined as an upstream material, it is rerouted to the relevant module. If there's caching configured for that upstream, the cache is first searched for an existing response to this request. Only when a cached response is not found, the actual proxying is performed. After this, the newly generated response is saved into the cache while being sent to the client.

It is interesting that while caching on the reverse proxy is known for a while, Nginx gained its fame as a magical accelerator without implementing it. The reason should be evident from the first section—radical changes in RAM consumption alone brought a lot of performance gains. Until the introduction of version 0.7.44, Nginx did not have any caching facilities built in. At that time, web administrators used either the famous squid HTTP proxy for caching or the mod_accel module for Apache. By the way, mod_accel module was created by Nginx's author Igor Sysoev and turned out to be the testbed for all the ideas about proper reverse proxying that were later implemented in Nginx.

Let us examine the caching directives of the most popular upstream module, ngx_proxy. Just to remind, this module hands over the request to another HTTP server. This is exactly how Nginx is run as a reverse proxy in front of Apache, for example. The full description is available in the great Nginx documentation at http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache. We won't repeat the documentation, but we will provide additional facts and ideas instead.

Directive

Additional information

proxy_cache_path

This directive is clearly the main one of the whole caching family. It specifies the storage parameters of the cache store starting with the path on the filesystem. You should definitely familiarize yourself with all the options. The most important are the inactive and max_size options, which control how the Nginx cache manager removes unused data from the cache store. One required parameter in this directive is the keys_zone, which links the cache store to the "zone". See in the later text.

proxy_cache

This is the main switch directive. It is required if you want any caching. It has a single somewhat cryptic parameter named "zone," which will be explained in detail further on. The value "off" will switch the caching off. It may be needed in cases when there is a proxy_cache directive further up the scope stack.

proxy_cache_bypass

This directive allows you to easily specify conditions on which some responses will never be cached.

proxy_cache_key

This directive creates a key that is used to identify objects in the cache. By default, the URL is used, but people add things to it quite commonly. Different responses should never have equal keys. Anything that may change the content of the page should be in the key. Besides obvious cookie values, you may want to add the client IP address if your pages depend on it (for example, use some form of geotargeting via the GeoIP database).

proxy_cache_lock

This is a binary on/off switch defaulting to off. If you turn it on, then simultaneous requests for the same ("same" here means "having the same cache key") resource will not be run in parallel. Only the first request will be executed while the rest are blocked waiting.

The proxy_cache_lock_* family of directives might be interesting when you have some very expensive responses to generate.

proxy_cache_lock_age

proxy_cache_lock_timeout

These two specify additional lock parameters. Refer to the documentation for details.

proxy_cache_methods

This is a list of HTTP methods that are cacheable. Besides the obvious "GET" and "HEAD" methods, you might want to sometimes cache less popular methods such as "OPTIONS" or "PROPFIND" from WebDAV. There might be cases when you want to cache responses even to "POST", "PUT," and "DELETE" although that would be a very serious bending of the rules and you should really know what you are doing.

proxy_cache_min_uses

This numeric parameter with a default value of "1" may be useful to optimize huge cache stores by not caching responses to rare requests. Remember that the effective cache is not the one that stores more but the one that stores useful things that get requested again and again.

proxy_cache_purge

This directive specifies the additional conditions on which objects are deleted from the cache store before expiration. It may be used as a way to forcefully invalidate a cache entry. A good cache key design should not require invalidation, but we all know how often good designs of anything happen in real life.

proxy_cache_revalidate

This is also a Boolean directive. HTTP conditional requests with headers "If-None-Match" or "If-Modified-Since" may update the validity of objects in the cache even if they do not return any new content to the requesting client. For this, specify "on".

proxy_cache_use_stale

This is an interesting directive that sometimes allows responding with an expired response from the cache. The main case to do this is an upstream failure. Sometimes, responding with a stale content is better than rejecting the request on the basis of the famous "Internal server error". From the user's point of view, this is very often the case.

proxy_cache_valid

This is a very rough cache expiration specification. Usually, you should control the validity of the cached data via response headers. However, if you need something quick or something broad, this directive will help you.

One very important concept that is used in caching subsystems throughout all the upstream modules is that of the cache zone. A zone is a named memory region, which is accessible by its name from all Nginx processes. Readers familiar with the concept of System V-shared memory or IPC via mmap-ed regions will instantly see the similarity. Zones were chosen as an abstraction for the cache state storage, which should be shared between all the worker processes. You may configure many caches inside your Nginx instance, but you will always specify a zone for each cache. You may link different caches to the same zone, and the information about the cached objects will be shared. Zones also act as objects encapsulating the actual cache storage configuration such as where on the filesystem the cached objects will persist, how the storage hierarchy will be organized, when to purge the expired objects, and how to load the objects from disk into memory on restart.

To summarize, an administrator first sets up at least one zone with all the relevant storage parameters with the directive *_cache_path and then plugs subtrees of the whole URL space into those zones with the directive *_cache.

Zones are set up globally, usually in the http scope while individual caches are linked to zones with the simple *_cache directive in the relevant contexts, for example, locations down the path tree or the whole server blocks.

We should remind you that the described caching subsystem directives' family exists for all the upstream modules of Nginx. You will substitute proxy_ for the other upstream moniker to end up with a whole other family of directives that do exactly the same, maybe with some slight variations for responses generated by upstreams of another type. For example, here for the information on how to cache FastCGI responses at http://nginx.org/en/docs/http/ngx_http_fastcgi_module.html#fastcgi_cache.

Let us provide some real-world caching configuration examples that will help you grasp the idea better:

http {
    proxy_cache_path /var/db/cache/nginx levels=1:2 keys_zone=cache1:1m max_size=1000m inactive=600m;
    proxy_temp_path /var/db/cache/tmp;

    server {
        listen 80;
        server_name example.com;

        location / {
            proxy_pass http://localhost:8080/;
            proxy_cache cache1;
            proxy_cache_valid 200 302 24h;
            proxy_cache_valid 404 5m;
        }
        }
}

This is a canonically simple cache configuration with one zone named cache1 and one cache configured under location / in one server. Several important details are worth mentioning. The temporary files directory configured with the proxy_temp_path directive is highly recommended to be on the same filesystem as the main cache storage because otherwise, Nginx will not be able to quickly move files between the temporary and permanent storage and will instead perform an expensive file copy operation.

The key_zone size specifies the amount of memory dedicated to the zone. This memory is used to store the keys and metainformation about the objects in the cache and not the actual cached responses (objects). The limit on the object storage is specified in the max_size parameter. Nginx spawns a separate process named cache manager, which will constantly scan all the cache zones and remove the least used objects when the max_size is exceeded.

The proxy_cache_valid directive combination specifies a much shorter period of validity for the negative 404 results. The idea behind it is that 404 might actually be fixed, at least some of them may appear due to some misconfiguration. It makes sense to retry such requests more frequently. You should also consider the load on the upstream when making decisions about validity periods. Many computationally heavy search algorithms require much more resources to give a negative answer. It is quite understandable that to make sure that a looked for entity is absent may require checking everywhere instead of stopping after the first found instance. This is a very simplified description of a search algorithm, but it is short enough so that you will remember to always check the request processing time in the logs for negative responses and their relative amount before shortening the cache validity interval.

Two important parameters of the cache are left out in the above configuration, and this means that you will fly with default values. The proxy_cache_methods defaults to only caching GET and HEAD requests, which may not be optimal for your web application. And proxy_cache_key defaults to $scheme$proxy_host$request_uri, which may be dangerous if your web application make similar requests for different users. Read about these directives and either add uniqueness to the key or fall back to uncached behavior via proxy_cache_bypass.

Another example that we would like to present is much more complex. Let us devote a separate section to it.

Caching static files

When scaling a website horizontally, you will inevitably find yourself in the situation of having many identical Nginx-powered servers behind a low-level balancer. All of them will proxy the requests to the same upstream server farm, and there will be no problems with synchronizing the active, dynamic content served by your website. But if you follow the advice about having all the static content present locally to allow Nginx to serve it in the most native and efficient way possible, you will end up with a task of having many identical copies of the same files everywhere.

The other way to do the same task is a setup where a farm of Nginx instances is used to serve a huge library of static files, for example, video or music. Having a copy of that library on each Nginx node is out of the question because it is too big.

As usual, there are many possible solutions for these two cases. One choice is having a secondary smaller farm of Nginx servers serving the files to the main farm, which will employ caching inside the ngx_proxy upstream.

Another interesting solution uses a network filesystem mounted on the nodes. The traditional Unix NFS has a bad reputation, but in reality, on current Linux kernels, it is stable enough to be used in production. Two of the alternatives are AFS and SMBFS. The files under the mount point will look local to Nginx, but they will still be downloaded over the network, which is much slower than reading a good, local SSD. Luckily, modern Linux kernels have the ability to locally cache files from the NFS and AFS. It is named FS-Cache and uses a separate userland daemon, cachefilesd, to store local copies of files from a network filesystem. You may read about FS-Cache at https://people.redhat.com/dhowells/fscache/FS-Cache.pdf.

FS-Cache configuration is rather straightforward, and we will not focus on it. There is another way to do it, which follows the philosophy of Nginx much more closely. SlowFS is a third-party, upstream-like module for Nginx, which provides a simple interface to a filesystem subtree. The interface includes caching capabilities, which are standard to all other Nginx upstreams.

SlowFS is open source under a very permissive license and is available either from the author's website or directly from GitHub as a repository. Refer to http://labs.frickle.com/nginx_ngx_slowfs_cache.

Here is an example SlowFS configuration:

http {
    slowfs_cache_path /var/db/cache/nginx levels=1:2 keys_zone=cache2:20m;
  slowfs_temp_path /var/db/cache/tmp 1 2;
    location / {
        root /var/www/nfs;
        slowfs_cache cache2;
        slowfs_cache_key $uri;
        slowfs_cache_valid 5d;
    }
}

This configuration installs a transparent caching layer over files available locally in /var/www/nfs. It does not matter how these files are actually stored, they still will be cached according to the parameters specified with the slowfs_* family of directives. But obviously, you will only note any speed-up if /var/db/cache is much faster than /var/www/nfs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.235.144