Caching, one of the key techniques used to scale web applications, is a critical factor for increasing both performance and scalability at relatively low cost. Caching is fairly simple and can usually be added to an existing application without expensive rearchitecture. In many ways, caching is winning the battle without the fight. It allows you to serve requests without computing responses, enabling you to scale much easier. Its ease makes it a very popular technique, as evidenced by its use in numerous technologies, including central processing unit (CPU) memory caches, hard drive caches, Linux operating system file caches, database query caches, Domain Name System (DNS) client caches, Hypertext Transfer Protocol (HTTP) browser caches, HTTP proxies and reverse proxies, and different types of application object caches. In each case, caching is introduced to reduce the time and resources needed to generate a result. Instead of fetching data from its source or generating a response each time it is requested, caching builds the result once and stores it in cache. Subsequent requests are satisfied by returning the cached result until it expires or is explicitly deleted. Since all cached objects can be rebuilt from the source, they can be purged or lost at any point in time without any consequences. If a cached object is not found, it is simply rebuilt.
Cache hit ratio is the single most important metric when it comes to caching. At its core, cache effectiveness depends on how many times you can reuse the same cached response, which is measured as cache hit ratio. If you can serve the same cached result to satisfy ten requests on average, your cache hit ratio is 90 percent, because you need to generate each object once instead of ten times. Three main factors affect your cache hit ratio: data set size, space and longevity. Let’s take a closer look at each one.
The first force acting on cache hit ratio is the size of your cache key space. Each object in the cache is identified by its cache key, and the only way to locate an object is by performing an exact match on the cache key. For example, if you wanted to cache online store product information for each item, you could use a product ID as the cache key. In other words, the cache key space is the number of all possible cache keys that your application could generate. Statistically, the more unique cache keys your application generates, the less chance you have to reuse any one of them. For example, if you wanted to cache weather forecast data based on a client’s Internet Protocol (IP) address, you would have up to 4 billion cache keys possible (this is the number of all possible IP addresses). If you decided to cache the same weather forecast data based on the country of origin of that client, you would end up with just a few hundred possible cache keys (this is the number of countries in the world). Always consider ways to reduce the number of possible cache keys. The fewer cache keys possible, the better for your cache efficiency.
The second factor affecting cache hit ratio is the number of items that you can store in your cache before running out of space. This depends directly on the average size of your objects and the size of your cache. Because caches are usually stored in memory, the space available for cached objects is strictly limited and relatively expensive. If you try to cache more objects than can fit in your cache, you will need to remove older objects before you can add new ones. Replacing (evicting) objects reduces your cache hit ratio, because objects are removed even when they might be able to satisfy future requests. The more objects you can physically fit into your cache, the better your cache hit ratio.
The third factor affecting cache hit ratio is how long, on average, each object can be stored in cache before expiring or being invalidated. In some scenarios, you can cache objects for a predefined amount of time called Time to Live (TTL). For example, caching weather forecast data for 15 minutes should not be a problem. In such a case, you would cache objects with a predefined TTL of 15 minutes. In other use cases, however, you may not be able to risk serving stale data. For example, in an e-commerce system, shop administrators can change product prices at any time and these prices may need to be accurately displayed throughout the site. In such a case, you would need to invalidate cached objects each time the product price changes. Simply put, the longer you can cache your objects for, the higher the chance of reusing each cached object.
Understanding these three basic forces is the key to applying caching efficiently and identifying use cases where caching might be a good idea. Use cases with a high ratio of reads to writes are usually good candidates for caching, as cached objects can be created once and stored for longer periods of time before expiring or becoming invalidated, whereas use cases with data updating very often may render cache useless, as objects in cache may become invalid before they are used again.
In the following sections, I will discuss two main types of caches relevant to web applications: HTTP-based caches and custom object caches. I will then introduce some technologies applicable in each area and some general rules to help you leverage cache more effectively.
The HTTP protocol has been around for a long time, and throughout the years a few extensions have been added to the HTTP specification, allowing different parts of the web infrastructure to cache HTTP responses. This makes understanding HTTP caching a bit more difficult, as you can find many different HTTP headers related to caching, and even some Hypertext Markup Language (HTML) metatags. I will describe the key HTTP caching headers later in this section, but before we dive into these details, it is important to note that all of the caching technologies working in the HTTP layer work as read-through caches.
Read-through cache is a caching component that can return cached resources or fetch the data for the client, if the request cannot be satisfied from cache (for example, when cache does not contain the object being requested). That means that the client connects to the read-through cache rather than to the origin server that generates the actual response.
Figure 6-1 shows the interactions between the client, the read-through cache, and the origin server. The cache is always meant to be the intermediate (also known as the proxy), transparently adding caching functionality to HTTP connections. In Figure 6-1, Client 1 connects to the cache and requests a particular web resource (a page or a Cascading Style Sheet [CSS] file). Then the cache has a chance to “intercept” that request and respond to it using a cached object. Only if the cache does not have a valid cached response, will it connect to the origin server itself and forward the client’s request. Since the interface of the service and the read-through cache are the same, clients can connect to the service directly (as Client 2 did), bypassing the cache, as read-through cache only works when you connect through it.
Figure 6-1 Client interacting with the read-through cache
Read-through caches are especially attractive because they are transparent to the client. Clients are not able to distinguish whether they received a cached object or not. If a client was able to connect to the origin server directly, it would bypass the cache and get an equally valid response. This pluggable architecture gives a lot of flexibility, allowing you to add layers of caching to the HTTP stack without needing to modify any of the clients. In fact, it is common to see multiple HTTP read-through caches chained to one another in a single web request without the client ever being aware of it.
Figure 6-2 shows how chaining HTTP read-through caches might look. I will discuss these caches in more detail later in this chapter, but for now just note that the connection from the client can be intercepted by multiple read-through caches without the client even realizing it. In each step, a proxy can respond to a request using a cached response, or it can forward the request to its source on a cache miss. Let’s now have a closer look at how caching can be controlled using HTTP protocol headers.
Figure 6-2 Chain of HTTP read-through caches
HTTP is a text-based protocol. When your browser requests a page or any other HTTP resource (image, CSS file, or AJAX call), it connects to the HTTP server and sends an HTTP command, like GET, POST, PUT, DELETE, or HEAD, with a set of additional HTTP headers. Most of the HTTP headers are optional, and they can be used to negotiate expected behaviors. For example, a browser can announce that it supports gzip compressed responses, which lets the server decide whether it sends a response in compressed encoding or not. Listing 6-1 shows an example of a simple GET request that a web browser might issue when requesting the uniform resource locator (URL) http://www.example.org/. I have removed unnecessary headers like cookies and user-agent to simplify the example.
In response to that request, the web server could reply with something similar to Listing 6-2. Note that the server decided to return a response in compressed encoding using the gzip algorithm, as the client has suggested, but it rejected the request to keep the network connection open (keep-alive header), responding with a (connection: close) response header and closing the connection immediately.
There are dozens of different request headers and corresponding response headers that clients and servers may use, but I will focus only on headers relevant to caching. Caching headers are actually quite difficult to get right, because many different options can be set. To add further complication, some older headers like “Pragma: no-cache” can be interpreted differently by different implementations.
You can use the same HTTP headers to control caching of your web pages, static resources like images, and web service responses. The ability to cache web service responses is, in fact, one of the key scalability benefits of REST-ful services. You can always put an HTTP cache between your web service and your clients and leverage the same caching mechanisms that you use for your web pages.
The first header you need to become familiar with is Cache-Control. Cache-Control was added to the HTTP 1.1 specification and is now supported by most browsers and caching packages. Cache-Control allows you to specify multiple options, as you can see in Listing 6-3.
The most important Cache-Control options that web servers may include in their responses include
private Indicates the result is specific to the user who requested it and the response cannot be served to any other user. In practice, this means that only browsers will be able to cache this response because intermediate caches would not have the knowledge of what identifies a user.
no-store Indicates the response should not be stored on disks by any of the intermediate caches. In other words, the response can be cached in memory, but it will not be persisted to disk. You should include this option any time your response contains sensitive user information so that neither the browser nor other intermediate caches store this data on disk.
no-cache Indicates the response should not be cached. To be accurate, it states that the cache needs to ask the server whether this response is still valid every time users request the same resource.
max-age Indicates how many seconds this response can be served from the cache before becoming stale (it defines the TTL of the response). This information can be expressed in a few ways, causing potential inconsistency. I recommend not using max-age (it is less backwards compatible) and depend on the Expires HTTP header instead.
no-transform Indicates the response should be served without any modifications. For example, a content delivery network (CDN) provider might transcode images to reduce their size, lowering the quality or changing the compression algorithm.
must-revalidate Indicates that once the response becomes stale, it cannot be returned to clients without revalidation. Although it may seem odd, caches may return stale objects under certain conditions, for example, if the client explicitly allows it or if the cache loses connection to the origin server. By using must-revalidate, you tell caches to stop serving stale responses no matter what. Any time a client asks for a stale object, the cache will then be forced to request it from the origin server.
Note that a cached object is considered fresh as long as its expiration time has not passed. Once the expiration time passes, the object becomes stale, but it can still be returned to clients if they explicitly allow stale responses. If you want to forbid stale objects from ever being returned to the clients, include the must-revalidate option in the Cache-Control response header. Clients can also include the Cache-Control header in their requests. The Cache-Control header is rarely used by the clients and it has slightly different semantics when included in the request. For example, the max-age option included in the requests tells caches that the client cannot accept objects that are older than max-age seconds, even if these objects were still considered fresh by the cache.
Another HTTP header that is relevant to caching is the Expires header, which allows you to specify an absolute point in time when the object becomes stale. Listing 6-4 shows an example of how it can be used.
Unfortunately, as you can already see, some of the functionality controlled by the Cache-Control header overlaps that of other HTTP headers. Expiration time of the web response can be defined either by Cache-Control: max-age=600 or by setting an absolute expiration time using the Expires header. Including both of these headers in the response is redundant and leads to confusion and potentially inconsistent behavior. For that reason, I recommend deciding which headers you want to use and sticking to them, rather than including all possible headers in your responses.
Another important header is Vary. The purpose of that header is to tell caches that you may need to generate multiple variations of the response based on some HTTP request headers. Listing 6-5 shows the most common Vary header indicating that you may return responses encoded in different ways depending on the Accept-Encoding header that the client sends to your web server. Some clients who accept gzip encoding will get a compressed response, whereas others who cannot support gzip will get an uncompressed response.
There are a few more HTTP headers related to caching that allow for conditional download of resources and revalidation, but they are beyond the scope of this book. Those headers include Age, Last-Modified, If-Modified-Since, and Etag and they may be studied separately. Let’s now turn to a few examples of common caching scenarios.
Even though you could cache static files forever, you should not set the Expires header more than one year into the future (the HTTP specification does not permit setting beyond that). Listing 6-6 shows an example of HTTP headers for a static file allowing for it to be cached for up to one year (counting from July 23, 2015). This example also allows caches to reuse the same cached object between different users, and it makes sure that compressed and uncompressed objects are cached independently, preventing any encoding errors.
The second most common scenario is the worst case—when you want to make sure that the HTTP response is never stored, cached, or reused for any users. To do this, you can use response headers as shown in Listing 6-7. Note that I used another HTTP header here (Pragma: no-cache) to make sure that older clients can understand my intent of not caching the response.
A last use case is for situations where you want the same user to reuse a piece of content, but at the same time you do not want other users to share the cached response. For example, if your website allowed users to log in, you may want to display the user’s profile name in the top-right corner of the page together with a link to his or her profile page. In such a scenario, the body of the page contains user-specific data, so you cannot reuse the same response for different users. You can still use the full page cache, but it will have to be a private cache to ensure that users see their own personalized pages. Listing 6-8 shows the HTTP headers allowing web browsers to cache the response for a limited amount of time (for example, ten minutes from July 23, 2015, 13:04:28 GMT).
The last thing worth noting here is that in addition to HTTP caching headers, you can find some HTML metatags that seem to control web page caching. Listing 6-9 shows some of these metatags.
It is best to avoid these metatags altogether, as they do not work for intermediate caches and they may be a source of confusion, especially for less experienced engineers. It is best to control caching using HTTP headers alone and do so with minimal redundancy. Now that we have discussed how HTTP caching can be implemented, let’s have a look at different types of caches that can be used to increase performance and scalability of your websites.
The HTTP protocol gives a lot of flexibility in deploying caches between the web client and the web server. There are many ways in which you can leverage HTTP-based caches, and usually it is fairly easy to plug them into existing applications. There are four main types of HTTP caches: browser cache, caching proxies, reverse proxies, and CDNs. Most of them do not have to be scaled by you, as they are controlled by the user’s devices or third-party networks. I will discuss scalability of HTTP-based caches later in this chapter, but first let’s discuss each of these HTTP cache types.
The first and most common type of cache is the caching layer built into all modern web browsers called browser cache. Browsers have built-in caching capabilities to reduce the number of requests sent out. These usually use a combination of memory and local files. Whenever an HTTP request is about to be sent, the browser can check the cache for a valid version of the resource. If the item is present in cache and is still fresh, the browser can reuse it without ever sending an HTTP request.
Figure 6-3 shows a developer’s toolbar shipped as part of the Google Chrome web browser. In the sequence of web resources being downloaded on the first page load, you can see that the time needed to load the HTML was 582 ms, after which a CSS file was downloaded along with a few images, each taking approximately 300 ms to download.
Figure 6-3 Sequence of resources downloaded on first visit
If HTTP headers returned by the web server allow the web browser to cache these responses, it is able to significantly speed up the page load and save our servers a lot of work rendering and sending these files. Figure 6-4 shows the same page load sequence, but this time with most of the resources being served directly from browser cache. Even though the page itself needs a long time to be verified, all the images and CSS files are served from cache without any network delay.
Figure 6-4 Sequence of resources downloaded on consecutive visit
The second type of caching HTTP technology is a caching proxy. A caching proxy is a server, usually installed in a local corporate network or by the Internet service provider (ISP). It is a read-through cache used to reduce the amount of traffic generated by the users of the network by reusing responses between users of the network. The larger the network, the larger the potential saving—that is why it was quite common among ISPs to install transparent caching proxies and route all of the HTTP traffic through them to cache as many web requests as possible. Figure 6-5 shows how a transparent caching proxy can be installed within a local network.
Figure 6-5 HTTP proxy server in local network
In recent years, the practice of installing local proxy servers has become less popular as bandwidth has become cheaper and as it becomes more popular for websites to serve their resources solely over the Secure Sockets Layer (SSL) protocol. SSL encrypts the communication between the client and the server, which is why caching proxies are not able to intercept such requests, as they do not have the necessary certificates to decrypt and encrypt messages being exchanged.
A reverse proxy works in the exactly same way as a regular caching proxy, but the intent is to place a reverse proxy in your own data center to reduce the load put on your own web servers. Figure 6-6 shows a reverse proxy deployed in the data center, together with web servers, caching responses from your web servers.
Figure 6-6 Reverse proxy
For some applications, reverse proxies are an excellent way to scale. If you can use full page caching, you can significantly reduce the number of requests coming to your web servers. Using reverse proxies can also give you more flexibility because you can override HTTP headers and better control which requests are being cached and for how long. Finally, reverse proxies are an excellent way to speed up your web services layer. You can often put a layer of reverse proxy servers between your front-end web application servers and your web service machines. Figure 6-7 shows how you can scale a cluster of REST-ful web services by simply reducing the number of requests that need to be served. Even if you were not able to cache all of your endpoints, caching web service responses can be a useful technique.
Figure 6-7 Reverse proxy in front of web services cluster
Figure 6-8 CDN configured for static files
You can also configure some CDN providers to serve both static and dynamic content of your website so that clients never connect to your data center directly; they always go through the cache servers belonging to the CDN provider. This technique has some benefits. For example, the provider can mitigate distributed denial of service attacks (as CloudFlare does). It can also lead to further reduction of web requests sent to your origin servers, as dynamic content (even private content) can now be cached by the CDN. Figure 6-9 shows how you can configure Amazon CloudFront to deliver both static and dynamic content for you.
Figure 6-9 CDN configured for both static files and dynamic content
Now that we have gone through the different types of caches, let’s see how we can scale each type as our website traffic grows.
One reason HTTP-based caching is so attractive is that you can usually push the load off your own servers onto machines managed by third parties that are closer to your users. Any request served from browser cache, a third-party caching proxy, or a CDN is a request that never got to your web servers, ultimately reducing the stress on your infrastructure. At the same time, requests served from HTTP caches are satisfied faster than your web servers could ever do it, making HTTP-based caching even more valuable.
As mentioned before, do not worry about the scalability of browser caches or third-party proxy servers; they are out of your control. When it comes to CDN providers, you do not have to worry about scalability either, as CDN providers scale transparently, charging you flat fees per million requests or per GB of data transferred. Usually, the prices per unit decrease as you scale out, making them even more cost effective. This leaves you to manage only reverse proxy servers. If you use these, you need to manage and scale them yourself.
There are many open-source reverse proxy solutions on the market, including Nginx, Varnish, Squid, Apache mod_proxy, and Apache Traffic Server. If you are hosted in a private data center, you may also be able to use a built-in reverse proxy functionality provided by some of the hardware load balancers.
For most young startups, a single reverse proxy should be able to handle the incoming traffic, as both hardware reverse proxies and leading open-source ones (Nginx or Varnish) can handle more than 10,000 requests per second from a single machine. As such, it is usually more important to decide what to cache and for how long rather than how to scale reverse proxies themselves. To be able to scale the reverse proxy layer efficiently, you need to focus on your cache hit ratio first. It is affected by the same three factors mentioned at the beginning of the chapter, and in the context of reverse proxies, they translate to the following:
Cache key space Describes how many distinct URLs your reverse proxies will observe in a period of time (let’s say in an hour). The more distinct URLs are served, the more memory or storage you need on each reverse proxy to be able to serve a significant portion of traffic from cache. Avoid caching responses that depend on the user (for example, that contain the user ID in the URL). These types of responses can easily pollute your cache with objects that cannot be reused.
Average response TTL Describes how long each response can be cached. The longer you cache objects, the more chance you have to reuse them. Always try to cache objects permanently. If you cannot cache objects forever, try to negotiate the longest acceptable cache TTL with your business stakeholders.
It is worth pointing out that you do not have to worry much about cache servers becoming full by setting a long TTL, as in-memory caches use algorithms designed to evict rarely accessed objects and reclaim space. The most commonly used algorithm is Least Recently Used (LRU), which allows the cache server to eventually remove rarely accessed objects and keep “hot” items in memory to maximize cache hit ratio.
Once you verify that you are caching objects for as long as possible and that you only cache things that can be efficiently reused, you can start thinking of scaling out your reverse proxy layer. You are most likely going to reach either the concurrency limit or throughput limit. Both of these problems can be mitigated easily by deploying multiple reverse proxies in parallel and distributing traffic among them.
Figure 6-10 shows a deployment scenario where two layers of reverse proxies are used. The first layer is deployed directly behind a load balancer, which distributes the traffic among the reverse proxies. The second layer is positioned between the front-end web servers and web service machines. In this case, front-end web servers are configured to pick a random reverse proxy on each request. Once your stack grows even larger, it makes sense to deploy a load balancer in front of the second reverse proxy layer as well to make configuration changes more transparent and isolated. Luckily, it is unlikely that you would need such a complicated deployment; usually, reverse proxies in front of the front-end web application are unnecessary, and it is more convenient to push that responsibility onto the CDN.
Figure 6-10 Multiple reverse proxy servers
If you are using HTTP caching correctly, adding more reverse proxies and running them in parallel should not cause problems. The HTTP protocol does not require synchronization between HTTP caches, and it does not guarantee that all of the client’s requests are routed through the same physical networks. Each HTTP request can be sent in a separate TCP/IP connection and can be routed through a different set of intermediate caches. Clients have to work under these constraints and accept inconsistent responses or use cache revalidation.
No matter what reverse proxy technology you choose, you can use the same deployment pattern of multiple reverse proxies running in parallel because the underlying behavior is exactly the same. Each proxy is an independent clone, sharing nothing with its siblings, which is why choice of reverse proxy technology is not that important when you think of scalability. For general use cases, I recommend using Nginx or a hardware reverse proxy, as they have superior performance and feature sets. A few Nginx features that are especially worth mentioning are
Nginx uses solely asynchronous processing, which allows it to proxy tens of thousands of concurrent connections with a very low per-connection overhead.
Nginx is also a FastCGI server, which means that you can run your web application on the same web server stack as your reverse proxies.
Nginx can act as a load balancer; it supports multiple forwarding algorithms and many advanced features, such as SPDY, WebSockets, and throttling. Nginx can also be configured to override headers, which can be used to apply HTTP caching to web applications that do not implement caching headers correctly or to override their caching policies.
Nginx is well established with an active community. As of 2013, it is reported to serve over 15% of the Internet (source Netcraft).
If you are hosting your data center yourself and have a hardware load balancer, I recommend using it as a reverse proxy as well to reduce the number of components in your stack. In all other cases, I recommend investigating available open-source reverse proxy technologies like Nginx, Varnish, or Apache Traffic Server; selecting one; and scaling it out by adding more clones.
Finally, you can also scale reverse proxies vertically by giving them more memory or switching their persistent storage to solid-state drive (SSD). This technique is especially useful when the pool of fresh cached objects becomes much larger than the working memory of your cache servers. To increase your hit ratio, you can extend the size of your cache storage to hundreds of GB by switching to file system storage rather than depending solely on the shared memory. By using SSD drives, you will be able to serve these responses at least ten times faster than if you used regular (spinning disc) hard drives due to the much faster random access times of SSD drives. At the same time, since cache data is meant to be disposable, you do not have to worry much about limited SSD lifetimes or sudden power failure–related SSD corruptions.w73
After HTTP-based caches, the second most important caching component in a web application stack is usually a custom object cache. Object caches are used in a different way than HTTP caches because they are cache-aside rather than read-through caches. In the case of cache-aside caches, the application needs to be aware of the existence of the object cache, and it actively uses it to store and retrieve objects rather than the cache being transparently positioned between the application and its data sources (which happens with read-through cache).
Cache-aside cache is seen by the application as an independent key-value data store. Application code would usually ask the object cache if the needed object is available and, if so, it would retrieve and use the cached object. If the required object is not present or has expired, the application would then do whatever was necessary to build the object from scratch. It would usually contact its primary data sources to assemble the object and then save it back in the object cache for future use. Figure 6-11 shows how cache-aside lives “on the side” and how the application communicates directly with its data sources rather than communicating through the cache.
Figure 6-11 Cache-aside cache
Similar to other types of caches, the main motivation for using object cache is to save the time and resources needed to create application objects from scratch. All of the object cache types discussed in this section can be imagined as key-value stores with support of object expiration, and they usually support a simplistic programming interface, allowing you to get, set, and delete objects based on the unique key string. Let’s now have a closer look at different types of object caches and their benefits and drawbacks.
As was the case for the HTTP caches we discussed earlier in this chapter, there are many different ways application object caches can be deployed. The actual technologies used may depend on the technology stack used, but the concepts remain similar.
Listing 6-10 shows how easy it is to store an object in web storage. Web storage works as a key-value store. To store an object, you provide a unique identifier, called the key, and the string of bytes that you want to be persisted (called the value).
To execute this, you could save each search result in web storage, together with the user’s coordinates and a timestamp; then on load time, you could simply show the last search results by loading it from web storage. At the same time, you could compare a user’s location and current time to the coordinates of their previous search. If the user’s location has changed significantly—let’s say they moved by more than 200 meters or they have not opened the application for more than a day—you could update the user interface to indicate an update is in progress and then issue a new asynchronous request to your server to load new data. This way, your users can see something immediately, making the application seem more responsive; at the same time, you reduce the number of unnecessary requests sent to your servers in case users are opening their apps a few times on their way to the restaurant.
Another important type of object cache is one located directly on your web servers. Whether you develop a front-end web application or a web service, you can usually benefit from local cache. Local cache is usually implemented in one of the following ways:
Objects are cached directly in the application’s memory. The application creates a pool for cached objects and never releases memory allocated for them. In this case, there is virtually no overhead when accessing cached objects, as they are stored directly in the memory of the application process in the native format of the executing code. There is no need to copy, transfer, or encode these objects; they can be accessed directly. This method applies to all programming languages.
Objects are stored in shared memory segments so that multiple processes running on the same machine could access them. In this approach, there is still very little overhead, as shared memory access is almost as fast as local process memory. The implementation may add some overhead, but it can be still considered insignificant. For example, in PHP, storing objects in shared memory forces object serialization, which adds a slight overhead but allows all processes running on the same server to share the cached objects pool. This method is less common, as it is not applicable in multithreaded environments like Java, where all execution threads run within a single process.
A caching server is deployed on each web server as a separate application. In this scenario, each web server has an instance of a caching server running locally, but it must still use the caching server’s interface to interact with the cache rather than accessing shared memory directly. This method is more common in tiny web applications where you start off with a single web server and you deploy your cache (like Memcached or Redis) on the same machine as your web application, mainly to save on hosting costs and network latencies. The benefit of this approach is that your application is ready to talk to an external caching server—it just happens to run on the same machine as your web application, making it trivial to move your cache to a dedicated cluster without the need to modify the application code.
Each of these approaches boils down to the same concept of having an object cache locally on the server where your application code executes. The main benefit of caching objects directly on your application servers is the speed at which they can be persistent and accessed. Since objects are stored in memory on the same server, they can be accessed orders of magnitude faster than if you had to fetch them from a remote server. Table 6-1 shows the orders of magnitude of latencies introduced by accessing local memory, disk, and remote network calls.
Table 6-1 Approximate Latencies when Accessing Different Resources
An additional benefit of local application cache is the simplicity of development and deployment. Rather than coordinating between servers, deploying additional components, and then managing them during deployments, local cache is usually nothing more than a bit of extra memory allocated by the application process. Local caches are not synchronized or replicated between servers, which also makes things faster and simpler, as you do not have to worry about locking and network latencies. By having identical and independent local caches on each server, you also make your web cluster easier to scale by adding clones (as described in Chapters 2 and 3) because your web servers are interchangeable yet independent from one another.
Unfortunately, there are some drawbacks to local application caches. Most importantly, each application server will likely end up caching the same objects, causing a lot of duplication between your servers. That is caused by the fact that caches located on your application servers do not share any information, nor are they synchronized in any way. If you dedicate 1GB of memory for object cache on each of your web servers, you realistically end up with a total of 1GB of memory across your cluster, no matter how many servers you have, as each web server will be bound to that limit, duplicating content that may be stored in other caches. Depending on your use case, this can be a serious limitation, as you cannot easily scale the size of your cache.
Another very important limitation of local server caches is that they cannot be kept consistent and you cannot remove objects from such a cache efficiently. For example, if you were building an e-commerce website and you were to cache product information, you might need to remove these cached objects any time the product price changes. Unfortunately, if you cache objects on multiple machines without any synchronization or coordination, you will not be able to remove objects from these caches without building overly complex solutions (like publishing messages to your web servers to remove certain objects from cache).
The last common type of cache relevant to web applications is a distributed object cache. The main difference between this type and local server cache is that interacting with a distributed object cache usually requires a network round trip to the cache server. On the plus side, distributed object caches offer much better scalability than local application caches. Distributed object caches usually work as simple key-value stores, allowing clients to store data in the cache for a limited amount of time, after which the object is automatically removed by the cache server (object expires). There are many open-source products available, with Redis and Memcached being the most popular ones in the web space. There are commercial alternatives worth considering as well, like Terracotta Server Array or Oracle Coherence, but I would recommend a simple open-source solution for most startup use cases.
Interacting with distributed cache servers is simple, and most caching servers have client libraries for all common programming languages. Listing 6-12 shows the simplicity of caching interfaces. All you need to specify is the server you want to connect to, the key of the value you want to store, and TTL (in seconds) after which the object should be removed from the cache.
Storing objects in remote cache servers (like Redis or Memcached) has a few advantages. Most importantly, you can scale these solutions much better. We will look at this in more detail in the next section, but for now, let’s say that you can scale simply by adding more servers to the cache cluster. By adding servers, you can scale both the throughput and overall memory pool of your cache. By using a distributed cache, you can also efficiently remove objects from the cache, allowing for cache invalidation on source data changes. As I explained earlier, in some cases, you need to remove objects from cache as soon as the data changes. Having a distributed cache makes such cache invalidation (cache object removal) easier, as all you need to do is connect to your cache and request object removal.
Using dedicated cache servers is also a good way to push responsibility out of your applications, as cache servers are nothing other than data stores and they often support a variety of features. For example, Redis allows for data persistence, replication, and efficient implementation of distributed counters, lists, and object sets. Caches are also heavily optimized when it comes to memory management, and they take care of things like object expiration and evictions.
Cache servers usually use the LRU algorithm to decide which objects should be removed from cache once they reach a memory limit. Any time you want to store a new object in the cache, the cache server checks if there is enough memory to add it in. If there is no space left, it removes the objects that were least recently used to make enough space for your new object. By using LRU cache, you never have to worry about deleting items from cache—they just expire or get removed once more “popular” objects arrive.
Distributed caches are usually deployed on separate clusters of servers, giving them more operating memory than other machines would need. Figure 6-12 shows how cache servers are usually deployed—in a separate cluster of machines accessible from both front-end and web service machines.
Figure 6-12 Common distributed cache deployment
Even though distributed caches are powerful scalability tools and are relatively simple in structure, adding them to your system adds a certain amount of complexity and management overhead. Even if you use cloud-hosted Redis or Memcached, you may not need to worry about deployments and server management, but you still need to understand and monitor them to be able to use them efficiently. Whenever deploying new caches, start as small as possible. Redis is a very efficient cache server, and a single machine can support tens of thousands of operations per second, allowing you to grow to reasonable traffic without the need to scale it at all. As long as throughput is not a problem, scale vertically by adding more memory rather than trying to implement a more complex deployment with replication or data partitioning. When your system grows larger and becomes more popular, you may need to scale above a single node. Let’s now have a closer look at how you can scale your object caches.
When it comes to scaling your object caches, the techniques depend on the location and type of your cache. For example, client-side caches like web browser storage cannot be scaled, as there is no way to affect the amount of memory that browsers allow you to use. The value of web storage comes with the fact that users have their own cache. You can keep adding users, and you do not have to scale the client-side caches to store their user-specific data.
The web server local caches are usually scaled by falling back to the file system, as there is no other way to distribute or grow cache that, by definition, lives on a single server. In some scenarios, you may have a very large data pool where each cached object can be cached for a long period of time but objects are accessed relatively rarely. In such a scenario, it may be a good idea to use the local file system of your web servers to store cached objects as serialized files rather than storing them in the memory of the shared cache cluster. Accessing cached objects stored on the file system is slower, but it does not require remote connections, so that web server becomes more independent and insulated from the other subsystems’ failures. File-based caches can also be cheaper because the disk storage is much cheaper than operating memory and you do not have to create a separate cluster just for the shared object cache. Given the rising popularity of SSD drives, file system–based caches may be a cheap and fast random access memory (RAM) alternative.
When it comes to distributed object caches, you may scale in different ways depending on the technology used, but usually data partitioning (explained in Chapters 2 and 5) is the best way to go, as it allows you to scale the throughput and the overall memory pool of your cluster. Some technologies, like Oracle Coherence, support data partitioning out of the box, but most open-source solutions (like Memcached and Redis) are simpler than that and rely on client-side data partitioning.
If you decide to use Memcached as your object cache, the situation is quite simple. You can use the libMemcached client library’s built-in features to partition the data among multiple servers. Rather than having to implement it in your code, you can simply tell the client library that you have multiple Memcached servers. Listing 6-13 shows how easy it is to declare multiple servers as a single Memcached cluster using a native PHP client that uses libMemcached under the hood to talk to Memcached servers.
By declaring a Memcached cluster, your data will be automatically distributed among the cache servers using a consistent hashing algorithm. Any time you issue a GET or SET command, the Memcached client library will hash the cache key that you want to access and then map it to one of the servers. Once the client finds out which server is responsible for that particular cache key, it will send the request to that particular server only so that other servers in the cluster do not have to participate in the operation. This is an example of the share-nothing approach, as each cache object is assigned to a single server without any redundancy or coordination between cache servers.
Figure 6-13 illustrates how consistent hashing is implemented. First, all possible cache keys are represented as a range of numbers, with the beginning and end joined to create a circle. Then you place all of your servers on the circle, an equal distance from one another. Then you declare that each server is responsible for the cache keys sitting between it and the next server (moving clockwise along the circle). This way, by knowing the cache key and how many servers you have in the cluster, you can always find out which server is responsible for the data you are looking for.
Figure 6-13 Cache partitioning using consistent hashing
To scale your cache cluster horizontally, you need to be able to add servers to the cluster, and this is where consistent hashing really shines. Since each server is responsible for a part of the key space on the ring, adding a new server to the cluster causes each server to move slightly on the ring. This way, only a small subset of the cache keys get reassigned between servers, causing a relatively small cache-miss wave. Figure 6-14 shows how server positions change when you scale from a four-server cluster to a five-server cluster.
Figure 6-14 Scaling cache cluster using consistent hashing
If you used a naïve approach like using a modulo function to map a cache key to the server numbers, each time you added or removed a server from the cluster, most of your cache keys would be reassigned, effectively purging your entire cache. The Memcached client for PHP is not the only client library supporting consistent hashing. In fact, there are many open-source libraries that you can use in your application layer if your cache driver does not support consistent hashing out of the box.
To understand caching even better, it is good to think of cache as a large hash map. The reason caches can locate items so fast is that they use hashing functions to determine the “bucket” in which a cached object should live. This way, no matter how large the cache is, getting and setting values can be performed in constant time.
Another alternative approach to scaling object caches is to use data replication or a combination of data partitioning and data replication. Some object caches, like Redis, allow for master-slave replication deployments, which can be helpful in some scenarios. For example, if one of your cache keys became so “hot” that all of the application servers needed to fetch it concurrently, you could benefit from read replicas. Rather than all clients needing the cached object connecting to a single server, you could scale the cluster by adding read-only replicas of each node in the cluster (see Chapter 2). Figure 6-15 shows how you could deploy read-only replicas of each cache server to scale the read throughput and allow a higher level of concurrency.
Figure 6-15 Scaling cache cluster using data partitioning and replication
It is worth mentioning that if you were hosting your web application on Amazon, you could either deploy your own caching servers on EC2 instances or use Amazon Elastic Cache. Unfortunately, Elastic Cache is not as smart as you might expect, as it is basically a hosted cache cluster and the only real benefit of it is that you do not have to manage the servers or worry about failure-recovery scenarios. When you create an Elastic Cache cluster, you can choose whether you want to use Memcached or Redis, and you can also pick how many servers you want and how much capacity you need per server. It is important to remember that you will still need to distribute the load across the cache servers in your client code because Elastic Cache does not add transparent partitioning or automatic scalability. In a similar way, you can create cache clusters using other cloud-hosting providers. For example, Azure lets you deploy a managed Redis instance with replication and automatic failover in a matter of a few clicks.
Object caches are in general easier to scale than data stores, and usually simple data partitioning and/or replication is enough to scale your clusters horizontally. When you consider that all of the data stored in object caches is, by definition, disposable, the consistency and persistence constraints can be relaxed, allowing for simpler scalability. Now that we have discussed different types of caches and their scalability techniques, let’s move on to some general rules of thumb that may be helpful when designing scalable web applications.
How difficult caching is depends on the application needs and how we use it. It’s important to know the most common types of caches and how to scale them. In this section, we will discuss where to focus and prioritize your caching efforts to get the biggest bang for the buck. We will also discuss some techniques that can help you reuse cached objects and some pitfalls to watch out for. Let’s get to it.
One of the most important things to remember about caching is that the higher up the call stack you can cache, the more resources you can save. To illustrate it a bit better, let’s consider Figure 6-16. It shows how the call stack of an average web request might look and roughly how much can you save by caching on each layer. Treat the percentage of the resources saved on Figure 6-16 as a simplified rule of thumb. In reality, every system will have a different distribution of resources consumed by each layer.
Figure 6-16 Caching in different layers of the stack
First, your client requests a page or a resource. If that resource is available in one of the HTTP caches (browser, local proxy) or can be satisfied from local storage, then your servers will not even see the request, saving you 100 percent of the resources. If that fails, your second best choice is to serve the HTTP request directly from reverse proxy or CDN, as in such a case you incur just a couple percentage points of the cost needed to generate the full response.
When a request makes it to your web server, you may still have a chance to use a custom object cache and serve the entire response without ever calling your web services. In case you need to call the web services, you may also be able to get the response from a reverse proxy living between your web application and your web services. Only when that fails as well will your web services get involved in serving the request. Here again, you may be able to use object caches and satisfy the request without the need to involve the data stores. Only when all of this fails will you need to query the data stores or search engines to retrieve the data needed by the user.
The same principle applies within your application code. If you can cache an entire page fragment, you will save more time and resources than caching just the database query that was used to render this page fragment. As you can see, avoiding the web requests reaching your servers is the ultimate goal, but even when it is not possible, you should still try to cache as high up the call stack as you can.
Another important thing to remember when working with caching is to always try to reuse the same cached object for as many requests/users as you can. Caching objects that are never requested again is simply a waste of time and resources.
To illustrate it better, let’s consider an example. Imagine you are building a mobile application that allows users to find restaurants near their current location. The main use case would be for the user to see a list of restaurants within walking distance so they could pick the restaurant they like and quickly have something to eat. A simple implementation of that application could check the GPS coordinates, build a query string containing the user’s current location, and request the list of nearby restaurants from the application server. The request to the web application programming interface (API) could resemble Listing 6-14.
The problem with this approach is that request parameters will be different for almost every single request. Even walking just a few steps will change the GPS location, making the URL different and rendering your cache completely useless.
A better approach to this problem would be to round the GPS location to three decimal places so that each person within the same street block could reuse the same search result. Instead of having billions of possible locations within the city limits, you reduce the number of possible locations and increase your chances of serving responses from cache. Since the URL does not contain user-specific data and is not personalized, there is no reason why you should not reuse the entire HTTP response by adding public HTTP caching headers.
If you were serving restaurants in Sydney and you decide to round the latitude and longitude to three decimal places, you would reduce the number of possible user locations to less than one million. Having just one million possible responses would let you cache them efficiently in a reverse proxy layer (or even a dynamic content CDN). Because restaurant details are unlikely to change rapidly, you should be able to cache service responses for hours without causing any business impact, increasing your cache hit ratio even further. Listing 6-15 shows how the structure of the URL remains the same and just the request arguments have changed, reducing the number of possible URLs being requested.
This principle of reusing the same data for many users applies to many more scenarios. You have to look for ways that would let you return the same result multiple times rather than generating it from scratch. If it is not possible to cache entire pages, maybe it is possible to cache page fragments or use some other trick to reduce the number of possible cache keys (as in my restaurant finder example). The point is, you need to maximize the cache hit ratio, and you can only do it by increasing your cache pool, extending the TTL of your objects, and decreasing the number of potential cache keys.
If you ever find yourself supporting an existing web application that does not have enough caching, you have to ask yourself, “Where do I start? What are the most important queries to be cached? What pages are worth caching? What services need to be cached the most?” As with any type of optimization, to be successful, you need to prioritize based on a strict and simple metric rather than depending on your gut feeling. To prioritize what needs to be cached, use a simple metric of aggregated time spent generating a particular type of response. You can calculate the aggregated time spent in the following way:
aggregated time spent = time spent per request * number of requests
This allows you to find out which pages (or resources) are the most valuable when it comes to caching. For example, in one of my previous jobs I worked on a website with fairly high levels of traffic. We wanted to scale and improve performance at the same time, so we began looking for opportunities to cache more aggressively. To decide where to start, I used a Google Analytics report and correlated traffic stats for the top 20 pages with the average time needed to render each of these pages. Then I created a ranking based on the overall value, similar to Table 6-2.
Table 6-2 Page Ranks Based on Potential Gain from Caching
If you look closely at Table 6-2 you can see that improving performance of the home page by 5 ms gives more overall saving than improving performance of the second most valuable page by 10 ms. If I went with my gut feeling, I would most likely start optimizing and caching in all the wrong places, wasting a lot of valuable time. By having a simple metric and a ranking of pages to tackle, I managed to focus my attention on the most important pages, resulting in a significant capacity increase.
“There are only two hard things in computer science: cache invalidation and naming things and off-by-one errors.” –Phil Karlton
The last rule of thumb is that cache invalidation becomes difficult very quickly. When you initially develop a simple site, it may seem easy. Cache invalidation is simply removing objects from cache once the source data changes to avoid using stale objects. You add an object to cache, and any time the data changes, you go back to the cache and remove the stale object. Simple, right? Well, unfortunately, it is often much more complicated than that. Cache invalidation is difficult because cached objects are usually a result of computation that takes multiple data sources as its input. That, in turn, means that whenever any of these data sources changes, you should invalidate all of the cached objects that have used it as input. To make it even more difficult, each piece of content may have multiple representations, in which case all of them would have to be removed from cache.
To better illustrate this problem, let’s consider an example of an e-commerce website. If you used object caches aggressively, you could cache all of the search queries that you send to the data store. You would cache query results for paginated product lists, keyword searches, category pages, and product pages. If you wanted to keep all of the data in your cache consistent, any time a product’s details change, you would have to invalidate all of the cached objects that contain that product. In other words you would need to invalidate the query results for all of the queries, including not just the product page, but also all of the other lists and search results that included this product. But how will you find all the search results that might have contained a product without running all of these queries? How will you construct the cache keys for all the category listings and find the right page offset on all paginated lists to invalidate just the right objects? Well, that is exactly the problem—there is no easy way to do that.
The best alternative to cache invalidation is to set a short TTL on your cached objects so that data will not be stale for too long. It works most of the time, but it is not always sufficient. In cases where your business does not allow data inconsistency, you may also consider caching partial results and going to the data source for the missing “critical” information. For example, if your business required you to always display the exact price and stock availability, you could still cache most of the product information and complex query results. The only extra work that you would need to do is fetch the exact stock and price for each item from the main data store before the rendering results. Although such a “hybrid” solution is not perfect, it reduces the number of complex queries that your data store needs to process and trades them for a set of much simpler “WHERE product_id IN (….)” queries.
Advanced cache invalidation techniques are beyond the scope of this book, but if you are interested in learning more about them, I recommend reading two white papers published in recent years. The first onew6 explains a clever algorithm for query subspace invalidation, where you create “groups” of items to be invalidated. The second onew62 describes how Facebook invalidates cache entries by adding cache keys to their MySQL replication logs. This allows them to replicate cache invalidation commands across data centers and ensures cache invalidation after a data store update.
Due to their temporary nature, caching issues are usually difficult to reproduce and debug. Although cache invalidation algorithms are interesting to learn, I do not recommend implementing them unless absolutely necessary. I recommend avoiding cache invalidation altogether for as long as possible and using TTL-based expiration instead. In most cases, short TTL or a hybrid solution, where you load critical data on the fly, is enough to satisfy the business needs.
Caching is one of the most important scalability techniques, as it allows you to increase your capacity at relatively low cost, and you can usually add it to your system at a later stage without the need to significantly rearchitect your system. If you can reuse the same result for multiple users or, even better, satisfy the response without the request ever reaching your servers, that is when you see caching at its best.
I strongly recommend getting very familiar with caching techniques and technologies available on the market, as caching is heavily used by most large-scale websites. This includes general HTTP caching knowledge42 and caching in the context of REST-ful web services,46 in addition to learning how versatile Redis can be.50
Caching is one of the oldest scalability techniques with plenty of use cases. Let’s now move on to a much more novel concept that has been gaining popularity in recent years: asynchronous processing.