CHAPTER 3: Content Caching: Keeping the Load Light

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

C H A P T E R 3

Content Caching: Keeping the Load Light

Caching is one of the key ingredients for an awesomely fast and responsive web site. Yet it’s often neglected. It’s not that caching is particularly difficult or hard to set up, but it’s a very broad topic so it’s often hard to know where to start. Caching requires a fair bit of background knowledge because as soon as you start using it, you’re inviting all sorts of potential problems that will be a real nightmare to track down if you’re not up to speed. Actually, even if you know exactly what you’re doing, trying to diagnose issues caused by various different layers of caching and how they may (or may not) interact can drive anyone mad.

Technical books naturally keep pretty close to their core themes. If you buy a book on PHP programming, for example, you would likely be surprised to find a chapter dedicated to Perl or Ruby on Rails. However, when an author wants to introduce caching, they have to compromise: they need to cover enough content to make it useful but it’s quite easy to get mired in technical detail that seems important (and under certain circumstances is essential) yet detracts from the main ideas. This is one reason why many books don’t touch on the topic at all or gloss over it when they do. Thus many developers end up without a solid understanding of what caching is and when and how to use it.

We have seen this with countless web design consultants. They create beautiful web sites and the web applications work beautifully. They are a joy to use and lovely to behold. Unfortunately, they are vulnerable to Cross Site Scripting attacks and SQL injection, amongst other things. How can these consultants design such great sites yet make these fundamental mistakes? Simple. When they learned PHP or whatever language, the books didn’t cover these topics, and they haven’t come across them in their travels. It’s not that they’re stupid, unprofessional, or don’t care; it’s just that they didn’t know these things existed.

And so it is with caching. Caching tends to fall between software development and system deployment—and has nasty habit of disappearing between the two. This chapter, although one of the easiest in the book, is also potentially one of the most important. It explains a technology that will give your web site a big kick in the pants. And it will do all this with very little in the way of work from you.

In the first section we’re going to look at what caching actually is, the different technologies, and how they fit together. We won’t go into much detail; we’ll just lay the groundwork so you can see how all the bits fit together and what their place is in the grand scheme of things. This will pay off in spades in the following sections when we go through a couple of the topics in more detail and start getting into the applications themselves.

Before you get to play with the cool toys, you need to learn some theory. Fortunately, you can get away with just a few key concepts. Then we’ll look at why what appears at first to be a simple problem but is actually rather challenging. We will look at how people have tried to solve this problem over the years and the techniques used to make it work. We’ll do our best to keep it light but you need to have a rough idea of what’s going on in order to appreciate some of the benefits (and challenges) that you’ll come across in the rest of the chapter.

For the most part, the in-depth sections can be read independently. Where there is synergy between different layers, we will point you to the relevant section for additional information.

Lastly, due to space constraints, we have had to scale back how much caching goodness we can expose you to. That said, rest assured that the content we cover will get you up and running quickly at least to the point where you can benefit from the technology. Where there are any major gotchas or limitations, we’ll make sure to highlight them and provide further information.

What Is a Cache?

A cache is a collection of items that are stored away for future use. Squirrels go to great lengths to find and store nuts (cache comes from the French verb cacher meaning “to hide”). They hide them where they think no one will ever find them. These hidden stores are caches and are very important to the squirrel; when a squirrel discovers his nuts are missing, he gets very upset and storms about looking possessed.

Fortunately, caches aren’t necessary for our survival, but they are necessary for super fast web sites. Like a squirrel foraging for nuts, it is much faster to go to a location close by and help yourself from your store than it is to go into the freezing weather and hunt around in the (quite possibly vain) hope of finding some. One is clearly easier and faster than the other, and this is what makes caching so beneficial.

Note How do you pronounce cache? The problem with words like cache is that although there are two generally accepted pronunciations for the word, the chances are you’ll only ever hear one of them. Then one day you have a meeting with someone who pronounces it the other way, and you have absolutely no idea what they’re talking about. At least when pronouncing Linux, the two versions are reasonable similar. However, cache can either be pronounced as “cash” or “kay-shh.” If you’re not expecting it, this can cause you to adopt the expression of a stunned mullet. Fortunately, now that you know about it, your brain won’t crash when you hear cache pronounced differently!

Whistle Stop Tour

When people talk about caching, it’s usual within a very specific context. They might be talking about a web browser or web proxy. Maybe they’ve even talking about edge-based caching systems. Or perhaps they’re talking about something else altogether. The problem with discussing caching is that so many related technologies all come under the caching banner. Even if we limit ourselves to just the technologies that play a major role in web application performance, we’ve still got more systems than you can count on one hand. Now “more than five” might not sound all that impressive, but when you consider that you’ll have to deal with these potentially all at the same time while simultaneously trying to avoid other possible pitfalls, it’s not exactly insignificant, either.

Browser-based Caching

Most people today are in the enviable position of having broadband. Not so long ago, broadband was spoken about in reverent tones. Those lucky souls who had a 2MB ADSL connection suddenly found a fan club of people hoping to share in their good fortune. It was always somewhat depressing to get home and sit through the thrashing warbling noise that our 56K dial-up modems would make while trying to get online.

Sadly, we can distinctly recall visiting Google and watching as the Google logo would download and slowly update the screen. Few people these days believe us when we say just loading Google’s front page took four seconds, but honestly it did. Then we would do our search, get the results page, and then wait for the cursed logo to download again. Thanks to the primitive nature of the browsers we were using at the time (which, in fairness, were state of the art at the time) we had to wait another four seconds before we could see our results. If we’d made a spelling mistake in our query, we’d have to resubmit the request—and wait another four seconds for the logo to reach us.

Fortunately, others decided that this situation was intolerable and they set about to fix it. This problem was caused by having to wait for the logo to crawl along the phone line. Although the computer could easily have drawn the logo in an instant, it couldn’t draw it before it received it. The obvious solution is to simply remember that we’ve already seen this particular file before. Sounds easy enough, but how do you know that the file is going to be the same before you’ve downloaded it? The answer is “with great difficulty” but we’ll come back to that in the section on Caching Theory.

Having the browser cache content made a massive difference in performance. By keeping the files on the machine itself and potentially even storing previously rendered pages, the network connection is effectively bypassed. (Ever wondered why when you press the “Back” arrow, the page loads instantly?) When you only need to download a small amount of text rather than bandwidth-hogging logos, the world just seems to be a nicer place!

Web Accelerators

We’re going to take a slight detour here to touch on a technology that came to be known as web acceleration. There were various applications that you could download (usually for a modest fee) that would promise to make your Internet connection a million times faster. The more legitimate applications did actually make things seem much faster. In fact, when you clicked on the next page, the whole thing would appear instantly. The interesting thing was when you looked at how fast the page appeared and you looked at the maximum theoretical speed of your modem (which no modem ever managed to reach), the numbers just didn’t add up.

This performance boost came from sleight of hand—and an ingenious one at that. When it comes to browsing the web, for the most part you have a quick burst of traffic as you load a page and then silence while you read through it. While you’re looking around, the connection is sitting idle.

The web accelerator applications would take advantage of this idle time by examining the links on the page and downloading them in the background. Because it was using the connection when it was idle, the user didn’t feel any impact. If the user then clicked one of the links the accelerator already downloaded, the files were already at hand so the page loaded instantly. While the user was happily reading the new page, the web accelerator was busy downloading more content in the background. In other words, the best case scenario was an instant page load. The worst case was the speed that it would have taken anyway without the accelerator. In effect, the user couldn’t lose.

It also helps to remember that modem connections were generally billed based on how long you remained connected to the server. Rather than being charged for how much you downloaded, you were charged based on how long you were actually online. So there was no cost penalty for downloading pages that were never viewed.

Today, with broadband and mostly dynamic content, these tools aren’t particularly useful any more. Many people now pay for the bandwidth they use as their connection is “always on.” So when someone talks about a web accelerator today, they are almost certainly referring to a cache that sits in front of a server. If you need to refer to the older solution, you can call it a “preemptive cache” (the name is a good fit because the cache would preemptively fetch pages in the hope that it would download the pages that you were going to visit).

PERCEIVED PERFORMANCE

The reason we say “perceived performance” is because although the site now seems to load almost instantly, in reality the web site and the speed of the connection haven’t changed. Many people will scoff at us here and say that there’s really no such thing as perceived performance; the web site is either fast or it isn’t.

When it comes to the end user, we are quite happy to accept this definition. It is highly unlikely that any of your customers will care how their high-speed experience is created. However, when you’re architecting these systems, you need to make the distinction so that when it comes to holding a complex system in your head you don’t keep tripping over your own feet.

Abstraction is a normal part of everyday life. We use tools without understanding precisely how they work or how they were developed. Imagine if you had to understand the entire history of the light bulb, electricity, and electrical distribution grids just to use a light switch! So while your users can enjoy the benefits of all your hard work and can simply call it speed or performance, you need to be a little more precise.

Web Proxies

Despite the success of browser-based caches, they had one big limitation. Because they were built into the browsers themselves, it meant that each browser would have its own copy of a given file. For home users, this was rarely a problem, but for businesses it became a real nuisance.

Generally speaking (although it’s not as true today) businesses pay more for Internet connections. They often have service-level agreements (SLAs) to guarantee availability or uptime, which in theory provide a superior connection. In exchange, businesses pay more than a home user might pay. As these connections were usually shared across a network, several people could use the connection at the same time. As e-mail and e-commerce took off, businesses spent even more time sending data to and from the Internet.

This created two issues. First, bandwidth was expensive. Either the business paid for what was downloaded or they paid for an unlimited connection at a suitably (much) higher fee. The second issue was performance. If the staff relied on the Web for their daily tasks, and if the Internet then slowed down to a crawl, it affected the company’s performance as a whole. Although the business could simply get a faster connection, this would come at a significant cost. What was needed was a way to give better performance while saving money.

Web proxies meet this goal in two ways. First, the proxy sits between the web browsers and the Internet. Rather than having each browser go out on the Web directly, requests are relayed through the proxy. Ultimately only the proxy would ever initiate connections to the Internet itself.

The second part of the solution is an adaption of the cache already found in browsers. When User A visits Google, she makes the request to the proxy. The proxy makes the request on her behalf and downloads all the relevant files, including the logo. The proxy stores these files and keeps track of when and where all the files came from. When User B decides to pay Google a visit, the proxy can serve the logo directly to the browser without having to make a request to Google. As the connection between the user and the proxy is extremely fast, the logo appears to download instantly.

When you have a network with a couple of thousand computers on it and all of them are browsing the Web, a good web proxy can make a massive difference to the amount of bandwidth being used and it can greatly increase performance. While there are many millions of web sites out there, the same ones keep on cropping up time and time again. Google, Facebook, YouTube, Amazon, eBay; these sites are so popular that they are constantly being viewed over and over again. In many cases different people are reading exactly the same content. For this scenario, web proxies are invaluable; they save money and boost performance!

Transparent Web Proxies

It’s not just companies and universities that are in on this web proxy business. In fact, almost all ISPs and network providers make extensive use of the technology. If you think about it, it makes a great deal of sense. After all, even a small ISP could have thousands of customers, some of which are big businesses using their own proxies to cut their costs.

So, to avoid being left out, ISPs starting getting their feet wet. Remember, for most entities, sending data on their own network is either free or cheap enough not to be a concern. However, sending data through a network owned by someone else can be very expensive indeed. Just as businesses try to avoid sending too much data to their ISP, ISPs want to avoid sending any more data than necessary to other networks—the same idea writ large.

ISPs used to offer web proxies directly to customers and many also hosted mirrors of popular FTP sites and web sites. This was great for the users because the data they wanted only had to come from the ISP instead of potentially from the other side of the world. The ISPs were rubbing their hands with glee because although they had to provide resources to host these mirrors, it was basically free compared to the cost of thousands of people all dragging down the latest Linux release from outside the network.

The big problem with ISPs offering a web proxy was getting people to actually use it. When the Internet first started getting popular, the first wave of users was fairly technical. They understood the advantages of using local web proxies and they understood that they could benefit by telling their local web proxy to get its content from the ISP’s proxy. This built a chain of proxies, each slightly slower than the last, but each having more content. Even though each proxy in the chain would be slower than the last, even the slowest proxy was probably faster than the web site itself.

But time moved on and people stopped caring about web proxies, local or otherwise. With broadband especially, even those who did know about the technology tended not to use it. This was a big backward step for the ISPs because now they had even more users making even greater demands from the network and none of them were playing nice and using the web proxy. An ISP couldn’t force the issue (people could always change to a new ISP) and most of the new generation of users had no idea what a proxy even was.

The solution to this problem came in the form of transparent proxies. These are a combination of web proxy and some nifty networking tricks that allow a web proxy to intercept a request from a web browser and take it upon itself to fetch the page. The web proxy then returns the cached page to the browser and the browser has no idea that they’re not talking directly to the real site.

This solution is very powerful because it’s pretty hard to circumvent. If the ISP redirects all traffic on port 80 (the standard web server port) then unless you use a completely different port, you’re going to go through their system. As almost every web site in the world runs on port 80, this is a very effective solution.

Now for the downside. You can’t see the proxy and have no way to know how many of them are between you and the web site you’re visiting. Perhaps there are none, but there could easily be several. Why does this matter? Because sooner or later someone is going to visit your site and it’s going to break because some random proxy somewhere decided to cache something it shouldn’t. If you don’t provide proper caching headers (we cover this in the “Caching Theory” section, but caching headers are simply instructions you’d like the proxy to follow), the proxy will make a “best effort” guess as to what it should cache. It won’t always get it right. Most caches are conservative, but in reality a web cache will cache whatever it is told to cache. To make matters worse, the dynamic nature of the Internet means that a given user might not always go through the same cache, a factor that can cause what appear to be random spurious errors. And as your Internet connection probably doesn’t go through that particular broken cache at all (hopefully) no matter what you do, you can’t recreate the error.

So, if your web application works beautifully from one location but not from another, perhaps it’s something between you and the site causing the problem rather than something you’ve done wrong.

Web SITE WEIRDNESS

Admittedly, it does sound a bit far-fetched to have an invisible proxy mess things up for your web site. Unfortunately, we’ve seen it happen. One of our customers had a very simple WordPress web site. WordPress (www.wordpress.org) is an awesome blogging platform that has oodles of plug-ins and features and is super easy to manage. It’s used all over the world to quickly and easily create functional yet great-looking web sites.

The newly deployed site was working perfectly in Chrome and then the customer tested it in Internet Explorer. It redirected itself to a missing page and threw an error. Now, while Internet Explorer is not our browser of choice, it is certainly more than capable of displaying a WordPress-based site.

It was obvious that it was doing a redirect, so we looked at the headers. Believe it or not, when Chrome requested the page, it was being given the correct page to redirect to. When Internet Explorer requested the page, it still received a redirect but for a non-existent URL, one that (as far as we could tell) never existed. As this was basically a fresh install and we had done this a hundred times before, this issue was maddening. We even start going through the source to see if there was any Internet Explorer-specific “magic.” We couldn’t find anything wrong after hours of experimentation. We then migrated the whole site en masse to a completely different machine and it worked perfectly the first time!

It turned out that there was a transparent proxy between us and the server. Whenever it saw Internet Explorer trying to access a WordPress site, it would help out by rewriting the reply from WordPress so that Internet Explorer would get the new direct location. As the proxy was transparent and had all forms of identification turned off, it was only after a lot of trial and error (and finally a desperate phone call) that we discovered the root cause.

So if your application ever starts behaving badly but only when there’s a full moon on a Wednesday, don’t jump to conclusions. Ask yourself if there might be some third party interference.

Edge-based Caching

Depending on who you talk to, “edge” can mean different things in a network context. Generally speaking, it refers to a server that’s as close as possible to the final user but still part of the provider’s infrastructure. For example, in Chapter 5 we look at content delivery networks, which often put your data so close to the user it’s almost in their living room. (We focus on RackSpace, which uses Akamai, but others include Amazon’s Cloud Front and Cache Fly.) Well, okay maybe just their country or city (if it’s big enough) but accessing a file from up the road is significantly faster than downloading it from the other side of the world.

However, we’re going to look at something a bit different here. This sort of cache usually sits between your application servers and your users. Although for big sites these caching servers sit on dedicated hardware, you can often find the cache and the application server on the same machine. This seems somewhat counterintuitive; if a server was already under performing, how do you improve performance by adding to the machine’s workload?

The performance benefit comes from how application servers work. Generally, they are fairly heavy-duty applications sporting lots of features and doing lots of work. They run the actual web application, so they need to be as fast as possible. The problem is, although you could have the web server facing the Internet directly, this means that your web server is handling various different tasks. It’s processing orders and updating the stock control system and it’s also busy serving graphics, stylesheets, JavaScript and all sorts of other miscellaneous things. This is akin to using a sledge hammer to crack a nut with the additional concern that while dispensing justice to the nut, the sledge hammer is not doing the important (money making) work that it should be doing.

Edge caching fixes this by sitting in front of the application server and handling all the requests from the various users. However, in a similar fashion to a web proxy (these caches are sometimes called reverse proxies), the cache tries to answer as many requests as it can without involving the web server. For example, if the front page of a web site is very rich in multimedia, this could put a serious strain on an application server. By placing a cache in front of it, the application server only needs to serve that static content once. From then on, the cache will send it back to the user.

This works extremely well because the cache is designed specifically to deal with requests in this fashion. It is created for the sole purpose of being very, very fast at getting static content back to users. This is all transparent from your application server’s point of view. It simply sees fewer requests and so has less work to do over all.

As with other caches, it’s really important that the cache doesn’t cache things it shouldn’t. Remember that the application server won’t have any way to tell that something has gone wrong. For this reason, Varnish, one of the best web accelerators (told you this would crop up), only caches content guaranteed to be safe. You can tailor it specifically to your application to eke out every last bit of performance, but in its freshly installed state it won’t do anything that might mangle your site. Varnish is one of the applications we cover later in this chapter.

Platform Caching

Platform caching is provided either by the framework you’re using to write your application (such as the PHP-based Yii Framework or the ever popular Ruby on Rails) or by a particular library that caches for you (such as the Smarty templating library for PHP). This sort of caching is considered different from application caching (covered later) because it’s not usually something you have to implement specifically for your application. In other words, you turn on the feature and it just works, rather than having to make specific calls to methods or functions provided by the framework (which is considered application caching).

Page caching, for example, doesn’t much care what the page is, how it was created, or even what language was used. Even with caching individual requests or parts of a page inside an application, often you can leave the heavy lifting to the framework or library. In short, although you might need to think before you make use of your framework’s caching support, it is usually the work of a few moments rather than spending hours coming up with a plan.

Because of the huge number of frameworks and libraries out there (each with its own ideas on how best to handle the problem) there simply isn’t enough space for us to give even the most popular ones more than a cursory mention. Even if we picked a particular framework to focus on, unless it happened to be your framework of choice, it wouldn’t add any value for you.

Instead, if you’re keen to give platform caching a go (and you should certainly at least find out what your platform offers), check out your framework’s manual or web site. Caching is usually a complicated enough subject that frameworks dedicate a whole section or chapter in their manuals on how to use it properly. This will give you more information on how your framework uses caching plus tailored examples and other platform specific bits and pieces that you should be aware of.

Application Caching

Application caching is similar to platform caching but rather than relying on a framework or plugin to do the caching for you, you do the heavy lifting yourself. There are numerous reasons why you might want to do this. For a start, each web application is different and while most frameworks can give you the basics of template caching and so forth, sometimes you need more.

For example, your site might have a front page that shows the latest news and stories from around the world. The problem is you have so many visitors it’s killing your site. You can’t keep dynamically creating the page because it is just too resource-intensive, but you don’t want to rely on basic static page caching because when you delete the cached page, there will be an upsurge in traffic until the new page is generated. Even then, how long do you cache for? Cache too short and you will still have performance problems. Caching too long and you defeat the whole purpose of having the site in the first place. What you need is some sort of compromise.

And this is where taking responsibility for your own caching can pay dividends. For example, with the above scenario, you could use memcached. Memcached is an extremely fast key/value store. You put data in and you take data out. That’s pretty much it. The lack of advanced features is one of its key strengths as it can implement the store in a very efficient way. Memcached is used on some of the world’s busiest web sites.

So how would this solve the previous problem? Well, the first thing we do when someone visits our site is see whether that page is in memcached. If the page is in there, we simply retrieve the page and send it back to the user. We don’t need to do a database query and we don’t need to build a new page. If we don’t find the page in the cache, we follow the standard page building procedure as before, but this time once the page is complete we insert it into memcached, which ensures it will be there for the next request.

Memcached allows you to set expiry times for your documents, so if you set a document to expire after 20 seconds, attempting to request the document after that time will fail and the page generation process will start all over again. For many people this works well, but it’s still a little bit passive. Specifically, your cache control is based on how much time has passed, regardless of whether any new stories have been added.

One way to deal with this is to create a whole new cached page every minute or so. Even if your site is being hammered by a thousand people, they will all be sent the data in memcached so the server itself is not under much load. Every minute the page in the cache is replaced with a new one, and when people refresh, they simply get the new page. Generating one page a minute is far better than trying to do it thousands of times.

But you can take it further still. Instead of running off a timer, you can simply update memcached when a new story has been added. So if an hour goes by without any new stories, you will only have generated that page the once. When a story finally gets added, you can create the new page and then load it into the cache.

You’ve gone from the server doing all the work and handling every single connection itself to making the server responsible only for sending out a prebuilt page and generating a new version only when required. In short, caching at this level provides a huge potential for performance enhancement but it requires really knowing your application inside out. Because memcached is deliberately simple, you will need to build your own logic to take advantage of these features. If you’re up for the challenge, you certainly won’t be disappointed!

Database Caching

Database caching, like databases themselves, is often considered a black art. There is a reason why database experts can charge such large amounts for just a day’s work. A good consultant can provide massive improvements for very little effort, just because he or she knows the database inside out and can make it sing and dance.

Out of the box, most databases are pretty fast and they tend to be configured for general performance. In other words, while the performance might not exactly blow your mind, it should be pretty respectable. To turn respectable into warp speed requires a lot of tweaking of the database itself, which disks it uses and how it uses them, the type of tables your database uses, and how you write your queries. There are many other things that can (and do) impact performance, and what works well under one type of load might fail on another. For example, if you have two queries, you might be able to run 1,000 of Query A and 1,000 of Query B before your server runs out of juice. But if you run them both at the same time, you might be surprised to see that you can only run 300 of each. This is because the different queries stress different parts of the system and in different ways.

Yes, trying to optimize and get the most out of a database is really hard. As soon as you change a query or add some new ones, you might find you need to recalculate all those details all over again!

Fortunately, you can leverage application level caching to relieve much of the burden from the database. When it does less work, it can often handle the remainder without any trouble. For this reason, we won’t be opening this particular Pandora’s Box in this book.

Just the Beginning…

Although this section was only meant as an overview, we actually crammed an awful lot of content into a rather small space. If we were sitting at a cafe talking to you about these technologies, each one could easily last for at least two cappuccinos. With that in mind, don’t be too surprised if you’ve come this far and still have unanswered questions.

Caching Theory: Why Is It so Hard?

The fundamental problem with caching is quite simply that web browsers don’t have telepathy. If we send a request right now, the response might be completely different from the response we would have received if we had sent the request 10 minutes ago or if we send it again in 10 minutes time.

Most of the issues with caching flow from this problem. We only want to download a resource if that resource is different from the last time we downloaded it. But we can’t tell for sure if the file has changed unless we check with the server. It would be nice if we could rely on the server to inform us of any changes, but because HTTP is a request/response protocol, there is no mechanism for the server to initiate a request. Even if it could inform the browser, that would mean keeping track of many different clients, many of whom no longer care about the status of the file or resource. Because we need to make the initial request, this creates a whole host of other issues such as how often to check for a new version. If we check too often, there’s not really much point caching the data in the first place. If we cache for too long, the user will see stale data, which can be just as bad (if not worse) as having to wait ages for the data to arrive.

HTTP 1.0 Caching Support

HTTP/1.0 offered some very basic caching features. Although simple, they worked pretty well for the most part. Basically, a server could use the “Expires” header to tell the browser how long it could safely cache the file before requesting it again. The browser could then simply use the cached file until the expiry time was reached.

The second feature was the IMS (If-Modified-Since) conditional request. When the browser requested a file that it already had in the cache, it could set the IMS header to ask the server if the file itself had changed in the time given. Remember, the time sent should be the time the file was originally downloaded rather than the time the file was due to expire. If the file hadn’t changed, the server could reply with the status code 304 - Not Modified. This was relatively small compared to the payload for the file itself and so although a request was made to the server, it could still prevent using bandwidth unnecessarily. If a browser received the 304 status code, it could simply use the cached version of the file. If the file was modified since the time sent, the server replied with the 200 - OK message and sent the file normally just as though the IMS header was not used.

Lastly, it was possible for a client to set “Pragma: no-cache” when sending a request. This indicated to any caches that handled the request that they should fetch the content from the live site and should not satisfy the request by sending data from their cache. The problem with this (and with the other features of HTTP/1.0) is that they were quite basic and didn’t provide a way to give specific instructions on how data should be cached or managed. There was an awful lot of guesswork involved; although the caching headers themselves were standardized across browsers, the actual meaning of the headers was somewhat vague. In short, the behavior of a cache was unpredictable and could break applications.

HTTP 1.1 Enhanced Caching Support

HTTP/1.1 was a massive advance on the previous version as far as caching was concerned. The biggest improvement was the creation of an extensible framework with specifications and fixed definitions. Terms that were used casually before were given strict meanings. This meant that the caching system was formalized and thus was relatively easy to implement in a standard and compatible way. It was possible to look at the options and know with a good deal of certainty how that content would be handled. This was simply not possible with the caching features found in HTTP/1.0.

According to the HTTP/1.1 standard, a cached object is “Fresh” until it reaches its expiry time. At this stage the object becomes “Stale” but that doesn’t necessarily mean it should be evicted from the cache. When an object becomes Stale it just means that the browser can no longer assume that the object is correct and so should revalidate it before using it. This can be done with the IMS conditional request that was used in HTTP/1.0. HTTP/1.1 expanded on IMS by adding other more specific conditions. For example, it is possible to request an object that hasn’t been modified after a certain time (If-Unmodified-Since, or IUS for short).

One problem with the IMS system is that it caches based on time. Time was quite useful when static pages and files were the norm and when the precision of a single second was more than sufficient for determining whether a file had changed or not. However, with modern applications, it’s quite possible that an identical URL will return a completely different file and equally possible that a completely different URL will return a file that is already in the cache.

To address this problem, HTTP/1.1 introduced entity tags, which are more commonly referred to as e-tags. These are unique keys for a given file or object. How the server creates the e-tag doesn’t matter as long as it ensures that the e-tag is unique for every different file. This is very powerful because it means you can request a resource and provide the exact file that you have and ask the server if the file has changed or not. Thus you don’t need to worry about time synchronization issues and can guarantee that you have the file you are supposed to have. You can also specify a list of e-tags that you think might satisfy a request. This means that the server can check multiple files to see if they meet your needs. If one of them has the valid e-tag, the server will generate a 304 - Not Modified response and supply the e-tag of the correct file. Otherwise it will send a normal 200 - OK response.

One of the issues with the HTTP/1.0 cache control offering was that the only way you could issue an IMS type request was by sending a specific time to the server. The problem with this approach is that every machine has its own clock and even when attempts are made to keep that clock in sync with the rest of the world, those clocks tend to drift. It’s not uncommon for machines to be out of sync by 10 or even 20 minutes. If you have an application that needs second level caching, you’re pretty much out of luck.

In HTTP/1.1, though, you can use the max-age header to specify a relative amount of time rather than referring to a specific point in time. This can be combined with the age header itself, which is used by caches to show how long that data has been sitting in the cache. This provides much more flexibility and transparency to the browser to determine whether or not it needs to send another request.

HTTP/1.1 also sports additional directives such as being able to request that the data is not altered in transit (the no-transform directive), is not cached at all (the private directive), and that the object is not stored (the no-store directive). It is important to note that although these are called directives, in reality they can’t be enforced. Good proxies will probably honor them, but there is no way to force that behavior. In short, these directives should be used where appropriate but you should not depend on them. If you are looking for security and to prevent caching of content at intermediary caches, you should almost certainly use SSL.

The Solution

These are the main caching control techniques that you will see in use today. There are more exotic options available but, while they might be able to give you a boost in specific circumstances, it’s unlikely that you’ll get as much benefit from them as you will from the others. This is partly because not all proxies support all of the options and partly because there can be a lot of work involved in adding support in your application to make it take advantage of the new features. You might also find that once you add support to your application, the stuff you’ve added to use the new feature may actually add more delay than the new feature will ultimately remove.

In short, go for the lowest hanging fruit first. IMS, for example, is probably the easiest to implement from a programming or development point of view. If you use it, you can cut down on content generation as well as bandwidth used. Once you have it working, maybe look at e-tags if it seems appropriate for the way your application works. There are plenty of things that you can do with the simple stuff before you venture into the darker arts of caching!

CACHING SLOWS THE INTERNET

There has been quite some interest recently in how caching might actually be slowing down the Internet rather than speeding it up. The basic idea is that the Internet is supposed to be a simple network that only concerns itself with moving data from one point to another. It doesn’t care what that payload is or where it has come from or where it is supposed to go.

Until recently, the demands for bandwidth and performance were much higher than the technology (not to mention the price of that technology) could supply. This meant that ISPs and related companies had to figure out a way of doing more with less, and caching and various related technologies were a good way to do this.

The problem is that caching introduces latency. It takes a finite amount of time for a cache to receive a request, process that request, and then either send back the cached version or fetch the latest version from the server that actually hosts the site. Although it takes time, historically this delay has been very short indeed compared to the amount of time saved actually sending data. As we don’t particularly care where the time is spent as long as that time is reduced overall, this seemed like a good trade-off, and for the most part it has been.

These days, however, we tend to have faster connections with much lower latency and we want to push vast amounts of data around with as little latency as possible (if you need a quick refresher on latency versus bandwidth, flip back to Chapter 2). In this case, caching gets in the way and can slow things down. This debate is still relatively young and the arguments are really just forming at this stage (depending on who you ask, of course). That said, it’s likely to become a hot topic for debate as high speed and low latency become more popular and applications start to not only take advantage of them but require them to work.

Caching Isn’t as Easy as It Looks

This section discussed the problems faced by the early World Wide Web. We looked at the caching support available in both the HTTP/1.0 and HTTP/1.1 protocols and gave a brief description of how they were used to solve at least some of the problems. We also touched on the issue of caching slowing down the Internet. This is still a relatively young idea but it is certainly gaining some momentum.

The key take-away from this section is that while caching sounds easy on the surface, there are a lot of things that can (and do) go wrong. By getting a feel for the tools available to solve these problems, you will have a greater feel for how caching works and the issues you might come up against.

Web Proxies

You might be surprised to find this in-depth section on web proxies, as these are predominantly used by end users or their ISPs. Even if ISPs are using transparent proxies, there’s nothing we can do about it, so why is it being covering here?

Although you won’t use a web proxy to directly enhance the performance of your web site, you can use one to test and tweak your site. For example, you can use it to see which files are being cached and can thus ensure it is doing what you expect. Transparent caches generally operate in the same way as a standard web cache (they also usually employ the same software) so you can kill two birds with one stone if you run your site through a proxy.

Whether or not this is worth doing initially is really down to the type of web site you’re testing and how much you need to tweak. Most web sites will benefit from caching in some form or another. Generally speaking, unless an application is doing something unusual, a web proxy won’t have much effect. That said, if you do happen to hit a brick wall with your application and you’re not sure why, firing up a proxy might provide some quick and easy insight into what’s going on under the covers.

The Squid Proxy Server

Squid is probably the most popular all-purpose web proxy (or web cache) available today. It was first released in 1996 and has long been the standard web proxy included by most Linux distributions. Just as most people think Apache when they need a web server, most people immediately reach for Squid when they need a web proxy. It certainly helps that Squid is packed full of features including the ability to create complex caching hierarchies (web caches using web caches), in-depth logging, high performance caching, and excellent user authentication support.

Squid also has a strong community with various plug-ins available. Although you might not use them for testing your web site, you might at some point find a use for them. Squid Guard, for example, is a plug-in that provides URL filtering to Squid. This is ideal if you want to block certain undesirable sites, such as advertising or adult material.

Tip If you really are keen on setting up Squid in order to protect people from some of the darker areas of the Internet, a very good application to investigate is Dan’s Guardian. It works in a similar way to a spam filter: it analyzes each page it downloads before sending it back to the browser. We have found that it is extremely effective at filtering content and is very easy to set up. Although it is not a Squid plug-in per se, it does require a proxy to connect through and Squid is often used for this purpose. You can get more information on Dan’s Guardian from its web site at http://dansguardian.org.

Getting Started

Before you can begin, you need to install Squid. Fortunately, Squid is a common package in almost all Linux distributions and is therefore really easy to install and getting running.

For Ubuntu, the command is

sudo apt-get install squid

and for CentOS it is

yum install squid

Well, that was fairly painless! Both distributions include the standard example configuration file. For basic use, you don’t need to change much. By default Squid runs on port 3128 but only accepts requests from the machine it’s running on (localhost). Although you can connect from any other machine on the network (Squid listens on all your network interfaces), you will get an error page if you try to actually visit a web site. This prevents you from accidentally creating an open proxy, which would be a “very bad thing™”. The last thing you want is random people on the Internet accessing all manner of sites while making it appear like you’re the one doing the browsing!

Still, security aside, a web proxy that won’t proxy for you is not particular useful so you’re going to need to do some tweaking. You only really have to make one change because the default Squid config file is pretty decent. It is also what the vast majority of Squid caches out there are using, so chances are your Squid server will be very similar to other proxies in the wild.

So without further ado, edit the file with

vim /etc/squid/squid.conf

You can’t just add your rules to the bottom of the file because by default Squid adds a line that denies access from everywhere. If you put your rules after this line, they will have no effect. Instead, you must make sure that they appear above http_access deny all. Fortunately, you can search for “INSERT YOUR OWN RULE” to take you to the right spot. In vim, simply enter /INSERT YOUR OWN RULE and press Enter. Then press O in order to insert a new line. Remember, vim is case sensitive, so you must search for INSERT and not insert. The lines you want to add are

acl my_networks src 192.168.1.0/24 192.168.2.0/24

http_access allow my_networks

Squid uses access control lists (ACLs) to define who can access the proxy and what they are allowed to do once they do connect. ACLs allow for very flexible configuration using a relatively simple syntax. Even though the building blocks are pretty simple, you can build some very intricate rules with them. The ACL you just created (called “my_networks”) will match any request that comes from either of the two networks listed. It doesn’t specify what should happen when a match occurs; it just provides a convenient way to refer to these requests.

The next line actually puts this into action. http_access tells Squid that it should allow or deny proxy requests that match the provided ACL. In your case, you have told Squid that you want to allow web proxying from the two networks specified in my_networks. To sum it up, you’ve now configured Squid to allow anyone on your network to access the Web through the proxy.

Of course the networks listed are just an example and you will need to adapt them to match your own network. You can specify any number of networks in an ACL or you can just specify a single network, which is probably the most common setting for the majority of people.

Now that you’re all configured, it’s time to fire up Squid. On both Ubuntu and CentOS you can start Squid with

service squid start

As this is the first time you’ve run Squid, you will probably see a message about it initializing and setting up its cache directories. On modern machines this should only take a few seconds and it only needs to be done the once. The last thing you need to do is make sure that your browser can connect to Squid. The exact method for doing this depends on the operating system and the browser you are using. Most of them place the proxy options under Networking or Advanced Settings. You want to set it to use the proxy for all protocols and enter the IP address of your server and port 3128. All being well, you should now be able to get online!

Note You may get an error saying “Could not determine fully qualified hostname.” This just means that Squid can’t figure out which name to call itself (used when showing error pages and so forth). If you get this message, go back into the config file (see previous instructions) but this time search for TAG: visible_hostname (/TAG: visible_hostname<Enter>). Press ‘O’ to insert a new line and then add visible_hostname myproxy (or whatever you’d like to call it). Save the file. You should now be able to start Squid.

Troubleshooting

If your browser complains that it can’t connect to the proxy server and you’re sure you’ve put in the right IP address and port, it could well be a firewall on your server. To allow access to port 3128 from anywhere on Ubuntu you can use

sudo ufw allow 3128/tcp

This assumes you’re using the ufw firewall management tool, which is simple to use but also pretty powerful. If you want to limit who can connect, you can also do

sudo ufw allow proto tcp from 192.168.1.0/24 to any port 3128

Again, you’ll need to make sure that the network matches your own and that it matches the one you configured in Squid. If you are using CentOS, you’re going to need to do some file editing if you want to use the standard iptables script. To do that, open the config file with

vim /etc/sysconfig/iptables

As before, press Shift+G to get to the bottom of the file and to go into insert mode. You need to make sure that when you add this line it is above the COMMIT line and also above the last rule line which rejects everything. If you add this line after either of those two lines, it won’t work. The line you need to add is

-A RH-Firewall-1-INPUT -m state --state NEW,ESTABLISHED,RELATED -m tcp -p tcp --dport 3128 -j

ACCEPT

Once you’ve saved the file (press the Escape key and type :wq and press Enter) you will need to restart the firewall with

service iptables restart

If the proxy connects but you get an error message from Squid itself (you can tell it’s from Squid because the error will be in the form of a web page that conveniently has Squid written at the bottom of it), you can usually figure out what went wrong. If the Internet connection is otherwise working fine, it’s probably a problem with the two lines you just added. Double-check that they match your network specifics.

Transparent Proxies

We’re not going to cover setting up a transparent proxy here because from a performance point of view (at least when it comes to your web application) it won’t be very different from what you’d experience with a standard proxy setup. However, if you do want to install a transparent proxy, there are copious amounts of information on how to do it on the Internet.

There are some gotchas that you should be aware of, though. You will need to do some fiddling with the firewall to make it redirect normal web traffic through your proxy. This may involve just cutting and pasting some lines into your firewall config, but if you’re not 100% confident with what it is you’re changing, you might want to give it a miss. Transparent proxies also don’t play nice with SSL. If you attempt to redirect port 443 (https) through a transparent proxy, it will fail due to the way the protocol works. HTTPS can be proxied but only when the browser is aware that it needs to use a proxy. Lastly, a transparent proxy only affects the ports that you specifically redirect through it. If you only redirect port 80 (which is by far the most common) then web sites on any other port will not be proxied or filtered.

In short, if it’s a small network and you’re looking to build a safe sandbox, simply block all outgoing traffic at the firewall for any address you don’t want to have access and then set up Squid as a normal proxy. This will mean that visitors can’t get online unless they use your proxy, which in turn means they are subject to your filtering rules and logging mechanisms.

If your plan is to prevent children or teens accessing things they shouldn’t, you might need to be more creative in how you lock down the network. Youngsters are extremely resourceful and have a nasty habit of being able to find holes in all but the toughest armor. The sheer ingenuity that we’ve seen in bypassing what appears to be bulletproof security never fails to surprise (and impress) us. Don’t say we didn’t warn you!

What’s Going On

Although you can often tell a lot simply by using the proxy in the same way that your end user would, it’s often helpful to see what’s going on under the covers. This is especially important when you have a page that has lots of content and it’s not easy to tell which content is being cached and which isn’t.

The easiest way is to follow Squid’s access log with a tail, like so:

tail -f /var/log/squid/access.log

This will follow the file (the handy -f flag) so you will see any requests in real time. The output will look something like this:

1327843900.435    333 127.0.0.1 TCP_REFRESH_MISS/301 777 GET

http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml - DIRECT/63.80.138.26

text/html

1327843901.242    504 127.0.0.1 TCP_REFRESH_MISS/200 8667 GET

http://feeds.bbci.co.uk/news/rss.xml? - DIRECT/63.80.138.26 text/xml

The first column is the timestamp of the event. By default Squid stores it as a Unix timestamp, which can be converted at sites such as www.unixtimestamp.com/. This is generally done for ease of parsing of log files because from a developer point of view, it’s often easier to work with this simplified format. If you’d rather have a more readable format, you can change this line in the Squid config file

access_log /var/log/squid/access.log squid

access_log /var/log/squid/access.log combined

You will also need to uncomment the logformat line that defines this formatting style. Search for logformat combined (/logformat combined<Enter>) and uncomment the line (remove the # symbol).

This will give you a much richer format, but it’s not ideal for automatic log parsing tools and the like, although many tools do support it. If you’re only using this for your own testing and aren’t really interested in what sites are being viewed or how often, then there’s really no reason not to use the combined format.

Carrying on with the log, the next field is the response time in milliseconds. So in your example, the first request took about a third of a second and the second took slightly over half a second. The next field is the IP address. In this case, the browser is on the same machine as Squid and so is connected via the loopback adapter. The next field is the event that triggered the request, followed by a forward slash and the response code from the server. A TCP_REFRESH_MISS is caused by Squid determining that it should check with the server to see if the file has changed. This could either be because the server has said that the document should only be cached briefly or it could be that Squid’s default policies have determined that a check should be sent. There are a huge variety of Squid event messages that can occur and you can find out all about them at www.linofee.org/~jel/proxy/Squid/accesslog.shtml. This site also describes the native Squid logging format in considerable detail. If there is anything that you are not sure about based on our explanation, the descriptions on that site might help resolve any confusion.

The first status code was 301, which means “moved permanently.” This is quite common in modern sites and is often done in tandem with URL management. The second status code, 200, just means “OK” and that the document has been returned as per the request.

The next field is the size of the resource in bytes—nothing too spectacular there. The next field shows the request method, which will usually be GET or POST. One thing to note is that POST requests are almost never cached because they contain dynamic content (usually form data) and so the response is almost always dynamic in return. It would be rare that you would want to cache such a response, so don’t be surprised if many (if not all) POST requests have the TCP_MISS action associated with them.

The next field is the URL that was requested. Although you can’t see it in your example, the whole URL (including the GET parameters) is also stored in the logs. You should never transmit sensitive data in a GET request because a system administrator could potentially see it in the logs. Another good reason for not sending data in a GET request is that you might accidentally end up caching a page you didn’t want to cache. In short, if you’re submitting a form or sending any sensitive data, don’t use a GET request; use POST instead. One place where a GET request with parameters is useful is that it makes it easy to create bookmarks to dynamic content and also allows Squid to potentially cache that content.

The hyphen in the next field is used for storing the user’s identity. This will always be a hyphen unless you add authentication to your proxy. The next field tells you how the request was made in the cache hierarchy. If you’re using the proxy as a standalone service (and you are if you have just followed this section), then you will always see DIRECT here. If you have configured Squid to use a parent cache (for example, if your ISP offers a web proxy service), you will see PARENT here when Squid asks the other proxy for the content rather than fetching it itself. The second part of this field is the IP address or the machine where Squid got the content from. The final field is the content type (or MIME type) of the file. This unsurprisingly describes the type of content the file contains, which is used both by Squid and the browser to determine the best way of handling the content.

Getting a Helping Hand

As you can see, the Squid logs provide a wealth of information but sometimes it can be challenging to see the wood for the trees, especially if you have a lot of data moving through your proxy. If you fall into this category, you might be interested in some of the analysis tools available for Squid such as Webalizer (www.mrunix.net/webalizer/). Webalizer generates graphs and statistics of usage and can give an excellent high-level view of how the cache is performing. Webalizer can be installed in Ubuntu with

sudo apt-get install webalizer

and CentOS with

yum install webalizer

Webalizer’s documentation is extensive, covers a wide range of use cases, and includes examples. The documentation can be found at ftp://ftp.mrunix.net/pub/webalizer/README or as part of the package your distribution installed.

Squid, the Swiss Army Knife of Proxies

This section offered a really quick tour of the Squid proxy server and why it might be useful for you to install one for yourself. We also touched very briefly on transparent proxies, some of the issues you should be aware of, and alternative ways of creating a walled garden. We covered basic troubleshooting, which (thanks to Squid’s ubiquitous nature) is probably something you can avoid. Lastly, we covered Squid’s log files, how to tail them, and how to read and understand the somewhat copious information printed on every line. We also recommended Webalizer as a possible graphical tool for log file analysis if you are looking for a more strategically view of the data.

Squid is an excellent tool and is extremely versatile. It will likely do whatever you need out of the box and contains excellent documentation right in the config file. There is also a huge amount of content on the Web covering a myriad interesting (and often mind boggling) configurations. In short, a good grounding in Squid is a very useful thing to have in your toolkit because you never know when it will come in handy.

Edge-based Caching: Introducing Varnish

Varnish Cache has been around since 2006 and, like many of the technologies introduced in this chapter, was designed to solve a real world problem. Admittedly, all applications are written to solve problems but some applications are written to solve a particular pain point that the authors are experiencing. More often than not, those pain points are not suffered by them alone and the software finds a strong following in the community. Varnish certainly qualifies for this distinction!

Why are we introducing Varnish when earlier in the chapter we used Squid, which can also be used as a web accelerator? Those of you who have read ahead will know that in Chapter 8 when we discuss HTTP load balancing, we use nginx even though Varnish can also do this job (as can Squid). Well, there are a few reasons for this. First, we wanted to share a wide range of different solutions so that you can choose the best solution for your particular needs. For example, if you have a requirement to use only official packages from an older Enterprise Linux release, nginx and Varnish might not be available to you. In that case, Squid is a viable alternative. Or perhaps you already have nginx installed and would rather not add an additional application into the mix.

However, unlike the other two, Varnish was designed from scratch to be an awesome web accelerator. It only does HTTP acceleration, unlike Squid, which can cache many different protocols, and nginx, which also happens to be a fully featured (and insanely quick) web server.

One of the interesting features of Varnish is that rather than managing the cache directly (that is deciding what to keep in memory and what to write out to disk), Varnish uses virtual memory. A similar solution is used by the MongoDB database and for much the same reasons. The problem with memory management is that it’s very difficult to get right and it almost always depends on how much memory is in the machine and how the individual applications use it. Also, as the application can really only see its own memory usage, it is very difficult to manage the memory and disk access in an optimal way.

The genius of using virtual memory is that it is managed by the operating system itself. Varnish effectively sees a block of memory, which it uses to store the cached data. However, it’s up to the operating system to decide which parts of that block are actually in physical memory and which blocks are written to disk. The short version is that Varnish doesn’t need to worry about moving data from memory to disk because the operating system will do it automatically behind the scenes. As the operating system does know how memory is being used on the machine (and it should, since one of its roles is to allocate and manage it), it can manage the movement of data in a far more optimal way than any individual application could.

NO SUPPORT FOR SSL

When we tell people about Varnish, one of the things they immediately get excited about is the idea that it can make their SSL connections go faster. Because SSL uses encryption, it requires a lot more resources to use than plain old HTTP. This is not ideal when the machine handling the SSL connection (otherwise known as the endpoint) is also the machine that’s handling the application itself. You want to free up as much CPU power as possible to power your application, so if you can get rid of having to deal with SSL, do so.

However, Varnish doesn’t deal with SSL at all, and this has caused some confusion or concern that maybe Varnish isn’t a mature product yet. In fact, Varnish doesn’t handle SSL because there’d be very little benefit in doing so. Many sites that do use SSL use it to encrypt sensitive data, data that is most probably dynamic and is probably not something you want to store in a cache anyway. Moreover, whenever you do SSL, there is always a finite amount of work that must be done no matter how fast you make it. Also, it’s not exactly an easy protocol to re-implement. If you build your own version, you then have to worry about all the security issues that you may have created for yourself. It’s not exactly easy to test, either. If you use an existing implementation such as OpenSSL, there’s very little difference between Varnish using it and another application using it, other than one would slow down Varnish of course!

If you have your heart set on SSL acceleration, there are numerous things you can do, such as adding special acceleration hardware or even configuring another front end server to act as an SSL endpoint. nginx would be an ideal choice and it could still connect through Varnish to ensure that whatever can be cached has been. For more information on load balancing SSL (another great way to boost performance), check out Chapter 10.

Sane Caching by Default

Varnish does sane caching by default. This means that it is very conservative in what it decides to cache. For example, any request that involves cookies will not be cached. This is because cookies are most often used to identify people and thus customize the response based on who is requesting the page. What you don’t want is for Varnish to take the first page it sees and then send it back to everyone else. At best, you will be sending stale content to your users; at worst, you might send confidential details to everyone by mistake.

Because Varnish only caches what it knows is safe, you can deploy it in front of your web application without worrying that you’ll get caught out by any of these problems. Having said that, if your application makes heavy use of cookies, you might find that very little of your application is getting cached. A classic example of a cookie-hungry web site is WordPress. Fortunately, there are plug-ins available for WordPress that make it more cache friendly (available from the usual WordPress plug-ins page in your control panel).

The good news is that Varnish is extremely configurable and allows you to customize pretty much everything about how requests are handled, what items are cached, and even what headers to add or remove. We only cover the basics in this chapter, but the Varnish manual goes into considerable depth on how you can tweak it to match your needs.

Installing Varnish

Installing Varnish is pretty straightforward. Many distributions include it in their own repositories but you’re going to use the official Varnish repositories to make sure you have the latest and greatest version. To install on Ubuntu, you first need to install the Varnish key (you’ll need to be root for this, so use sudo –i before running the following instructions). This can be quickly done with

curl http://repo.varnish-cache.org/debian/GPG-key.txt | apt-key add -

Adding the Varnish repository to apt just requires adding a single line to the source list:

echo "deb http://repo.varnish-cache.org/ubuntu/ lucid varnish-3.0" >> /etc/apt/sources.list

Lastly to install, you just to update the package list with

apt-get update

and run the install command

apt-get install varnish

Now we’re going to cover installing Varnish on CentOS. The EPEL (Extra Packages for Enterprise Linux) also has Varnish packages available. However, one of the main benefits of using an Enterprise Linux operating system such as CentOS is that you can install updates without worrying about things breaking. This can be guaranteed because no packages are allowed to do anything that breaks backwards compatibility, so everything that worked with the last version should work with the new version. In order for EPEL to be useful, it also needs to make that same guarantee, so if at some point Varnish changes something that breaks backwards compatibility, the EPEL will not provide that new version.

To ensure you get the latest version of Varnish, we are going to show you how to install Varnish directly from the Varnish YUM repository. This way you will get all the new versions of Varnish even if they break backwards compatibility. Of course, this is a trade-off. If you do require backwards compatibility, you can always install Varnish directly from the EPEL.

The first thing you need to do is add the repository to your system. Rather than set that up by hand, you can simply install the repository package from the Varnish web site, like so:

rpm  -i http://repo.varnish-cache.org/redhat/varnish-3.0/el5/noarch/varnish-release-3.0-

1.noarch.rpm --nosignature

This will add the Varnish repository to your list of sources. The --nosignature is needed because at present the Varnish signing key isn’t included with Yum. Now all you have to do is install Varnish itself with

yum install varnish

And you’re done! If you want Varnish to start automatically when the machine starts (and you almost certainly do), finish up with the following command:

chkconfig --levels 35 varnish on

Tip On a fresh CentOS 5.7 install, we got a dependency error when trying to install Varnish. If you get an error concerning “libedit” you can fix that dependency by adding the EPEL repository and rerunning the install. You will be asked whether or not to add the EPEL key, so say yes.

Now let’s get Varnish up and running!

Getting Up and Running

Configuring Varnish is a bit trickier than Squid because Varnish needs to know which port to listen on and which servers (called backends) to connect to. By far the most common configuration is to have Varnish running on port 80. This isn’t surprising as it is also the standard port that web servers listen on and, in turn, the port to which web browsers will try to connect.

This can become a bit tricky if you’re running Varnish on the same machine as your web server, which will also be trying to use port 80. As only one application can bind to a port at any given time, the two can’t peacefully coexist. If you are not running Apache on the same machine, you can skip ahead to the next section.

Moving Apache to a different port is relatively straightforward and requires changing two things. First, you need to tell Apache not to listen on port 80 and then, if relevant, you need to update any of your virtual host entries that refer to port 80. On Ubuntu, you can see which ports Apache is using by editing /etc/apache2/ports.conf, which will contain something like this:

NameVirtualHost *:80
Listen 80

Note that CentOS runs SELinux by default and this restricts what ports Apache can bind to. If you get a permission denied error if you use port 81, you can add that port by running

semanage port -a -t http_port_t -p tcp 81

All you need to do is change the 80 to another port (such as 81) and save the file. The second line is the one that tells Apache what port to listen on and the first line is used for running virtual hosts (being able to host multiple web sites with different names on a single IP address). This line needs to match the definition in each of your virtual host entries. For CentOS, there’s no preset location for virtual hosts, and many people just add it to the main Apache configuration file. However, for Ubuntu, those files can be found in /etc/apache2/sites-enabled. Every time you see

<VirtualHost *:80>

you need to replace it with

<VirtualHost *:81>

You will know right away if you’ve missed one because when you restart Apache, it will warn you if there are any virtual hosts that don’t match the one you configured in ports.conf.

Now that you have Apache under control, it’s time to tell Varnish to take its place. To configure the port Varnish listens on, you need to edit the main configuration file. On CentOS, this can be found in /etc/sysconfig/varnish and on Ubuntu it can be found in /etc/default/varnish. The config files are similar on both systems but they are structured slightly differently. On Ubuntu, you want to look for DAEMON_OPTS and replace -a :6081 with -a :80. On CentOS, it’s a bit clearer: you need to replace the value of VARNISH_LISTEN_PORT with 80.

Once you’ve set the port Varnish should listen on, it’s time to tell it the location of your backend servers. By default it points to port 80 on the local machine. That’s not ideal for you because after the quick configuration rejig, this means you’ll be serving content from yourself, which will cause problems. For both Ubuntu and CentOS, the default configuration file is /etc/varnish/default.vcl. The bit you’re interested in right now is the following chunk (though it might be slightly different from the example):

backend default {

    .host = "127.0.0.1";

    .port = "80”;

}

Although Varnish can handle multiple backends and act as a load balancer, here you’re just going to focus on getting things going with as single backend. If you are running Apache on the same machine and moved it to port 81 as per the previous section, all you need to do is update the port number. If your web server is on a separate machine, you need to update the IP address to point to it. If it’s not running on the standard port, you should update that as well. When you’re done, save the file and exit your editor.

At this point you have Varnish configured. It’s listening on port 80 so that it receives all of the web requests destined for the machine. You’ve told it where to find the web site that you want it to accelerate. All you need to do now is fire it up! This you can do with

service varnish start

And you’re live! You can easily test that things are working as expected by trying to visit the web site. Point your browser at Varnish and all being well you should see your web site.

Of course, this doesn’t prove that Varnish is actually caching anything. We’ve already discussed how Varnish plays it safe when it comes to storing things in its cache. One way to see what is going on is to use the varnishncsa command. This allows you to log in the familiar Apache style and show some Varnish-specific data. Here is the command to see what is being cached:

varnishncsa -F '%t %r %s %{Varnish:handling}x'

Here the only option you’re using is –F, which allows you to manually specify the format. The first element is the time, followed by the type of request, then the status, and finally whether or not it was a hit or miss. An example output looks like this:

[05/Feb/2012:17:45:43 +0000] GET http://peter.membrey.hk/varnish.jpg HTTP/1.1 200 miss
[05/Feb/2012:17:45:48 +0000] GET http://peter.membrey.hk/varnish.jpg HTTP/1.1 200 hit

You can see that the first request for the image was a miss (not too surprising as you’d never requested that file before). You then requested it almost immediately afterwards, and this time you can see that Varnish had cached it because it was recorded as a hit.

Customizing Varnish

This section provided a quick overview on Varnish and some of its design features. It also showed why it stands above the competition. You got Varnish up and running with a basic configuration (and moved Apache out of the way when necessary) and pointed it to a backend server. At this stage, you have a working web accelerator, but chances are you want to take it further. You can find the excellent manual at www.varnish-cache.org/docs/3.0/tutorial/index.html, which covers the various support tools and how to use VCL to really customize Varnish. Although VCL will require a bit of thought and being careful where you put your feet, it will pay off when it comes to boosting performance.

Summary

This chapter is one of the longer chapters in the book but it covers an extensive amount of ground. Each of the different caching layers has its own complexities, and there are subtle issues that seem to interact with each other to create new and interesting (and extremely frustrating) problems to be solved. We could spend entire chapters on each of these topics and even then barely scratch the surface.

The take-away from this chapter is that while caching can be complicated and intricate, there are many simple things that can be done without too much hassle to reap great performance benefits from your application. Some will be more useful than others, depending on what you’re trying to do. This chapter exposed you to some of the core concepts of caching and thus provided you with a solid starting point for further investigation. Like many things, once you are able to grasp the thread, pulling it free is very easy!

We started by looking at caching as a whole and why it is a critical technology for good performance in today’s Web. We stepped through all the layers of caching that you’re likely to come across (as well as some that you probably won’t) to give you an overview of what’s available to boost your application’s performance.

We then dug a bit deeper into the basics of caching theory and looked at the tools that browsers provide in order to make caching a reality. We covered some of the problems and looked at how they might be solved. We also touched on a relatively new idea that caching might be slowing the Internet down instead of making it faster.

Next we introduced Squid, discussed the pros and cons of setting up your own proxy, and showed how you could use it to look at the caching characteristics of your application. We highlighted some issues with transparent proxies and talked about some tools to make your proxy more secure and child-proof. Then we introduced Varnish and covered how to get it installed and powering your site. We mentioned some of the issues with using a web accelerator and how to customize Varnish to your exact needs.

That wraps up our whirlwind tour of caching! We would have loved to spend more time on this subject and cover the applications in more depth, but we lacked the space. Hopefully we’ve been able to give you an idea of what’s out there; after all, it’s hard to find something when you don’t even know that you want to find it!

The next chapter covers DNS and how you can quickly leverage it to your benefit for some simple but effective load balancing!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 3: Content Caching: Keeping the Load Light

Create new playlist

Sign In

Sign Up

C H A P T E R 3

Content Caching: Keeping the Load Light

What Is a Cache?

Whistle Stop Tour

Browser-based Caching

Web Accelerators

Web Proxies

Transparent Web Proxies

Edge-based Caching

Platform Caching

Application Caching

Database Caching

Just the Beginning…

Caching Theory: Why Is It so Hard?

HTTP 1.0 Caching Support

HTTP 1.1 Enhanced Caching Support

The Solution

Caching Isn’t as Easy as It Looks

Web Proxies

The Squid Proxy Server

Getting Started

Troubleshooting

Transparent Proxies

What’s Going On

Getting a Helping Hand

Squid, the Swiss Army Knife of Proxies

Edge-based Caching: Introducing Varnish

Sane Caching by Default

Installing Varnish

Getting Up and Running

Customizing Varnish

Summary

Table of Contents for
CHAPTER 3: Content Caching: Keeping the Load Light