Making HTTP Better

The days when a typical website consisted of several kilobytes of static text and perhaps some minor graphic elements are long gone. As computers have become more powerful, and 300 bps modems have become easier to find in a museum than in every household, form has begun to dominate substance on the Web. Hundreds of kilobytes of images and subpages, subframes, and client-side scripts are commonly used to make sites more attractive and professional, with varying degrees of success. For many sites, multimedia contents have actually become the primary type of information served, with HTML providing only a placeholder for images, video, embedded Java programs, or games. The Web in general is no longer merely a way to tell others about your private projects or interests; the driving force behind it is the ability to market and sell products and services cheaper and faster than ever. And marketing demands the eye-catching presentation of products and services.

Web browsers, web servers, and HTTP itself have had to adapt to this changing reality to make it easy to deploy new technologies and follow new trends. Conveniently enough, many of the technologies introduced in this process have interesting security implications for mere mortals and can also help us identify the client on the other end of the wire in a transparent way. As such, we must consider the optional features and extensions introduced since the day the Web was born.

Latency Reduction: A Nasty Kludge

The problem with the Web and some other current protocols is that the content presented to a user by a single multimedia site must be obtained from various sources (including wholly different domains) and then combined. Web pages have their text and formatting information separate from actual images and other sizable goodies (a practice truly to be praised by those who have a limited bandwidth and just want to get to the point).

This situation makes it necessary for clients to make several requests in order to render a web page. The most naive way to achieve this is by requesting each piece, one by one, in sequence, but this is not the best practice in the real world because it leads to bottlenecks: Why wait for a page to load simply because the banner server is running slowly? Hence, to improve the speed of content retrieval, the browser issues numerous requests at once.

And herein lies the first shortcoming of HTTP: it offers no native ability to serve simultaneous requests. Instead, requests must be issued sequentially.

The sequential (also called serial) fetch model results in a considerable performance penalty if one of the web page elements needs to be downloaded from a slow server or over a spotty link or if it takes a while for the server to prepare and deliver a particular element. If sequential fetching were the only option, any such slow request would prevent subsequent requests from being issued and served until it (the slow request) is filled.

Because newer versions of HTTP have not improved this situation, most client software implements a kludge: the web browser simply opens a number of simultaneous, separate TCP/IP sessions to a server or a set of servers and attempts to issue many requests at once. This solution is actually quite sane when the page is requesting resources from several separate machines. However, it’s not a good fix when the requested resources are on a single system, where all requests could be made in a single session and reasonably managed by the server. Here’s why:

  • The server has no chance to determine the best order in which to serve requests. (If it could, it would serve time-consuming, sizable, or simply the least relevant objects last.) It is simply forced to do all nearly at once, which can still cause the most important stuff to be needlessly delayed by increased CPU load.

  • If several larger resources are served at once, and the operating system scheduler switches between the sessions, the result can be considerable negative performance impact due to the need for the disk drive to seek between two possibly distant files repeatedly and in rapid succession.

  • Considerable overhead is usually associated with completing a new TCP/IP handshake (though this is somewhat lessened by keep-alive capabilities in newer versions of HTTP). It’s more efficient to issue all requests within a single connection.

  • Opening a new session and spawning a new process to serve the request involves overhead on the operating system level and strains devices such as stateful firewalls. Although modern web servers attempt to minimize this problem by keeping spare, persistent processes to accept requests as they arrive, the problem is seldom eliminated fully. A single session avoids unnecessary overhead and lets the server allocate only the resources absolutely needed to asynchronously serve chosen requests.

  • Last but not least, if the network, not the server, is the bottleneck, performance can actually deteriorate as packets are dropped as the link saturates with data from several sources arriving at once.

Alas, good or bad, this architecture is with us for now, and it is still better than serial fetch. We should acknowledge its presence and learn to take advantage of it.

How can this very property help us to identify the software that the client is using? Quite simply. The significance of parallel file fetching for the purpose of browser fingerprinting should be fairly obvious: no two concurrent fetch algorithms are exactly the same, and there are good ways to measure this.

But before we turn our attention to parallel fetching, we need to take a look at two other important pieces of the security and privacy equation for the Web: caches and identity management. Although seemingly unrelated, they make a logical whole in the end. Thus, a brief intermission.

Content Caching

Keeping local caches of documents received from the server is one of the more important features of the Web during its rapid expansion in recent years.[30] Without it, the cost of running this business would have been considerably higher.

The problem with the increasing weight and complexity of a typical website is that it requires more and more bandwidth (which for businesses remains generally quite expensive), as well as better servers to serve the data at a reasonable speed.

If performance is not impacted by bandwidth bottlenecks, solutions such as concurrent sessions (as described earlier) put additional strain on service providers instead. The reason might be fairly surprising: if a person on a fairly slow link (such as a modem) opens four subsequent sessions to fetch even a fairly simple page, four connections and four processes or threads need to be kept alive on the server, taking away those resources from those with faster connections.

Finally, to make things worse, heavier and more complex websites don’t always mesh with user expectations. Relatively long web page load times that were once considered fairly decent now seem annoying and drive users away. In fact, research suggests that the average web user won’t wait more than 10 seconds for a page to download before they move on.[101] The result is that corporations and service providers need more resources and better links to handle the incoming traffic. In fact, had things been left the way they were initially designed, the demand for serverside resources would have likely exceeded our capacity to fulfill the demand some time ago.

Of some help is that the contents served to web surfers is static or changes seldom, at least when compared with the rate at which a resource is retrieved by users. (This is especially true for large files, such as graphics, video, documents, executables, and so on.) By caching data closer to the end user—be it on the ISP level or even on the endpoint browser itself—we can dramatically decrease the bandwidth used for subsequent visits from users who share a common caching engine and make it easier on the servers handling the traffic. The ISP benefits from a lowered bandwidth consumption, as well, being able to serve more customers without having to invest in new equipment and connections. What HTTP needs, however, is a mechanism to keep the cache accurate and up-to-date. The author of a page (either human or machine) needs to be able to tell the cache engine when to fetch a newer version of a document.

To implement document caching, HTTP provides two built-in features:

  • A method for telling, with minimum effort, whether a portion of data has been modified since the most recent version held by the cache engine (the document recorded at the time of the last visit).

  • A method for determining which portions of data should not be cached, whether for security reasons or because the data is generated dynamically every time the resource is requested.

This functionality is in practice achieved fairly simply: The server returns all cacheable documents with the regular HTTP session, but with an additional protocol-level header, Last-Modified. To no surprise, this header represents the server’s idea of the time this document was last modified. Documents that cannot be cached are, on the other hand, marked by the server with the header Pragma: no-cache (Cache-Control: no-cache in HTTP/1.1).

The client browser (or an intermediate cache engine run by the ISP) is supposed to cache a copy of every cacheable page based on the presence of an appropriate header, along with the last modification information. It should keep the cached page for as long as possible, either until the user-configured cache limit is exceeded or the user manually purges the cache, unless specifically instructed to discard it after a specific date with an Expires header.

Later, when the site is visited again, the client concludes that they have a previous instance of the page cached on the disk and follows a slightly different procedure when accessing it. As long as a document lives in the cache, the client attempts to fetch the file every time the user revisits a site, but specifies the If-Modified-Since header with every request, using the value previously seen in the Last-Modified header for <Since>. The server is expected to compare the Modified-Since value with its knowledge of the last modification time for a given resource. If the resource has not been changed since that time, the HTTP error message “304 Not Modified” is returned instead of the requested data. As a result, the actual file transfer is suppressed, thus preserving bandwidth (with only a couple of hundred bytes exchanged during this communication). The client (or intermediate cache engine) is expected to use a previously cached copy of the resource instead of downloading it again.

Note

A more up-to-date approach, ETag and If-None-Match headers, a part of entity tagging functionality of HTTP/1.1, works in a similar manner but aims to resolve the ambiguity surrounding the interpretation of file modification times: the problems that stem from a file being modified several times in a short period of time (below the resolution of the clock used for Last-Modified data). of files being restored from a backup (with a modification time older than the last cached copy), and so on.

Managing Sessions: Cookies

Another important and seemingly unrelated requirement for HTTP was that it be able to differentiate between sessions and track them across connections, store session settings and identity information. For example, some websites greatly benefit from the ability to adapt to one’s personal preferences and to restore the look and feel chosen by the user each time they visit the site. Naturally, a user’s identity can be established by prompting for a login and password every time a page is viewed, at which point the user’s personal settings can be loaded, but this bit of extra effort dramatically reduces the number of people who would be willing to do this to access the page.

A transparent and persistent way to store and retrieve certain information from the client’s machine was needed to ensure seamless and personalized access to web forums, bulletin boards, chats, and many other features that define the browsing experience for so many people. On the other hand, the ability for web server administrators to recognize and identify returning visitors by assigning them a unique tag and retrieving it later meant the surrender of anonymity in exchange for a little convenience. Such a mechanism would give companies with second-grade ethics a great tool to track and profile users, record their shopping and browsing preferences, determine their interests, and so forth. Search engines could easily correlate requests from the same user, and content providers that serve resources such as ad banners could use this information to track people even without their permission or the knowledge of site operators.[31] Regardless of the concerns, however, there seemed to be no better, sufficiently universal alternative for this mechanism. And so web cookies were born.

Cookies, as specified in RFC2109,[102] are small portions of text that are issued by a server when the client connects to it. The server specifies a Set-Cookie header in the response to the visitor. This portion of text is, by its additional parameters, limited in scope to a specific domain, server, or resource and has a limited lifespan. Cookies are stored by cookie-enabled client software in a special container file or folder (often referred to as a cookie jar) and are automatically sent back to the server using a Cookie header whenever a connection to a specific resource is established again.

Servers can choose to store (or push out) user settings in Set-Cookie headers and just read them back on subsequent visits; and here is where cookie functionality would end in a perfect world. Unfortunately, computers have no way of telling what is stored in a cookie. A server can choose to assign a unique identifier to a client using the Set-Cookie header and then read it back to link current user activity to previous actions in the system.

The mechanism is wildly regarded as having serious privacy implications. Some activists downright hate cookies, but the opposition to this technology is getting less and less vocal nowadays. Browsing the Web with cookies disabled gets increasingly more difficult—with some sites even refusing traffic from clients that do not pass a cookie check. Thankfully, many browsers offer extensive cookie acceptance, restriction, or rejection settings and can even prompt for every single cookie before accepting it (although the latter is not particularly practical). This makes it possible to mount a reasonable defense of your privacy, if only by defining who the “good guys” are and who to trust.

But is our privacy in our hands then?

When Cookies and Caches Mix

The privacy of web browsing has long been considered a hot issue, and not without reason. Many people do not want others to snoop on their preferences and interests, even if their whereabouts are not particularly questionable. Why? Sometimes, you simply do not want a shoddy advertising company to know that you are reading about a specific medical condition and then be able to link this information to an account you have on a professional bulletin board, particularly because there is no way of knowing where this information will end up.

Cookie control makes our browsing experience reasonably comfortable, while keeping bad guys at bay. But even turning cookies off does not prevent information from being stored on one’s system to be later sent back to a server. The functionality needed to store and retrieve data on a victim’s machine has long been present in all browsers, regardless of cookie policy settings. The two necessary technologies work in a similar manner and differ only in terms of their intended use: cookies and file caching.

Somewhere back in 2000, Martin Pool posted a fairly short but insightful message[103] to the Bugtraq mailing list, sharing an interesting observation and supporting it with some actual code. He concluded that there is no significant difference between the Set-Cookie and Cookie functionality versus Last-Modified and If-Modified-Since, at least for systems that do not use centralized proxy caches and that store copies of already fetched documents locally on disk (as is the case with most of us mere mortals). A malicious website administrator can store just about any message in the Last-Modified header returned for a page their victim visits (or, if this header is sanity-checked, it might simply use a unique, arbitrary date to uniquely identify this visitor). The client would then send If-Modified-Since with an exact copy of the unique identifier stored by a rogue operator on their computer whenever a page is revisited. A “304 Not Modified” response ensures that this “cookie” is not discarded.

Preventing the Cache Cookie Attack

Using your browser to slightly tweak Last-Modified data in response might seem like a neat way to prevent this type of exposure (while introducing some cache inaccuracy), but this is not the case. Another variant of this attack is to rely on storing data in cached documents, as opposed to using tags directly: a malicious operator can prepare a special page for the victim when a website is visited for the first time. The page contains a reference to a unique file name listed as an embedded resource (for example, an image). When a client revisits this page, the server notices the If-Modified-Since header and replies with the 304 error message, prompting the old copy of the page to be used. The old page contains a unique file reference that is then requested from the server, making it possible to map the client’s IP to a previous session in which that file name had been returned. Oops.

Naturally, the lifetime of cache-based “cookies” is limited by cache size and expiration settings for cached documents configured by the user. However, these values are generally quite generous, and information stored within metadata for a resource that is revisited once every couple of weeks can last for years, until the cache is manually purged. For companies that serve common components included on hundreds or thousands of sites (again, banners are a good example), this is a nonissue.

The main difference with these cache cookies, compared with cookies proper, is not a matter of the functionality they offer, but rather the ease of controlling the aforementioned exposure. (Cache data must also serve other purposes and cannot be easily restricted without a major performance impact associated with disabling caching partly or completely.)

In this bizarre twist, you can see how two aspects of the Web collide, effectively nullifying security safeguards built around one of them. Practice shows that intentions are not always enough, because rogues are not always willing to play by the rules and use the technology the way we want them to. Perhaps turning your cookies off does not make that much of a difference after all?

But then it is about time to go back to the main subject of our discussion.



[30] Its importance is slowly decreasing, however: as more and more web pages are generated dynamically, and our Internet backbone becomes more mature and capable, caching is bound to lose its significance.

[31] If an advertisement banner or any other element of a website is placed on a shared server, such as http://banners.evilcompany.com, the operator of evilcompany.com can issue and retrieve cookies whenever a person visits any legitimate website that uses banners supplied by them. Needless to say, most banner providers do issue cookies and track users, albeit primarily for market research purposes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.174.168