Chapter 1. Web technologies and HTTP

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 1. Web technologies and HTTP

This chapter covers

How a web page is loaded by the browser
What HTTP is and how it evolved up to HTTP/1.1
The basics of HTTPS
Basic HTTP tools

This chapter gives you background on how the web works today and explains some key concepts necessary for the rest of this book to make sense; then it introduces HTTP and the history of the previous versions. I expect many of the readers of this book to be at least somewhat familiar with a lot of what is discussed in this first chapter, but I encourage you not to skip it; use this chapter as an opportunity to refresh yourself on the basics.

1.1. How the web works

The internet has become an integral part of everyday life. Shopping, banking, communication, and entertainment all depend on the internet, and with the growth of the Internet of Things (IoT), more and more devices are being put online, where they can be accessed remotely. This access is made possible by several technologies, including Hypertext Transfer Protocol (HTTP), which is a key method of requesting access to remote web applications and resources. Although most people understand how to use a web browser to surf the internet, few truly understand how this technology works, why HTTP is a core part of the web, or why the next version (HTTP/2) is causing such excitement in the web community.

1.1.1. The internet versus the World Wide Web

For many people, the internet and the World Wide Web are synonymous, but it’s important to differentiate between the two terms.

The internet is a collection of public computers linked through the shared use of the Internet Protocol (IP) to route messages. It’s made up of many services, including the World Wide Web, email, file sharing, and internet telephony. The World Wide Web (or the web), therefore, is but one part of the internet, though it’s the most visible part, and as people often look at email through web-mail front ends (such as Gmail, Hotmail, and Yahoo!), some of them use the web interchangeably with the internet.

HTTP is how web browsers request web pages. It was one of the three main technologies defined by Tim Berners-Lee when he invented the web, along with unique identifiers for resources (which is where Uniform Resource Locators, or URLs, came from) and Hypertext Markup Language (HTML). Other parts of the internet have their own protocols and standards to define how they work and how their underlying messages are routed through the internet (such as email with SMTP, IMAP, and POP). When examining HTTP, you’re dealing primarily with the World Wide Web. This line is getting more blurred, however, as services built on top of HTTP, even without a traditional web front end, mean that defining the web itself is trickier and trickier! These services (known by acronyms such as REST or SOAP) can be used by web pages and non-web pages (such as mobile apps) alike. The IoT simply represents devices that expose services that other devices (computers, mobile apps, and even other IoT devices) can interact with, often through HTTP calls. As a result, you can use HTTP to send a message to a lamp to turn it on or off from a mobile phone app, for example.

Although the internet is made up of myriad services, a lot of them are being used proportionally less and less while use of the web continues to grow. Those of us who recall the internet in the earliest days recall acronyms such as BBS and IRC that are practically gone today, replaced by web forums, social media websites, and chat applications.

All this means that although the term World Wide Web was often incorrectly used interchangeably with the internet, the continued rise of the web—or at least of HTTP, which was created for it—may mean that soon, that understanding may not be as far from the truth as it once was.

1.1.2. What happens when you browse the web?

For now, I return to the primary and first use of HTTP: to request web pages. When you open a website in your favorite browser, whether that browser is on a desktop or laptop computer, a tablet, a mobile phone, or any of the myriad other devices that allow internet access, an awful lot is going on. To get the most out of this book, you need to understand exactly how browsing the web works.

Suppose that you fire up a browser and go to www.google.com. Within a few seconds, the following will have happened, as illustrated in figure 1.1:

The browser requests the real address of www.google.com from a Domain Name System (DNS) server, which translates the human-friendly name www.google.com to a machine-friendly IP address. If you think of an IP address as a telephone number, DNS is the telephone book. This IP address is either an older-format IPv4 address (such as 216.58.192.4, which is nearly human-usable) or a new-format IPv6 address (such as 2607:f8b0:4005:801:0:0:0:2004, which is definitely getting into “machines-only” territory). Much as telephone area codes are occasionally redesignated when a city starts to run out of phone numbers, IPv6 is needed to deal with the explosion of devices connecting to the internet now and in the future. Be aware that due to the global nature of the internet, larger companies often have several servers around the globe. When you ask your DNS for the IP address, it often provides the IP address of the nearest server to make your internet browsing faster. Someone based in America will get a different IP address for www.google.com than someone based in Europe, for example, so don’t worry if you get different values of IP addresses for www.google.com than those I’ve given here.

Whatever happened to IPv5?
If Internet Protocol version 4 (IPv4) was replaced with version 6 (IPv6), what happened to version 5? And why have you never heard of IPv1 through IPv3?

The first 4 bits of an IP packet give the version, in theory limiting it to 15 versions. Before the much-used IPv4, there were four experimental versions starting at 0 and going up to 3. None of these versions was formally standardized until version 4, however.^[a] After that, version 5 was designated for Internet Stream Protocol, which was intended for real-time audio and video streaming, similar to what Voice over IP (VoIP) became later. That version never took off, however, not least because it suffered the same address limitations of version 4, and when version 6 came along, work on it was stopped, leaving version 6 as the successor to IPv4. Apparently, it was initially called version 7 under the incorrect assumption that version 6 was already assigned.^[b] Versions 7, 8, and 9 have also been assigned but are similarly not used anymore. If there ever is a successor to IPv6, it will likely be IPv10 or later, which no doubt will lead to questions similar to the ones that open this sidebar!

^a
See https://tools.ietf.org/html/rfc760. This protocol was later updated and replaced (https://tools.ietf.org/html/rfc791).

^b
See https://archive.is/QqU73#selection-417.1-417.15.
The web browser asks your computer to open a Transmission Control Protocol (TCP) connection^[1] over IP to this address on the standard web port (port 80)^[2] or over the standard secure web port (port 443).

¹
Google has started experimenting with QUIC, so if you’re connecting from Chrome to a Google site, you may use that. I discuss QUIC in chapter 9.

²
Some websites, including Google, use a technology called HSTS to automatically use a Secure HTTP connection (HTTPS), which runs on port 443, so even if you try to connect over HTTP, the connection automatically upgrades to HTTPS before the request is sent.

IP is used to direct traffic through the internet (hence, the name Internet Protocol!), but TCP adds stability and retransmissions to make the connection reliable (“Hello, did you get that?” “No, could you repeat that last bit, please?”). As these two technologies are often used together, they’re usually abbreviated as TCP/IP, and, together, they form the backbone of much of the internet. A server can be used for several services (such as email, FTP, HTTP, and HTTPS [HTTP Secure] web servers), and the port allows different services to sit together under one IP address, much as a business may have a telephone extension for each employee.
When the browser has a connection to the web server, it can start asking for the website. This step is where HTTP comes in, and I examine how it works in the next section. For now, be aware that the web browser uses HTTP to ask the Google server for the Google home page.

Note

At this point, your browser will have automatically corrected the shorthand web address (www.google.com) to the more syntactically correct URL address of http://www.google.com. The actual full URL includes the port and would be http://www.google.com:80, but if standard ports are being used (80 for HTTP and 443 for HTTPS), the browser hides the port. If nonstandard ports are being used, the port is shown. Some systems, particularly in development environments, use port 8080 for HTTP or 8443 for HTTPS, for example.

If HTTPS is being used (I go into HTTPS in a lot more detail in section 1.4), extra steps are required to set up the encryption that secures the connection.
The Google server responds with whatever URL you asked for. Typically, what gets sent back from the initial page is the text that makes up the web page in HTML format. HTML is a standardized, structured, text-based format that makes up the text content of a page. It’s usually divided into various sections defined by HTML tags and references other bits of information needed to make the media-rich web pages you’re used to seeing (Cascading Style Sheets [CSS], JavaScript code, images, fonts, and so on). Instead of an HTML page, however, the response may be an instruction to go to a different location. Google, for example, runs only on HTTPS, so if you go to http://www.google.com, the response is a special HTTP instruction (usually, a 301 or 302 response code) that redirects to a new location at https://www.google.com. This response starts some or all of the preceding steps again, depending on whether the redirect address is a different server/port combination, a different port in the same location (such as a redirect to HTTPS), or even a different page on the same server and port. Similarly, if something goes wrong, you get back an HTTP response code, the best-known of which is the 404 Not Found response code.
The web browser processes the returned request. Assuming that the returned response is HTML, the browser starts to parse the HTML code and builds in memory the Document Object Model (DOM), which is an internal representation of the page. During this processing, the browser likely sees other resources that it needs to display the page properly (such as CSS, JavaScript, and images).
The web browser requests any additional resources it needs. Google keeps its web page fairly lean; at this writing, only 16 other resources are needed. Each of these resources is requested in a similar manner, following steps 1–6, and yes, that includes this step, because those resources may in turn request other resources. The average website isn’t as lean as Google and needs 75 resources,^[3] often from many domains, so steps 1–6 must be repeated for all of them. This situation is one of the key things that makes web browsing slow and one of the key reasons for HTTP/2, the main purpose of which is to make requesting these additional resources more efficient, as you’ll see in future chapters.

³
https://httparchive.org/reports/page-weight#reqTotal
When the browser has enough of the critical resources, it starts to render the page onscreen. Choosing when to start rendering the page is a challenging task and not as simple as it sounds. If the web browser waits until all resources are downloaded, it would take a long time to show web pages, and the web would be an even slower, more frustrating place. But if the web browser starts to render the page too soon, you end up with the page jumping around as more content downloads, which is irritating if you’re in the middle of reading an article when the page jumps down. A firm understanding of the technologies that make up the web—especially HTTP and HTML/CSS/JavaScript—can help website owners reduce these annoying jumps while pages are being loaded, but far too many sites don’t optimize their pages effectively to prevent these jumps.
After the initial display of the page, the web browser continues, in the background, to download other resources that the page needs and update the page as it processes them. These resources include noncritical items such as images and advertising tracking scripts. As a result, you often see a web page displayed initially without images (especially on slower connections), with images being filled in as more of them are downloaded.
When the page is fully loaded, the browser stops the loading icon (a spinning icon on or near the address bar for most browsers) and fires the OnLoad JavaScript event, which JavaScript code may use as a sign that the page is ready to perform certain actions.
At this point, the page is fully loaded, but the browser hasn’t stopped sending out requests. We’re long past the days when a web page was a page of static information. Many web pages are now feature-rich applications that continually communicate with various servers on the internet to send or load additional content. This content may be user-initiated actions, such as when you type requests in the search bar on Google’s home page and instantly see search suggestions without having to click the Search button, or it may be application-driven actions, such as your Facebook or Twitter feed’s automatically updating without your having to click a refresh button. These actions often happen in the background and are invisible to you, especially advertising and analytics scripts that track your actions on the site to report analytics to website owners and/or advertising networks.

Figure 1.1. Typical interaction when browsing to a web page

As you can see, a lot happens when you type a URL, and it often happens in the blink of an eye. Each of these steps could form the basis for a whole book, with variations in certain circumstances. This book, however, concentrates on (and delves a little deeper into) steps 3–8 (loading the website over HTTP). Some later chapters (particularly chapter 9) also touch on step 2 (the underlying network connection that HTTP uses).

1.2. What is HTTP?

The preceding section is deliberately light on the details of how HTTP works so you can get an idea of how HTTP fits into the wider internet. In this section, I briefly describe how HTTP works and is used.

As I mentioned earlier, HTTP stands for Hypertext Transfer Protocol. As the name suggests, HTTP was initially intended to transfer hypertext documents (documents that contain links to other documents), and the first version didn’t support anything but these documents. Quickly, developers realized that the protocol could be used to transfer other file types (such as images), so now the Hypertext part of the HTTP acronym is no longer too relevant, but given how widely used HTTP is, it’s too late to rename it.

HTTP depends on a reliable network connection, usually provided by TCP/IP, which is itself built on some type of physical connection (Ethernet, Wi-FI, and so on). Because communication protocols are separated into layers, each layer can concentrate on what it does well. HTTP doesn’t concern itself with the lower-level details of how that network connection is established. Although HTTP applications should be mindful of how to handle network failures or disconnects, the protocol itself makes no allowances for these tasks.

The Open Systems Interconnection (OSI) model is a conceptual model often used to describe this layered approach. The model consists of seven layers, though these layers don’t map exactly to all networks and in particular to internet traffic. TCP spans at least two (and possibly three) layers, depending on how you define the layers. Figure 1.2 shows roughly how this model maps to web traffic and where HTTP fits into this model.

Figure 1.2. The transport layers of internet traffic

There’s some argument about the exact definition of each layer. In complex systems like the internet, not everything can be classified and separated as easily as developers would like. In fact, the Internet Engineering Task Force (IETF) warns against getting too hung up on layering.^[4] But it can be helpful to understand at a high level where HTTP fits in this model and how it depends on lower-level protocols to work. Many web applications are built on top of HTTP, so the Application layer, for example, refers more to networking layers than to JavaScript applications.

⁴
https://tools.ietf.org/html/rfc3439#section-3

HTTP is, at heart, a request-and-response protocol. The web browser makes a request, using HTTP syntax, to the web server, which responds with a message containing the requested resource. The key to the success of HTTP is its simplicity. As you’ll see in later chapters, however, this simplicity can be a cause of concern for HTTP/2, which sacrifices some of that simplicity for efficiency.

The basic syntax of an HTTP request, after you open a connection, is as follows:

GET /page.html↵

The ↵ symbol represents a carriage return/newline (Enter or Return key). In its basic form, HTTP is as simple as that! You provide one of the few HTTP methods (GET, in this case) followed by the resource you want (/page.html). Remember that at this point, you’ve already connected to the appropriate server, using a technology such as TCP/IP, so you’re simply requesting the resource you want from that server and don’t need to be concerned with how that connection happens or is managed.

The first version of HTTP (0.9) allowed only this simple syntax and had only the GET method. In this case, you might ask why you needed to state GET for an HTTP/0.9 request, because it’s superfluous, but future versions of HTTP introduced other methods, so kudos to the inventors of HTTP for having the foresight to see that more methods would come. In the next section, I discuss the various versions of HTTP, but this syntax is still recognizable as the format of an HTTP GET request.

Consider a real-life example. Because the web server needs only a TCP/IP connection to receive HTTP requests, you can emulate the browser by using a program such as Telnet. Telnet is a simple program that opens a TCP/IP connection to a server and allows you to type text commands and view text responses. This program is exactly what you need for HTTP, though I cover much better tools for viewing HTTP near the end of the chapter. Unfortunately, some technologies are becoming less prevalent, and Telnet is one of them; many operating systems no longer include a Telnet client by default. It may be necessary for you to install a Telnet client to try some simple HTTP commands, or you can use an equivalent like the nc command. This command is short for netcat and is installed in most Linux-like environments, including macOS, and for the simple examples I show here, it’s almost identical to Telnet.

For Windows, I recommend using the PuTTY software^[5] over the default client bundled with Windows (which usually isn’t installed anyway and must be added manually), as the default client often has display issues, such as not displaying what you’re typing or overwriting what’s already on the terminal. When you install and launch PuTTY, you see the configuration window, where you can enter the host (www.google.com), port (80), and connection type (Telnet). Make sure that you click the Never option for closing the window on exit; otherwise, you won’t see the results. All these settings are shown in figure 1.3. Note also that if you have trouble entering any of the following commands and receive a message about a badly formatted request, you may want to change Connection > Telnet > Telnet Negotiation Mode to Passive.

⁵
https://www.putty.org/

Figure 1.3. PuTTY details for connecting to Google

If you’re using an Apple Macintosh or a Linux machine, you may be able to issue the Telnet command directly from a shell prompt if Telnet is already installed:

$ telnet www.google.com 80

Or, as I mentioned earlier, use the nc command in the same way:

$ nc www.google.com 80

When you have a Telnet session and make the connection, you see a blank screen, or, depending on your Telnet application, some instructions like the following:

Trying 216.58.193.68...
Connected to www.google.com.
Escape character is '^]'.

Whether or not this message is displayed, you should be able to type your HTTP commands, so type GET / and then press the Return key, which tells the Google server that you’re looking for the default page (/) and (because you haven’t specified an HTTP version) that you want to use the default HTTP/0.9. Note that some Telnet clients don’t echo back what you’re typing by default (especially the default Telnet client bundled with Windows, as I mentioned earlier), so it can be difficult to see exactly what you’re typing. But you should still send the commands.

Using Telnet behind company proxies

If your computer doesn’t have direct internet access, you won’t be able to connect to Google directly by using Telnet. This scenario is often the case in corporate environments that use a proxy to restrict direct access. (I cover proxies in chapter 3.) In this case, you may be able to use one of your internal web servers (such as your intranet site) as an example rather than Google. In section 1.5.3, I discuss other tools that can work with a proxy, but for now, you can read along without following the instructions.

The Google server will respond, most likely using HTTP/1.0, despite the fact that you sent a default HTTP/0.9 request (no server uses HTTP/0.9 anymore). The response is an HTTP response code of 200 (to state that the command was a success) or 302 (to state that the server wants you to redirect to another page), followed by a closing of the connection. I go into more detail on this process in the next section, so don’t get too concerned about these details now.

Following is one such response from a command-line prompt on a Linux server with the response line in bold. Note that the HTML content returned isn’t shown in full for the sake of brevity:

$ telnet www.google.com 80
Trying 172.217.3.196...
Connected to www.google.com.
Escape character is '^]'.
GET /
HTTP/1.0 200 OK
Date: Sun, 10 Sep 2017 16:20:09 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
P3P: CP="This is not a P3P policy! See
     https://www.google.com/support/accounts/answer/151657?hl=en for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Set-Cookie:
      NID=111=QIMb1TZHhHGXEPjUXqbHChZGCcVLFQOvmqjNcUIejUXqbHChZKtrF4Hf4x4DVjTb01R
      8DWShPlu6_aQ-AnPXgONzEoGOpapm_VOTW0Y8TWVpNap_1234567890-p2g; expires=Mon,
      12-Mar-2018 16:20:09 GMT; path=/; domain=.google.com; HttpOnly
Accept-Ranges: none
Vary: Accept-Encoding

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage"
     lang="en"><head><meta content="Search the world's information, including
     webpages, images, videos and more. Google has many special features to help
     you find exactly what you're looking for." name="description

...etc.

</script></div></body></html>Connection closed by foreign host.

If you’re based outside the United States, you may see a redirect to a local Google server instead. If you’re based in Ireland, for example, Google sends a 302 response and advises the browser to go to Google Ireland (http://www.google.ie) instead, as shown here:

GET /
HTTP/1.0 302 Found
Location: http://www.google.ie/?gws_rd=cr&dcr=0&ei=BWe1WYrf123456qpIbwDg
Cache-Control: private
Content-Type: text/html; charset=UTF-8
P3P: CP="This is not a P3P policy! See
     https://www.google.com/support/accounts/answer/151657?hl=en for more info."
Date: Sun, 10 Sep 2017 16:23:33 GMT
Server: gws
Content-Length: 268
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Set-Cookie: NID=111=ff1KAwIMjt3X4MEg_KzqR_9eAG78CWNGEFlDG0XIf7dLZsQeLerX-
     P8uSnXYCWNGEFlDG0dsM-8V8X8ny4nbu2w96GRTZtzXWOHvWS123456dhd0LpD_123456789;
     expires=Mon, 12-Mar-2018 16:23:33 GMT; path=/; domain=.google.com; HttpOnly

<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">

     <TITLE>302 Moved</TITLE></HEAD><BODY>
                                   <H1>302 Moved</H1>
                                                    The document has moved
                                                                           <A
     HREF="http://www.google.ie/?gws_rd=cr&amp;dcr=0&amp;ei=BWe1WYrfIojUgAbqpIbw
     Dg">here</A>.
</BODY></HTML> Connection closed by foreign host.

As shown at the end of each example, the connection is closed; to send another HTTP command, you need to reopen the connection. To avoid this step, you can use HTTP/1.1 (which keeps the connection open by default, as I discuss later) by entering HTTP/1.1 after the requested resource:

GET / HTTP/1.1↵↵

Note that if you’re using HTTP/1.0 or HTTP/1.1, you must press Return twice to tell the web server that you’re finished sending the HTTP request. In the next section, I discuss why this double return/blank line is required for HTTP/1.0 and HTTP/1.1 connections.

After the server responds, you can reissue the GET command to get the page again. In reality, web browsers usually use this open connection to get other resources rather than the same resource again, but the concept is the same.

Technically, to abide by the HTTP/1.1 specification, HTTP/1.1 requests also require you to specify the host header, for reasons that I (again) discuss later. For these simple examples, however, don’t worry about this requirement too much, because Google doesn’t seem to insist on it (although if you’re using websites other than www.google.com, you may see unexpected results).

As you can see, the basic HTTP syntax is simple. It’s a text-based request-and-response format, although this format changes under HTTP/2 when it moves to a binary format.

If you’re requesting nontext data such as an image, a program like Telnet won’t be sufficient. Gobbledygook will appear in the terminal session as Telnet tries and fails to convert the binary image format to meaningful text, as in this example:

I no longer use Telnet, because much better tools are available for viewing the details of an HTTP request, but this exercise is useful for explaining the format of an HTTP message and showing how simple the initial versions of the protocol were.

As I mention earlier, the key to the success of HTTP is its simplicity, which makes it relatively easy to implement at a service level. Therefore, almost any computer with network abilities, from complex servers to light bulbs in the IoT world, can implement HTTP and immediately provide useful commands across a network. Implementing a fully HTTP-compliant web server is a much more arduous task. Similarly, web browsers are hugely complex and have myriad other protocols to contend with after a web page has been fetched over HTTP (including HTML, CSS, and JavaScript used to display the page it has fetched). But creating a simple service that listens for an HTTP GET request and responds with data isn’t difficult. The simplicity of HTTP has also led to the boom in the microservices architectural style, in which an application is broken into many independent web services, often based on lighter application servers such as Node.js (Node).

1.3. The syntax and history of HTTP

HTTP was started by Tim Berners-Lee and his team at the CERN research organization in 1989. It was intended to be a way of implementing a web of interconnecting computers to provide access to research and link them so they could easily reference one another in real time; a click of a link would open an associated document. The idea for such a system had been around for a long time, and the term hypertext was coined in the 1960s. With the growth of the internet during the 1980s, it was possible to implement this idea. During 1989 and 1990, Berners-Lee published a proposal^[6] to build such a system; he went on to build the first web server based on HTTP and the first web browser to request HTML documents and display them.

⁶
https://www.w3.org/History/1989/proposal.html

1.3.1. HTTP/0.9

The first published specification for HTTP was version 0.9, issued in 1991. The specification^[7] is small at fewer than 700 words. It specifies that a connection is made over TCP/IP (or a similar connection-oriented service) to a server and optional port (80 to be used if no port is specified). A single line of ASCII text should be sent, consisting of GET, the document address (with no spaces), and a carriage return and line feed (the carriage return being optional). The server is to respond with a message in HTML format, which it defines as “a byte stream of ASCII characters.” It also states, “The message is terminated by the closing of the connection by the server,” which is why the connection was closed after each request in previous examples. On handling errors, the specification states: “Error responses are supplied in human-readable text in HTML syntax. There is no way to distinguish an error response from a satisfactory response except for the content of the text.” It ends with this text: “Requests are idempotent. The server need not store any information about the request after disconnection.” This specification gives us the stateless part of HTTP, which is both a blessing (in its simplicity) and a curse (due to the way that technologies such as HTTP cookies had to be tacked on to allow state tracking, which is necessary for complex applications).

⁷
https://www.w3.org/Protocols/HTTP/AsImplemented.html

Following is the only possible command in HTTP/0.9:

GET /section/page.html↵

The requested resource (/section/page.html) can change, of course, but the rest of the syntax is fixed.

There was no concept of HTTP header fields (herein known as HTTP headers) or any other media, such as images. It’s amazing to think that from this simple request/response protocol, intended to provide easy access to information in a research institute, quickly spawned the media-rich World Wide Web that is so ingrained in the world today. Even from an early stage, Berners-Lee called his invention the World-WideWeb (without the spaces that we use today), again showing his foresight of the scope of the project and plans for it to be a global system.

1.3.2. HTTP/1.0

The WorldWideWeb was an almost-instant success. According to NetCraft,^[8] by September 1995 there were 19,705 hostnames on the web. A month later, this figure jumped to 31,568 and has grown at a furious rate ever since. At this writing, we’re approaching 2 billion websites. By 1995, the limitations of the simple HTTP/0.9 protocol were apparent, and most web servers had already implemented extensions that went way beyond the 0.9 specification. The HTTP Working Group (HTTP WG), headed by Dave Raggett, started working on HTTP/1.0 in an attempt to document the “common usage of the protocol.” The document was published in May 1996 as RFC 1945.^[9] An RFC (Request for Comments) document is published by the IETF; it can be accepted as a formal standard or be left as an informal documentation.^[10] The HTTP/1.0 RFC is the latter and is not a formal specification. It describes itself as a “memo” at the top, stating, “This memo provides information for the internet community. This memo does not specify an internet standard of any kind.”

⁸
https://news.netcraft.com/archives/category/web-server-survey/

⁹
https://tools.ietf.org/html/rfc1945

¹⁰
An excellent post on reading and understanding RFCs is at https://www.mnot.net/blog/2018/07/31/read_rfc.

Regardless of the formal status of the RFC, HTTP/1.0 added some key features, including

More request methods: HEAD and POST were added to the previously defined GET.
Addition of an optional HTTP version number for all messages. HTTP/0.9 was assumed by default to aid in backward compatibility.
HTTP headers, which could be sent with both the request and the response to provide more information about the resource being requested and the response being sent.
A three-digit response code indicating (for example) whether the response was successful. This code also enabled redirect requests, conditional requests, and error status (404 - Not Found being one of the best known).

These much-needed enhancements of the protocol happened organically through use, and HTTP/1.0 was intended to document what was already happening with many web servers in the real world, rather than define new options. These additional options opened a wealth of new opportunities to the web, including the ability to add media to web pages for the first time by using response HTTP headers to define the content type of the data in the body.

HTTP/1.0 methods

The GET method stayed much the same as under HTTP/0.9, though the addition of headers allowed a conditional GET (an instruction to GET only if this resource has changed since the last time the client got it; otherwise, tell the client that the resource hasn’t changed and to carry on using that old copy). Also, as I mentioned earlier, users could GET more than hypertext documents and use HTTP to download images, videos, or any sort of media.

The HEAD method allowed a client to get all the meta information (such as the HTTP headers) for a resource without downloading the resource itself. This method is useful for many reasons. A web crawler like Google, for example, can check whether a resource has been modified and download it only if it has, thus saving resources for both it and the web server.

The POST method was more interesting, allowing the client to send data to a web server. Rather than put a new HTML file directly on the server by using standard file-transfer methods, users could POST the file by using HTTP, provided that the web server was set up to receive the data and do something with it. POST isn’t limited to whole files; it can be used for much smaller parts of data. Forms on websites typically use POST, with the contents of the form being sent as field/value pairs in the body of the HTTP request. The POST method, therefore, allowed content to be sent from the client to the server as part of an HTTP request, representing the first time that an HTTP request could have a body, like HTTP responses.

In fact, GET allows data to be sent in query parameters that are specified at the end of a URL, after the ? character. https://www.google.com/?q=search+string, for example, tells Google that you’re searching for the term search string. Query parameters were in the earliest Uniform Resource Identifier (URI) specification,^[11] but they were intended to provide additional parameters to clarify the URI rather than to serve as a way of uploading data to a web server. URLs are also limited in terms of length and content (binary data can’t be sent here, for example), and some confidential data (passwords, credit card data, and so on) shouldn’t be stored in a URL, as it is easily visible on the screen and in browser history, or may be included if the URL is shared. POST, therefore, is often a better way of sending data, and data isn’t as visible (though care should still be taken with this data when sent over plain HTTP rather than secure HTTPS, as I discuss later). Another difference is that a GET request is idempotent whereas a POST request is not, meaning that multiple GET requests to the same URL should always return the same result, whereas multiple POST requests to the same URL requests may not. If you refresh a standard page on a website, for example, it should show the same thing. If you refresh a confirmation page from an e-commerce website, your browser may ask whether you’re sure that you want to resubmit the data, which may result in your making an additional purchase (though e-commerce sites should write their applications to ensure that this situation doesn’t happen!).

¹¹
https://tools.ietf.org/html/rfc1630

HTTP request headers

Whereas HTTP/0.9 had a single line to GET a resource, HTTP/1.0 introduced HTTP headers. These headers allowed a request to provide the server additional information, which it could use to decide how to process the request. HTTP headers are provided on separate lines after the original request line. An HTTP GET request changed from this

GET /page.html↵

to this

GET /page.html HTTP/1.0↵
Header1: Value1↵
Header2: Value2↵
↵

or (without headers) to

GET /page.html HTTP/1.0↵
↵

That is, an optional version section was added to the initial line (default was HTTP/0.9 if not specified), and an optional HTTP header section was followed by two carriage return/newline characters (henceforth called return characters for brevity) at the end instead of one. The second newline was necessary to send a blank line, which was used to indicate that the (optional) request header section was complete.

HTTP headers are specified with a header name, a colon, and then the header content. The header name (though not the content) is case-insensitive, per the specification. Headers can span multiple lines when you start each new line with a space or tab, but this practice isn’t recommended; few clients or servers use this format and may fail to process them correctly. Multiple headers of the same type may be sent; they’re semantically identical to sending comma-separated versions. As a result

GET /page.html HTTP/1.0↵
Header1: Value1↵
Header1: Value2↵

should be treated the same way as

GET /page.html HTTP/1.0↵
Header1: Value1, Value2↵

Although HTTP/1.0 defined some standard headers, this example also demonstrates that HTTP/1.0 allows custom headers (Header1, in this example) to be provided without requiring an updated version of the protocol. The protocol was designed to be extensible. The specification, however, explicitly states that “these fields cannot be assumed to be recognizable by the recipient” and may be ignored, whereas the standard headers should be processed by an HTTP/1.0-compliant server.

A typical HTTP/1.0 GET request is

GET /page.html HTTP/1.0↵
Accept: text/html,application/xhtml+xml,image/jxr/,*/*↵
Accept-Encoding: gzip, deflate, br↵
Accept-Language: en-GB,en-US;q=0.8,en;q=0.6↵
Connection: keep-alive↵
Host: www.example.com↵
User-Agent: MyAwesomeWebBrowser 1.1↵↵

This example tells the server what formats you can accept the response in (HTML, XHTML, XML, and so on), that you can accept various encodings (such as gzip, deflate, and brotli, which are compression algorithms used to compress data sent over HTTP), and what languages you prefer (GB English, followed by US English, followed by any other form of English), and what browser you’re using (MyAwesomeWebBrowser 1.1). It also tells the server to keep the connection open (a topic that I return to later). The whole request is completed with the two return characters. From here on, I exclude the return characters for readability reasons. You can assume the last line in the request is followed by two return characters.

HTTP response codes

A typical response from an HTTP/1.0 server is

HTTP/1.0 200 OK
Date: Sun, 25 Jun 2017 13:30:24 GMT
Content-Type: text/html
Server: Apache

<!doctype html>
<html>
<head>
...etc.

The rest of the HTML provided follows. As you see, the first line of the response is the HTTP version of the response message (HTTP/1.0), a three-digit HTTP status code (200), and a text description of that status code (OK). Status codes and descriptions were new concepts under HTTP/1.0; under HTTP/0.9, there was no such concept as a response code; errors could be given only in the returned HTML itself. Table 1.1 shows the HTTP response codes defined in the HTTP/1.0 specification.

Table 1.1. HTTP/1.0 response codes

Category	Value	Description	Details
1xx (informational)	N/A	N/A	HTTP/1.0 doesn’t define any 1xx status codes, but does define the category.
2xx (successful)	200	OK	This code is the standard response code for a successful request.
	201	Created	This code should be returned for a POST request.
	202	Accepted	The request is being processed but hasn’t completed processing yet.
	204	No content	The request has been accepted and processed, but there’s no BODY response to send back.
3xx (redirection)	300	Multiple choices	This code isn’t used directly. It explains that the 3xx category implies that the resource is available at one (or more) locations, and the exact response provides more details on where it is.
	301	Moved permanently	The Location HTTP response header should provide the new URL of the resource.
	302	Moved temporarily	The Location HTTP response header should provide the new URL of the resource.
	304	Not modified	This code is used for conditional responses in which the BODY doesn’t need to be sent again.
4xx (client error)	400	Bad request	The request couldn’t be understood and should be changed before resending.
	401	Unauthorized	This code usually means that you’re not authenticated.
	403	Forbidden	This code usually means that you’re authenticated, but your credentials don’t have access.
	404	Not found	This code is probably the best-known HTTP status code, as it often appears on error pages.
5xx (server error)	500	Internal server error	The request couldn’t be completed due to a serverside error.
	501	Not implemented	The server doesn’t recognize the request (such as an HTTP method that hasn’t yet been implemented).
	502	Bad gateway	The server is acting as a gateway or proxy and received an error from the downstream server.
	503	Service unavailable	The server is unable to fulfill the request, perhaps because the server is overloaded or down for maintenance.

Astute readers may notice some missing codes (203, 303, 402) from earlier drafts of the HTTP/1.0 RFC. Some additional codes were excluded from the final published RFC. Some of these codes returned in HTTP/1.1, though often with different descriptions and meanings. The Internet Assigned Numbers Authority (IANA) maintains the full list of HTTP status codes across all versions of HTTP, but the status codes in table 1.1, first defined in HTTP/1.0,^[12] represent most typically used status codes.

¹²
https://www.iana.org/assignments/http-status-codes/http-status-codes.xhtml

It may also be apparent that some of the responses could overlap. You may wonder, for example, whether an unrecognized request is a 400 (bad request) or a 501 (not implemented). The response codes are designed to be broad categories, and it’s up to each application to use the status code that fits best. The specification also stated that response codes were extensible, so new codes could be added as needed without changing the protocol. This is another reason why the response codes are categorized. A new response code (such as 504) may not be understood by an existing HTTP/1.0 client, but it would know that the request failed for some reason on the server side and could handle it the way it handles other 5xx response codes.

HTTP response headers

After the first return line are zero or more HTTP/1 header response lines. Request headers and response headers follow the same format. They’re followed by two return characters and then the body content, as shown in bold:

GET /
HTTP/1.0 302 Found
Location: http://www.google.ie/?gws_rd=cr&dcr=0&ei=BWe1WYrf123456qpIbwDg
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Date: Sun, 10 Sep 2017 16:23:33 GMT
Server: gws
Content-Length: 268
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8
">                                                                         
    <TITLE>302 Moved</TITLE></HEAD><BODY>
                                   <H1>302 Moved</H1>
                                                     The document has moved
<A HREF="http://www.google.ie/?gws_rd=cr&amp;dcr=0&amp;ei=BWe1WYrfIojUgAbqpI
bwDg">here</A>.</BODY></HTML> Connection closed by foreign host.

With the publication of HTTP/1.0, the HTTP syntax was greatly expanded to make it capable of creating dynamic, feature-rich applications beyond the simple document repository fetching that the initial published version HTTP/0.9 allowed. HTTP was also getting more complicated, expanding from the approximately 700-word HTTP/0.9 specification to the nearly 20,000-word HTTP/1.0 RFC. Even as this specification was published, however, the HTTP Working Group saw it as a stopgap to document current use and was already working on HTTP/1.1. As I mentioned earlier, HTTP/1.0 was published mostly to bring some standards and documentation to HTTP as it was being used in the wild, rather than to define any new syntax for clients and servers to implement. In addition to the new response codes, other methods (such as PUT, DELETE, LINK, and UNLINK) and additional HTTP headers in use at the time were listed in the appendices of the RFC, some of which would be standardized in HTTP/1.1. The success of HTTP was such that the working group struggled to keep up with the implementations only five short years after it was launched to the world.

1.3.3. HTTP/1.1

As you’ve seen, HTTP was launched as version 0.9 as a basic way of fetching text-based documents. This version was expanded beyond text to a more fully fledged protocol 1.0, which was further standardized and refined in 1.1. As the versioning suggests, HTTP/1.1 was more a tweak of HTTP/1.0 that didn’t contain radical changes to the protocol. Moving from 0.9 to 1.0 was a much bigger change, with the addition of HTTP headers. HTTP/1.1 made some further improvements to allow optimal use of the HTTP protocol (such as persistent connections, mandatory server headers, better caching options, and chunked encoding). Perhaps more important, it provided a formal standard on which to base the future of the World Wide Web. Although the basics of HTTP are simple enough to understand, there are many intricacies that could be implemented in slightly different ways, and the lack of a formal standard makes it difficult to scale.

The first HTTP/1.1 specification was published in January 1997^[13] (only nine months after the HTTP/1.0 specification was published). It was replaced by an update specification in June 1999^[14] and then enhanced for a third time in June 2014.^[15] Each version made the previous ones obsolete. The HTTP specification now spanned 305 pages and nearly 100,000 words, which shows how much this simple protocol grew and how important it was to clarify the intricacies of how HTTP should be used. In fact, at this writing the specification is being updated again,^[16] and this update is expected to be published early in 2019. Fundamentally, HTTP/1.1 isn’t too different from HTTP/1.0, but the explosion of the web in the first two decades of its existence gave rise to additional features and required documentation showing exactly how to use it.

¹³
https://tools.ietf.org/html/rfc2068

¹⁴
https://tools.ietf.org/html/rfc2616

¹⁵
https://tools.ietf.org/html/rfc7230andhttps://tools.ietf.org/html/rfc7235

¹⁶
https://github.com/httpwg/http-core

Describing HTTP/1.1 would take a book in itself, but I attempt here to discuss the main points, to provide background and context for some of the HTTP/2 discussions later in this book. Many of the additional features of HTTP/1.1 were introduced through HTTP headers, which themselves were introduced in HTTP/1.0, meaning that the fundamental structure of HTTP didn’t change between the two versions, although making the host header mandatory and adding persistent connections were two notable changes in the syntax from HTTP/1.0.

Mandatory Host header

The URL provided with HTTP request lines (such as a GET command) isn’t an absolute URL (such as http://www.example.com/section/page.html) but a relative URL (such as /section/page.html). When HTTP was created, it was assumed that a web server would host only one website (though possibly many sections and pages on that site). Therefore, the host part of the URL was obvious, because a user had to be connected to that web server before making HTTP requests. Nowadays, many web servers host several sites on the same server (a situation known as virtual hosting), so it’s important to tell the server which site you want as well as which relative URL you want on that site. This feature could have been implemented by changing the URL in the HTTP requests to the full, absolute URL, but it was thought this change would have broken many existing web servers and clients. Instead, the feature was implemented by adding a Host header in the request:

GET / HTTP/1.1
Host: www.google.com

This header was optional in HTTP/1.0, but HTTP/1.1 made it mandatory. The following request is technically badly formed, as it specifies HTTP/1.1 but doesn’t provide a Host header:

GET / HTTP/1.1

According to the HTTP/1.1 specification,^[17] this request should be rejected by the server (with a 400 response code), though most web servers are more forgiving than they should be and have a default host that is returned for such requests.

¹⁷
https://tools.ietf.org/html/rfc7230#section-5.4

Making the Host header mandatory was an important step in HTTP/1.1, allowing servers to make more use of virtual hosting and therefore allowing the enormous growth of the web without the complexity of adding individual web servers for each site. Additionally, the relatively low limit of IPv4 IP addresses would have been reached much sooner without this change. On the other hand, if that limit had not been implemented, perhaps it would have helped forced the move to IPv6 much earlier; instead, it’s in the process of being rolled out at this writing despite having been around for more than 20 years!

Specifying a mandatory Host header field instead of changing the relative URL to an absolute URL involved some contention.^[18] HTTP proxies, introduced with HTTP/1.1, allowed connection to an HTTP server via an intermediary HTTP server. The syntax for proxies was already set to require full absolute URLs for all requests, but actual web servers (also called origin servers) were mandated to use the Hosts header. As I mentioned earlier, this change was necessary to avoid breaking existing servers, but making it mandatory left no doubt that HTTP/1.1 clients and servers must use virtual-hosting-style requests to be fully compliant HTTP/1.1 implementations. It was thought that in some future version of HTTP, this situation would be dealt with better. The HTTP/1.1 specification states, “To allow for transition to the absolute-form for all requests in some future version of HTTP, a server MUST accept the absolute-form in requests, even though HTTP/1.1 clients will only send them in requests to proxies.” Nevertheless, as you’ll see later, HTTP/2 didn’t resolve this problem cleanly, instead replacing the Host header with the :authority pseudoheader field (see chapter 4).

¹⁸
See https://lists.w3.org/Archives/Public/ietf-http-wg-old/1999SepDec/0014.html for some discussions on this subject.

Persistent connections (aka Keep-Alives)

Another important change introduced in HTTP and supported by many HTTP/1.0 servers, even though it wasn’t included in the HTTP/1.0 specification, was the introduction of persistent connections. Initially, HTTP was a single request-and-response protocol. A client opens the connection, requests a resource, gets the response, and the connection is closed. As the web became more media-rich, this closing of the connection proved to be wasteful. Displaying a single page required several HTTP resources, so closing the connection only to reopen it caused unnecessary delays. This problem was resolved with a new Connection HTTP header that could be sent with an HTTP/1.0 request. By specifying the value Keep-Alive in this header, the client is asking the server to keep the connection open to allow the sending of additional requests:

GET /page.html HTTP/1.0
Connection: Keep-Alive

The server would respond as usual, but if it supported persistent connections, it included a Connection: Keep-Alive header in the response:

HTTP/1.0 200 OK
Date: Sun, 25 Jun 2017 13:30:24 GMT
Connection: Keep-Alive
Content-Type: text/html
Content-Length: 12345
Server: Apache

<!doctype html>
<html>
<head>
...etc.

This response tells the client that it can send another request on the same connection as soon as the response is completed, so the server doesn’t have to close the connection to the client only to reopen it. It can be more complicated to know when the response is complete when you use persistent connections; the connection closing is a pretty good sign that the server has finished sending for a nonpersistent connection! Instead, the Content-Length HTTP header must be used to define the length of the response body, and when the entire body has been received, the client is free to send another request.

An HTTP connection can be closed at any point by either the client or the server. Closing may occur accidentally (perhaps due to network connectivity errors) or deliberately (if, for example, a connection isn’t used for a while and a server decides to close the connection to regain some resources for other connections). Therefore, even with persistent connections, both clients and servers should monitor the connections and be able to handle unexpectedly closed connections. The situation becomes more complicated with certain requests. If you’re checking out on an e-commerce website, for example, you may not want to resend the request without checking whether the server processed the initial request.

HTTP/1.1 not only added this persistent-connection process to the documented standard, but also went one step further and changed it to the default. Any HTTP/1.1 connection could be assumed to be using persistent connections even without the presence of the Connection: Keep-Alive header in the response. If the server did want to close the connection, for whatever reason, it had to explicitly include a Connection: close HTTP header in the response:

HTTP/1.1 200 OK
Date: Sun, 25 Jun 2017 13:30:24 GMT
Connection: close
Content-Type: text/html; charset=UTF-8
Server: Apache

<!doctype html>
<html>
<head>
...etc.
Connection closed by foreign host.

I touched on this topic in the Telnet examples earlier in this chapter. Now you can use Telnet again to send the following:

An HTTP/1.0 request without a Connection: Keep-Alive header. You should see that the connection is automatically closed by the server after the response is sent.
The same HTTP/1.0 request, but with a Connection: Keep-Alive header. You should see that the connection is kept open.
An HTTP.1.1 request, with or without a Connection: Keep-Alive header. You should see that the connection is kept open by default.

It’s not unusual to see HTTP/1.1 clients include this Connection: Keep-Alive header for HTTP/1.1 requests, despite the fact that it’s the default and should be assumed. Similarly, servers sometimes include the header in HTTP/1.1 responses despite this being unnecessary.

On a similar topic, HTTP/1.1 added the concept of pipelining, so it should be possible to send several requests over the same persistent connection and get the responses back in order. If a web browser is processing an HTML document, for example, and sees that it needs a CSS file and a JavaScript file, it should be able to send the requests for these files together and get the responses back in order rather than waiting for the first response before sending the second request. Here’s an example:

GET /style.css HTTP/1.1
Host: www.example.com

GET /script.js HTTP/1.1
Host: www.example.com

HTTP/1.1 200 OK
Date: Sun, 25 Jun 2017 13:30:24 GMT
Content-Type: text/css; charset=UTF-8
Content-Length: 1234
Server: Apache

.style {
...etc.

HTTP/1.1 200 OK
Date: Sun, 25 Jun 2017 13:30:25 GMT
Content-Type: application/x-javascript; charset=UTF-8
Content-Length: 5678
Server: Apache

Function(
...etc.

For several reasons (which I go into in chapter 2), pipelining never took off, and support for it in both clients (browsers) and servers is poor. So, although persistent connections allow the TCP connection to be reused for multiple requests, which was a good performance improvement, HTTP/1.1 is still fundamentally a request-and-response protocol for most implementations. While that one request is being handled, the HTTP connection is blocked from being used for other requests.

Other new features

HTTP/1.1 introduced many other features, including

New methods in addition to the GET, POST, and HEAD methods defined in HTTP/1.0. These methods are PUT, OPTIONS, and the less-used CONNECT, TRACE, and DELETE.
Better caching methods. These methods allowed the server to instruct the client to store the resource (such as a CSS file) in the browser’s cache so it could be reused later if required. The Cache-Control HTTP header introduced in HTTP/1.1 had more options than the Expires header from HTTP/1.0.
HTTP cookies to allow HTTP to move from being a stateless protocol.
The introduction of character sets (as shown in some examples in this chapter) and language in HTTP responses.
Proxy support.
Authentication.
New status codes.
Trailing headers (discussed in chapter 4, section 4.3.3).

HTTP has continually added new HTTP headers to further expand capabilities, many for performance or security reasons. The HTTP/1.1 specification doesn’t claim to be the definitive end for HTTP/1.1 and actively encourages new headers, even dedicating a section^[19] on how headers should be defined and documented. As I mention earlier, some of these headers are added for security reasons and are used to allow the website to tell the web browser to turn on certain optional security protections, so they require no implementation on the server side (other than the ability to send the header). At one time, there was a convention to include an X- in these headers to show that they weren’t formally standardized (X-Content-Type, X-Frame-Options, X-XSS-Protection), but this convention has been deprecated,^[20] and new experimental headers are difficult to differentiate from headers in the HTTP/1.1 specification. Often, these headers are standardized in their own RFCs (Content-Security-Policy,^[21] Strict-Transport-Security,^[22] and so on).

¹⁹
https://tools.ietf.org/html/rfc7231#section-8.3.1

²⁰
https://tools.ietf.org/html/rfc6648

²¹
https://tools.ietf.org/html/rfc7762

²²
https://tools.ietf.org/html/rfc6797

1.4. Introduction to HTTPS

HTTP was originally a plain-text protocol. HTTP messages are sent across the internet unencrypted and therefore are readable by any party that sees the message as it’s routed to its destination. The internet, as the name suggests, is a network of computers, not a point-to-point system. The internet provides no control of how messages are routed, and you, as an internet user, have no idea how many other parties will see your messages as they’re sent across the internet from your internet service provider (ISP) to telecom companies and other parties. Because HTTP is plain-text, messages can be intercepted, read, and even altered en route.

HTTPS is the secure version of HTTP that encrypts messages in transit by using the Transport Layer Security (TLS) protocol, though it’s often known by its previous incarnation as Secure Sockets Layer (SSL), as discussed in the sidebar below.

HTTPS adds three important concepts to HTTP messages:

Encryption —Messages can’t be read by third parties while in transit.
Integrity —The message hasn’t been altered in transit, as the entire encrypted message is digitally signed, and that signature is cryptographically verified before decryption.
Authentication —The server is the one you intended to talk to.

SSL, TLS, HTTPS, and HTTP

HTTPS uses SSL or TLS to provide encryption. SSL (Secure Sockets Layer) was invented by Netscape. SSLv1 was never released outside Netscape, so the first production version was SSLv2, released in 1995. SSLv3, released in 1996, addressed some insecurities.

As SSL was owned by Netscape, it wasn’t a formal internet standard, though it was subsequently published by the IETF as a historic document.^[a] SSL was standardized as TLS (Transport Layer Security). TLSv1.0^[b] was similar to SSLv3, though not compatible. TLSv1.1^[c] and TLSv1.2^[d] followed in 2006 and 2008, respectively, and were more secure. TLSv1.3 was approved as a standard in 2018;^[e] it’s more secure and performant,^[f] though it will take time to become widespread.

^a
https://tools.ietf.org/html/rfc6101

^b
https://tools.ietf.org/html/rfc2246

^c
https://tools.ietf.org/html/rfc4346

^d
https://tools.ietf.org/html/rfc5246

^e
https://tools.ietf.org/html/rfc8446

^f
https://blog.cloudflare.com/rfc-8446-aka-tls-1-3/

Despite the availability of these newer, more secure, standardized versions, SSLv3 was considered to be good enough by many people, so it was the de facto standard for a long time, even though many clients supported TLSv1.0 as well. In 2014, however, major vulnerabilities were discovered in SSLv3,^[g] which must no longer be used^[h] and is no longer supported by browsers. This situation started a major move toward TLS. After similar vulnerabilities were found in TLSv1.0, security guidelines insisted that TLSv1.1 or later be used.^[i]

^g
https://www.us-cert.gov/ncas/alerts/TA14-290A

^h
https://tools.ietf.org/html/rfc7568

ⁱ
https://www.pcisecuritystandards.org/documents/Migrating-from-SSL-Early-TLS-Info-Supp-v1_1.pdf

The net effect of all this history is that people use these acronyms in different ways. Many people still refer to encryption as SSL because it was the standard for so long; others use SSL/TLS or TLS. Some people try to avoid the debate by referring to it as HTTPS, even though this term isn’t strictly correct.

In general in this book, I refer to encryption as HTTPS (rather than SSL or SSL/TLS) unless I’m specifically talking about specific parts of the TLS protocol. On a similar note, I refer to the core semantics of HTTP as HTTP, whether it’s used over an unencrypted HTTP connection or an encrypted HTTPS connection.

HTTPS works by using public key encryption, which allows servers to provide public keys in the form of digital certificates when users first connect. Your browser encrypts messages by using this public key, which only the server can decrypt, as only it has the corresponding private key. This system allows you to communicate securely with a website without having to know a shared secret key in advance, which is crucial for a system like the internet, where new websites and users come and go every second of every day.

The digital certificates are issued, and digitally signed, by various certificate authorities (CAs) trusted by the browser, which is why it’s possible to authenticate that the public key is for the server you’re connecting to. One big problem with HTTPS is that it indicates only that you’re connecting to that server—not that the server is trustworthy. Fake phishing sites can be set up easily with HTTPS for a different but similar domain (exmplebank.com instead of examplebank.com). HTTPS sites are usually shown with a green padlock in web browsers, which many users take to mean safe, but it merely means securely encrypted.

Some CAs do some extra vetting on websites when they issue certificates and provide an Extended Validation certificate (known as an EV certificate), which encrypts the HTTP traffic the same way as a normal certificate but also displays the company name in most web browsers, as shown in figure 1.4.

Figure 1.4. HTTPS web browser indicators

Many people dispute the benefits of EV certificates,^[23] mostly because the vast majority of users don’t notice the company name and don’t act any differently on sites that use EV or standard Domain Validated (DV) certificates. A middle ground of Organizational Validated (OV) certificates do some of the checks but don’t give extra notification in browsers, making them largely pointless at a technical level (though CAs may include extra support commitments as part of purchasing them).

²³
https://www.tunetheweb.com/blog/what-does-the-green-padlock-really-mean/

The Google Chrome team is researching and experimenting with these security indicators at the time of this writing,^[24] trying to remove what it sees as unnecessary information, including the scheme (http and https), any www prefix, and possibly even the padlock itself (instead assuming that HTTPS is the norm and that HTTP should be explicitly marked as not secure). The team is also considering whether to remove EV.^[25]

²⁴
https://blog.chromium.org/2018/05/evolving-chromes-security-indicators.html

²⁵
https://groups.google.com/forum/#!topic/mozilla.dev.security.policy/szD2KBHfwl8%5B1-25%5D

HTTPS is built around HTTP and is almost seamless to the HTTP protocol itself. It’s hosted on a different port by default (port 443 as opposed to port 80 for standard HTTP), and it has a different URL scheme (https:// as opposed to http://), but it doesn’t fundamentally alter the way HTTP is used in terms of syntax or message format except for the encryption and decryption itself.

When the client connects to an HTTPS server, it goes through a negotiation stage (or TLS handshake). In this process, the server provides the public key, client and server agree on the encryption methods to use, and then client and server negotiate a shared encryption key to use in the future. (Public key cryptography is slow, so public encryption keys are used only to negotiate a shared secret, which is used to encrypt future messages with better performance). I discuss the TLS handshake in detail in chapter 4 (section 4.2.1).

After the HTTPS session is set up, standard HTTP messages are exchanged. The client and server encrypt these messages before sending and decrypt upon receipt, but to the average web developer or server manager, there’s no difference between HTTPS and HTTP after it’s configured. Everything happens transparently unless you’re looking at the raw messages sent across the network. HTTPS wraps up standard HTTP requests and responses rather than replacing them with another protocol.

HTTPS is a huge topic that’s well beyond the scope of this book. I touch on it again briefly in future chapters, as HTTP/2 does bring in some changes. But for now, it’s important only to know that HTTPS exists and that it works at a lower level than HTTP (between TCP and HTTP). Unless you’re looking at the encrypted messages themselves, you won’t see any real difference between HTTP and HTTPS.

For web servers using HTTPS, you need a client that can understand HTTPS and do the encryption and decryption, so you can no longer use Telnet to send example HTTP requests to those servers. The OpenSSL program provides an s_client command that you can use to send HTTP commands to an HTTPS server, similar to the way Telnet is used:

openssl s_client -crlf -connect www.google.com:443 -quiet
GET / HTTP/1.1
Host: www.google.com

HTTP/1.1 200 OK
...etc.

We’re reaching the end of the usefulness of command-line tools to examine HTTP requests, however. In the next section, I take a brief look at browser tools, which provide a much better way to see HTTP requests and responses.

1.5. Tools for viewing, sending, and receiving HTTP messages

Although it was helpful to use tools like Telnet to understand the basics of HTTP, command-line tools like the ones discussed here have limitations, not least of which is dealing with the enormous size of most web pages. Several tools allow you to see and send HTTP requests in a better way than Telnet. Many of these tools can be used from the main tool you use to interact with the web: your web browser.

1.5.1. Using developer tools in web browsers

All web browsers come with so-called developer tools, which allow you to see many details behind websites, including HTTP requests and responses.

You launch developer tools by pressing a keyboard shortcut (F12 on Windows for most browsers, or Option+Command+I on Apple computers) or by right-clicking a bit of HTML and choosing Inspect from the contextual menu. Developer tools have various tabs showing the technical details behind the page, but the one you’re most interested in for the purposes of this discussion is the Network tab. If you open the developer tools and then load the page, the Network tab shows all the HTTP requests, and clicking on one of them produces more details, including the request and response headers. Figure 1.5 shows the Chrome developer tools that you get when loading https://www.google.com.

Figure 1.5. Developer tools Network tab in Chrome

The URL is entered at the top in the address bar (1) as usual. Note the padlock, and https:// scheme, showing that Google is using HTTPS successfully (though, as mentioned, Chrome may be changing this). The web page is returned below the address bar, again as usual. If you loaded this page with developer tools open, however, you see a new section with various tabs. Clicking the Network tab (2) shows the HTTP requests (3), including information such as the HTTP method (GET), the response status (200), the protocol (http/1.1), and the scheme (https). You can change the columns shown by right-clicking the column headings. The Protocol, Scheme, and Domain columns aren’t shown by default, for example, and for some sites (such as Twitter), you see h2 in this column for HTTP/2 or perhaps even http/2+quic (Google) for an even newer protocol that I discuss in chapter 9.

Figure 1.6 shows what happens when you click the first request (1). The right section is replaced by a tabbed view where you can see the response headers (2) and the request headers (3). I’ve discussed many but not all of these headers in this chapter.

Figure 1.6. Viewing HTTP headers in developer tools in Chrome

HTTPS is handled by the browser, so developer tools show only the HTTP request messages before they’re encrypted and the response messages after they’ve been decrypted. For the most part, HTTPS can be ignored after it’s set up, provided that you have the right tools to handle encryption and decryption for you. Additionally, most browsers’ developer tools show media correctly, so images display properly, and code (HTML, CSS, and JavaScript) can often be formatted for easier reading.

I return to developer tools throughout the book. You should familiarize yourself with your browser’s developer tools for your site, or for popular sites you use, if you’re not familiar with them.

1.5.2. Sending HTTP requests

Although web browsers’ developer tools are the best way to see raw HTTP requests and responses, they’re surprisingly poor at allowing you to send raw HTTP requests. Other than the address bar, which can be used only to send simple GET requests, and whatever functionality a website has built (to POST via HTML forms, for example), the tools rarely offer you the ability to send any other raw HTTP messages.

The Advanced REST client application^[26] gives you a way of sending raw HTTP messages and seeing the responses. Send a GET request (1) for the URL https://www.google.com (2) and click Send (3) to get the response (4), as shown in figure 1.7. Note that the application handles HTTPS for you.

²⁶
https://install.advancedrestclient.com (Note: this must be opened in Chrome.)

Figure 1.7. Advanced REST client application

Using this application is no different from using the browser, but the Advanced REST Client also allows you to send other types of HTTP requests (such as POST and PUT) and to set the header or body data to send. Advanced REST Client started life as a Chrome browser extension^[27] but has since been moved to a separate application. Similar browser extension tools act like Advanced REST Client, including Postman (Chrome), Rested,^[28] RESTClient^[29] (Firefox), and RESTMan^[30] (Opera), all of which have comparable functionality.

²⁷
https://chrome.google.com/webstore/detail/advanced-rest-client/hgmloofddffdnphfgcellkdfbfbjeloo

²⁸
https://addons.mozilla.org/en-US/firefox/addon/rested/

²⁹
https://addons.mozilla.org/en-US/firefox/addon/restclient/

³⁰
https://addons.opera.com/en/extensions/details/restman/

1.5.3. Other tools for viewing and sending HTTP requests

You can use many other tools to send or view HTTP requests outside the browser. Some of these work from the command line (such as curl,^[31] wget,^[32] and httpie^[33]), and some work with desktop clients (such as SOAP-UI^[34]).

³¹
https://curl.haxx.se/

³²
https://www.gnu.org/software/wget/

³³
https://httpie.org/

³⁴
https://www.soapui.org/

If you’re looking to view the traffic at a lower level, you may want to consider Chrome’s net-internals page or network sniffer programs such as Fiddler^[35] and Wireshark.^[36] I look at some of these tools in later chapters when I look at the details of HTTP/2, but for now the tools mentioned in this section should suffice.

³⁵
https://www.telerik.com/fiddler

³⁶
https://www.wireshark.org/

Summary

HTTP is one of the core technologies of the web.
Browsers make multiple HTTP requests to load a web page.
The HTTP protocol started as a simple text-based protocol.
HTTP has grown more complex, but the basic text-based format hasn’t changed in the past 20 years.
HTTPS encrypts standard HTTP messages.
Various tools are available for viewing and sending HTTP messages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 1. Web technologies and HTTP

Create new playlist

Sign In

Sign Up

Chapter 1. Web technologies and HTTP

1.1. How the web works

1.1.1. The internet versus the World Wide Web

1.1.2. What happens when you browse the web?

Note

Figure 1.1. Typical interaction when browsing to a web page

1.2. What is HTTP?

Figure 1.2. The transport layers of internet traffic

Figure 1.3. PuTTY details for connecting to Google

1.3. The syntax and history of HTTP

1.3.1. HTTP/0.9

1.3.2. HTTP/1.0

HTTP/1.0 methods

HTTP request headers

HTTP response codes

Table 1.1. HTTP/1.0 response codes

HTTP response headers

1.3.3. HTTP/1.1

Mandatory Host header

Persistent connections (aka Keep-Alives)

Other new features

1.4. Introduction to HTTPS

Figure 1.4. HTTPS web browser indicators

1.5. Tools for viewing, sending, and receiving HTTP messages

1.5.1. Using developer tools in web browsers

Figure 1.5. Developer tools Network tab in Chrome

Figure 1.6. Viewing HTTP headers in developer tools in Chrome

1.5.2. Sending HTTP requests

Figure 1.7. Advanced REST client application

1.5.3. Other tools for viewing and sending HTTP requests

Summary

Table of Contents for
Chapter 1. Web technologies and HTTP