4
HOW WEB SERVERS WORK

image

In the previous chapter, you learned how browsers communicate over the internet and render the HTML pages and other resources that make up a website. In this chapter, you’ll learn about how those same HTML pages are constructed by web servers.

By its simplest definition, a web server is a computer program that sends back HTML pages in response to HTTP requests. Modern web servers encompass a much broader range of functionality than this suggests, however. When a browser makes an HTTP request, modern web servers allow code to be executed in order to generate the web page HTML dynamically, and often incorporate content from a database. As a web developer, you’ll spend most of your time writing and testing this type of code.

This chapter covers how developers organize code and resources within a web server. I’ll also pinpoint common weaknesses in web servers that allow security vulnerabilities to occur, and talk about how to avoid these pitfalls.

Static and Dynamic Resources

Web servers serve two types of content in response to HTTP requests: static resources and dynamic resources. A static resource is an HTML file, image file, or other type of file that the web server returns unaltered in HTTP responses. A dynamic resource is code, a script, or a template that the web server executes or interprets in response to an HTTP request. Modern web servers are capable of hosting both static and dynamic resources. Which resource the server executes or returns depends on the URL in the HTTP request. Your web server will resolve URLs according to a configuration file that maps URL patterns to particular resources.

Let’s look at how web servers handle static and dynamic resources.

Static Resources

In the early days of the internet, websites consisted mostly of static resources. Developers coded HTML files by hand, and websites consisted of individual HTML files that were deployed to the web server. The “deployment” of a website required the developer to copy all the HTML files to the web server and restart the server process. When a user wished to visit the website, they would type the website’s URL in their browser. The browser would make an HTTP request to the web server hosting the website, which would interpret the incoming URL as a request for a file on disk. Finally, the web server would return the HTML file as is in the HTTP response.

An example of this is the website for the 1996 movie Space Jam. It consists entirely of static resources, and it’s still online at spacejam.com. Clicking through the site takes us back to a simpler and aesthetically less sophisticated time in web development. If you visit the website, you will notice that each of the URLs like https://www.spacejam.com/cmp/sitemap.html end with a .html suffix, indicating that each web page corresponds to an HTML file on the server.

Tim Berners-Lee’s original vision of the web looked much like the Space Jam website: a network of static files hosted on web servers that would contain all the world’s information.

URL Resolution

Modern web servers handle static resources in much the same way as their older counterparts. To access a resource in a browser, you include the resource name in the URL, and the web server returns the resource file from disk as it’s requested. To display the picture shown in Figure 4-1, the URL includes the resource name /images/hedgehog_in_spaghetti.png, and the web server returns the appropriate file from disk.

image

Figure 4-1: An example of a static resource

Modern web servers have a few additional tricks up their sleeves. A modern web server allows any URL to be mapped to a particular static resource. We would expect the hedgehog_in_spaghetti.png resource to be a file living in the /images directory on the web server, but in fact, the developer can call it anything they choose. By unlinking the URL from the filepath, web servers give developers more freedom to organize their code. This might allow each user to have a different profile image, but use the same path, for instance.

When returning a static resource, modern web servers often add data to the HTTP response or process the static resource before returning it. For example, web servers often dynamically compress large resource files by using the gzip algorithm to reduce the bandwidth used in the response, or add caching headers in HTTP responses to instruct the browser to cache and use a local copy of a static resource if a user views it again within a defined window of time. This makes the website more responsive for the user and reduces the load the server has to handle.

Because static resources are simply files of one form or another, they don’t, by themselves, exhibit much in the way of security vulnerabilities. The process of resolving a URL to a file can introduce vulnerabilities, however. If a user designates certain types of files to be private (for example, the images they upload), you will need to have access control rules defined on the web server. We’ll look at various ways hackers attempt to circumvent access control rules in Chapter 11.

Content Delivery Networks

A modern innovation designed to improve the delivery speeds of static files is the content delivery network (CDN), which will store duplicated copies of static resources in data centers around the world, and quickly deliver those resources to browsers from the nearest physical location. CDNs like Cloudflare, Akamai, or Amazon CloudFront offload the burden of serving large resource files, such as images, to a third party. As such, they allow even small companies to produce responsive websites without a massive server expenditure. Integrating a CDN into your site is usually straightforward, and the CDN service charges a monthly fee depending on the amount of resources you deploy.

Using a CDN also introduces security complications. Integrating with a CDN effectively allows a third party to serve content under your security certificate, so you need to set up your CDN integration securely. We’ll investigate how to securely integrate third-party services such as CDNs in Chapter 14.

Content Management Systems

Plenty of websites still consist of mostly static content. Rather than being coded by hand, these sites are generally built using content management systems (CMSs) that provide authoring tools requiring little to no technical knowledge to write the content. CMSs generally impose a uniform style on the pages and allow administrators to update content directly in the browser.

CMS plug-ins can also provide analytics to track visitors, add appointment management or customer support functions, and even create online stores. This plug-in approach is part of a larger trend of websites using specialized services from third-party companies to build custom features. For example, sites commonly use Google Analytics for customer tracking, Facebook Login for authentication, and Zendesk for customer support. You can add each of these features with a few lines of code and an API key, making it significantly easier to build feature-rich sites from scratch.

Using other people’s code to build your site, either by integrating a CMS or using plug-in services, theoretically makes you more secure because these third parties employ security professionals and have an incentive to secure their services. However, the ubiquity of these services and plug-ins also makes them a target for hackers. For example, many self-hosted instances of WordPress, the most popular CMS, are infrequently patched. You can easily discover WordPress vulnerabilities through a simple Google search, as shown in Figure 4-2.

When you use third-party code, you need to stay on top of security advisories and deploy security patches as soon as they become available. We’ll investigate some of risks around third-party code and services in Chapter 14.

image

Figure 4-2: Come get your unsecured WordPress instances.

Dynamic Resources

Though it’s simpler to use static resources, authoring individual HTML files by hand is time-consuming. Imagine if retail websites had to code up a new web page every time they added a new item to their inventory. It would inefficiently use up everyone’s time (though it would provide a guarantee of job security for web developers).

Most modern websites instead use dynamic resources. Often the dynamic resource’s code loads data from a database in order to populate the HTTP response. Typically, the dynamic resource outputs HTML, though other content types can be returned depending on the expectations of the browser.

Dynamic resources allow retail websites to implement a single product web page capable of displaying many types of products. Each time a user views a particular product on the site, the web page extracts the product code from a URL, loads the product price, image, and description from the database, and interpolates this data into the HTML. Adding new products to the retailer’s inventory then becomes a matter of simply entering new rows in the database.

There are many other uses for dynamic resources. If you access your banking website, it looks up your account details and incorporates them in the HTML. A search engine like Google returns matches pulled from Google’s massive search index and returns them in a dynamic page. Many sites, including social media and web-mail sites, look different to each user, because they dynamically construct the HTML after the user logs in.

As useful as dynamic resources are, they create novel security vulnerabilities. The dynamic interpolation of content into the HTML can be vulnerable to attack. We’ll look at how to protect ourselves from maliciously injected JavaScript in Chapter 7, and see how HTTP requests generated from other websites can cause harm in Chapter 8.

Templates

The first dynamic resources were simple script files, often written in the Perl language, that the web server executed when a user visited a particular URL. These script files would write out the HTML that made up a particular web page.

Code that makes up a dynamic resource in this fashion often isn’t intuitive to read. If a web page consists of static resources, you can look at a static HTML file to get a sense of how it’s organized, but it’s harder to do the same with dynamic resources that have a thousand lines of Perl code. Essentially, you have one language (Perl) writing out content in another language (HTML) that, downstream, a browser will render onscreen. Making changes to Perl code while keeping in mind what the eventual rendered output will look like is a difficult task.

To address this, web developers often use template files to build dynamic web pages. Templates are mostly HTML, but have programmatic logic interspersed within them that contains instructions to the web server. This logic is generally simple and usually does one of three things: pull data from a database or the HTTP request and interpolate it into the HTML, conditionally render sections of the HTML template, or loop over a data structure (for example, lists of items) to repeatedly render a block of HTML. All modern web frameworks use template files (with variations in syntax) because inserting code snippets into HTML typically makes code cleaner and more readable.

Databases

When a web server executes the code in a dynamic resource, it often loads data from a database. If you visit a retail website, the web server looks up the product ID in a database, and uses the product information stored in the database to construct the page. If you log into a social media site, the web server loads your timeline and notifications from an underlying database in order to write the HTML. In fact, most modern websites use databases to store user information, and the interface between the web server and a database is a frequent target for hackers.

Database technology predates the invention of the web. As computers became more widespread back in the 1960s, companies started to see the value of digitizing and centralizing their record keeping to make searching and maintenance easier. With the birth of the web, sticking a web frontend on top of a product inventory database was a natural progression for companies looking to branch out into online retail.

Databases are key for authentication too. If a website wants to identify returning users, it needs to keep a record of who has signed up to the site and verify, or authenticate, their login information against stored credentials when they return.

The two most commonly used types of databases are SQL and NoSQL. Let’s take a look at both.

SQL Databases

The most common databases used today are relational databases that implement Structured Query Language (SQL), a declarative programming language that maintains and fetches data.

NOTE

SQL can be pronounced either “ess-qew-ell” or “sequel,” although you can try pronouncing it “squeal” if you want to see your database administrator squirm uncomfortably.

SQL databases are relational, which means they store data in one or more tables that relate to each other in formally prescribed ways. You can think of a table as akin to a Microsoft Excel spreadsheet with rows and columns, with each row representing a data item, and each column representing a data point for each item. Columns in a SQL database have predefined data types, typically strings of text (often of fixed length), numbers, or dates.

Database tables in a relational database relate to each other via keys. Usually, each row in a table has a unique numeric primary key, and tables can refer to each other’s rows via foreign keys. For example, if you were storing user orders as database records, the orders table would have a foreign key column called user_id that represents the user who placed the order. Instead of storing user information directly in the orders table, this user_id column would contain foreign-key values that refer to a specific row’s primary key (the id column) in the users table. This type of relation ensures that you cannot store orders in the database without storing the user, and ensures that only a single source of truth exists for each user.

Relational databases also feature data integrity constraints that prevent data corruption and make uniform queries to the database possible. Like foreign keys, other types of data integrity constraints can be defined in SQL. For example, you could require the email_address column in a users table to contain only unique values, to force each user in the database to have a different email address. You could also require non-null values in tables so that the database must specify an email address for each user.

SQL databases also exhibit transactional and consistent behavior. A database transaction is a group of SQL statements executed in a batch. A database is said to be transactional if each transaction is “all or nothing”: that is, if any SQL statement fails to execute within the batch, the entire transaction fails and leaves the database state unchanged. SQL databases are consistent because any successful transaction brings the database from one valid state to another. Any attempt to insert invalid data in a SQL database causes the whole transaction to fail and the database to remain unaltered.

Because data stored in SQL databases is often highly sensitive, hackers target databases to sell their contents on the black market. Hackers also often take advantage of insecurely constructed SQL statements. We’ll examine how in Chapter 6.

NoSQL Databases

SQL databases are often the bottleneck of a web application’s performance. If most HTTP requests hitting a website generate a database call, the database server will experience a tremendous load and slow the performance of the website for all users.

These performance concerns have led to the increasing popularity of NoSQL databases—databases that sacrifice the strict data integrity requirements of traditional SQL databases to achieve greater scalability. NoSQL encompasses a variety of approaches to storing and accessing data, but a few trends among them have emerged.

NoSQL databases are often schemaless, allowing you to add fields to new records without having to upgrade any data structures. To achieve this flexibility, data is often stored in key-value form, or in JavaScript Object Notation (JSON).

NoSQL database technology also tends to prioritize widescale replication of data over absolute consistency. SQL databases guarantee that simultaneous queries by different client programs will see the same results; NoSQL databases often loosen this constraint and guarantee only eventual consistency.

NoSQL databases make storing unstructured or semistructured data very easy. Extracting and querying data tends to be a little more complex—some databases offer a programmatic interface, while others implement their own query languages that adapt SQL-like syntax to their data structures. NoSQL databases are vulnerable to injection attacks in much the same way as SQL databases are, though an attacker has to correctly guess the database type to successfully mount an attack.

Distributed Caches

Dynamic resources can also load data from in-memory distributed caches, another popular approach to achieving the massive scalability required by large websites. Caching refers to the process of storing a copy of data kept elsewhere in an easily retrievable form, to speed up retrieval of that data. Distributed caches like Redis or Memcached make caching data straightforward and allow software to share data structures across different servers and processes in a language-agnostic way. Distributed caches can be shared among web servers, making them ideal for storing frequently accessed data that would otherwise have to be retrieved from a database.

Large web companies typically implement their tech stacks as a range of microservices—simple, modular services that perform one action on demand—and use distributed caches to communicate between them. Services often communicate via queues stored in a distributed cache: data structures that can put tasks in a waiting state so they can be completed one at a time by numerous worker processes. Services can also use publish-subscribe channels that allow many processes to register interest in a type of event, and have them notified en masse when it occurs.

Distributed caches are vulnerable to hacks in the same way that databases are. Thankfully, Redis and Memcached were developed in an age when these kinds of threats were well-known, so best practices are generally baked into software development kits (SDKs), the code libraries you use to connect with the caches.

Web Programming Languages

Web servers will execute code in the process of evaluating dynamic resources. A huge number of programming languages can be used to write web server code, and each has different security considerations.

Let’s look at some of the more commonly used languages. We’ll use these languages in code samples in later chapters.

Ruby (on Rails)

The Ruby programming language, like Dragon Ball Z and the Tom Selleck film Mr. Baseball, was invented in Japan in the mid ’90s. Unlike either Dragon Ball Z or Tom Selleck, it didn’t become popular for another decade until the Ruby on Rails platform was released.

Ruby on Rails incorporates many best practices for building large-scale web applications and makes them easy to implement with minimal configuration. The Rails community also takes security seriously. Rails was one of the first web server stacks to incorporate protections against cross-site request forgery attacks. Nevertheless, Rail’s ubiquity makes it a common target for hackers. Several major security vulnerabilities have been discovered (and hastily patched) in recent years.

Simpler Ruby web servers often described as microframeworks (for example, Sinatra) have become popular alternatives to Rails in recent years. Microframeworks allow you to combine individual code libraries that perform one particular function, so your web server is deliberately minimal in size. This contrasts with Rails’s “everything including the kitchen sink” model of deployment. Developers who use a microframework generally find the extra capabilities they need by using the RubyGems package manager.

Python

The Python language was invented in the late 1980s. Its clean syntax, flexible programming paradigm, and wide variety of modules have made the language phenomenally popular. Newcomers to Python are often surprised that whitespace and indenting have semantic meaning, which is unusual among programming languages. Whitespace is so important in the Python community that they fight holy wars over whether indentation should be done with tabs or spaces.

Python is used for a variety of applications, and is often the go-to language for data science and scientific computing projects. Web developers have a wide choice of actively maintained web servers to choose from (such as the popular Django and Flask). The diversity of web servers also acts as a security feature because hackers are less likely to target a particular platform.

JavaScript and Node.js

JavaScript started out as a simple language for executing small scripts within the browser, but became popular for writing web server code and rapidly evolved with the Node.js runtime. Node.js runs on top of the V8 JavaScript engine, the same software component that Google Chrome uses to interpret JavaScript within the browser. JavaScript still contains many quirks, but the prospect of using the same language on the client side and server side has made Node the fastest-growing web development platform.

The largest security risks in Node are due to its rapid growth—hundreds of modules are added every day. You’ll need to take extra caution when you use third-party code in your Node application.

PHP

The PHP language was developed from a set of C binaries used to build dynamic sites on Linux. PHP later developed into a fully fledged programming language, though the unplanned evolution of the language is evident in its disorganized nature. PHP inconsistently implements many built-in functions. For example, variable names are case-sensitive, but function names are not. Despite these quirks, PHP remains popular and, at one point, it powered 10 percent of sites on the web.

If you’re writing PHP, you’re often maintaining a legacy system. Because older PHP frameworks exhibit some of the nastiest security vulnerabilities you can imagine, you should update legacy PHP systems to use modern libraries. Every type of vulnerability, whether it’s command execution, directory traversal, or a buffer overflow, has given PHP programmers sleepless nights.

Java

Java and the Java Virtual Machine (JVM) have been widely used and implemented in the enterprise space, allowing you to run Java’s compiled bytecode across multiple operating systems. It’s generally a good workhorse language when performance is a concern.

Developers have used Java for everything, whether for robotics, mobile app development, big-data applications, or embedded devices. Its popularity as a web development language has waned, but many millions of lines of Java code still power the internet. From a security perspective, Java is haunted by its past popularity; legacy applications contain a lot of Java code that run older versions of the language and frameworks. Java developers need to update to secure versions in a timely fashion lest they become easy pickings for hackers.

If you’re a more adventurous developer, you’ll find other popular languages that run on the JVM and offer compatibility with Java’s huge ecosystem of third-party libraries. Clojure is a popular Lisp dialect; Scala is a functional language with static typing; Kotlin is a newer object-oriented language designed to be backward compatible with Java, while making scripting easier.

C#

C# was designed by Microsoft as part of the .NET initiative. C# (and other .NET languages, such as VB.NET) use a virtual machine called the Common Language Runtime (CLR). C# is less abstracted from the operating system than Java, and you can happily intermingle C++ code with C#.

Microsoft has had a conversion late in life to open source evangelism, and the reference implementation of C# is now, thankfully, open source. The Mono project allows .NET applications to run on Linux and other operating systems. Nevertheless, most companies using C# deploy to Windows servers and the typical Microsoft stack. Windows has had a troubling history security-wise—being, for instance, the most common target platform for viruses—so anyone looking to adopt .NET as a platform needs to be aware of the risks.

Client-Side JavaScript

As a web developer, you have a choice of languages for writing web server code. But when your code needs to be executed in the browser, you have exactly one choice: JavaScript. As I mentioned previously, the popularity of JavaScript as a server-side language can in part be credited to web developers’ familiarity with it from writing for the client side.

JavaScript in the browser has moved a long way beyond the simple form-validation logic and animated widgets it was used for in the early days of the web. A complex site such as Facebook uses JavaScript to redraw areas of the page as the user interacts with it—for example, rendering a menu when the user clicks an icon, or opening a dialog when they click a photo. Sites often update the user interface when background events occur, too, by adding notification markers when others leave comments or write new posts.

Achieving this kind of dynamic user interface without refreshing the whole page and interrupting the user experience requires client-side JavaScript to manage a lot of state in memory. Several frameworks have been developed to organize memory state and render pages efficiently. They also allow for modular reuse of JavaScript code over various pages on the site, a key design consideration when you have millions of lines of JavaScript to manage.

One such JavaScript framework is Angular, originally released by Google under an open source license. Angular borrows from server-side paradigms and uses client-side templates to render web pages. The Angular template engine—which executes in the browser as the page loads—parses the template HTML supplied by the server, and processes any directives as they appear. Because the template engine is simply JavaScript executing in the browser, it can write directly to the DOM and short-circuit some of the browser-rendering pipeline. As the memory state changes, Angular automatically re-renders the DOM. This separation makes for cleaner code and more-maintainable web applications.

The open source React framework, which was released by the Facebook development team, takes a slightly different approach from Angular. Instead of interspersing code in HTML templates, React encourages the developer to write HTML-like tags directly into JavaScript. React developers typically create JavaScript XML (JSX) files that they run through a preprocessor and compile into JavaScript before sending them to the browser.

Writing JavaScript code like return <h1>Hello, {format(user)}</h1> for the first time can seem strange to developers used to separating JavaScript and HTML files, but by making HTML a first-class element of the JavaScript syntax, React enables useful features (for example, syntax highlighting and code completion) that would otherwise be difficult to support.

Rich, client-side JavaScript frameworks like Angular and React are great for building and maintaining complex sites. JavaScript code that manipulates the DOM directly is partial to a new type of security vulnerability, however: DOM-based cross-site scripting attacks, which we’ll look at in more detail in Chapter 7.

Note that although JavaScript is the only language a browser typically executes, that doesn’t mean you have to write all your client-side code in JavaScript. Many developers use languages like CoffeeScript or TypeScript that are transpiled into JavaScript during the build process before being sent to the browser. These languages are subject to the same security vulnerabilities as JavaScript at execution time, so in this book I’ll mostly limit our discussions to plain old JavaScript.

Summary

Web servers serve two types of content in response to HTTP requests: static resources, such as images, and dynamic resources, which execute custom code.

Static resources are resources that we can serve directly from a filesystem or a content delivery network to increase the responsiveness of the site. Website owners usually author websites that consist wholly of static resources in a content-management system, which allows nontechnical administrators to edit them directly in the browser.

Dynamic resources, on the other hand, are resources that we often define in the form of templates, HTML that’s interspersed with programmatic instructions to be interpreted by the server. They’ll typically read data from a database or a cache that informs how the page is rendered. The most common form of database is a SQL database, which stores data in tabular form, with strictly defined rules on the structure of the data. Larger websites often use a NoSQL database, a newer variety of database that relaxes some of the constraints of the traditional SQL database in order to achieve greater scalability. We write dynamic resources in a web programming language, of which there are many.

In the next chapter, you’ll look at the process of writing code itself. The key to writing secure, bug-free code is a disciplined development process; I’ll show you how you should write, test, build, and deploy your code.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.96.247