Chapter 8. SPOF Testing

On November 12, 2014, for 90 excruciating minutes, customers of Google’s DoubleClick for Publishers (DFP) service experienced an outage. It is estimated that over 50,000 websites were affected, costing millions of dollars in lost advertising revenue. In addition to the direct loss of revenue, there was a secondary effect. Some websites that depended on DFP started experiencing outage-like behavior of their own. Users were unable to access these sites because the pages effectively froze waiting for network activity with the DFP server. This scenario is known as a single point of failure (SPOF) of frontend code, in which one weak link can take the whole site down.

Brian O’Kelley, CEO of AppNexus, operator of a large real-time online ad platform and a DoubleClick rival, estimated the disruption cost publishers $1 million per hour in aggregate.

Wednesday’s outages affected more than 55,000 websites, according to Dynatrace, which monitors website and web application performance for companies including eight out of the 10 largest retailers in North America.

http://on.wsj.com/1EUfDRn

A SPOF is able to happen due to the way browsers handle unresponsive servers. When a server experiences an outage similar to what happened to Google’s ad service, websites that depend on it fail to communicate. The browser’s normal recourse is to try again. As the browser unsuccessfully attempts to reach the downed server, the original request is left hanging. When this request is made synchronously, all other page activity grinds to a halt.

A user visiting a site that is undergoing a SPOF is likely to have a very bad experience. The page appears blank or incomplete, nothing on the page responds to interactions like scrolling or clicking, and much time is wasted. Users don’t care if the site is the victim of a third-party failure; as far as they’re concerned, the site is broken. Site owners can immediately expect a loss of business from these users, simply because they’re unable to use the site. Worse yet, there are long-term effects that adversely impact these users’ sentiment. Sites that go down are seen as less reliable, untrustworthy, and undeserving of a return visit.

There are many possible sources for a SPOF. In addition to the cause in the advertising example, social media widgets are commonly cited examples of SPOF-inducing scripts. A website that includes a button to post to Twitter may suffer if Twitter goes down. Other common third-party resource types include analytics code like Google Analytics and JavaScript frameworks like jQuery. You and your site can be affected if you rely on external resources like these.

You may be thinking at this point that externally hosted third-party resources are terrible because they have the ability to take your website down. Your next thought may be to mitigate the risk of a SPOF by turning these third-party resources into “first-party” resources by hosting them yourself. If the third-party server goes down, you would be unaffected, but if your server goes down, your site would be affected no matter where your resources are hosted. This might seem like a viable alternative, but remember the benefits of third-party resource hosting. Perhaps the most obvious advantage is that you don’t have to update the resource as changes are published. For example, some jQuery users prefer to use a centrally located CDN to host the JavaScript file because updates to the library are automatically pushed to the hosted file. Another advantage is that the resource is shared among multiple websites. If the resource is configured to be cached by users, visiting one website will effectively preload the resource and make it readily available for the next website. For popular resources, users would rarely need to reload it.

The truth is that third-party resources can be trusted not to bring your site down. Techniques and best practices have been developed to ensure that one server’s outage doesn’t become everyone else’s SPOF. The most straightforward solution is to load third-party resources asynchronously. By definition, this means that the page does not depend on the resource’s immediate availability. No matter how long the resource takes to load (even if not at all), the show will go on and the page will continue to function. The feature brought by the resource will be unavailable of course, but developers could plan for that contingency accordingly with fallbacks.

You should be happy to hear that there were many websites that did not go down as a result of that DFP outage. These websites correctly implemented defensive techniques to handle such a failure, and the worst that happened was that the site was unable to display any advertisements. To most users, this sounds like a win. And there certainly were plenty of people rejoicing at the outage for turning a significant portion of the web ad-free. But this is a success story because the problem ended there for these websites. News sites were still able to deliver the news, and ecommerce sites were still able to sell things.

If you weren’t eager to start analyzing whether your site is liable to SPOF before, by now you should be. What we’ve looked at so far with WebPageTest’s synthetic testing is good for identifying areas of a page that can be sped up. These tests are somewhat “blue sky” scenarios in which all of your third-party resources’ servers are online and properly handling traffic. This presents a blind spot in your analysis if you’re looking for potential causes of SPOF. Recall that the other kind of performance measurement, RUM, simply collects live data about how real users are impacted. If a SPOF were to occur, RUM would certainly detect this and show an anomaly in the reporting. But at this point, it’s too little too late. The outage is real and your users are suffering for it. What we need is a way to prevent SPOF, not just react to it.

In this chapter, we will look at how to use WebPageTest to diagnose SPOF problems before they happen. To do this, we will induce SPOF-like conditions by preventing responses from ever reaching the client. In this simulation, a request is made and never heard from again, just like what would happen if the server at the other end went down. Using this technique, we hope to expose failure-prone resources that could be costing you valuable business.

We will also discuss request blocking, which is a technique related to inducing SPOF with some clear differences. By preventing requests from ever being dispatched, we are able to measure performance by omission, which is the effect a resource has when removed from the page.

Black-Hole Rerouting

There are only two ways to know how your site reacts to a third-party failure: testing it ahead of time and watching it unfold as it is actually happening. We’re going to see how WebPageTest can be used to test third-party failures so that your users aren’t the first to know.

The first step of SPOF testing is to identify the third parties that can take your page down. To do this, we use the Domains tab of the results page (Figure 8-1). After running a first-view test of your page, you can access a list of the domains that were used to construct the page. These domains are grouped by the number of requests they served and the total number of bytes sent.

uwpt 0801
Figure 8-1. The Domains tab of the test results page shows the frequency and size of resources served by each domain for a given page

On this page, we can easily identify the domains that contribute the most to a given test page. Obviously, the domain of the test page itself should be prominently high on the list. What we’re looking for are third-party domains out of our control. It would be prudent to test each and every third-party domain for SPOF, but for now let’s select the one with the most requests. With this domain, we have a couple of ways to simulate what would happen if it were suddenly inaccessible.

setDns

Recall from “Reading a Waterfall” that DNS lookup is just the resolution of a recognizable hostname like example.com to its IP address. This process is identical to looking up a phone number in a phone book. With DNS resolution, there are many phone books: some are little black books while others are exhaustive tomes. Computers have their own little black book in which to jot down a few important names and numbers. If the browser needs to resolve a name not in this book, it asks an authoritative DNS server. WebPageTest provides a way for you to jot any name and number into its little black book. You don’t even need to use the correct number for the given name. This is exactly what we’ll do to simulate our first SPOF.

To test what happens when a given domain goes offline, start by opening up the familiar Script tab. This time we’re going to use a new command, setDns:

setDns _domain_ _IP address_
navigate _test URL_

The setDns command designates an IP address to be the point of contact for all requests to a given domain. When navigating to a test page, any requests at the given domain will route to the given address. Now we need an IP address that points to a server that pretends to be failing. WebPageTest has you covered with the appropriately named blackhole.webpagetest.org host. By assigning this host’s IP address to a domain, we’re able to simulate its failure:

blackhole.webpagetest.org

72.66.115.13

It’s worth noting that when you run a SPOF test, nothing may actually go wrong. This is a good thing! This means that your page is adequately prepared to handle the sudden failure of a third-party resource. When this isn’t the case, the test results speak for themselves (Figure 8-2).

uwpt 0802
Figure 8-2. The failure of a resource resulted in a 20-second timeout, during which no other requests were able to be made and page loading halted

The waterfall diagram clearly shows a gap 20 seconds long between requests. This is the amount of time that the browser spent attempting to communicate with the third party. Instead of communicating with the third party, though, the browser was sending requests to the black hole without receiving any responses. During this time, the browser is not able to start any other requests. If this happens early enough, the user would be left with an incomplete, possibly unresponsive page.

SPOF Tab

If you’re thinking that scripting is a cumbersome way to run a SPOF test, you should be relieved to know that WebPageTest makes it even easier. With the SPOF advanced settings tab (Figure 8-3), all you need to do is enter the domain that you want to send to the black hole.

uwpt 0803
Figure 8-3. The SPOF tab allows you to test the failure of a domain without having to write a setDns script

When you run a test from the SPOF tab, you’re actually running two tests: one with and without the DNS override. By running a normal test along with the SPOF test, you have a control to differentiate the effects of the simulation. Instead of taking you to the individual test results when the test is complete, the SPOF test results are displayed in the comparison view, as shown in Figure 8-4. This makes it extremely easy to identify the user-perceived effects of a SPOF. WebPageTest also generates a video comparison of the normal page load and the effects of SPOF, as shown in Figure 8-5.

uwpt 0804
Figure 8-4. The SPOF tab generates a comparison of tests with and without domain failures. The filmstrip view clearly shows the page-load impact of SPOF in which the SPOF page takes more than 20 seconds longer to load.

When we look at the filmstrip comparison of the two tests, the 20-second difference is immediately apparent. Other metrics like visual progress and speed index also illustrate the dramatic consequences of a request timing out. See Figure 8-6 for a visual progress chart.

uwpt 0805
Figure 8-5. WebPageTest also generates a video comparison of each test so that you can watch a page load normally and observe the effects of SPOF side-by-side
uwpt 0806
Figure 8-6. The visual progress chart is another way to quantify the effects of SPOF on a page. This chart compares the amount of time it takes for a page to display its content. Under SPOF conditions, the progress is 0% for more than 20 seconds until the failed request times out and the page load is able to complete.

Using the SPOF tab is an incredibly convenient way to demonstrate the dangers of the irresponsible use of third-party resources. Developers should never assume that third parties are 100% reliable, because anything can happen, even to the biggest Internet giants.

Blocking Requests

SPOF is such a dramatic scenario. The weakest link in a chain of requests could spell destruction for the usability of an entire page. Failure in this case is a third-party meltdown improperly handled on the client side. Sometimes failure could just be a resource that is too slow. Service-level agreements (SLAs) are used by some third parties to reassure dependents that they will serve resources at some high rate of reliability or even no slower than some guaranteed speed. Failure could also be an SLA that wasn’t met. It doesn’t take a total meltdown for users to become annoyed at the slowness of page load; resources that load more slowly than expected can and should be considered failures.

How can we measure the impact of a particular resource? As we know, the time it takes to load a resource is only part of the story. For JavaScript resources, there is also time to parse and execute the code. The simplest way to measure a resource would be to run tests of a page with and without it. The difference in speed is the residual effect of that resource.

A practical example of this is advertisements. Ads are generally regarded as a necessary evil, so we may be too quick to accept the performance hit they incur. One way to measure their impact would be to serve pages without ads to some segment of users and compare performance against a control group. Unfortunately, ads are what keep the lights on, and it may be extremely difficult to convince the powers that be to voluntarily give up ad revenue to an entire segment of users. Instead of a RUM A/B test, we can use synthetic testing to simulate a page with and without ads.

WebPageTest exposes the functionality to block requests based on pattern matching. For example, you could instruct the test agent to prevent resource names containing the “ads” substring from being requested. This functionality is exposed in the script tab through the block command.

block _substring_
navigate _test URL_

Any request for a URL containing the substring will be blocked from ever reaching the network (Figure 8-7). How is this any different from sending the request into the black hole? Most important, browsers will not timeout on a request that was never made, so the page will not experience that long pause between requests. This test simply measures the omission of a resource without presuming anything about its reliability or load time.

uwpt 0807
Figure 8-7. The block tab takes a space-delimited list of substrings to be matched against blocked requests. If the request URL contains any of these substrings, it will be prevented from materializing.

If scripting is not your thing, take comfort in the fact that WebPageTest has a tab for blocking requests, too. This one couldn’t be simpler to use; just enter one or more substrings to match against requests that should be blocked.

As with any synthetic test, several runs are essential to ensuring that the results are accurate and representative. Nine test runs with and without blocking should be enough to get a suitable median run that could be used for comparison.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.97.202