Testing in Production

As we mentioned in the beginning of this chapter, the process of deploying an application to production can be downright frightening. But there are a number of ways you can use to avoid the risk and fear of pushing a new version out to production. Even when your application is fully rolled out, you’re still not done testing, as there are a number of other tests you can do to ensure you have resilient and reliable services.

Canary Testing

Canary testing is a technique used to deploy a new version of your microservice to a small percentage of your user base to ensure there aren’t any bugs or bottlenecks in the new service. To do canary testing, you can use tools like NGINX’s split_clients module to split traffic based on your routing rules. Another option is Netflix’s Zuul (http://bit.ly/netflixzuul), which is an Edge service that can be used for canary testing new services based on a set of routing rules. Facebook uses Gatekeeper, a tool that gives developers the capability to only deploy their changes to a specific user group, region, or demographic segment. Similarly, at Microsoft, the Bing team has a routing rule where all employees on the corporate network get “dogfood” experimental versions of the Bing search engine before it is shipped to customers.

For an example, let’s say we had a new version of a microservice to canary test. You can split traffic as shown in Figure 6.11, with 99% of initial traffic going to the current release (v1.0), and 1% of traffic would go to the new release (v1.1). If the new release is performing well (see Chapter 7, “Monitoring,” for more information on what to measure), then you can increase the amount of traffic incrementally until 100% of traffic is using the new release. If at any point during the rollout, the new release is failing, you can easily toggle all traffic to use v1.0 without needing to do a rollback and redeployment.

Image

FIGURE 6.11: Deploying a new microservice version using canary testing

Depending on how complex your deployment is, you can do canary testing to roll out across machines and geographies. For example, if the rollout of a logging service for a virtual machine works on one machine, you’d slowly roll it out to all machines within a service, and then to all services within a region. Once that region is successfully up and running, your rollout continues to the next region until the service is fully deployed. Don’t confuse canary testing with A/B testing. Canary testing relates to the concept of deploying a build to a subset of your users to ensure quality, whereas A/B relates to feature changes to see if they increase a key performance metric.

A/B Testing

A/B testing is a way to measure how two versions of a feature stack up against each other in terms of performance, discoverability, and usability. When setting up an A/B test, the test manager sets up two groups of users: One is the control group, and the other is the treatment group, which is normally significantly smaller than the control group. As an example, in Outlook.com, we created an experiment where we moved the built-in Unsubscribe bar from the bottom of an email message body to the top, driven by the hypothesis that users simply were not aware of the “Unsubscribe” feature in the product. We rolled out a new treatment that moved the unsubscribe feature to a more discoverable location in the UI, and rolled it out to ten percent of worldwide Outlook.com users. Over a fixed period of time we analyzed the data that showed a higher usage of the Unsubscribe feature that had been obtained simply by moving it to a more discoverable location.

Fault Tolerance and Resiliency Testing

Fault tolerance, often referred to as fault injection testing, implies that the set of microservices that make up your app has unavoidable and potentially undetectable issues that can be triggered at some point in time. The developer makes the assumption that under certain circumstances, the system will end up in a failing state. Fault-tolerance testing covers scenarios where one or many of the critical microservices fail and are no longer available. The resolution comes down to two choices: exception processing or fault treatment. Exception processing means that the underlying “engine” realizes that something is amiss and is able to continue working to the point where it can notify the user about an error, though it is itself in an error recovery state, and potentially treat the issue on its own.

On the other hand, fault treatment enables the system to prevent the error altogether and ensures that it can take an action before bad things happen.

As with any other aspect of service building, resiliency does not necessarily follow a linear implementation scale. Just because you need to ensure that your services work in a fault-tolerant environment does not mean that you have to embrace every tool and pattern to do so. In the example of Shopify, an ecommerce company, there is a resiliency maturity pyramid that describes the levels of investments depending on the infrastructure in place as shown in Figure 6.12.

Image

FIGURE 6.12: Resiliency maturity pyramid, presented by Shopify at Dockercon 2015

Moving up the pyramid, for a broader infrastructure it is necessary to test and harden your infrastructure to ensure high availability. Services are required to handle scenarios where components need to be either temporarily automatically decommisioned, or swapped out in a timely manner. Although we discussed testing with mocks earlier, two tools to call out in particular are Toxiproxy and Chaos Monkey:

Toxiproxy is a network proxy tool designed to test network conditions and validate against single points of failure caused by the network, available at https://github.com/Shopify/toxiproxy.

Chaos Monkey is a tool that is part of the Symian Army resiliency tool set, that identifies groups of systems (Auto Scaling Groups, or ASGs) and in a random fashion, picks a virtual machine within that group and terminates it. The service runs on a predefined schedule, in a way that would enable engineers to quickly react to any critical service downtime and test if the system can self-heal against catastrophic failures. The tool is available at http://bit.ly/netflixchaosmonkey.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.142.56