Chapter 5. Canary Deployments

Managing change and managing risk in environments with dynamic infrastructure and continuously evolving workloads is a challenge that we’re sure you all face. Employing the pattern of canary deployments helps you reduce this implicit, system-wide risk. The canary deployment pattern allows you to release new versions of your services into production based on either a percentage of requests or to a subset of users while providing the ability to rollback to the original state if there is undesirable behavior or errors.

Canary deployments or canary releases are not a pattern that originated with the introduction of the service mesh. The canary deployments pattern was written about in Martin Fowler’s blog back in 20141
. Given capabilities, such as dynamic load balancing, configurable routing, and observability, a service mesh can facilitate canary deployments far easier than traditional infrastructure.

Problem

Let’s investigate the problem that canary deployments solve, any change to a system introduces risk; this risk manifests itself in several areas:

  • Changes to application code (code deployments)

  • Changes to infrastructure (change from 16GB to 32GB memory in a VM)

  • Changes to infrastructure software (upgrade Kubernetes from v1.15 to v1.16)

While you should always write software-defined tests for all of these areas of change, it is not always possible to stimulate your production system’s exact conditions. Doing so can be cost-prohibitive and bears the burden of having another environment to sustain a production system. Often tests are executed on systems that differ significantly in terms of:

  • Scale (number of physical machines, number of services)

  • Load (number of requests, the variance of type of request)

  • Random events (test systems generally control inputs based on known conditions)

Let’s look at a simple example involving changes to application code. Mary has just completed the latest updates to her API; the unit and integration tests pass. When she deploys it to production, immediately there are reports of elevated errors in the system - failed functional “tests” that she couldn’t otherwise replicate in an environment other than production. This error cascaded through the system and resulted in a total system outage.

After much debugging and testing, she determines that a downstream system sends a malformed input to her service, causing errors. The team which owns the downstream service accidentally introduced a bug into the system in a previous release.

Why did automation testing not catch this issue?

In this case, it was due to an assumption of the inputs to the test. Errors caused by incorrect assumptions related to input conditions are not a common problem. All tests generally validate what you know about the system; they are vulnerable to what you do not know. In Mary’s case, she did not know that her application was not defensively handling a downstream bug. The result of this was a total system outage. The remedy of the bug was simple and involved a two-line fix to the request validation. Unfortunately, while simple to fix, this type of bug caused the business to lose substantial revenue. The question you should be asking is:

How could this situation have been avoided in the first place?

If Mary had gradually introduced the new version of her service into production, she would have been able to detect the bug without causing a full system outage. The bug would still have existed, but there would have been a dramatic reduction in end-users exposure to the bug.

Solution

Gradual and measured introduction of a new version of an application is commonly called a Canary deployment. With a canary deployment, you deploy a new version of your application to the production environment; however, it initially receives no user requests. The previous version continues to handle 100% of the requests.

Traffic is gradually distributed to the new version of the service while monitoring it for errors or anomalies. Incremental increases to the new service’s traffic continue until the new service handles 100% of all traffic, and you remove the previous service version. If you detect unsatisfactory levels of errors at any point, traffic is reverted to the old version. Because the service meshes routing later handles traffic direction, it is incredibly quick to change the traffic flow. Also, there is a dramatic reduction to risk as you hopefully catch any errors before exposure to the problem affects all users.

Traffic distribution
Figure 5-1. Traffic distribution

There is a further subdivision of this pattern that restricts traffic to a subset of the users. You can expose users to the canary using HTTP cookie, HTTP header, JWT Claim, or gRPC metadata. Regardless if you choose to use a controlled group or all your users, the process of canary deployments remains the same.

Canary Deployment Steps

Fundamentally a canary Deployment falls into the following five steps:

  • Deploy

  • Modify Traffic Split

  • Observe

  • Promote

  • Rollback

Canary Deployment Steps
Figure 5-2. Canary Deployment Steps

Deploy

Before deploying the new version of your service, you must determine a measure of success. This criteria typically relates to several observable outputs of the system, not just the number of errors. For example, you may decide the following requirements:

  1. The system must not have an increased level of errors.

  2. The performance of the system must be within +/- 10%

  3. The system must process requests according to its design.

When deploying a new version of the application (Canary) into the environment, it should not immediately start to receive traffic. This step allows for initial health checking or production testing before starting the roll out.

Increase Traffic Split

After deployment the next step is to modify the split of traffic between the production release and the canary. The percentage of traffic introduced should be enough to produce a meaningful measurement of performance while minimizing risk.

Because this step is iterative, you start with very few requests and slowly increase this over time until the canary’s traffic reaches 100%, at this point you move to the Promotion phase.

Observe

A service mesh will provide you with many useful metrics such as L7 status codes such as HTTP or gRPC status codes that can be used to determine if your canary is performing as required. However, application-level metrics should always supplement the built-in metrics.

For example, the metrics that flow through the Service Mesh will tell you that 5% of the time, the service returned an HTTP status 500 with a response time of 100ms. You can not determine if the service is effective in the task it is trying to perform. For example, Mike works in the payments team and is deploying a new version of the you have a canary for your payment service.

While observing the external status of the service Mike sees that all requests result in a HTTP status code 200 and the response time is within the tolerance of 100ms. And progresses to increase the traffic of the canary.

However; the application is not functioning correctly, the canary has been mis-configured and is using the development credentials for the payment gateway. While the application code is functioning and customer orders have been processed all payment requests have been sent to a sandbox instead of the production gateway. The result of this is that any order processed by the canary would have shipped correctly but the customer’s payment would not have been taken. If Mike had correlated the HTTP status codes from the service mesh along with the number of transactions from the payment gateway he would have spotted this misconfiguration and stopped the rollout.

The metrics you need to observe for your application differs from application to application, in many cases it will be OK to just observe error levels and application durations, but only knowledge of the internal service function will allow you to make that decision.

In addition to selecting the right metrics to measure, specific errors and performance criteria only manifest themselves at a particular load; for example, lock contention on a datastore is only significant when you have enough traffic attempting to obtain the lock. When evaluating the success of a canary, be cautious when increasing the traffic to it, and where possible, take many small steps instead of a few large steps.

Promote

Once the new version of your application is handling 100% of the traffic, you can safely remove the old version freeing capacity on your cluster.

Rollback

A canary deployment’s core feature is rolling back deployments to the previous state when the new application does not perform as desired. To roll back a canary, you change the percentage of traffic flowing to 0; however, you should not remove the existing application until you have diagnosed the root cause of why you rolled things back. A failed canary deployment provides much information regarding metrics and logs and the ability to debug or test in situ, which can help expedite the discovery of why it was not successful.

Why this pattern?

We believe that canary deployments are one of the go-to patterns in modern software release engineering. It is also an established distributed systems pattern which is particularly suited to Service Mesh as many of the required components, such as dynamic routing and observability, are core features of Service Meshes.

Technical implementation

When you deploy your application and register it with the Service Mesh, you tag the new instances with some form of metadata. The data plane handles all upstream calls and load balances to all available endpoints. The upstream load balancer inside the data plane has a weighting applied to the different endpoints based on the metadata to perform a canary deployment.

For example, given a total weight of 1 and you would like to direct 90% of traffic to version 1 and 10% to version 2, you would assign a weighting of .9 to your version 1 endpoints and .1 to your version 2. This assignment makes the selection of a version 1 endpoint 9x more likely than version 2. The load balancing algorithm does not consider the number of endpoints in each group, only the percentage of traffic distributed to each group.

Generally, each data plane implements an individual load balancing strategy. The control plane configures the weightings but plays no part in the selection criteria as distributed load balancing equates to approximately the same result should this have been decided centrally.

A distributed approach increases the service mesh’s performance as no expensive network hops, or central resources subject to contention are required. Reliability increases as the data plane can distribute traffic while experiencing a temporary interruption in service from the control plane.

Reference Implementation

Let’s see the pattern in action; you can use Meshery to deploy an example application to your service mesh. The following Meshery deployment configuration will add the example application for the canary pattern. The deployment creates three applications:

  • API which is the ingress application which has an upstream application Payments

  • Payments v1 configured to handle 50% of all traffic sent by API.

  • Payments v2 configured to handle 50% of all traffic sent by API.

deployment:
  canary:
    upstreams:
      - name: payments_v1
        weight: 50%
        instances: 1
      - name: payments_v2
        weight: 50%
        instances: 1

The folder canary in the GitHub repository contains the Meshery application and performance test files. Run the following command to set up the demo application:

mesheryctl pattern apply -f ./canary_deployment.yaml
  
## Deploying application
     Application deployed, you can access the application using the URL:
     http://localhost:8200

With the demo application deployed, you can now test the application. First, let’s run a manual test to automatically see the service mesh balancing the traffic between version 1 and version 2 of the payments application. Run the following command in your terminal:

curl http://localhost:8200

You should see output similar to the following:

➜ curl localhost:8200
{
  "name": "API",
  "uri": "/",
  "type": "HTTP",
  "ip_addresses": [
    "10.42.0.16"
  ],
  "start_time": "2020-09-13T10:35:44.202865",
  "end_time": "2020-09-13T10:35:44.245175",
  "duration": "42.3107ms",
  "body": "Hello World",
  "upstream_calls": [
    {
      "name": "PAYMENTS V1",
      "uri": "http://localhost:9091",
      "type": "HTTP",
      "ip_addresses": [
        "10.42.0.17"
      ],
      "start_time": "2020-09-13T10:35:44.233257",
      "end_time": "2020-09-13T10:35:44.243477",
      "duration": "10.2194ms",
      "headers": {
        "Content-Length": "259",
        "Content-Type": "text/plain; charset=utf-8",
        "Date": "Sun, 13 Sep 2020 10:35:43 GMT",
        "Server": "envoy",
        "X-Envoy-Upstream-Service-Time": "13"
      },
      "body": "Hello World",
      "code": 200
    }
  ],
  "code": 200
}

The demonstration application is outputting the call details, including timings and the response from the upstream service. Because the traffic split is set to 50/50, you will either see the name of the upstream service returned as PAYMENTS V1 or PAYMENTS V2

Executing this curl command multiple times will show you a 50/50 split between version 1 and version two. You can use Meshery’s performance management feature to characterize the performance of the deployed application. The file ./canary_test.yaml in the source repository contains the following test definition. We will run a relatively simple test that will run for 30 seconds, which will validate that the two canaries receive the correct traffic split.

performance_profile:
  duration: 30s
  qps: 100
  threads: 4
  success_conditions:
    - request_duration_p50: 100ms
    - request_duration_p50["name=payments_v1"]:
        value: < 100ms
    - request_duration_p50["name=payments_v2"]:
        value: < 100ms
    - request_count["name=payments_v1"]:
        value: 50%
        tolerance: 0.1%
    - request_count["name=payments_v2"]:
        value: 50%
        tolerance: 0.1%

You can run this test using the following command:

mesheryctl patter apply -f canary_test.yaml

Once the test completes, you will see the summary report. The tests defined in the performance test document will all have passed, and you will see the distribution between the two canaries of approximately 50%.

## Executing performance tests
Summary:
  Total:        30.0313 secs
  Slowest:       0.0521 secs
  Fastest:       0.0140 secs
  Average:       0.0200 secs
  Requests/sec: 99.8959
  
  Total data:   2534390 bytes
  Size/request: 844 bytes
 Results:
  request_duration_p50                      20ms  PASS
  request_duration_p50[“name=payments_v1”]  22ms  PASS   
  request_duration_p50[“name=payments_v2”]  18ms  PASS   
  request_count[“name=payments_v1”]         1496  PASS   
  request_count[“name=payments_v2”]         1504  PASS   

Before continuing, why not experiment with some different splits for the traffic to see how this affects the results?

Discussion

So far, you have learned the basics of the Canary pattern; however, several patterns complement a canary deployment, and there are caveats and considerations you need to know to operate a canary deployment successfully. Let’s take a look at these.

Caveats and Considerations

Now you understand the canary deployment pattern’s fundamentals, let’s look at some essential caveats and considerations.

Parallel Change2

When introducing a new version of an application into production, you are often required to:

  • Change or modify the application’s behavior.

  • Change the interface.

  • Change a model in an external data store.

The ability to successfully employ canary deployments requires that these actions remain compatible with the service’s previous version. For example, should the new version of a service necessitate a change to the datastore, which changes a string value to an integer, it is unlikely that the data store will remain compatible with the current service version.

It’s important to consider all these factors when deciding whether it is possible to use a canary deployment pattern.

Sample Size

When observing the results from a new service deployed, you need to consider the sample size or the number of requests the new service has processed. For example, you deploy a new version of the service into production with a 100% success rate. What if that version had only received one request, and by chance that request was successful, all subsequent requests may have failed? In statistics, there is a confidence interval; that is, you state, given a repeat of the experiment, I am n% confident of the same outcome.

You can use a simple formula to determine the sample size to ensure that your canary behaves the way you expect. To calculate the sample size using Yamane’s formula:3

n=N1+Ne2

N the population

e is the margin of error

n is the sample size

The population is going to be the number of requests which make up your comparison group. For example, if in a 24hr period, the existing service has 10000 requests and an error rate of 10%, we determine that your canary will be successful if the error rate is <= 10%. Our sample needs to be based on a comparable population so you can use 10000 requests as the population.

For the margin of error, you can use a statistical norm of +-5%. Putting all of this together, you get the following equation and result.

n=1000010001*0.0025=400

That means that you can be 95% confident that a repeat of the experiment would yield the same results given a sample of 400. Therefore, after 400 requests, you can be 95% sure that the canary’s error rate can be meaningfully compared to that of the original version.

If you are unsure of how to calculate an appropriate sample, the safest approach is to ensure your sample equals the population.

Like for Like Comparison

One of the dangers you should be aware of when determining if a canary is successful is comparing like for like. For example, version 1 of your service has an average performance of 100ms, and version 2 has an average performance of 20ms. This performance is a vast improvement, and you might convince yourself that version 2 is a complete success.

However, if you did your deployment at noon when the traffic to the application is predominately read based, and the service does the bulk of its work at midnight. Had you run the canary at midnight, the new service may have had a performance degradation of 300ms, which would have resulted in a rollback.

When determining the success criteria of a Canary, you should always be careful of the measurement period to ensure you select a period that represents the full spectrum of the application’s work.

Outliers

One of the main reasons that unit / functional and even manual testing does not fully protect you from errors when you deploy an application to production is that applications are always dependent on the environment. For this reason, you should consider the distribution group size when evaluating a canary. A single instance deployed to a noisy node could lead to a false negative as it is not the application that is at fault but the node to which it has been deployed. The larger the deployment set, the easier it is to identify outliers like this. If possible, you should attempt to deploy a canary that mimics the size of the original deployment.

Automated vs. Operator led Canary Deployment

Canary deployments are an iterative process; there is the action of configuring the traffic splitting and analyzing the results. In this pattern, we have predominantly addressed the technical operation of using Canary deployments; we have not yet discussed who should be managing this process.

Correctly implemented, your authors believe that there is little reduction in operational risk by taking a manual approach to a canary deployment. An operator’s actions regarding analyzing the metrics or changing the traffic split are no different from the logic codified into an automated system. Depending on the duration of a rollout, it may be less risky to let the automation control the operation. A canary deployments performance needs to be measured continually to make the necessary decisions to promote or rollback. A machine is less likely to drift off into a Reddit rabbit hole and neglect its duties.

Thankfully, there is not even a need to codify this process yourself. At the time of writing, the two most popular tools for automating deployments are the CNCF projects, Flagger and Argo. To successfully implement the Canary pattern, you need to have a fundamental understanding of how it works. Still, we recommend that you seek a tool that allows the pattern to be applied as an automated process when you are looking to employ it in production.

Conclusion and Further Reading

In this chapter, you have learned how the service mesh provides you with the capability to deploy application code with minimized risk. The correct implementation of a canary deployment relies on a proper roll out duration to ensure the test’s accuracy, and you have learned how to calculate these durations. You have learned that you need to layer patterns. The canary is often used with a Retry to protect the end-user when the canary returns errors and many more nuances of the patterns that will help you use it successfully.

To learn more about Canary deployments we recommend reading the following articles which have been cited in this chapter.

1 Sato, D., 2021. bliki: CanaryRelease. [online] martinfowler.com. Available at: <https://martinfowler.com/bliki/CanaryRelease.html> [Accessed 2 February 2021].

2 Medium. 2021. Expand Contract Pattern and Continuous Delivery of Databases. [online] Available at: <https://medium.com/continuousdelivery/expand-contract-pattern-and-continuous-delivery-of-databases-4cfa00c23d2e> [Accessed 2 February 2021].

3 Israel, G., 2021. Determining Sample Size. [ebook] University of Florida. Available at: <https://www.tarleton.edu/academicassessment/documents/samplesize.pdf> [Accessed 2 February 2021].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset