Retries are a foundational, distributed systems pattern; they help your systems deal with problems like unreliable connectivity, dynamically changing endpoints and software bugs. Engineers that design distributed applications are encouraged to design for failure and anticipate that such failures are inevitable; because of this, Retries can be considered one of the patterns that make up the first line of defense when incorporating resilience into a distributed application.
Let’s look at the problem statement in greater detail.
Services communicating across a network can be subject to a transient failure caused by several factors:
Network instability causing disruptions or disconnection.
Latency in the network, causing the upstream service to fail to respond in a timely fashion.
High load or rate-limiting in the upstream service causes it to be slow to respond or accept the request.
Endpoint no longer exists caused by the data plane’s endpoint catalog out of sync with deployed application instances.
Upstream service is suffering temporary failure, either an internal bug or an upstream dependency failing.
These possible causes for an upstream service being unreachable or unresponsive present their own challenges to overcome. Because these failures are not mutually exclusive, a single application can be affected by different failures simultaneously.
Let’s look at a scenario, Jerry works for a social media company, and his team owns the profile service for the application. The profile service allows the user to update the name, biography, interests and upload a photograph or avatar for their profile. The system writes the profile data to the database; however, to upload images, the profile service calls the upstream image service responsible for uploading and resizing images.
The profile and image services are deployed with multiple instances in a resilient way running on the service mesh. Still, Jerry occasionally sees errors in his metrics, reporting that his service fails when a user uploads an image.
Digging deeper, Jerry sees that his errors are due to the upstream image service disconnecting part way through the image upload service. When chatting with the image service team, their metrics are not reporting any errors when Jerry sees the errors in his application. However, they see something that is probably the root cause; when Jerry sees errors, the application was deploying a new version.
Why did the application redeploy cause errors?
When the image team redeployed their application, the scheduler adds the new version of the application before removing the old version. However, if the application terminates before waiting for existing work to complete, all requests in progress result in an error. The upstream image service behaved in precisely this way. It responded immediately to the kill signal and exited before completing its work and returning the response. Since Jerry`s service never received the response message it was expecting, his application then returned an error message.
Jerry’s issue, however, was twofold; in addition to the premature stop of all in-progress requests, the list of endpoints that the profile service used to call the upstream image service was only eventually consistent. The failure caused a situation where the profile service attempted to contact an endpoint that no longer existed.
How could this situation have been avoided in the first place?
Solving the first of Jerry’s problems could be solved by ensuring that the image service gracefully responded to termination signals. Good practice upon receiving a termination signal is to stop accepting any new requests and exiting on completion of all the current requests handled by the service. Jerry could also have worked around these problems by retrying the request.
The second problem is when there are changes to the endpoints in a system; it takes time for this information to propagate across the service mesh. The standard approach to a failed connection should be to retry another endpoint in the catalog.
Thankfully there is a pattern that can help with these problems, let’s take a look.
Each failure scenario can have a different solution; however, solving any of them can start with retrying the failed request. To ensure resilience, you configure distributed systems with multiple instances of an application; a simple solution to transient failure is to retry the request, sending the second request to a different upstream service.
A retry is a configurable element of upstream communication in the service mesh; the proxy is responsible for establishing all upstream connections and can be configured to reattempt a request under certain conditions.
Automatic retries are the most simple yet powerful and valuable mechanisms a service mesh has for gracefully handling partial or transient request failures; let’s see how you can implement this.
While examining the technical implementation, you must consider that a retry for an upstream service is not a single retry but layered retries for Layer 4 and Layer 7.
The reason for this is that there are two predominant ways that a request can fail:
It is impossible to establish a TCP connection to the upstream, or the connection is closed.
The upstream request fails due to a communicated error or request timeout.
Layer 4 retries are concerned with a service’s ability to establish a connection to an upstream service. Figure 15.1 shows a layer 4 retry in action, the payment service attempting to establish a connection to the upstream currency service 10.2.1.5. This connection does not succeed as this particular instance of the currency service no longer exists.
On connection failure, the Payment service proxy then retries another endpoint from the list. The selection of the next endpoint is generally dependent upon the load balancing algorithm the proxy is using. Still, in most cases, the most appropriate thing to do is not retry the connection against the same endpoint.
Retrying a Layer 4 connection when establishing the connection fails is generally safe as no request has been sent to the upstream at this point. However, since the proxy is not aware of the connection’s context, it is not safe to retry when a connection is closed as the reason behind the closure is unknown. For example, the connection could be closed due to a network fault, but it may also have been closed deliberately by the upstream service.
To further protect against transient failure, the proxy needs to understand the application protocol. For example, the HTTP protocol states that you return a response code as part of a response to a request. The specification classifies these response codes into different categories, and 500 codes denote that a request has failed. HTTP aware retries can recover from situations where individual requests fail or where a connection is closed before a response is received.
The following diagram shows this operation; the initial request to the service instance at 10.2.1.5 returned an HTTP status 500. Since the data plane is aware of the HTTP protocol and has been configured to retry when receiving any 5xx status code, the upstream request is attempted again using the next endpoint in the load-balanced list.
While retries are most powerful when they understand the application protocol (like HTTP), this is not required. It is common to configure your service mesh with layered retries; a service will commonly implement both connection and protocol-based retries, as shown in the diagram below.
To see how this works in action, let’s deploy an example an run some tests.
Let’s look at an example of retries in-action, using the pattern outlined in Figure x.2. You can use Meshery to deploy an example application that consists of a Payment service that calls upstream Currency services. There are two instances of the Currency service; the first currency is configured to operate correctly; however, the second returns an error for half of all requests. Given round-robin load balancing of requests to the Currency service from the Payment service, you would expect to see approximately 25% of all requests fail without a retry. The service mesh retries the upstream Currency service up to 3 times before reporting a failure to handle the upstream failure.
--- deployment: retry: upstream_services: 2 upstream_retries: 3 upstream_error_rate: - 0 # version 1 0% error rate - 50 # version 2 50% error rate
You can apply this configuration and launch the application using the following command:
mesheryctl perf deploy --file ./retry _test.yaml
Let’s now test that the system is exhibiting the correct behavior by using the following Meshery performance test.
--- performance_test: duration: 200s rps: 100 threads: 10 success_conditions: - request_count[status_code=200, name=”payments"]: value: 100% - request_count[status_code!=200, name=”payments"]: value: 0% - request_count[status_code=200, name=”currency_v1"]: value: 100% - request_count[status_code=500, name=”currency_v1"]: value: 0% - request_count[status_code=200, name=”currency_v2"]: value: 50% tolerance: 0.1% - request_count[status_code=500, name=”currency_v2"]: value: 50% tolerance: 0.1%
You can start the test using the following command:
mesheryctl perf run -file retry_test.yaml
Once the test completes, you will see that the number of errors returned from payments_v2 currency service configured to return an error for 50% of all requests is approximately 25% of all traffic. However, The number of errors passed to the caller of the API service is 0. The reason for this is that internally the service mesh has been retrying this request; you can see the effect of this as currency_v1 is actually serving approximately 75% of all requests.
## Executing performance tests Summary: Total: 180.0521 secs Slowest: 0.093 secs Fastest: 0.0518 secs Average: 0.0528 secs Requests/sec: 189.2397 Results: request_count[status_code=200, name="payments"] 34020 PASS request_count[status_code!=200, name=”payments"] 0 PASS request_count[status_code=200, name="currency_v1"] 25559 PASS request_count[status_code=500, name="currency_v1"] 0 PASS request_count[status_code=200, name="currency_v2"] 8423 PASS request_count[status_code=500, name="currency_v2"] 8677 PASS
Now you have seen the pattern in action, let’s look at some related patterns that complement a retry.
Retries are rarely used on their own; patterns like circuit breaking ensure you do not repeatedly retry a failing endpoint, and timeouts ensure you fail fast. Let’s look at these patterns in more depth.
Retries are an essential pattern to ensure resilience in your application, but you must understand a certain amount of complexity to use them correctly. You also must have good observability into your system; to correctly configure a retry, you need to understand the system’s behavior. By leveraging the service mesh for your application’s traffic, you benefit that the service mesh provides you with the necessary capability to observe the network traffic and, importantly, to configure retry behavior centrally and without needing to redeploy or recompile your application.
In the next chapter, you will learn an essential companion to the Retry pattern, the Circuit Breaker. Since the invention of distributed systems, correctly layering patterns like Retries, Circuit Breaking, and Timeouts have been paramount to an application’s uptime. With the service mesh, this problem merely configuration, and you do not need to change a single line of code.