Chapter 10: Dealing with Service Failures

Chapter 10.

One of the vulnerabilities in building a large microservice-based application is dealing with service failures. The more services you have, the greater the likelihood of a service failing, and the larger number of other services that are dependent on the failed service. How can you deal with these service failures without adding instability to your application? In this chapter, we will discuss some techniques to deal with service failures.

Cascading Service Failures

Consider a service that you own. It has several dependencies, and several services depend on it. Figure 10-1 illustrates the service “Our Service” with multiple dependencies (Service A, Service B, and Service C) and several services that depend on it (Consumer 1 and Consumer 2). Our service is dependent on three services, and our service is depended on by two services.

Figure 10-1.  

Figure 10-1 . Our Service and its dependencies and consumers

What happens if one of our dependencies fails? Figure 10-2 shows Service A failing.

Unless you are careful, Service A failing can cause “Our Service” to also fail, since it has a dependency on Service A.

Figure 10-2.  

Figure 10-2 . Our Service with a failed dependency

Now if “Our Service” fails, this failure can cause Consumer 1 and Consumer 2 to fail. The error can cascade, causing many more services to fail, as shown in Figure 10-3.

A single service in your system can, if unchecked, cause serious problems to your entire application.

Figure 10-3.  

Figure 10-3 . Cascading failure

What can you do to prevent cascading failures from occurring? There are times when you can do nothing—a service error in a dependency will cause you (and other dependent services) to fail, because of the high level of dependency required. Sometimes your service can’t do its job if a dependency has failed. But that isn’t always the case. In fact, often there is plenty you can do to salvage your service’s actions for the case in which a dependent service fails.

Responding to a Service Failure

When a service you depend on fails, how should you respond? As a service developer, your response to a dependency failure must be:

Predictable

Understandable

Reasonable for the situation

Let’s look at each of these.

Predictable Response

Having a predictable response is an important aspect for services to be able to depend on other services. You must provide a predictable response given a specific set of circumstances and requests. This is critical to avoid the previously described cascading service failures from affecting every aspect of your application. Even a small failure in such an environment can cascade and grow into a large problem if you are not careful.

As such, if one of your downstream dependencies fails, you still have a responsibility to generate a predictable response. Now that predictable response might be an error message. That is an acceptable response, as long as there is an appropriate error mechanism included in your API to allow generating such an error response.

[note]An error response is not the same as an unpredictable response. An unpredictable response is a response that is not expected by the services you are serving. An error response is a valid response stating that you were not able to perform the specified request. They are two different things.

If your service is asked to perform the operation “3 + 5,” it is expected to return a number, specifically the number “8.” This is a predictable response. If your service is asked to perform the operation “5 / 0,” a predictable response would be “Not a Number,” or “Error, invalid request.” That is a predictable response. An unpredictable response would be if you returned “50000000000” once and “38393384384337” another time (sometimes described as garbage in, garbage out).

A garbage in, garbage out response is not a predictable response. A predictable response to garbage in would be “invalid request.”

Your upstream dependencies expect you to provide a predictable response. Don’t output garbage if you’ve been given garbage as input. If you provide an unpredictable response to an unpredictable reaction from a downstream service, you just propagate the unpredictable nature up the value chain. Sooner or later, that unpredictable reaction will be visible to your customers, which will affect your business. Or, worse, the unpredictable response injects invalid data into your business processes, which makes your business processes inconsistent and invalid. This can affect your business analytics as well as promote a negative customer experience.

As much as possible, even if your dependencies fail or act unpredictably, it is important that you do not propagate that unpredictability upward to those who depend on you.

[note]A predictable response really means a planned response. Don’t think “Well, if a dependency fails, I can’t do anything so I might just as well fail, too.” If everything else is failing, you should instead, proactively figure out what a reasonable response would be to the situation. Then detect the situation and perform the expected response.

Understandable Response

Understandable means that you have an agreed upon format and structure for your responses with your upstream processes. This constitutes a contract between you and your upstream services. Your response must fit within the bounds of that contract, even if you have misbehaving dependencies. It is never acceptable for you to violate your API contract with your consumers just because a dependency violated its API contract with you. Instead, make sure your contracted interface provides enough support to cover all contingencies of action on your part, including that of failed dependencies.

Reasonable Response

Your response should be indicative of what is actually happening with your service. When asked “What is 3 + 5?” it should return an acceptable answer even if dependencies are failing. It might be acceptable to return “Sorry, I couldn’t calculate that result,” or “Please try again later,” but it should not return “red” as the answer.

This sounds obvious, but you’d be surprised by the number of times an unreasonable response can cause problems. Imagine, for instance, if a service wants to get a list of all accounts that are expired and ready to be deleted. As illustrated in Figure 10-4, you might call an “expired account” service (which will return a list of accounts to be deleted), and then go out and delete all the accounts in the list.

Figure 10-4.  

Figure 10-4 . Unreasonable API response

If the “expired account” service runs into a problem and cannot calculate a valid response, it should return “None,” or “I’m sorry, I can’t do that right now.” Imagine the problems it would cause if, instead of returning a reasonable response, it returned a list of all accounts in the system? In this case, the “manager service” would go ahead and try and delete all accounts in the system, which is almost certainly the wrong thing to do, and the results would be devestating if suddenly all your accounts in your application were deleted.

Determining Failures

Now that we know how to respond to failures, let’s discuss how to determine when a dependency is failing in the first place. How do you determine when a dependency is failing? It depends on the failure mode. Here are some example failure modes that are important to consider, ordered from easiest to detect to hardest to detect:

Garbled response

The response was not understandable. It was “garbage” data in an unrecognizable format. This might indicate that the response is in the wrong format or the format might have syntax errors in it.

Response indicated a fatal error occurred

The response was understandable. It indicated that a serious error occurred processing the request. This is usually not a failure of the network communications layer, but of the service itself. It could also be caused by the request sent to the service not being understandable.

Response was understandable but returned results were not what was expected

The response was understandable. It indicated that the operation was performed successfully without serious errors, but the data returned did not match what was expected to be returned.

Result out of expected bounds

The response was understandable. It indicated that the operation was performed successfully without serious errors. The data returned was of a reasonable and expected format, but the data itself was not within expected bounds. For example, consider a service call that is requesting the number of days since the first of the year. What happens if it returns a number such as 843? That would be a result that is understandable, parsable, did not fail, but is clearly not within the expected bounds.

Response did not arrive

The request was sent, but no response ever arrived from the service. This could happen as a result of a network communications problem, a service problem, or a service outage.

Response was slow in arriving

The request was sent, and the response was received. The response was valuable and useful, and within expected bounds. However, the response came much later than expected. This is often an indication that the service or network is overloaded, or that other resource allocation issue exists.

When you receive a response that is garbled, you instantly know the response is not usable and can take appropriate action. An understandable response that did not match the needed results can be a bit more challenging to detect, and the appropriate action to take can be tougher to determine, but it is still reasonable to do so.

A response that never arrives is difficult to detect in a way that allows you to perform an appropriate action with the result. If all you are going to do is generate an error response to your consumer, a simple timeout on your dependency may suffice in catching the missing response.

A Better Approach to Catch Responses That Never Arrive

[note]This doesn’t always work, however. For instance, what do you do if a service usually takes 50 ms to respond, but the variation can cause the response to come as quick as 10 ms, or take as long as 500 ms? What do you set your timeout to? An obvious answer is something greater than 500 ms. But, what if your contracted response time to the consumer of your service is <150 ms? Obviously, a simple timeout of 500 ms isn’t reasonable, as that is effectively the same as you simply passing your dependency error on to your consumer. This violates the predictable and the understandable tests.

How can you resolve this issue? One potential answer is to use a circuit breaker pattern. This coding pattern involves your service keeping track of calls to your dependency and how many of them succeed versus how many fail (or timeout). If a certain threshold of failures is reached, the circuit breaker “breaks” and causes your service to assume your dependency is down and stop sending requests or expecting responses from the service. This allows your service to immediately detect the failure and take appropriate action, which can save your upstream latency SLAs.

You can then periodically check your dependency by sending a request to it that is known to fail. If it begins to succeed again (above a predefined threshold), the circuit breaker is “reset” and your service can resume using the dependency again.

A response that comes in slow from a service (versus never comes in) is perhaps the most difficult to detect. The problem becomes how slow is too slow? This can be a tough question and simply using basic timeouts (with or without circuit breakers) is usually insufficient to reasonably handle this situation, because a slow response can “sometimes” be fast enough, generating seemingly erratic results. Remember, predictability of response is an important characteristic for your service, and a dependency that fails unpredictably (because of slow responses and bad timeouts) will hurt your ability to create a predictable response to your dependencies.

Greater Sophistication in Detecting Slow Dependencies

[note]A more sophisticated timeout mechanism, along with circuit breaker and similar patterns, can help with this situation. For instance, perhaps you can create “buckets” for catching the recent performance of calls to a given dependency. Each time you call the dependency, you store this fact into a bucket based on how long the response took to arrive. You keep results in the buckets for a specific period of time only. Then, you use these bucket counts to create rules for triggering the circuit breaker. For instance, you could create these rules:

If you receive “500 requests in one minute that take longer than 150 ms,” you trigger the circuit breaker.

If you receive “50 requests in one minute that take longer than 500 ms,” trigger the circuit breaker.

If “you receive 5 requests in one minute that take longer than 1,000 ms,” trigger the circuit breaker.

This type of layered technique can catch more serious slowdowns earlier while not ignoring less serious slowdowns.

Appropriate Action

What do you do if an error occurs? That depends on the error. The following are some useful patterns that you can employ for handling errors of various types.

Graceful Degra dation

If a service dependency fails, can your service live without the response? Can it continue performing the work it needs to do, just without the response from the failed service? If your service can perform at least a limited portion of what it was expected to do without the response from the failed service, this is an example of graceful degradation.

Graceful degradation is when a service reduces the amount of work it can accomplish as little as possible when it lacks needed results from a failed service.

Example 10-1 . Reduced functionality

Imagine that you have a web application that generates an ecommerce website that sells T-shirts. Let’s also assume that there is an “image service” that provides URLs for images to be displayed on this website. If the application makes a call to this image service and the service fails, what should the application do? One option would be for the application to continue displaying the requested product to the customer, but without the images of the product (or show a “no image available” message). The web application can continue to operate as an ecommerce store, just with the reduced capability of not being able to display product images.

This is far superior to the ecommerce website simply failing and returning an error to the user simply because the images are not available.

Example 10-1 is an example of reduced functionality. It is important for a service (or application) to provide as much value as it can, even if not all the data it normally would need is available to it due to a dependency failure.

Grac eful Backoff

There comes a point at which there just aren’t enough results available to be useful. The request must simply fail. Instead of generating an error message, can you perform some other action that will provide value to the consumer of your service?

Example 1 0-2 . Graceful backoff

Continuing with the situation described in Example 10-2, suppose that the service that provides all the details for a given product has failed. This means that the website doesn’t have any information it can display about the requested product. It doesn’t make any sense to simply show an empty page, as that is not useful to your customers. It is also not a good idea to generate an error (“I’m sorry, an error occurred”).

Instead, you could display a page that apologizes for the problem, but provides links to the most popular products available on the site. Although this is not what the customer really wanted, it is potentially of value to the customer, and it prevents a simple “page failed” error from occurring.

Changing what you need to do in a way that provides some value to the consumer, even if you cannot really complete the request, is an example of graceful backoff.

Fail as Early as Possible

What if it is not possible for your service to continue to operate without the response from the failed service? What if there are no reduced functionality or graceful backoff options that make sense? Without the response from the failed service, you can’t do anything reasonable. In this case, you might just need to fail the request.

If you have determined that there is nothing you can do to save a request from failing, it is important that you fail the request as soon as possible. Do not go about doing other actions or tasks that are part of the original request after you know the request will fail.

A corollary to this rule is to perform as many checks on an inbound request as possible and as early as possible in order to ensure that, when you move forward, there is a good chance that the request will succeed.

Example 10-2 . Divide By Z ero

Consider the service that takes two integers and divides them. You know that it is invalid to divide a number by zero. If you get a request such as “3 / 0,” you could try to calculate the result. Sooner or later in the calculation process, you’ll notice that the result can’t be generated, and you will issue an error.

In Example 10-2, because you know that all divisions by zero will always fail, simply check the data that is passed into the request. If the divisor is zero, return an error immediately. There is no reason to attempt the calculation.

Why should you fail as early as possible? There are a few reasons:

Resource conservation

If a request will fail, all work you do before you determine that the request will fail is wasted work. If that work involves making many calls to dependent services, you could waste significant resources only to get an error.

Responsiveness

The sooner you determine a request will fail, the sooner you can give that result to the requester. This lets the requester move on and make other decisions more quickly.

Error complexity

Sometimes, if you let a failing request move forward, the way it fails might be a more complex situation that is more difficult to diagnose or more evasive. For instance, consider the “3 / 0” example. You can determine immediately that the calculation will fail and can return that. If you instead go ahead and perform the calculation, the error will occur, but perhaps in a more complicated manner—for example, depending on the algorithm you use to do the division, you could get caught in an infinite loop that only ends when a timeout occurs.

Thus, instead of getting an error such as “divide by zero” error, you would wait a very long time and get an “operation timeout” error. Which error would be more useful in diagnosing the problem?

Customer-Caused Problems

It is especially important to fail as early as possible for cases that involve invalid input coming from the consumer of your service. If you know that there are limits to what your service can do reasonably, check for those limits as early as possible.

Example 10-3 . A real-world resource wasting

At a company I once worked with, there was an account service that was having performance problems. The service began slowing down and slowing down until it was mostly unusable.

After digging into the problem, we discovered that someone had sent the account service a bad request. Someone had asked the account service to get a list of 100,000 customer accounts, with all the account details.

Now, there is no legitimate business use case for this to have happened (in this context), so the request itself was obviously an invalid request. The value 100,000 was way out of range of rational numbers to provide as input to this request.

However, the account service dutifully attempted to process the request...and processed...and processed...and processed...

The service eventually failed because it did not have enough resources to complete such a large request. It stopped after processing a few thousand accounts and returned a simple error message.

The calling service, the one that generated the invalid request, saw the failure message and decided that it should just retry the request. And retry. And retry. And retry.

The account service repeatedly processed thousands of accounts only to have those results thrown away in a failure message. But it did this over and over and over again.

The repeated failed requests consumed large quantities of available resources. It consumed so many resources that legitimate requests to the service began to back up, and eventually fail.

In Example 10-3, a simple check early on in the account service’s processing of the request (such as a check to ensure that the requested number of accounts was of a reasonable size) could have avoided the excessive and ultimately wasted consumption of resources. Additionally, if the error message returned indicated that the error was permanent and caused by an invalid argument, the calling service could have seen the “permanent error” indicator and not attempted retries that it knew would fail.

Provide service limits

A corollary to this story is to always provide service limits. If you know your service can’t handle retrieving more than, say, 5,000 accounts at a time, state that limit in your service contract and test and fail any request that is outside that limit.

Summary

Garbage in, garbage out is a problematic way of dealing with errors, as it passes responsibility for recognizing a bad result on to other services that may not be able to make effective decisions. Bad data should be detected as early as possible and handled appropriately. Additionally, services should always act in a dependable and understandable manner, even in failure conditions. They should never generate garbage or non-understandable results.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.255.162