We have made a long journey through all previous chapters of this book and completed the part of the book dedicated to microservice development basics. So far, you have learned how to bootstrap microservices, write tests, set up service discovery, use synchronous and asynchronous communication between your microservices, and serialize the data between them using different formats, as well as how to deploy the services and verify that their APIs work.
This chapter begins the third part of the book, dedicated to more advanced concepts of microservice development, including reliability, observability, maintainability, and scalability. In this chapter, we will cover some practical aspects of microservice development that are important for ensuring your services can operate well under many conditions, including failure scenarios, changes in network traffic, and unexpected service shutdowns.
In this chapter, we will cover various techniques and processes that can help you increase the reliability of your services. We will cover the following topics:
Let’s proceed to the first section of the chapter, which will help you to understand service reliability concepts better.
To complete this chapter, you need Go 1.11+ or above.
You can find the GitHub code for this chapter here:
https://github.com/PacktPublishing/microservices-with-go/tree/main/Chapter10
While implementing new applications, services, or features, engineers often focus first on meeting various system requirements, such as implementing specific application features. The initial result of such work is usually some working code that correctly performs its job, such as handling some data processing task or serving network requests as an API endpoint. We can say that such code initially performs well in isolation—the implemented code produces expected outputs for the inputs we provide.
Things usually get more complex when we add more components to the system. Let’s take our movie service from Chapter 2 and assume that its API gets used by some external service that has millions of users. Our service can be implemented perfectly fine and produce the right results for various test inputs. Still, once we get requests from an external service, we may notice various issues. One of them is called denial of service (DoS)—an external service can overload our service by asking to process too many requests, to the extent that our service stops serving new requests. The outcome of such an issue can vary from minor system performance degradation to service crashes due to reaching CPU, file, or memory limits.
DoS is just one of the examples of things that can go wrong in a microservice environment. Assume that you performed a fix that limits the number of incoming requests to your service, but the fix broke the services calling your API because they did not expect a sudden DoS on their requests. An alternative scenario is a change in a service API that introduces a backward-incompatible change. This change is incompatible with one or multiple previously released versions of callers of your service API. As a result, services calling your API could experience various negative effects, up to the point that they would be unable to process any requests.
Let’s define the quality of a service that can be resilient in the face of unexpected failures as reliability—the quality of operating expectedly and having explicitly defined limitations. The last clause in our definition of reliability makes a big difference to its meaning—it’s not enough to perform a certain function well. It is equally important to be explicit about the service’s limitations and what happens when these limitations are breached.
In our movie service example, we would need to be explicit about multiple things, such as the following:
For example, if our service can’t process more than 100 simultaneous requests per service instance, we could explicitly state this in the documentation to our API and reject all extra incoming requests by returning a special error code, such as HTTP 429 Too Many Requests. Such indication of system limits and explicit communication of congestion issues would be a great step toward improving overall system reliability by making its behavior more deterministic and, hence, reliable.
In general, achieving a high degree of reliability is a continuous process and requires constant improvements in the following three categories:
Prevention, detection, and mitigation improvements can be made by performing two types of actions:
We will divide the rest of the chapter into two sections, describing these two types of actions. Let’s proceed to the first section, covering the automation-related reliability work.
In this section, we will talk about various automation techniques that can help you improve the reliability of your services.
First, let’s get back to communication error handling, which we briefly covered earlier in Chapter 5. Having the right communication error-handling logic in place is the first step toward achieving higher reliability of your services, so we will focus on multiple aspects of error handling that are equally important in microservice development.
As we discussed in Chapter 5 of this book, when two components—such as a client and a server—communicate with each other, there are three possible resulting scenarios:
From the perspective of a client, there are two different classes of errors:
Differentiating between retriable and non-retriable errors is the responsibility of the client. However, it is a good practice to indicate this explicitly whenever possible. For example, a server can return specific codes, indicating the types of errors (such as HTTP 404 Not Found) so that a client can recognize retriable errors and perform retries. Differentiation between client and server errors also helps to ensure that requests are not retried for non-retriable errors. It is important from the server’s perspective because handling duplicate, invalid requests increases its load.
Let’s illustrate how to handle retriable communication errors by implementing client request retries. Setting up automated responses to potential issues, such as communication errors, helps to make the system more resilient to transient failures, resulting in a better experience for all components in the system.
Let’s illustrate how to implement request retries in microservice code. For this, let’s review the metadata gRPC gateway code we implemented earlier in Chapter 5. The Get function includes the actual call to the metadata service:
resp, err := client.GetMetadata(ctx, &gen.GetMetadataRequest{MovieId: id})
if err != nil {
return nil, err
}
Let’s now look at the implementation of the GetMetadata endpoint in the metadata service gRPC handler. The GetMetadata function includes the following code:
func (h *Handler) GetMetadata(ctx context.Context, req *gen.GetMetadataRequest) (*gen.GetMetadataResponse, error) {
if req == nil || req.MovieId == "" {
return nil, status.Errorf(codes.InvalidArgument, "nil req or empty id")
}
m, err := h.ctrl.Get(ctx, req.MovieId)
if err != nil && errors.Is(err, metadata.ErrNotFound) {
return nil, status.Errorf(codes.NotFound, err.Error())
} else if err != nil {
return nil, status.Errorf(codes.Internal, err.Error())
}
return &gen.GetMetadataResponse{Metadata: model.MetadataToProto(m)}, nil
}
As we can see, the implementation of the GetMetadata endpoint includes three error cases, each having its own gRPC error code:
The InvalidArgument and NotFound errors are non-retriable—there is no point in retrying requests failing validation or trying to retrieve records that are not found. Internal errors may indicate a wide range of issues, such as bugs in the service code, so we can’t certainly state that you should perform retries on them.
There are, however, some other types of gRPC error codes that indicate potentially retriable errors. Let’s list some of them:
Let’s first implement some simple retry logic inside the metadata gRPC gateway by replacing the Get function with the following code:
// Get returns movie metadata by a movie id.
func (g *Gateway) Get(ctx context.Context, id string) (*model.Metadata, error) {
conn, err := grpcutil.ServiceConnection(ctx, "metadata", g.registry)
if err != nil {
return nil, err
}
defer conn.Close()
client := gen.NewMetadataServiceClient(conn)
var resp *model.Metadata
const maxRetries = 5
for i := 0; i < maxRetries; i++ {
resp, err = client.GetMetadata(ctx, &gen.GetMetadataRequest{MovieId: id})
if err != nil {
if shouldRetry(err) {
continue
}
return nil, err
}
return model.MetadataFromProto(resp.Metadata), nil
}
return nil, err
}
Add a function that should help us to check whether a communication error is retriable:
func shouldRetry(err error) bool {
e, ok := status.FromError(err)
if !ok {
return false
}
return e.Code() == codes.DeadlineExceeded || e.Code() == codes.ResourceExhausted || e.Code() == codes.Unavailable
}
Note that we also need to import two extra packages for checking for specific gRPC error codes—google.golang.org/grpc/codes for accessing a list of error codes and google.golang.org/grpc/status for checking whether the communication error is a valid gRPC error.
Now, our metadata gRPC gateway can perform up to five retries of requests to the metadata service. The retry logic that we just added should help us minimize the impact of occasional errors, such as temporary server unavailability (for example, during an unexpected outage or temporary network issues). However, it introduces some additional challenges:
The latter scenario may be especially challenging to the server due to uneven load distribution. Imagine that you are doing some work and getting some phone calls with extra tasks. If you responded to such calls and said that you were busy, you wouldn’t want to get called again immediately and asked to perform the same tasks again—instead, you would want the caller to call back after some time. Similarly, immediate retries would be suboptimal to servers experiencing congestion issues, so we would need to perform additional modifications to our retry logic to introduce extra delays between the retries so that our server does not get overloaded with immediate retries.
The technique of adding extra delays between client request retries is called backoff. Different types of backoff are implemented by using different delay intervals between the retry requests:
An example of exponential backoff would be a sequence of calls where the first retry would be done after a 100 ms delay, the second one would take a 400 ms wait, and the third retry delay would be 900 ms. Exponential backoff is usually a better solution than constant, because it performs the next retry much slower than the previous ones, allowing the server to recover in case of overloading. A popular Go library at https://github.com/cenkalti/backoff provides an implementation of exponential and other types of backoff algorithms.
Backoff delay can also be modified by introducing small random changes to its duration. For example, the retry delay value on each step could be increased or decreased by up to 10% to better spread the load on the server. This optimization is called jittering. To illustrate the usefulness of jittering, assume multiple clients start calling the server simultaneously. If retries are performed with the same delays for each client, they will keep calling the server simultaneously, generating bursts of server requests. Adding pseudo-random offsets to retry delay intervals helps to distribute the load on a server more evenly, preventing possible traffic bursts from request retries.
Let’s now talk about another class of communication issues related to time. When a client performs a request to a server, multiple possible failures may result in either a client or a server not receiving enough data to consider the request successful. Possible failure scenarios include the following:
These failures can result in longer waiting times for a client. Imagine you are sending a letter to your relative and not getting a response back. Without additional information, you would continue waiting without knowing whether the letter got lost at any step or the relative simply hasn’t responded.
For synchronous requests, there is a way to improve the client experience by setting a request timeout—an interval after which the request is considered as failed in case of not receiving a successful response. Setting request timeouts is a good practice due to multiple reasons:
In Go, timeouts are usually propagated via the context.Context object. As we mentioned in Chapter 1, each I/O operation, such as a network call, accepts the context object as an argument, and we can set a timeout by calling the context.WithTimeout function, as shown in the following code snippet:
func TimeoutExample(ctx context.Context, args Args) {
const timeout = 10 * time.Second
ctx, cancel := context.WithTimeout(ctx, timeout)
defer cancel()
resp, err := SomeOperation(ctx, args)
}
In the preceding example, we set the timeout for the SomeOperation function to 10 seconds, so it should not take more than 10 seconds to complete the operation.
Setting a timeout is not the only way to limit request processing time. An alternative solution to this is setting a deadline—the maximal time until which the request should get processed not to be considered as failed. Unlike a timeout, which is set using the time.Duration structure (for example, having the value of 10 seconds), a deadline indicates the exact instance of time (for example, January 1, 2074, 00:00:00). Here’s an example of using a deadline for the same operation as in the previous code example:
deadline := time.Parse(time.RFC3339, "2074-01-01T00:00:00Z")
ctx, cancel := context.WithDeadline(ctx, deadline)
defer cancel()
resp, err := SomeOperation(ctx, args)
Technically, both a timeout and a deadline help us achieve the same goal—set a time limit for a target operation. You are free to use either format, depending on your preferences.
Let’s now talk about another client-server communication failure scenario—when a client tries to operate and doesn’t get a successful response even after a set of retries. In such a case, there are three possible options for the client:
The last option is called a fallback—an alternative logic that can get executed if some operation can’t be performed as expected.
Let’s take our rating service as an example. In our service, we implemented the GetAggregatedRating endpoint by reading all ratings for a provided record from the rating repository. Now, let’s consider a failure scenario when we can’t retrieve the ratings due to some problem, such as MySQL database unavailability. Without a fallback logic, we would not be able to process an incoming request and would need to return an error to our caller.
An example of a fallback would be to use a cache—we could store the previously retrieved ratings in the memory of a service (for example, inside a map structure) and return them on database-read errors. The following code snippet provides an example of such a fallback logic:
ratings, err := c.repo.Get(ctx, recordID, recordType)
if err != nil && err == repository.ErrNotFound {
return 0, ErrNotFound
} else if err != nil {
log.Printf("Failed to get ratings for %v %v: %v", recordID, recordType, err)
log.Printf("Fallback: returning locally cached ratings for %v %v", recordID, recordType)
return c.getCachedRatings(recordID, recordType)
}
Using fallbacks is an example of graceful degradation—a practice of handling application failures in a way that an application still performs its operations in a limited mode. In our example, the movie service would continue processing requests for getting movie details even if the recommendation feature is unavailable, providing a limited but working functionality to its users.
When designing new services or features, ask yourself which operations could be replaced with fallbacks in case of failures. Additionally, check which features and operations are absolutely necessary and which ones can be turned off in case of any failure, such as system overload or losing a part of a system due to an outage. Also, a good practice is to emit additional useful information related to failures, such as logs and metrics, and make it explicit in the code that the fallback is intentional, as in the preceding example.
As we discussed at the beginning of this chapter, there may be a situation when a microservice is overloaded and can’t handle incoming requests anymore. How can we prevent or mitigate such issues?
A popular way of preventing such issues is setting a hard limit on the number of requests to be processed in parallel. Such a technique is called rate limiting and can be applied on multiple levels:
When a client or a server exceeds the configured number of requests, the result of a request would be an error that should include a special code or message, indicating that a request has been rate limited.
An example of a rate-limiting indication in the HTTP protocol is a built-in status code, 429 Too Many Requests. When a client receives a response with such a code, it should take this into account by either reducing the call rate or waiting some time until the server can process requests again.
Client- and server-level rate limiting are often done by each service instance separately: each instance keeps track of the current number of outgoing or incoming requests. The downside of these models is the inability to configure the limits on a global-service level. If you configure each service client instance to send no more than 100 requests per second, you may still receive 100,000 simultaneous requests if there are 1,000 client instances. Such a high number of simultaneous requests could easily overload your service.
Network-level rate limiting can potentially solve this problem: if rate limiting is performed in a centralized way (for example, by a load balancer that handles requests between the services), the component performing rate limiting can keep track of the total number of requests across all service instances.
While network-level rate limiters provide more flexibility to configure the settings, they often require additional centralized components (such as load balancers). Because of this, we are going to demonstrate how to use a simpler approach, based on the client level.
There is a popular package implementing rate limiting in Go, called golang.org/x/time/rate. The package implements the token bucket algorithm—a limiting algorithm that initializes a bucket of some configured maximal size b, decrements its value by 1 on each request, and refills it at a configured rate of r elements per second. For example, for b = 100 and r = 50, the token bucket algorithm creates a bucket of size 100 and refills it at a rate of 50 per second. At any moment in time, it doesn’t allow more than 100 simultaneous requests (the maximal number is controlled by the current bucket size).
Here is an example of using a token bucket-based rate limiter in Go:
package main
import (
"fmt"
"golang.org/x/time/rate"
)
func main() {
limit := 3
burst := 3
limiter := rate.NewLimiter(rate.Limit(limit), burst)
for i := 0; i < 100; i++ {
if limiter.Allow() {
fmt.Println("allowed")
} else {
fmt.Println("not allowed")
}
}
}
This code prints allowed 3 times and then keeps printing not allowed 97 times unless it takes more than 1 second to run.
Let’s illustrate how to use such a rate limiter in combination with a gRPC API handler, which we implemented in Chapter 5. The gRPC protocol allows us to define interceptors—operations that are performed on each request and can modify the gRPC server’s response to it. To add a gRPC rate limiter to the movie service gRPC handler, perform the following steps:
“github.com/grpc-ecosystem/go-grpc-middleware/ratelimit"
const limit = 100
const burst = 100
l := newLimiter(100, 100)
srv := grpc.NewServer(grpc.UnaryInterceptor(ratelimit.UnaryServerInterceptor(l)))
type limiter struct {
l *rate.Limiter
}
func newLimiter(limit int, burst int) *limiter {
return &limiter{rate.NewLimiter(rate.Limit(limit), burst)}
}
func (l *limiter) Limit() bool {
return l.l.Allow()
}
Our rate limiter is using a rate-limiting gRPC server interceptor from the github.com/grpc-ecosystem/go-grpc-middleware/ratelimit package. Its interface is slightly different from our limiter from golang.org/x/time/rate, so we added a structure that links them together. Now, our gRPC server allows up to 100 requests per second and returns an error with a codes.ResourceExhausted special code in case the limit is exceeded. This allows us to make sure the service does not get overloaded with a sudden spike of a large number of requests—if somebody requests 1 million movie details at once from it, we are not going to make 1 million calls to our metadata service and overload its database.
Keep in mind that rate limiting is a powerful technique; however, it needs to be used with caution because setting the limit too low would make your system unnecessarily too restrictive for users by rejecting too many requests. To calculate fair rate-limiting settings for your services, you need to periodically perform benchmarking, understanding the maximum throughput of their logic.
Let’s move to the next topic of automation-based reliability techniques, describing how to gracefully terminate the execution of your services.
In this section, we are going to talk about the graceful handling of service shutdown events. Service shutdowns can be triggered by multiple events:
Generally, sudden termination of the execution of a service may result in the following negative consequences:
To prevent these issues, you need to ensure that your service shuts down gracefully by performing a set of operations that minimize any negative consequences for the service and its components. Performing a graceful shutdown, the service would run some extra logic before the termination, such as the following:
Graceful shutdown logic for Go services is usually implemented in the following way:
Here is a code example that you can add to the main function of any Go service, such as the ones that we implemented in Chapter 2:
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, os.Interrupt, syscall.SIGTERM)
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
s := <-sigChan
log.Printf("Received signal %v, attempting graceful shutdown", s)
// Graceful shutdown logic.
}()
wg.Wait()
There is also a way to gracefully handle panics in Go code by using the built-in recover function. The following code snippet demonstrates how to handle a panic inside the main function and execute any custom logic, such as closing any open connections:
func main() {
defer func() {
if err := recover(); err != nil {
log.Printf("Panic occurred, attempting graceful shutdown")
// Graceful shutdown logic.
}
}()
panic("panic example")
}
In our code, we check whether there is a service panic by calling the recover function and checking whether it returns a non-nil error. In case of a panic, we can perform any additional operations, such as saving any unsaved data or terminating any open connections.
To gracefully terminate the execution of a Go gRPC server, you need to call the GracefulStop function instead of Stop. Unlike the Stop function, GracefulStop would wait until all requests are processed, helping to reduce the negative impact of the shutdown on the clients.
If you have some long-running components, such as Kafka consumers or any background goroutines executing long-running tasks, you can communicate the service termination signal using the built-in context.Context structure. The context.Context structure provides a feature called context cancellation—an ability to notify different components about the cancellation of an execution by sending a specific event through the channel associated with the context.
Let’s update our rating service code to illustrate how to implement context cancellation and a graceful shutdown of a gRPC server:
ctx, cancel := context.WithCancel(context.Background())
Our code creates an instance of a context and the cancel function, which we will be calling on service shutdown to notify our components, such as the service registry, about upcoming termination.
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, os.Interrupt, syscall.SIGTERM)
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
s := <-sigChan
cancel()
log.Printf("Received signal %v, attempting graceful shutdown", s)
srv.GracefulStop()
log.Println("Gracefully stopped the gRPC server")
}()
In our code, we let the rating service listen for process interruption and termination signals and start the background goroutine, which keeps listening for the relevant notifications. Once it receives either signal, it calls the cancel function that we obtained in the previous step. The result of calling this function would be a notification that would be sent to the components initialized with our context, such as the service registry.
wg.Wait()
Let’s now test the code that we just implemented. Run the rating service and then terminate it by pressing Ctrl + C/Cmd + C (depending on your OS). You should see the following messages:
2022/10/13 08:55:05 Received signal interrupt, attempting graceful shutdown
2022/10/13 08:55:05 Gracefully stopped the gRPC server
Communication of termination and interruption events is a common practice in Go microservice development and is an elegant way of implementing graceful shutdown logic. When designing and implementing your services, think in advance about possible resources that need to be closed or de-initialized upon the service termination, such as any network clients and connections. A graceful shutdown logic can prevent the negative effects of sudden service termination. It can also reduce the number of possible errors in your services and improve your operating experience.
At this point, we have reviewed some automation techniques to improve the reliability of our services and reduce the symptoms of various failure scenarios. Now, we can proceed to the next section of the chapter, covering another aspect of reliability work related to development processes and culture. Improvements to your development processes are essential to achieving high reliability in the long term, and the section should be useful to you by providing some valuable tips and ideas that you can utilize in microservice development.
In this section, we are going to describe some techniques for achieving higher service reliability based on changes in the development processes and culture. You will learn how to establish the processes for improving and reviewing your service reliability, how to learn from any service-related issues and incidents efficiently, and how to measure your service reliability. We will cover the processes and practices that are widely used across the industry, outlining the most important ideas from each one. The section is going to be more theoretical than the previous one; however, it should be equally useful.
First, we are going to provide an overview of the on-call process essential for setting up a mechanism for monitoring issues with your services.
When your services start handling production traffic or start serving user requests, one of your first reliability goals should be to detect any issues, or incidents, as early as possible. Efficient detection should be automatic—a program will always be much more efficient than a human in detecting most issues. Each automatic detection should notify one or more engineers about the incident so that the engineers can perform work in order to mitigate an incident.
The process for establishing such a mechanism for notifying engineers about service incidents is called on-call. This process helps ensure that at any moment in time, service incidents are acknowledged and addressed by the engineers responsible for the service.
The main ideas behind the on-call process are the following:
Having an on-call process is common to most technology companies and teams, and the on-call process is pretty similar in most companies across the industry. Some popular solutions provide mechanisms for triggering various types of notifications, such as SMS, emails, and even phone calls. You can also configure on-call rotations and assign them to different services. One of the most popular solutions to on-call management is PagerDuty—a platform providing a set of tools for automating on-call operations, as well as integrations with hundreds of services, including Slack, Zoom, and many more. PagerDuty provides all the features we listed earlier, allowing engineers to configure on-call rotations for their services and notifying them about incidents in different ways. Additionally, it provides an API that can be used for both accessing the incident data and triggering new incidents from the code.
We are not going to dive into the details of PagerDuty's features and integrations in this chapter—I suggest you check the official PagerDuty documentation on their website, https://developer.pagerduty.com/docs. I also suggest you read Chapter 12 before establishing an on-call process for your services. It will help you to learn more about possible incident detection mechanisms and tools you can utilize in your projects.
Let’s discuss the common challenges of establishing an on-call process in a microservice environment:
Some companies may have thousands of microservices, so centralized incident response teams become crucial. For example, Uber has a dedicated team of engineers called Ring0 that is able to address any widespread incidents and coordinate the mitigation of issues that span multiple teams. Having such a team helps to dramatically reduce incident mitigation time.
To better understand what happens next after incidents are detected and acknowledged by the engineers, we are going to move now to the next topic: incident management.
Once incidents are detected and acknowledged by the engineers, there are two other types of work necessary for improving the service or system reliability—mitigation and prevention. Mitigation is required for resolving an open issue unless it gets resolved by itself or due to some external changes (for example, an external API getting fixed by the owning team). Prevention work is useful for ensuring the issue does not happen again. Without a proper prevention response to the incident, you may keep fixing the same issue over and over again, spending your time and affecting the experience of your system’s users.
To make the incident mitigation process quick and efficient, especially in a large team where engineers may have different levels of understanding of the system, there should be enough documentation describing which actions to perform in case of an incident. Such documentation is called a runbook and should be prepared for as many types of detectable incidents as possible. Whenever an on-call engineer gets an incident notification, it should be clear from the runbook which steps to take to mitigate it.
A good runbook should be short and concise and provide clear actionable steps that are easy to understand for any engineer. Let’s take this example:
rating_service_fd_limit_reached:
mitigation: Restart the service
If the incident mitigation requires further investigation, include any useful links, such as links to the relevant application logs and dashboards. You should aim for the lowest possible incident mitigation time—also called time to repair (TTR)—to increase the availability of your service and improve its overall health.
Once the incident is mitigated, focus on prevention work to ensure you take all actions to eliminate its causes, as well as to improve detection and mitigation mechanisms, if needed. Multiple companies across the industry use the process of writing documents called incident postmortems to organize the learnings around incidents and make sure each incident involves enough work related to its future prevention. An incident postmortem generally consists of the following data:
A great example of a postmortem document is provided in the famous Google Site Reliability Engineering (SRE) book, and you can get familiar with it at the following link: https://sre.google/sre-book/example-postmortem/.
To get to the root cause of the incident, you can use a technique called Five whys. The idea of the technique is to keep asking what caused the previously mentioned problem until the root cause is found. Let’s take the following root cause analysis (RCA) as an example to understand the technique:
Incident: Rating service returns internal errors to its API callers
Root cause analysis:
In this example, we kept finding the underlying cause of each previous issue by using the Five whys technique, until we got to the root cause of the incident in just three steps. The technique is very powerful and easy to use, and it can help you get to the root cause of even complex issues quite quickly.
Make sure you include and track action items for your incidents. Capturing the incident details and identifying the causes isn’t enough for making sure incidents are prevented. Prioritizing the action items helps ensure that the most critical ones get addressed as early as possible.
Now, let’s move to the next reliability process based on periodic testing of your possible service failure scenarios.
As many system administrators know, it is not enough to have backups of your data to guarantee its durability. You also need to ensure that you can restore the data from the backups in case of any failure. The same principle applies to any part of your service infrastructure—to know that your services are resilient to particular failures, you need to perform periodic exercises, called drills.
You can perform many possible types of drills. As in the example with the database backups, if you have any persistent data stored in a database, you can periodically test the ability to back up and restore the data, verifying that your services are tolerable to database availability issues. Another example would be network drills. You can simulate network issues, such as connectivity loss, by updating service routing configuration or any other network settings to check how your services behave in case of network unavailability.
There are multiple benefits of performing reliability drills:
Drills are often performed as planned incidents—incidents that get announced in advance and follow the regular incident management process, including the work on the postmortem document. The drill postmortem document should include the same items as a regular incident, with a focus on improving the mitigation and prevention experience. Additionally, engineers should focus on reviewing and updating the service runbooks, making sure that the incident mitigation instructions are accurate and up to date.
At this point, we have discussed the most important service reliability techniques. There are many more interesting topics to cover that are related to service reliability—some of them, related to incident detection, we are going to cover in Chapter 12 of the book. If you are interested in the topic, I strongly encourage you to read the Google Site Reliability Engineering (SRE) book, which provides a comprehensive guide to various reliability-related techniques. You can find the online version of the book by going to the following link: https://sre.google/sre-book/table-of-contents. The practices that are described in the book are applicable to any microservice, so you can always use it as a reference while working on building any type of system.
In this chapter, we covered the topic of reliability, describing a set of techniques and practices that can help you to make your microservices more resilient to various types of failures. You have learned some useful techniques for automating error responses of your services and reducing the negative impact of various types of issues, such as service overloading and unexpected service shutdowns.
In the final part of the chapter, we discussed various reliability techniques based on changes in engineering processes and culture, such as introducing the on-call and incident management processes, as well as performing periodic reliability drills. The knowledge that you gained from reading this chapter should help you to establish a solid foundation for writing reliable microservices.
In the next chapter, we are going to continue our journey into the reliability topic and focus on collecting service telemetry data, such as logs, metrics, and traces. Service telemetry data is the primary instrument for setting up service incident detection, and we will illustrate how to work with each type of telemetry data in your microservice code.
If you’d like to learn more, refer to the following resources:
3.143.228.40