Reliability is the most critical feature of a service or a system and should be aligned with business objectives. This alignment should be tracked constantly, meaning that the alignment needs measurement. Site reliability engineering (SRE) prescribes specific technical tools or practices that will help in measuring characteristics that define and track reliability. These tools are service-level agreements (SLAs), service-level objectives (SLOs), service-level indicators (SLIs), and error budgets.
SLAs represent an external agreement with customers about the reliability of a service. SLAs should have consequences if violated (that is, the service doesn't meet the reliability expectations), and the consequences are often monetary in nature. To ensure SLAs are never violated, it is important to set thresholds. Setting these thresholds ensures that an incident is caught and potentially addressed before repeated occurrences of similar or the same events breach the SLA. These thresholds are referred to as SLOs.
SLOs are specific numerical targets to define reliability of a system, and SLOs are measured using SLIs. SLIs are a quantitative measure of the level of service provided over a period. Error budgets are calculated based on SLOs (that are based on SLIs) and essentially are the inverse of availability, representing a quantifiable target as to how much a service can be unreliable. All these tools or technical practices need to work in tandem, and each one is dependent on the other. SRE uses these technical practices to maintain the balance between innovation and system reliability and thus achieve the eventual goal—build reliable software faster.
Chapter 1, DevOps, SRE, and Google Cloud Services for CI/CD, introduced SRE technical practices—SLAs, SLOs, SLIs, and error budgets. This chapter will deep dive into these technical practices. In this chapter, we're going to cover the following main topics:
An SLA is a promise made to a user of a service to indicate that the availability and reliability of the service should meet a certain level of expectation. An SLA details a certain level of performance or expectation from the service.
There are certain components that go into defining which agreements can be considered as an SLA. These are referred to with specific jargon and are elaborated, as mentioned, in the following sections.
The party that represents the service provider and service consumer can differ based on the context and nature of the service. For a consumer-facing service such as video streaming or web browsing, a service consumer refers to the end user consuming the service and a service provider refers to the organization providing the service. On the other hand, for an enterprise-grade service such as a human resource (HR) planning system, a service consumer refers to the organization consuming the service and a service provider refers to the organization providing the service.
An organization or end user consuming a service will have certain expectations in terms of service behavior, such as availability (or uptime), responsiveness, durability, and throughput.
An agreement or contract can be either implicit or explicit in nature. An example of an implicit contract is a non-commercial service such as Google Search. Google has a goal to provide a fluid search experience to all its users but hasn't signed an explicit agreement with the end user. If Google misses its goal, then users will not have a good experience. A repeat of such incidents will impact Google's reputation, as users might prefer to use an alternate search engine.
An example of an explicit contract is a commercial service such as Netflix or a paid enterprise-grade service such as Workday. In such scenarios, legal agreements are written that include consequences in case the service expectations are not met. Common consequences include financial implications or service credits.
This concludes an introduction to key jargon with respect to SLAs. The next subsection elaborates on the blueprint for a well-defined SLA.
Having a well-defined SLA is critical to its success. Here are some factors that could be used as a blueprint for a well-defined SLA:
SLAs should focus on the minimum level of objectives a service should meet to keep customers happy. However, SLAs are strictly external targets and should not be used as internal targets by the implementation teams.
To ensure that SLAs are not violated, implementation teams should have target objectives that reflect user's expectations from the service. The target objectives from implementation teams are used as internal targets, and these are generally stricter than the external targets that were potentially set by product teams. The internal targets are referred to as SLOs and are used as a prioritization signal to balance release velocity and system reliability. These internal targets need to be specifically measured and quantified at a given point of time. The measurement should be done using specific indicators that reflects users' expectations, and such indicators are referred to as SLIs.
To summarize, for a service to perform reliably, the following criteria needs to be met:
Let's look at a hypothetical example. Consider a requirement where a user's request/response time falls within a minimum time period. A latency metric can be used to represent the user's expectation. A sample SLA in this scenario can state that every customer will get a response within 1,000 milliseconds (ms). In this case, the SLO for this SLA must be stricter and can be set at 800 ms.
This completes the section on SLAs. We looked at the key constructs of an SLA, factors that could impact a well-defined SLA, and its impact on setting internal target objectives or SLOs, using specific indicators or SLIs that impact customer satisfaction. The next section transitions from an SLA to an SLO and its respective details.
Service consumers (users) need a service to be reliable, and the reliability of the service can be captured by multiple characteristics such as availability, latency, freshness, throughput, coverage, and so on. From a user's perspective, a service is reliable if it meets their expectations. A critical goal of SRE is to measure everything in a quantitative manner. So, to measure, there is a need to represent user expectations quantitatively.
SRE recommends a specific technical practice called a SLO to specify a target level (numerical) to represent these expectations. Each service consumer can have a different expectation. These expectations should be measurable, and for that they should be quantifiable over a period. SLOs help to define a consistent level of user expectations where the measured user expectation should be either within the target level or should be within a range of values. In addition, SLOs are referred to as internal agreements and are often stricter than SLAs promised to the end users. This ensures that any potential issues are resolved before their repetitive occurrence results in violating the SLA.
SLOs are key to driving business decisions by providing a quantifiable way to balance release cadence of service features versus service reliability. This emphasis will be covered in the upcoming subsection.
The need for revenue growth puts businesses under constant pressure to add new features and attract new users to their service. So, product managers usually dictate the requirement of these new features to development teams. Development teams build these requirements and hand them over to the operations team to stabilize. Development teams continue their focus on adding new features to a service rather than stabilizing existing ones. Operations teams tend to get overloaded since they are constantly firefighting to maintain the reliability of the existing service, in addition to rolling out new features. So, the most important question is: If reliability is a feature of a system, then how can you balance reliability along with the release of other features?
SLOs are the answer to how to maintain a balance between reliability and release velocity. SLOs allow us to define target levels for a reliable service. These target levels should be decided by all the stakeholders across an organization, including engineering teams (development and operations) and the product team. The agreed-upon target levels should reflect users' experiences while using the service. This allows monitoring systems to identify existing problems before users register complaints. SLOs should be treated more as a prioritization signal rather than an operational concern.
SLOs should be used as a primary driver for decision making. SLOs represent a common language for all reliability conversations that is based on actual metrics. This will allow a business to decide when to release new features versus when to continue their focus on the reliability of an existing service. It will also allow operations teams to have a streamlined set of goals, preventing ad-hoc actions to run the service, and eventually avoiding operational overload.
Operational overload is a term that describes the ongoing maintenance tasks that keep systems and services running at optimal performance. If a team is constantly interrupted by operations load and cannot make progress toward their key priorities, then the team is in a state of operational overload.
The main reason for a team to be in a state of operational overload is a lack of consensus on the level of reliability a service should support. This lack of consensus is apparent from development teams' focus on adding new features to a service rather than stabilizing existing ones.
SLOs must have strong backing from the executive team. In the case of missed SLO targets, there should be well-documented consequences that prioritize engineering efforts toward stabilizing the reliability of a service rather than working or releasing new features. SLOs are key to removing organization silos and create a sense of shared responsibility and ownership. SLOs drive incentives that organically invoke a thought process whereby developers start to care about service reliability and operators start to care about pushing new features out as quickly as possible. The recommended guidelines to set SLOs will be detailed in the upcoming subsection.
The journey or process to identify the right SLOs for a service is very complex. There are multiple aspects or guidelines that need to be considered. Each of these aspects is critical to set or define an SLO for a service.
SLO targets are always driven by quantifiable and measurable user expectations called SLIs. The happiness test is a good starting point to set SLO targets for a service. As per the test, the service should have target SLOs that barely meet the availability and reliability expectations of the users, as the following applies:
A target SLO for an average response time is defined as a range between 600 and 800 ms. If the average response time is less than 800 ms, then the service meets the target SLO, and users are happy. If the average response time is greater than 800 ms (even though it is less than stipulated in the SLA), then the service misses the target SLO and the users are sad. The following diagram illustrates an example where an SLA with respect to average response time for a request is set to 1,000 ms:
Reliability is the most important feature of a service and reflects user happiness. However, setting 100% as the SLO or reliability target is not a realistic and reasonable goal for the following reasons:
As 100% is the wrong reliability target, it is important to find the optimal reliability target where the service is reliable enough for the user and there is an opportunity to update or add new features to the service.
Another perspective with which to look at reliability targets for a service is the amount of unreliability the service is willing to tolerate. Unreliability of the service is also referred to as the downtime.
Let's consider some reliability targets, as follows:
The following table summarizes the possibility of detecting an issue and the possibility to self-heal based on a reliability target over a 30-day period:
To summarize, a reliability target should be set to a level that is realistic where an issue can be detected and addressed. An automated self-healing process is recommended over a human involvement—for example, redirecting traffic to a new availability zone (AZ) in case of an existing AZ failure.
Setting a reliability target too low means that issues could frequently occur, leading to large duration of downtimes, and customers will be impacted regularly. Setting a reliability target too high at 99.999% or even 100% means that the system cannot practically fail, and that makes it difficult to add new features to the service or application.
Reliability is the most important feature of a service, and setting SLOs allow monitoring systems to capture how the service is performing. When setting SLOs for the first time, it's possible to set SLOs based on past performance, taking an assumption that users are happy to start with. SLIs for these SLOs are based on existing monitoring systems and are considered as an initial baseline that must be met. Such SLOs are known as achievable SLOs, and any misses below the initial base line should result in directing engineering efforts to focus on getting reliability back to the initial baseline.
How to get started with setting achievable SLOs
Metrics to set achievable SLOs can either be taken from the load balancer or backfilled from the logs. Both approaches give an insight into historical performance.
If SLOs need to be set in the absence of historical data or if historical data does not accurately reflect users' expectations, it is recommended to set an achievable target and then refine the target to closely match users' expectations and business needs. Such SLOs are known as aspirational SLOs. Monitoring systems will then use these metrics to track these SLOs.
Once either achievable or aspirational SLOs are set, it's possible that new features are introduced to the service, but the probability for a service to be unreliable also increases. This can result in customers being unhappy even after meeting SLOs. This is an indication that monitoring metrics need to be revisited. SLOs need to be iteratively set and periodically re-evaluated. These metrics might have worked when originally set, but might not anymore.
Here are a few possible scenarios that call for SLOs to be re-evaluated:
How frequently should SLOs be revisited or re-evaluated?
It's recommended that SLOs be revisited or re-evaluated every 6 to 12 months to ensure that defined SLOs continue to match business changes and users' expectations.
In addition to periodically revisiting SLOs, there are scenarios where a different SLO—more precisely, a tighter SLO—can be used when a spike in traffic is anticipated. For example, during holiday shopping, many businesses expect a significant spike in traffic, and in such scenarios, businesses can come up with a temporary strategy of tightening the SLO from 99.9% to 99.99%. This means system reliability is prioritized over a need or urge to release new features. The SLO targets are set back to their original value (in this example, back to 99.9%) when normal traffic resumes.
This completes the section on SLOs, with a deep insight into the need for reliability, setting reliability targets, and the way SLOs drive business decisions using SLIs. The next subsection introduces why SLOs need SLIs and is also a precursor before exploring SLIs in detail.
SLOs are specific numerical targets to define the reliability of a system. SLOs are also used as a prioritization signal to determine the balance between innovation and reliability. SLOs also help to differentiate happy and unhappy users. But the striking question is: How do we measure SLOs?
SLOs are measured using SLIs. These are defined as a quantifiable measure of service reliability and specifically give an indication on how well a service is performing at a given moment of time. Service consumers have certain expectations from a service and SLIs are tied directly to those expectations. Examples of quantifiable SLIs are latency, throughput, freshness, and correctness. SLIs are expressed as a percentage of good events across valid events. SLOs are SLI targets aggregated over a period.
We'll get into more details about SLIs in the next section, Exploring SLIs. This includes categorizing SLIs by types of user journeys and elaborating on the ways to measure SLIs.
An SLI is a quantitative measure of the level of service provided with respect to some aspect of service reliability. Aspects of a service are directly dependent on potential user journeys, and each user journey can have a different set of SLIs. Once the SLIs are identified per user journey, the next critical step is to determine how to measure the SLI.
This section describes the details around how to identify the right measuring indicators or SLIs by categorizing user journeys, the equation to measure SLIs, and ways to measure SLIs.
The reliability of a service is based upon the user's perspective. If a service offers multiple features, each feature will involve a set of user interactions or a sequence of tasks. This sequence of tasks that is critical to the user's experience offered by the service is defined as a user journey.
Here are some examples of user journeys when using a video streaming service:
Each user journey can have a different expectation. These expectations can vary, from the speed at which the service responds to a user's request to the speed at which data is processed, to the freshness of the data displayed or to the durability at which data can be stored.
There could be a myriad user journeys across multiple services. For simplicity, user journeys can be classified into two popular categories, as follows:
Each category defines specific characteristics. Each specific characteristic can represent an SLI type that defines the reliability of the service. These are specified in the following sections.
Availability, latency, and quality are the specific aspects or characteristics of SLIs that need to be evaluated as part of a request/response user journey.
Availability is defined as the proportion of valid requests served successfully. It's critical for a service to be available to meet user's expectations.
To convert an availability SLI definition into an implementation, a key choice that needs to be made is: How to categorize requests served as successful?
To categorize requests served as successful, error codes can be used to reflect users' experiences of the service—for example, searching a video title that doesn't exist should not result in a 500 series error code. However, being unable to execute a search to check if a video title is present or not, should result in a 500 series error code.
Latency is defined as the proportion of valid requests served faster than a threshold. It's an important indication of reliability when serving user-interactive requests. The system needs to respond within a timely fashion to consider it as interactive.
Latency for a given request is calculated as the time difference between when the timer starts and when the timer stops. To convert a latency SLI definition into an implementation, a key choice that need to be made is: How to determine a threshold to classify responses as fast enough?
To determine a threshold that classifies responses as fast enough, it's important to first identify the different categories of user interactions and set thresholds accordingly per category. There are three ways to bucketize user interactions, outlined as follows:
Quality is defined as the proportion of valid requests served without degrading the service. It's an important indication on how a service can fail gracefully when its dependencies are unavailable.
To convert a quality SLI into an implementation, a key choice that needs to be made is: How to categorize if responses are served with degraded quality? To categorize responses served with degraded quality, consider a distributed system with multiple backend servers. If the incoming request is served by all backend services, then the request is processed without service degradations. However, if the incoming request is processed by all backend servers except one, then it indicates responses with degraded quality.
If a request is processed with service degradation, the response should be marked as degraded, or a counter should be used to increment the count of degraded responses. As a result, a quality SLI can be expressed as a ratio of bad events to total events instead of a ratio of good events to total events.
How to categorize a request as valid
To categorize a request as valid, different methodologies can be used. One such method is to use HyperText Transfer Protocol (HTTP) response codes. For example, 400 errors are client-side errors and should be discarded while measuring the reliability of the service. 500 errors are server-side errors and should be considered as failures from a service-reliability perspective.
Freshness, correctness, coverage, and throughput are the specific aspects or characteristics of SLIs that need to be evaluated as part of a data processing/pipeline-based user journey. This is also applicable for batch-based jobs.
Freshness is defined as the proportion of valid data updated more recently than a threshold. Freshness is an important indicator of reliability while processing a batch of data, as it is possible that the output might become less relevant over a period of time. This is primarily because new input data is generated, and if the data is not processed regularly or rebuilt to continuously process in small increments, then the system output will not effectively reflect the new input.
To convert a freshness SLI into an implementation, a key choice that needs be made is: When to start and stop the timer to measure the freshness of data? To categorize that the data processed is valid for SLI calculation, the correct source of input data or the right data processing pipeline job must be considered. For example, to calculate the freshness of weather-streaming content, data from a sports-streaming pipeline cannot be considered. This level of decision making can be achieved by implementing code and a rule-processing system to map the appropriate input source.
To determine when to start and stop times to measure the freshness of data, it is important to include timestamps while generating and processing data. In the case of a batch processing system, data is considered fresh if the next set of data is not processed and generated. In other words, freshness is the time elapsed since the last time the batch processing system completed.
In the case of an incremental streaming system, freshness refers to the age of the most recent record that has been fully processed. Serving stale data is a common way to degrade the response quality. Measuring stale data as degraded response quality is a useful strategy. If no user accesses the stale data, no expectations around the freshness of the data can have been missed. For this to be feasible, one option is to include a timestamp along with generating data. This allows the serving infrastructure to check the timestamp and accurately determine the freshness of the data.
Correctness is defined as the proportion of valid data producing a correct output. It's an important indication of reliability whereby processing a batch of data results in the correct output. To convert a correctness SLI into an implementation of it, a key choice that needs to be made is: How to determine if the output records are correct?
To determine if the output records produced are correct, a common strategy is to use golden input data, also known as a set of input data that consistently produces the same output. This way, the produced output can be compared to the expected output from the golden input data.
Proactive testing practices—both manual and automated—are strongly recommended to determine correctness.
Coverage is defined as the proportion of valid data processed successfully. It's an important indication of reliability, whereby the user expects that data will be processed and outputs will subsequently also be available.
To convert a coverage SLI into an implementation, the choice that needs to be made is: How to determine that a specific piece of data was processed successfully? The logic to determine if a specific piece of data was processed successfully should be built into the service, and the service should also track the counts of success and failure.
The challenge comes when a certain set of records that were supposed to be processed are skipped. The proportion of the records that are not skipped can be known by identifying the total number of records that should be processed.
Throughput is defined as the proportion of time where the data processing rate is faster than a threshold. It's an important indicator of reliability of a data processing system, whereby it accurately represents user happiness and operates continuously on streams or small batches of data.
To convert a throughput SLI into an implementation, a key choice that needs to be made is: What is the unit of measurement for data processing? The most common unit of measurement for data processing is bytes per second (B/s).
It is not necessary that all sets of inputs have the same throughput rate. Some inputs need to be processed faster and hence require higher throughput, while some inputs are typically queued and can be processed later.
SLIs recommended for a data storage-based user journey
Systems processing data can also be further classified into systems responsible for only storing data. So, a data storage user-based journey is another possible classification of a user journey where availability, durability, and end-to-end latency are additional recommended SLIs. Availability refers to data that could be accessed on demand from a storage system. Durability refers to the proportions of records written that could be successfully read from a storage system as and when required at that moment. End-to-end latency refers to the time taken to process a data request, from ingestion to completion.
The following table summarizes specific characteristics to represent an SLI type, grouped by the type of user journey:
Given that there is a wide choice of SLIs to select from, Google recommends the following specific SLIs based on the type of systems:
Given that we have looked at various factors that impact on determining SLIs specific to a user journey, the upcoming subsection will focus on the methodology and sources to measure SLIs.
An SLI equation is defined as the proportion of valid events that were good, as illustrated here:
This equation has the following properties:
This completes our summary of SLI equation and its associated properties. The next subsection details various popular sources to measure SLIs.
Identifying potential user journeys for a service is the first important step to identify SLIs. Once SLIs to measure are identified, the next key step is to measure the SLIs so that corresponding alerts can be put in place. The key question in this process is: How to measure and where to measure?
There are five popular sources or ways to measure SLIs, outlined as follows:
Here are some details on how information from server-side logs can be used to measure SLIs:
Here are details of the limitations of using server-side logs to measure SLIs:
Here are details on how information from the application server can be used to measure SLIs:
Here are details of the limitations of using the application server to measure SLIs:
What is a complex multi-request user journey?
A complex multi-request user journey will include a sequence of requests, which is a core part of a user consuming a service such as searching a product, adding a product to a shopping cart, and completing a purchase. Application metrics cannot capture metrics for the user journey but can capture metrics related to individual steps.
Here are details of how information from frontend infrastructure can be used to measure SLIs:
Here are details of the limitations of using a frontend infrastructure to measure SLIs:
Here are details of how information from synthetic clients can be used to measure SLIs:
Here are details of the limitations of using synthetic clients to measure SLIs:
Telemetry refers to remote monitoring from multiple data sources and is not restricted to capture metrics related to application health, but can be extended to capture security analytics such as suspicious user activity, unusual database activity, and so on. Here are details of how information from telemetry can be used to measure SLIs:
What is OpenTelemetry?
OpenTelemetry is a unified standard for service instrumentation. It provides a set of application programming interfaces (APIs)/libraries that are vendor-agnostic and standardizes how to collect and send data to compatible backends. OpenTelemetry is an open source project that is part of the Cloud Native Computing Foundation (CNCF).
Here are details of the limitations of using telemetry to measure SLIs:
This completes an elaboration of five different sources to measure SLIs. Given that each source has its own limitations, there is no best source to measure SLIs. In most cases, a combination of sources is always preferred. For example, if an organization is getting started with their SRE practice, usage of server-side logs to backfill SLIs and frontend infrastructure to readily use the metrics from the load balancer might be a good way to start. It can later be extended to capturing metrics from the application server, but given it doesn't support complex multi-request user journeys, an organization can later shift to the use of telemetry or synthetic clients based on their need. The next subsection summarizes a few SLI best practices as recommended by Google.
It's a tedious task for an organization that would like to start on their SRE journey, and a key aspect of this journey is to identify, define, and measure SLIs. Here is a list of Google-recommended best practices:
This completes a comprehensive deep dive on SLIs, with a focus on categorizing user journeys, identifying specific aspects that impact a user journey, various sources to measure, and recommended best practices to define SLIs. To summarize, there are four critical steps for choosing SLI, listed as follows:
The upcoming section focusses on error budgets, which are used to achieve reliability by maintaining a balance with release velocity.
Once SLOs are set based on SLIs specific to user journeys that define system availability and reliability by quantifying users' expectations, it is important to understand how unreliable the service is allowed to be. This acceptable level of unreliability or unavailability is called an error budget.
The unavailability or unreliability of a service can be caused due to several reasons, such as planned maintenance, hardware failure, network failures, bad fixes, and new issues introduced while introducing new features.
Error budgets put a quantifiable target on the amount of unreliability that could be tracked. They create a common incentive between development and operations teams. This target is used to balance the urge to push new features (thereby adding innovation to the service) against ensuring service reliability.
An error budget is basically the inverse of availability, and it tells us how unreliable your service is allowed to be. If your SLO says that 99.9% of requests should be successful in a given quarter, your error budget allows 0.1% of requests to fail. This unavailability can be generated because of bad pushes by the product teams, planned maintenance, hardware failures, and other issues:
Here's an example. If SLO says that 99.9% of requests should be successful in a given quarter, then 0.1% is the error budget.
Let's calculate the error budget, as follows:
If SLO = 99.9%, then error budget = 0.1% = 0.001
Allowed downtime per month = 0.001 * 30 days/month * 24 hours/day * 60 minutes/hour = 43.2 minutes/month
This introduces the concept of an error budget. The next subsection introduces the concept of an error budget policy and details the need for executive buy-in with respect to complying with the error budget policy.
If reliability is the most important feature of a system, an error budget policy represents how a business balances reliability against other features. Such a policy helps a business to take appropriate actions when the reliability of the service is at stake. The key to defining an error budget policy is to actually decide the SLO for the service. If the service is missing SLO targets, which means the error budget policy is violated, then there should be consequences. These consequences should be enforced by generating executive buy-in. Operations teams should have an influence on the impact of the development team's practices by halting the release of new features if the service is getting very close to exhausting the error budget or has exceeded the error budget.
Error budgets can be thought of as funds that are meant to be spent across a given time period. These funds can be spent on releasing new features, rolling out software updates, or managing incidents. An error budget is basically the inverse of availability, and it tells us how unreliable your service is allowed to be. If your SLO says that 99.9% of requests should be successful in a given quarter, your error budget allows 0.1% of requests to fail. This unavailability can be generated because of bad pushes by the product teams, planned maintenance, hardware failures, and so on. The next subsection lists out the characteristics of an effective error budget policy.
An error budget policy should have the following characteristics:
The preceding list of characteristics clearly calls out the fact that it is extremely necessary to have well-documented SLOs to define an effective error budget policy. This will be discussed in the next subsection.
The key to defining an error budget is to actually to decide the SLOs for the service. SLOs clearly differentiate between reliable services and unreliable services, thus extending it to identify happy versus unhappy users. SLOs should be clearly defined without any ambiguity and should be agreed by product owners, developers, SREs, and executives.
In addition to implementing an SLO and configuring a monitoring system to alert on the SLO, the following characteristics are recommended for a well-documented SLO in terms of metadata:
The next subsection discusses multiple options to set error budgets.
Error budgets can be thought as funds that are meant to be spent across a given time period. These funds can be spent on releasing new features or rolling out software updates or managing incidents. But this raises several questions, such as the following:
Different strategies can be used to determine the right time to spend error budgets within a time period. Let's assume the time period is 28 days. There could be three potential options, listed as follows:
Any of the preceding options or a combination of the three can be used to define a dynamic release process, and it all depends on what developers and operations team agree upon based on current business needs and past performance. The dynamic release process can be implemented by setting alerts based on error budget exhaustion rates.
If the error budget of a service is exhausted but the development team needs to push a new feature as an exception scenario, SRE provisions this exception using silver bullets.
Envision silver bullets as tokens that could be given to the operations team to facilitate an exception to release new features when having exceeded the error budget. These tokens reside with a senior stakeholder and the development team needs to pitch the need to use silver bullets to the stakeholder. A fixed number of such tokens are given to the stakeholder and these are not carried over to the next time period. In addition to the use of silver bullets, SRE also recommends the use of rainy-day funds whereby a certain amount of the error budget is additionally provided to handle unexpected events.
Error budgets cannot be carried over to the next time period. So, in all practicality, the goal is to spend the error budget by the end of the time period. Constantly exhausting error budgets and repeated use of silver bullets should call for a review, where engineering efforts should be invested in making the service more reliable by improving the service code and by adding integration tests.
The use of dynamic release cadence, error budget exhaustion rates, silver bullets, and rainy-day funds are advanced techniques prescribed by SRE to manage error budgets. This completes the subsection on defining characteristics for an effective error budget policy, listing out characteristics for well-documented SLOs and discussing options to set error budgets. The next subsection details factors that are critical in ensuring that a service stays reliable and does not exhaust the error budget.
When a service exhausts its error budget or repeatedly comes close to exhausting the same, engineering teams should focus on making a service reliable. This raises the next obvious question: How can the engineering teams make a service more reliable to meet users' expectations?
To get deeper insights into this, it's critical to consider the following key factors essential to determine the potential impact on the service:
Reliability can be improved by implementing the following options:
Let's discuss each option in detail, next.
TTD can be reduced by the following approaches:
TTR can be reduced by the following approaches:
Impact % can be reduced by the following approaches:
Frequency can be reduced by the following approaches:
Here are some options from an operational standpoint to make a service reliable:
This completes a complete deep dive into potential factors that needs to be considered and feasible options that can be implemented to make a service reliable, thus not consuming the error budget. The next subsection summarizes the section on error budgets.
Error budgets can be summarized by the following key pointers:
This completes the section on error budgets, with a deep dive into multiple aspects that include how to define an effective error budget policy, how to set error budgets, the impact of having an executive buy-in that helps to make a service reliable, and how to effectively balance the release velocity of new features.
Toil was introduced in Chapter 1, DevOps, SRE, and Google Cloud Services for CI/CD, and is defined as the work tied to a production service where the characteristic of that work is manual, repetitive, automatable, tactical, lacks enduring value, and linearly grows with the service. Toil is often confused with overhead, but overhead refers to administrative work that includes email, commute, filing expense reports, and attending meetings. Toil can be both good and bad—it really depends on the amount of toil.
Here are some of the positive sides of performing toil, but in very short and limited amounts:
However, excessive toil can lead to the following problems or issues:
All the aforementioned problems or issues can potentially lead to attrition, as SRE engineers might not be happy with their everyday work and might look elsewhere for better work and challenges. SRE recommends that toil should be bounded and that an SRE engineer should not work more than 50% of their time on toil. Anything more than 50% blurs the line between an SRE engineer and a system administrator. SRE engineers are recommended to spend the remaining 50% on supporting engineering teams in achieving reliability goals for the service.
Eliminating toil allows SRE engineers to add service features to improve the reliability and performance of the service. In addition, focus can continue to remain on removing toil as identified, thus clearing out a backlog of any manual repetitive work. SRE encourages the use of engineering concepts to remove manual work. This also allows SRE engineers to scale up and manage services better than a development or an operations team.
SRE recommends removing toil through automation. Automation provides consistency and eliminates the occurrence of oversights and mistakes. Automation helps to perform a task much faster than humans and can also be scheduled. Automation also ensures to prevent a problem before reoccurring. Automation is usually done through code, and this also provides a chance for SRE engineers to use engineering concepts to implement the required logic.
This concludes the section on toil: its characteristics, the good and bad aspects, and the advantages of using automation to eliminate toil. The next section illustrates how SLAs, SLOs, and error budgets are impacted based on SLI performance.
In this section, we will go through two hands-on scenarios to illustrate how SLO targets are met or missed based on SLI performance over time. SLOs performance will have direct impact on SLAs and error budgets. Changes in the error budget will specifically dictate the priority between the release of new features versus service reliability. For ease of explanation, a 7-day period is taken as the measure of time (ideally, a 28-day period is preferred).
Here are the expectations for this scenario:
Given that the anticipated SLO for service is 98%, here is how the allowed downtime or error budget is calculated (you can use this downtime calculator for reference: https://availability.sre.xyz):
So, if total downtime across 7 days is less than 201.6 minutes, then the service is within SLO compliance of 98%, else the service is out of SLO compliance.
Now, let's illustrate how the SLO is impacted based on SLI measurements. Assume that new features are introduced for the service (across the 7-day period) and the features are stable, with minimal issues.
The following table represents the SLI measurements of availability, respective downtime based on SLI performance, and the reduction in error budget on a per-day basis:
The following screenshot represents the SLI performance for service uptime (top) and the error budget burndown rate (bottom) based on the values from the preceding table:
Here are some critical observations:
This completes a detailed illustration of a scenario where the SLO is met based on the SLI performance over a 7-day period. The next scenario illustrates the opposite, where the SLO is missed based on SLI performance.
Here are the expectations for this scenario:
As calculated in Scenario 1, the allowed downtime for a 98% SLO is 201.6 minutes. So, the SLO is out of compliance if downtime is greater than 201.6 minutes over a 7-day period.
Now, let's illustrate how the SLO is impacted based on SLI measurements. Assume that new features are introduced for the service (across the 7-day period) but the introduced features are not stable, causing major issues resulting in longer downtimes.
The following table represents the SLI measurements of availability, respective downtime based on SLI performance, and the reduction in error budget on a per-day basis:
The following screenshot represents the SLI performance for service uptime (left-hand side) and the error budget burndown rate (right-hand side) based on the values from the preceding table:
Here are some critical observations:
This brings an end to a detailed rundown of Scenario 2. This completes our illustration of how SLAs, SLOs, and error budgets are impacted based on SLI performance. This also means we have reached the end of this chapter.
In this chapter, we discussed in detail the key SRE technical practices: SLAs, SLOs, SLIs, error budgets, and eliminating toil. This included several critical concepts such as factors that can be used for a well-defined SLA, providing guidelines to set SLOs, categorizing user journeys, detailing sources to measure SLIs along with their limitations, elaborating on error budgets, detailing out factors that can make a service reliable, understanding toil's consequences, and elaborating on how automation is beneficial to eliminate toil. These concepts allow us to achieve SRE's core principle, which is to maintain the balance between innovation and system reliability and thus achieve the eventual goal: build reliable software faster.
In the next chapter, we will focus on concepts required to track SRE technical practices: monitoring, alerting, and time series. These concepts will include monitoring as a feedback loop, monitoring sources, monitoring strategies, monitoring types, alerting strategies, desirable characteristics of an alerting system, time-series structures, time-series cardinality, and metric types of time-series data.
Here are some important points to remember:
For more information on Google Cloud Platform's (GCP's) approach toward DevOps, read the following articles:
Answer the following questions:
a) Toil
b) Manual
c) Overhead
d) Automation
a) SLI
b) SLO
c) SLA
d) Error budget
a) Availability, latency, and durability
b) Latency, coverage, throughput, and availability
c) Coverage, correctness, and quality
d) Availability, latency, and quality
a) 25%-55%
b) 45%-60%
c) 50%-75%
d) 30%-50%
a) 99.9%
b) 99.99%
c) 99.999%
d) 100%
a) 2 to 3
b) No specific limit
c) 3 to 5
d) 5 to 7
a) Database—Availability.
b) Database—Durability.
c) Database—Freshness.
d) Web application—Availability.
e) Web application—Durability.
f) Both the database and web application should be available. Production apps should have full availability.
a) SLA
b) SLO
c) SLI
d) Error budget
a) SLA
b) SLO
c) SLI
d) Error budget
a) Application server
b) Frontend infrastructure
c) Synthetic clients
d) None of the above
3.16.66.206