Chapter 2. Monitoring in a Reliability Engineering World

“Our measure should focus on outcomes, not outputs.” -- Dr. Nicole Forsgren, Accelerate1

Monitoring systems is an extensive topic that has been heavily shaped in the past few years by the seminal work in Google’s SRE book and its followup “The Site Reliability Workbook”. Since these two books came out, ‘SRE’ became the most popular trend in open job listings and some companies have gone as far as retitling existing staff to some flavor of “reliability engineering”.

But is reliability engineering a fad? Or has it brought a significant change to how engineering teams define reliability and therefore how to use monitoring to guide team decisions such as “are we providing an acceptable customer experience?”, “should we focus on reliability and resilience work? Or can we work on new features?”.

This chapter is framed in previous writings on what Site Reliability Engineering means and expects its reader to have an understanding of what these principles are. If you have not read either of the mentioned books, we recommend these chapters as a crash course:

  • Chapter 2 of the Site Reliability Workbook which covers how to implement SLOs

  • Chapter 5 of the Site Reliability Workbook which covers alerting on SLOs

This SRE framing also impacts what the role of a database engineer is in an organization applying SRE principles. This is a fundamental shift compared to the stereotypical DBA role. It is, in our opinion, a shift that is more than a recruiting fad or a retitling but a core change to what the role’s primary responsibilities are.

The impact of reliability engineering on DBA teams

For many years, monitoring database performance relied on deep dives into single server performance. That still has a lot of value, but tends to be more about reactive measurements - like profiling a server that is performing poorly. This was the standard operating procedure in the days of the gatekeeping DBA teams where no one else was allowed to know how the database operated.

Enter Google’s introduction of Reliability Engineering. The role of a Database Administrator (DBA) became more complex and turned into more of a Site Reliability Engineer (SRE) or Database Reliability Engineer (DBRE). Teams had to optimize for their time. Service levels help you define when customers are unhappy, and allow you to better balance your time between addressing things like performance issues and scaling challenges against working on internal tooling. With this, let’s discuss the different ways you need to monitor MySQL to ensure a successful customer experience.

Note

We recommend reading chapter 1 of The Site Reliability Workbook (O’Reilly, 2018) to get a deeper understanding of the philosophy behind moving towards service level performance management in production, as it enlightens a lot of the advice in this chapter. The crux of that chapter is basically making you answer a simple question: “How do you know your customers are happy?”

Defining Service Level Goals

Before going into how to measure whether customers are happy with the performance of your database clusters, we must first know what our goals are, and align on a common language to describe these goals. Here are some questions that can serve as conversation starters in your organization to define these goals.

  • What are the metrics appropriate for measuring success?

  • What values for these metrics are acceptable to the customers and our business needs?

  • At what point are we considered in a degraded state?

  • When are we in an altogether failed state and need to remediate as soon as possible?

There are scenarios with obvious answers to these questions (e.g., the source database is down; we are not taking any writes, and therefore, business is halted). Some are less obvious such as a periodic task is sometimes hogging all the database disk IO and suddenly everything else is slower. Having a shared understanding across the organization of what we are measuring and why, can help guide prioritization conversations. Reaching that shared understanding through ongoing conversations across the organization helps guide whether you can spend engineering effort on new features or if there needs to be more investment in performance improvement or stability.

In SRE practices, these discussions about customer satisfaction will align the team on what is healthy for the business in terms of SLIs, SLOs, and SLAs. Let’s start with defining what these acronyms mean.

Service Level Indicator (SLI)

In very simple terms, an SLI answers the question, “How do I measure whether my customers are happy?”. The list of what represents a healthy system from the users’ perspective. SLIs can be business level indicators such as “Response time for a customer-facing API” or a more fundamental ‘service is up.’ You may find you need different indicators/metrics depending on the data’s context/how it relates to the product.

Service Level Objective (SLO)

An SLO answers the question, “What is the minimum my SLI can be allowed to be to make sure my customers are happy?”. SLO is the objective range we want to be in for a given SLI to be considered a healthy service. If you think uptime is the SLI, then the number of nines you want to be up for a given time span is the SLO. SLOs have to be defined as a value over a given timeframe to ensure that everyone is aligned on what the SLO means. An SLI plus an SLO form the basics on how to know if your customers are happy.

Service Level Agreement (SLA)

SLAs bring the answer to “What SLO am I willing to agree to that has consequences?”. An SLA is an SLO that has been included in an agreement with one or more customers of the business (paying customers, not internal stakeholders) with financial or other penalties if not that SLA is not met. It is important to note that SLAs are optional.

We will not cover SLAs much in this chapter as SLAs tend to be more of a business discussion than an engineering one. This sort of decision hinges mostly on what sales the business expects to get if they promise an SLA in contracts and is that worth the risk to revenue if the SLA were to be broken. Hopefully, such a decision is informed by what we do cover here around both choosing SLIs and matching SLOs.

Defining these SLIs/SLOs/SLAs guide not only the answer to business health but also guide planning within engineering teams. If a team is not hitting their agreed-upon SLOs, they should not be proceeding with new feature work. Similarly, for database engineering teams. If one of the potential SLOs we discuss in this chapter is not being met, it spurs the conversation of why it is not being met. When you come armed with the data to explain why customer experience is suboptimal, you can have more meaningful conversations around team priorities.

What does it take to make customers happy

After choosing a set of metrics as your SLIs, it may be tempting to set the goals to 100%. You must fight that urge, though. Remember that picking indicators and objectives is to evaluate at any time, with an objective metric, whether your team can innovate with new features or if stability is at risk of dropping below acceptable levels for customers and therefore needs more attention and resources. The goal is to define what is the absolute minimum you need to do to make customers happy. If a customer is happy with your pages loading in 2 seconds, there’s no need to set a target for pages to load in 750ms. This can create an unreasonable burden on engineering teams.

Taking an example of uptime as an indicator and objective values for it, one can declare that “we will not have any downtime,” but what does that mean when implementing and tracking if we are meeting goals? Reaching three nines of availability is no small feat. Three nines over a whole year amount to just over 8 hours, translating to only 10 mins in a given week. The more nines you promise, the harder this gets, and the more expensive engineering hours the team will have to spend to deliver on such a promise. Here is a helpful chart from Amazon Web Services showing the challenge in raw numbers.

Table 2-1.
Table 2-1: Availability time by nines
Availability Downtime / Year Downtime / Month Downtime / Week Downtime / Day
99.999% 5 mins, 15.36 secs 26.28 secs 6.06 secs 0.14 secs
99.995% 26 mins, 16.8 secs 2 mins, 11.4 secs 30.3 secs 4.32 secs
99.990% 52 mins, 33.6 secs 4 mins, 22.8 secs 1 mins, 0.66 secs 8.64 secs
99.950% 4 hrs, 22 mins, 48 secs 31 mins, 54 secs 5 mins, 3 secs 43 secs
99.900% 8 hrs, 45 mins, 36 secs 43 mins, 53 secs 10 mins, 6 secs 1 min, 26 secs
99.500% 43 hrs, 48 mins, 36 secs 3 hrs, 39 mins 50 hrs, 32 mins, 17 secs 7 mins, 12 secs
99.250% 65 hrs, 42 mins 5 hrs, 34 mins, 30 secs 1 hr, 15 mins, 48 secs 10 mins, 48 secs
99.000% 3 days, 15 hrs, 54 mins 7 hrs, 18 mins 1 hr, 41 mins, 5 secs 14 mins, 24 secs

Because engineering time is a finite resource, you must be careful not to strive for perfection when choosing SLOs. Not all the features in your product require all these nines to keep customers satisfied, and so you will find that as your product feature set grows, you will have varying SLIs and SLOs depending on the specific feature impact, or the revenue being driven by it. That is to be expected and is a sign of a thoughtful process. The balance you need to strike there is to discover early and often the spots where conflicting needs are starting to bottleneck on the same data resource and find ways to separate those concerns to maintain separate SLIs and SLOs.

These indicators and objectives are also an effective way to have a uniform language between product and engineering that guides the decision making between “spending engineering time on new features” vs “spend time on resilience and fixing issues”. It is also a way to decide, from the list of things we would like to accomplish, which is the most important based on customer experience. You can use SLIs and SLOs to guide work prioritization conversations that are otherwise hard to align on.

What to measure

Let’s imagine a company whose product is an online store for home made items. The company is seeing a lot more traffic due to increased online shopping and there is demand on the infrastructure group to ensure the database layer can handle the increased demand. Throughout the coming section we will talk about what to measure as if we are that fictional infrastructure team.

Now that we have defined terms such as SLI/SLO/SLA, we will discuss, in terms of our online store, which specific MySQL metrics are useful SLIs and how to choose SLOs for them. We will also discuss the limits of database SLIs/SLOs on driving customer experience and when we need more overall metrics for measuring the customer experience.

Defining SLIs and SLOs

Defining a good SLI and a matching SLO are almost an art that is centered around explaining succinctly how to provide a delightful user experience for your customers. We will not spend a ton of time explaining in the abstract how to create meaningful SLIs and SLOs.2 In the context of MySQL it needs to be a representation that defines three major themes: Availability, Latency and Lack of Critical Errors.

For our online shopping store this means pages that load quickly, faster than a few hundred milliseconds at least 99.5% of the time, measured across a month. It also means a reliable checkout process where intermittent failures are only allowed 1% of the time in a given calendar month. Note how these indicators and objectives are defined. They need to not be ‘100%’ because we know we operate in a world where failure is inevitable and they are defined over a time span, so the team can accurately budget their work between new features and resilience based on where they are in this error budget.

“I expect 99.5% of my database requests to be served in less than 2ms with no errors” is both a sufficient SLI with a clear SLO and is not simple. You cannot confirm all of this in one metric. It is a single sentence representation of how you expect the database layer to behave in order to provide an acceptable customer experience.

So what is a good example in our online store for metrics that can build this customer experience picture? Start with synthetic tests such as page loads in production that sample load rate. This is useful as a consistent signal that “things are ok”. This is just the beginning. Let’s discuss different facets of signals to track to build a picture. As we move through these examples, we will tie it with our online store to help you visualize how these different metrics build a picture of a good customer experience. First, we talk about tracking query response time.

Query analysis

Query analysis and monitoring query latency in the context of SLIs and SLOs needs to focus on customer experience. This means relying on tooling that can alert you as soon as possible when query response times are growing longer than an agreed upon threshold. Let’s discuss a few paths you can take to achieve that level of monitoring.

The open source route

If using a paid vendor is not an option for you, do not despair. Another well established open source option is Percona Monitoring and Management, known as PMM4. It runs by installing a client on your database instances and a server that collects metrics from both the MySQL instance running and more basic system metrics and packages them in dashboards for consumption by operators. One of the major benefits of PMM is that the organization of the dashboards is guided by long time experience in the Percona community around monitoring MySQL performance which makes it an excellent resource to get engineers new to MySQL familiar with how to monitor MySQL performance.

Another route you can take is shipping your database slow logs and MySQL Performance Schema outputs to a centralized location where you can use well known tools such as pt-query-digest to analyze these logs and gain more insight into what your database instances are spending their time on.

However, this process can have a slow and customer impacting feedback loop if not properly used. You ideally want to discover issues before customers notice them. Reactively checking logs after that happens is not how to do that. You run the risk of eroding customer trust because of how long it takes to discover performance regressions and the process of digging into all sorts of post fact artifacts to determine what happened.

Your focus should be on stopping customer impact and then later using logs/metrics collected during that customer impact to further understand how we got here.

Profiling MySQL performance is a very useful tool. You can use it to find bottlenecks to make your instances do more with the same specification, to save in infrastructure cost, or to answer the question “why is this taking this long”. It is not a tool to solely determine if you are meeting your service reliability promises as it is far deep in the internals of MySQL. For service level performance evaluation, we need a new way of thinking about performance.

Note
A note on “testing in production: we often hear the drumbeat of “test in production” and it makes a lot of folks cringe. But a mindset of production as also a place where we test our changes is of value and here is why. Deploying a change to production is when you discover how that change interacts with the rest of the system, at scale, with real customer traffic. By deploying the proposed change to production, you can see the impact with adjacent systems. By using the basic “Are customers happy” question, you can see:
  • When the feedback loop from production is quick and tied strongly to a change. It becomes much faster to roll back the change and reinspect the specific change that was being deployed
  • This method fosters stronger collaboration between feature teams and the database engineers. When all parties involved are aligned on the specific metrics to watch and what values they should be at, the task of measuring performance becomes a team effort
  • In the case of a regression, the effort spent outside production to look into “what happened” is far more specific than trying to recreate a benchmark suite that emulates a larger footprint of code paths. Engineering time spent to debug becomes far better targeted

Now, let’s dive into additional metrics that help you further understand the experience of your online store customers. You should think about the metrics you can get from MySQL in the frame of outcomes, not outputs. We will also cover examples of things you cannot measure through MySQL metrics alone.

Choosing a metric for availability

5
  • When dealing with inevitable catastrophic failures, what features are non negotiable and what features are ‘nice to have’? (e.g: can customers continue with existing shopping carts and check those out but maybe not add new items during this failure?)

  • What types of failures do we define as ‘catastrophic’? (e.g.: failure of listing search might not be catastrophic, but failure of checkout operations would be)

  • What does ‘degraded functionality’ look like? (e.g.: can we load generic recommendations instead of customized ones based on past purchase history when needed?)

  • What is the shortest possible Mean Time to Recovery (MTTR) we can promise for our core features given a set of probable failure scenarios? (e.g.: if the database powering shopping cart checkout is failing writes, how fast can we safely pivot to a new source node?)

When choosing a metric to show availability, make sure the communicated target is framed in customer experience and that it makes sense for the specific feature it represents. Rely on your customer support organization as the voice of the customer experience in this choice so you can avoid implied expectations that are not met later when incidents happen.

One MySQL metric that can be used as a leading indicator for availability issues is Threads_running. It tracks how many queries are currently in flight on a given database host. When threads running are growing at a fast clip and not showing any signs of decline, it is a sign of queries not finishing fast enough and therefore stacking and consuming resources. Allowing this metric to grow usually results in a database host either causing a full CPU lockup or intense memory load that tends to lead to the entire MySQL process being shut down by the operating system. This is obviously a major outage if it happens on a source node and something you should strive to have leading indicators for. One relatively straightforward way to monitor this is to inspect how many CPU cores you have and if thrteads_running is exceeding that, it is a sign that your server is hitting that precarious state.

There are a number of MySQL server settings that help guide how much system resources a connection uses and since 1 connection = 1 thread in MySQL, this math gets straightforward once you know which variables are session based. We will not cover those settings here but [BPCONFIG] will cover these in detail. Once you do that math, leave some headroom to allow for operator response time if the alert threshold is exceeded.

Monitoring Query Latency

6 this from the application perspective and add more insight to the outliers using tracing tools such as Honeycomb or LightStep.

Tracking Errors

Do you need to track and alert on every error that ever happens? Yes and no.

The sheer existence of errors for a MySQL client in a running service is not an indication of something being definitely broken. In a world of distributed systems, there are a myriad of scenarios where clients can encounter errors that are intermittent and in many cases resolved with a simple retry of the failed query. The rate of errors happening though across the fleet of services that handle database queries in your infrastructure, can be a crucial indicator of brewing trouble. Here are some examples of client side errors that might be just noise normally but are a sign of trouble if their rate accelerates.

Lock wait timeout

Your clients reporting a sharp increase of this error can be a sign of an escalating row lock contention on your source node that transactions keep retrying and still failing. It can be a precursor for write downtime

Aborted connections

Clients reporting a sudden surge of aborted connections can be an indicator of issues in any access layer you have between the clients and the database instances. Not tracking that down can lead to a lot of client side retries which consumes resources

One thing MySQL server tracks that can help you is the set of server variables named Connection_errors_xxx where ‘xxx’ is different kinds of connection errors. A sudden increase of any of these counters can be a strong lead towards telling you that something new and unusual is currently broken.

Are there errors where a single instance means there is trouble and needs to be handled? Yes. Here are some examples of server side errors that you want to know if they happen.

For example, getting errors that the MySQL instance is running in read only mode is a sign of issues even if this error does not happen very often. It can mean that you just had a replica promoted to source but is still running in read only mode (you run replicas in read only mode, don’t you?) which is downtime of writes for your cluster. Or it can mean there is some issue in your access layer sending write traffic to a replica. In either of those cases, it is not a sign of an intermittent issue solved with a retry.

Another server side error that is a sign of a major problem is either “Too many connections” or an OS level “Cannot create new thread”. These are a sign that your application layer has created and left open more connections than your DB server is configured to allow, either in the server max_connections variable or the number of threads the MySQL process is allowed to open. These errors translate immediately as 5xx errors to your application and depending on your application design can also be impactful to your customers.

As you can see, measuring performance and choosing which ones to frame your SLIs is as much a communication and social problem as it is a technical one, so you should be prepared for that.

Steady state monitoring

Back to our online store and how we envision monitoring our customers’ experience. Imagine you aren’t experiencing any major failures of any components but you note that there is a rising tide of customer support tickets reporting ‘slowness’ or occasional errors that seem to disappear on their own. How do you track down behaviour like this? This can be a very difficult task if you do not already have a good idea what the baseline performance of a number of signals is. The dashboards and scripts that you use to trigger on-call alerts can be referred to as steady state monitoring. These let you know something unexpected is happening with a given system whether or not there was a change. They are an important tool for giving you leading indicators before your customers experience failure.

The balance you need to strike with monitoring is that it always needs to be actionable while also being a true leading indicator. Alerting on disk space for a database at 100% full is too late as the service is already down, but alerting on 80% might be too slow/not as actionable if the growth rate is not that fast.

Let’s talk about useful signals you can monitor that are not directly tied to actual customer impact.

Disk growth

Tracking disk growth is the sort of metric that you might not think about until it becomes a problem. When it does become a problem, solving the issue can be time consuming and business impacting. It is definitely better to understand how you track it, that you have a plan to mitigate it, and be clear on what alerting thresholds are appropriate.

There are a number of strategies you can use to monitor disk growth. Let’s break them down from most ideal to bare minimum.

If your monitoring tooling can allow it, tracking the rate of growth of the disk space usage can be remarkably useful. There are always scenarios where available disk space can burn down relatively quickly, putting your availability at risk. Operations such as long running transactions with large undo logs or alter tables are examples of why you would approach full disk too fast. There are many incident stories out there where excessive logging or a change in insert pattern for a given dataset went undetected till ‘the database’ ran out of disk space. Only then did all sorts of alerts fire.

If tracking the rate of growth is not feasible (not all monitoring tools provide this ability), you can set multiple thresholds with lower warnings that only fire during business hours and a higher more critical value as an alert to off hours on call. This allows the team to have a heads up through the business hours warning before things get dire enough to wake someone up.

If you can neither monitor the rate of growth or define multiple thresholds for the same metric, then you have to at least have a single value threshold for disk space used at which you page your on-call engineers. This threshold needs to be low enough to allow some action and free disk space as the team assesses the reasons it fired and consider a longer term mitigation. Consider evaluating the maximum throughput your disk can write (MB/s) and using that to help calculate how long at max traffic throughput it would take to fill the disk. You need that much lead time to avoid an event.

We discuss in Chapter 4, operating system and hardware configurations that relate to how MySQL uses disk space and what tradeoffs to consider in those decisions in relation to disk space growth. It should be expected that at some point, hopefully, your business is growing to the point where you cannot store all of your data in one cluster of servers. Even if you run in a cloud environment that can expand volumes for you, there is still planning to do around this and so you always want to have a threshold for free disk space that allows you the time to plan and do the needed expansion without a panic.

The takeaway here is to make sure you have some monitor for disk space growth. Even if you think it is early days and too soon to need one. This is one of the growth axes that catch almost everyone unprepared.

Connection growth

As your business grows, a common layer that grows linearly is your application layer. You will need more instances to support login, shopping carts, processing requests, or whatever the context of the product may be. All these added instances start opening more and more connections to your database hosts. You may mitigate that growth for some time by adding replicas, using replication as a scale out measure, or even use middleware layers such as ProxySQL to make the growth of your frontend decoupled from connection load directly on the database.

While your traffic is growing, there is a finite pool of connections the database server can support and that pool is configured as the server setting max_connections. Once you have enough connections in total connected to the server that reaches that maximum,your database will not allow any new ones and that is a common contributing cause to incidents where you can no longer open new connections to the databases leading to increased errors to your users.

Monitoring connection growth is about making sure your resources are not exhausted to the point of risk to your database availability. This risk can come in 2 different ways:

1. The application layer is opening lots of connections it’s not using and creating the risk of maxing out connections for no good reason. A clear sign of this is seeing connections count (threads_connected) as high but threads_running is still low.

2. The application layer is actively using lots of connections and risking overloading the database. You can distinguish this state by seeing that both threads_connected and threads_running are at high values (hundreds? thousands?) and increasing.

A useful thing to consider when setting up monitoring for connection count is relying on percentages and not absolute numbers. A percentage of threads_connected / max_connections shows you how close the growth of your application node count is taking you to the maximum connection pool the database can allow. This helps you monitor for the first state of connection growth trouble.

Separately, you should be tracking and alerting on how busy a DB host is, which as we just explained is seen in the value of threads_running. Typically, if this value is growing north of a 100 threads, you start to see elevated CPU usage and increased memory use and is a general sign of high load on the DB host. This is an immediate concern for your database availability as it can escalate to the MySQL process getting killed by the operating system. A common quick solution is to use the kill process command or a tool that automates using it such as pt-kill tactically to relieve load then look into why the database got into this state using query analysis which we described earlier.

Warning

Connection storms are situations in production systems where the application layer perceives increases in query latency and responds with opening more connections to the database layer. It can result in adding significant load on the database as it handles the large influx of new connections which takes away resources from fulfilling query requests. Connection storms can cause a sudden decrease in available connections in max_connections and increase the risk of your database availability

Replication lag

MySQL has a native replication feature that sends data from one server - the source - to one or more additional servers, referred to as replicas. The delay between data being written on the source and being available on the replicas is referred to as replication lag. If your application reads data from the replicas, lag can make it seem as if your data has inconsistencies as you send reads to replicas not yet caught up on all the changes. In a social media example, a user may comment on something someone else has posted. This data is written to the source and then replicated out to the replicas. When the user attempts to view their reply, if the application sends the request to a server which is lagged, the replica may not have the data yet. This can create confusion for the user, thinking their comment was not saved. We cover strategies to fight replication lag in more detail in chapter REPLICATION.

Lag is one of those metrics that can be an acute SLI that can trigger incidents. It is also a longer-term trend indicating the need for more architectural change. In the longer-term context, even if you never hit replication lag that is impacting the customer experience, it is still a sign that, at least intermittently, the volume of writes from source nodes is surpassing what replicas can write at current configuration. It can be a canary in the coal mine for your write capacity. If listened to, it can prevent future full-blown incidents.

Warning

Be wary of alerting someone to replication lag. Immediate, actionable remediation may not always be possible. Likewise, if you don’t read from replicas, consider how aggressively your monitoring system alerts someone to this condition. Alerts that someone receives, especially off hours, should always be actionable.

Replication lag is one of those metrics that can impact both immediate/tactical decisions but also keeping an eye on its trends long term can help save you the hassle of larger business impact and keep you ahead of the growth curve.

IO utilization

One of the never-ending endeavors of a database engineer is “do as much of the work as possible in memory because it is faster”. While that is certainly accurate, we also know that we cannot possibly accomplish that 100% of the time because it would mean our data entirely fits in memory, in which case ‘scale’ is not yet a thing we need to spend energy on.

As your database infrastructure scales and your data does not fit in memory anymore, you come to realize that the next best thing is to not read so much data from disk that queries are stuck waiting their turn for those precious IO cycles. This remains true even in this era of almost everything running on solid state drives. As your data size grows and your queries need to scan more of it to fulfill requests, you will find that IO wait can become a bottleneck for your traffic growth.

Monitoring your disk IO activity helps you get ahead of performance degradation before it becomes customer facing. There are a few things you can monitor to achieve this goal. Tools such as iostat can help you monitor for IO wait. You want to monitor and alert if your database server has a lot of threads sitting in IOwait, an indication that they are in queue waiting on some disk resources to be available. You find this by tracking IOutil as a running graph for a meaningful period of time, say a day or two or even a week. IOutil is reported as a percentage of the overall system’s disk access capacity. Having that be close to 100% for sustained periods on a host that is not running backups can be an indication of full table scans and inefficient queries. You also want to monitor the overall utilization of your disk IO capacity as a percentage as that can forewarn you of disk access becoming a future bottleneck for your database performance.

Auto increment space

One of the less known landmines in using MySQL is that auto increment primary keys are by default created as signed integers and can run out of key space. This happens when you have done enough insert that the auto increment key has reached the maximum possible value for its datatype. When planning what metrics you should monitor on a long-term basis, remaining integer space for any tables that use auto increments as the primary key is a simple action that will almost certainly save you some major incident7 pain in the future when you can predict the need for larger keyspace in advance.

How do you monitor this keyspace? You have a few options:

  • If you already use Percona Monitoring Manager and its prometheus exporter, this comes baked in and all you need to do is turn on the flag -collect.auto_increment.columns

  • If your team does not use Prometheus, can use the query below which can be modified to use either as a metrics producer or an alert to tell you when any of your tables are approaching the maximum key space possible8. This query relies on information_schema which has all the metadata about the tables in your database instance

SELECT
    t.TABLE_SCHEMA AS `schema`,
    t.TABLE_NAME AS `table`,
    t.AUTO_INCREMENT AS `auto_increment`,
    c.DATA_TYPE AS `pk_type`,
    (
        t.AUTO_INCREMENT /
        (CASE DATA_TYPE
            WHEN 'tinyint'
                THEN IF(COLUMN_TYPE LIKE '%unsigned',
                    255,
                    127
                )
            WHEN 'smallint'
                THEN IF(COLUMN_TYPE LIKE '%unsigned',
                    65535,
                    32767
                )
            WHEN 'mediumint'
                THEN IF(COLUMN_TYPE LIKE '%unsigned',
                    16777215,
                    8388607
                )
            WHEN 'int'
                THEN IF(COLUMN_TYPE LIKE '%unsigned',
                    4294967295,
                    2147483647
                )
            WHEN 'bigint'
                THEN IF(COLUMN_TYPE LIKE '%unsigned',
                    18446744073709551615,
                    9223372036854775807
                )
        END / 100)
    ) AS `max_value`
    FROM information_schema.TABLES t
    INNER JOIN information_schema.COLUMNS c
        ON t.TABLE_SCHEMA = c.TABLE_SCHEMA
        AND t.TABLE_NAME = c.TABLE_NAME
    WHERE
        t.AUTO_INCREMENT IS NOT NULL
        AND c.COLUMN_KEY = 'PRI'
        AND c.DATA_TYPE LIKE '%int'
;

There is a lot of nuance and context that you have to think about when picking a primary key in general and for managing auto increments specifically and we will cover that in Chapter 3.

Backup creation/restore time

Long term planning is not only about growth while the business is running as usual but also about recovery in an acceptable time frame. We will discuss in more depth in Chapter 13 how to think about disaster recovery, but we bring it up here to note that a good disaster recovery plan only works when you revisit it and adjust its goals.

Note
In this chapter and other sections throughout this book, you will see us mention sharding or partitioning as different ways to split your data on separate instances in order to scale. We want to define what we mean by these and how they differ to avoid confusion as you read the rest of this book.

Functional sharding means splitting specific tables that serve a specific business function into a dedicated cluster in order to separately manage this dataset uptime, performance or even access controls.

Horizontal sharding is when you have a dataset that has grown past the size you can reliably serve out of a single cluster and you split it into multiple clusters and serve the data from several nodes, relying on some lookup mechanism to locate the subset you need.

If your databases are reaching a size where restoring from a backup will take longer than what is acceptable for restoring critical functionality of the business then even if everything else is running fine, you need to examine either adjusting that mean time to recovery target, change the definition of ‘critical functionality’, or find a way to make backup restore time shorter. Here are some things to think about when planning for disaster recovery

  • Be very specific what functionality falls into this recovery target and if needed, look into whether the data that powers that functionality subset needs to be in a separate cluster to actually make that expectation realistic

  • If functionally partitioning that data into multiple and smaller instances is not feasible, the entire dataset is now under that target for recovering via backups. The dataset that takes the longest to restore from backups will be what drives this recovery process completion time.

  • Make sure to have automated methods for testing (we will cover some examples in Chapter 11. How long it takes to restore a backup from a file to a running database that has also caught up on replicating all changes since the backup was created, and to store that metric somewhere with enough retention to see long term (read: at least a year) trends. This is one of those indicators that can slip by and become surprisingly long if monitoring it is not automated.

You will see that in many of the example long term metrics we describe shortly that we almost always point out the need for either functional sharding or horizontal sharding of your data. The goal here is to explicitly point to the fact that if you consider sharding when you have incidents that have capacity issues as a major contributing cause then you have likely considered it too late. The work of breaking down your data to manageable pieces doesn’t start when your data is too large for one cluster but rather well before that when you are still determining what are the goals that provide a successful customer experience.

Understanding how long it takes you to recover your data can help set expectations of what to do in a real disaster. It can also make you aware of when it might take longer than the business wants it to. This is a precursor to needing to shard.

Measuring Long Term Performance

Choosing SLIs and SLOs for day-to-day operations is only the beginning. You need to make sure you are not mistaking the forest for the trees and focusing on specific host metrics instead of inspecting the overall system performance and the customer experience outcomes. In this section, we cover strategies you can use to think about overall long-term health of the system.

Learning your business cadence

It is important to be aware of the traffic cadence of your business as that will always be the time where all your SLOs are both the most tested and receiving the most scrutiny from your most important customers. The business cadence can mean peak traffic times are orders of magnitude larger than ‘average’ and that has plenty of consequences if your database infrastructure is not prepared for this. In the context of the database infrastructure, this can translate to orders of magnitude more requests per second to fulfill, a lot more connection load from your application servers, or larger revenue impact if you were to have an intermittent failure of write operations. Here are some examples of business cadence that should help you understand what possible business cycle your company operates within:

  • E-commerce site: Black Friday to Christmas are the busiest times of the year, and we can see orders of magnitude more sales. This means a lot more shopping carts, a lot more concurrent sales, and a lot more revenue impact for the same failures any other time of the year.

  • HR software: In the US, November is typically the time of the year when a lot of employees are making benefits elections at a time known as “open enrollment” which will lead to a lot more traffic.

  • An online fresh flowers vendor: Valentine’s day will be the busiest time of the year, with a lot more folks ordering deliveries of bouquets

As you can see, these business cycles can vary wildly depending on the customer need the business is filling. It is crucial for you to be aware of this cycle and what implications it has on the business revenue, business reputation, and therefore how much preparation you should have to meet the task without impacting the stability of the systems you are tasked to run.

When it comes to measuring the performance of the database infrastructure underpinning the business, it is important not to measure performance in a bubble that is separate from the rest of the important metrics the rest of your engineering organization is tracking. Database performance should be part of the larger conversation about tech stack performance and not handled as a special case. The way to do that is to start by using the same tools as the rest of your engineering organization as much as possible. You want the metrics and dashboard you rely on to determine how the DB layer is performing to be equally accessible as the application layer metrics or even in the same dashboards. This mind set, regardless of what tech or vendor you use, will go along way in creating an environment where everyone is invested in the performance of the full stack and reduces the proverbial wall engineers can feel between the features they write and the databases that support them.

Tracking your metrics effectively

There are a number of things to be concerned with when it comes to long term planning for a business. These things include but are not limited to:

  • Planning for future capacity

  • Trying to foresee when major improvements are needed and when incremental changes are enough.

  • Planning for the growth of the cost of running your infrastructure

You need to be able to not just measure the health of the datastore infrastructure as a point in time picture but to also trend performance improvement or degradation on a long term basis. This means not just identifying SLIs and SLOs but also finding which SLIs and SLOs remain valuable, high signal metrics for long term trends as well. You will likely find that not all metrics that can be used in short term on-call decision making can also be appropriate for long term business planning.

Before we dive into what metrics are important for long term planning, let’s talk about some tools that empower that long term trend monitoring.

Using monitoring tools to inspect the performance

Measuring performance is important in both the immediate “are we currently in an incident” sense and the long-term tracking and trending sense. The tool that holds the metrics you care about is as important a decision as what the metrics themselves are. What is the use of choosing a good SLI if you then cannot properly see its trend over time in a manner that is relatable to the rest of the organization metrics?

The field of monitoring tools is rapidly growing, and there are lots of strong opinions on how it should be done. The goal here is increased transparency and a focus on tracking outcomes rather than outputs. And in the field of making an infrastructure stack successful, tracking success is a team sport.

Instead of talking about specific tools here, we will instead list some important features/aspects to think about when considering if a tool is good for this kind of long-term trending.

Say no to averages

Whether you are self-managing your metrics solution as an engineering org or using a SaaS (Software as a service), be careful how your metrics solution normalizes data for long term storage. A lot of solutions aggregate longer term data into averages by default (Graphite is one of the first to do that), and that is a big problem. If you need to look at the trend of a metric over a period longer than a few weeks, average will smooth down peaks which means if you are looking to see if your disk IO utilization can double for the next year, a graph of average data points will very likely give you a false sense of security. Always look at peaks when trending months’ data so you can keep the fidelity of occasional spikes in your view.

Percentiles are your friend

Percentiles rely on ordering the data points in a given timespan and removing the highest value ones depending on the target percentile (ie: if you are looking for 95th, remove the top 5%). It is an excellent tool for making the data you are looking at visualise closer to how we discuss SLIs and SLOs. If you can easily make the graph showing your query response time show the 95th percentile, you can far more easily match that to the SLO you want to achieve for application request completion and make the database metrics make sense to folks like your customer support org, your engineers and not just your database engineering team.

Long retention period and performance

This may seem obvious but the performance of a monitoring tool when trying to display long time spans is important. If you are evaluating solutions for business metric trending, you need to make sure to test out how the user experience changes when asking for longer and longer time spans of data. A metrics solution is only as good as it can be at making that data available, not just the speed of ingestion or how long it keeps it.

Now that we have covered what a long term monitoring tool should look like, let’s discuss how all we’ve covered so far in choosing SLIs and SLOs can guide your data architecture.

Using SLOs to guide your overall architecture

Keeping a consistent, good, customer experience while your business is also growing is no small feat. As the size of the business grows, keeping even the same SLOs, much less setting more ambitious ones, becomes harder and harder. Take something like availability, for example, everyone wants as many nines as possible of uptime for both reads and writes for all their data. But the more stringent SLOs you want to achieve the more expensive the work becomes as your database peak transactions per second or its size also grows by orders of magnitude.

Using the SLIs and SLOs we already discussed, you can also find the points in growth where it makes sense to start splitting your data into either functional shards or data partitions. We will discuss in further details scaling MySQL using sharding in Chapter 10 but the important point to make here is that the same SLIs and SLOs that tell you “how is the system performing now” can also guide you to knowing when it is time to put in the investment for scaling MySQL so that the individual clusters remain manageable within the boundaries of the SLOs that preserve your customers’ experience.

Having a metrics solution that can handle both short- and long-term metrics and can trend changes for you in a useful manner, is a very important part of tracking both tactical performance metrics along with longer term, business impacting, trends of how your database infrastructure is doing.

Summary

It is important during your journey of applying reliability engineering concepts to monitoring your database infrastructure that you start and constantly improve and revisit your indicators and objectives. They are not meant to be set in stone after the first time you define some SLIs and SLOs. As the business grows, you will gain a deeper understanding of the customers’ experience and that should drive improvements to your SLIs and SLOs.

We hope this chapter helps frame for you how to apply reliability engineering to monitoring MySQL successfully as your company scales.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.236.55.137