Chapter 3. Operating System and Hardware Optimization

Your MySQL server can perform only as well as its weakest link, and the operating system and the hardware on which it runs are often limiting factors. The disk size, the available memory and CPU resources, the network, and the components that link them all limit the system’s ultimate capacity. Thus, you need to choose your hardware carefully and configure the hardware and operating system appropriately. For example, if your workload is I/O-bound, one approach is to design your application to minimize MySQL’s I/O workload. However, it’s often smarter to upgrade the I/O subsystem, install more memory, or reconfigure existing disks. If you’re running in a cloud hosted environment, the information below can still be very useful, especially understanding file system limitations and Linux I/O schedulers.

What Limits MySQL’s Performance?

Many different hardware components can affect MySQL’s performance, but the two most frequent bottlenecks we see are CPU and I/O saturation. CPU saturation happens when MySQL works with data that either fits in memory or can be read from disk as fast as needed. A lot of datasets fit completely in memory with the large amounts of RAM available these days.

I/O saturation, on the other hand, generally happens when you need to work with much more data than you can fit in memory.

How to Select CPUs for MySQL

You should consider whether your workload is CPU-bound when upgrading current hardware or purchasing new hardware.

You can identify a CPU-bound workload by checking the CPU utilization, but instead of looking only at how heavily your CPUs are loaded overall, look at the balance of CPU usage and I/O for your most important queries, and notice whether the CPUs are loaded evenly.

Broadly speaking, you likely have two goals for your server:

Low latency (fast response time)
To achieve this you need fast CPUs because each query will use only a single CPU.
High throughput
If you can run many queries at the same time, you might benefit from multiple CPUs to service the queries.

If your workload doesn’t utilize all of your CPUs, MySQL can still use the extra CPUs for background tasks such as purging InnoDB buffers, network operations, and so on. However, these jobs are usually minor compared to executing queries.

Balancing Memory and Disk Resources

The biggest reason to have a lot of memory isn’t so you can hold a lot of data in memory: it’s ultimately so you can avoid disk I/O, which is orders of magnitude slower than accessing data in memory. The trick is to balance the memory and disk size, speed, cost, and other qualities so you get good performance for your workload.

Caching, Reads, and Writes

If you have enough memory, you can insulate the disk from read requests completely. If all your data fits in memory, every read will be a cache hit once the server’s caches are warmed up. There will still be logical reads from memory, but no physical reads from disk. Writes are a different matter, though. A write can be performed in memory just as a read can, but sooner or later it has to be written to the disk so it’s permanent. In other words, a cache can delay writes, but caching cannot eliminate writes as it can for reads.

In fact, in addition to allowing writes to be delayed, caching can permit them to be grouped together in two important ways:

Many writes, one flush
A single piece of data can be changed many times in memory without all of the new values being written to disk. When the data is eventually flushed to disk, all the modifications that happened since the last physical write are permanent. For example, many statements could update an in-memory counter. If the counter is incremented 100 times and then written to disk, 100 modifications have been grouped into one write.
I/O merging
Many different pieces of data can be modified in memory and the modifications can be collected together, so the physical writes can be performed as a single disk operation.

This is why many transactional systems use a write-ahead logging strategy. Write-ahead logging lets them make changes to the pages in memory without flushing the changes to disk, which usually involves random I/O and is very slow. Instead, they write a record of the changes to a sequential log file, which is much faster. A background thread can flush the modified pages to disk later; when it does, it can optimize the writes.

Writes benefit greatly from buffering, because it converts random I/O into more sequential I/O. Asynchronous (buffered) writes are typically handled by the operating system and are batched so they can be flushed to disk more optimally. Synchronous (unbuffered) writes have to be written to disk before they finish. That’s why they benefit from buffering in a RAID controller’s battery-backed write-back cache (we discuss RAID a bit later).

What’s Your Working Set?

Every application has a “working set” of data--that is, the data that it really needs to do its work. A lot of databases also have plenty of data that are not in the working set. You can imagine the database as a desk with filing drawers. The working set consists of the papers you need to have on the desktop to get your work done. The desktop represents main memory in this analogy, while the filing drawers are the hard disks.

Just as you don’t need to have every piece of paper on the desktop to get your work done, you don’t need the whole database to fit in memory for optimal performance - just the working set.

When dealing with hard disk drives (HDDs), it was good practice to try to find an effective memory-to-disk ratio. This was largely due to the slower latency and low IOPS of HDDs. With solid-state drives (SSDs), it becomes far less important.

Solid-State Storage

Solid-state (flash) storage is the standard for most database systems, especially OLTP. Only on very large data warehouses or very legacy systems would you typically find HDDs. This shift came as the price of SSDs dropped significantly around 2015.

Solid-state storage devices use nonvolatile flash memory chips composed of cells, instead of magnetic platters. They’re also called NVRAM, or nonvolatile random access memory. They have no moving parts, which makes them behave very differently from hard drives.

Here’s a quick summary of flash performance. High-quality flash devices have:

  • Much better random read and write performance compared to hard drives. Flash devices are usually slightly better at reads than writes.
  • Better sequential read and write performance than hard drives. However, it’s not as dramatic an improvement as that of random I/O, because hard drives are much slower at random I/O than they are at sequential I/O.
  • Much better support for concurrency than hard drives. Flash devices can support many more concurrent operations, and in fact, they don’t really achieve their top throughput until you have lots of concurrency.

The most important things are improvements in random I/O and concurrency. Flash memory gives you very good random I/O performance at high concurrency.

An Overview of Flash Memory

Hard drives with spinning platters and oscillating heads had inherent limitations and characteristics that are consequences of the physics involved. The same is true of solid-state storage, which is built on top of flash memory. Don’t get the idea that solid-state storage is simple. It’s actually more complex than a hard drive in some ways. The limitations of flash memory are actually pretty severe and hard to overcome, so the typical solid-state device has an intricate architecture with lots of abstractions, caching, and proprietary “magic.”

The most important characteristic of flash memory is that it can be read many times rapidly, and in small units, but writes are much more challenging. You can’t rewrite a cell without a special erase operation, and you can erase only in large blocks--for example, 512 KB. The erase cycle is slow and eventually wears out the block. The number of erase cycles a block can tolerate depends on the underlying technology it uses; more about this later.

The limitations on writes are the reason for the complexity of solid-state storage. This is why some devices provide stable, consistent performance, and others don’t. The magic is all in the proprietary firmware, drivers, and other bits and pieces that make a solid-state device run. To make write operations perform well and avoid wearing out the blocks of flash memory prematurely, the device must be able to relocate pages and perform garbage collection and so-called wear leveling. The term write amplification is used to describe the additional writes caused by moving data from place to place, writing data and metadata multiple times due to partial block writes.

Garbage Collection

Garbage collection is important to understand. In order to keep some blocks fresh and ready for new writes, the device reclaims blocks. This requires some free space on the device. Either the device will have some reserved space internally that you can’t see, or you will need to reserve space yourself by not filling it up all the way - this varies from device to device. Either way, as the device fills up, the garbage collector has to work harder to keep some blocks clean, so the write amplification factor increases.

As a result, many devices get slower as they fill up. How much slower is different for every vendor and model and depends on the device’s architecture. Some devices are designed for high performance even when they are pretty full, but in general, a 100 GB file will perform differently on a 160 GB SSD than on a 320 GB SSD. The slowdown is caused by having to wait for erases to complete when there are no free blocks. A write to a free block takes a couple of hundred microseconds, but an erase is much slower--typically a few milliseconds.

RAID Performance Optimization

Storage engines often keep their data and/or indexes in single large files, which means RAID (Redundant Array of Inexpensive Disks) is usually the most feasible option for storing a lot of data. RAID can help with redundancy, storage size, caching, and speed. But as with the other optimizations we’ve been looking at, there are many variations on RAID configurations, and it’s important to choose one that’s appropriate for your needs.

We won’t cover every RAID level here, or go into the specifics of exactly how the different RAID levels store data. Instead, we focus on how RAID configurations satisfy a database server’s needs. The most important RAID levels are:

RAID 0
RAID 0 is the cheapest and highest-performance RAID configuration, at least when you measure cost and performance simplistically (if you include data recovery, for example, it starts to look more expensive). Because it offers no redundancy, we recommend RAID 0 only for servers you don’t care about, such as replicas or servers that are “disposable” for some reason.
Again, note that RAID 0 does not provide any redundancy, even though “redundant” is the R in the RAID acronym. In fact, the probability of a RAID 0 array failing is actually higher than the probability of any single disk failing, not lower!
RAID 1
RAID 1 offers good read performance for many scenarios, and it duplicates your data across disks, so there’s good redundancy. RAID 1 is a little bit faster than RAID 0 for reads. It’s good for servers that handle logging and similar workloads because sequential writes rarely need many underlying disks to perform well (as opposed to random writes, which can benefit from parallelization). It is also a typical choice for low-end servers that need redundancy but have only two hard drives.
RAID 0 and RAID 1 are very simple, and they can often be implemented well in software. Most operating systems will let you create software RAID 0 and RAID 1 volumes easily.
RAID 5
RAID 5 is a little scary, but it’s the inevitable choice for some applications because of price constraints and/or constraints on the number of disks that can physically fit in the server. It spreads the data across many disks, with distributed parity blocks so that if any one disk fails the data can be rebuilt from the parity blocks. If two disks fail, the entire volume fails unrecoverably. In terms of cost per unit of storage, it’s the most economical redundant configuration, because you lose only one disk’s worth of storage space across the entire array.
Random writes are expensive in RAID 5 because each write to the volume requires two reads and two writes to the underlying disks to compute and store the parity bits. Writes can perform a little better if they are sequential, or if there are many physical disks. On the other hand, both random and sequential reads perform decently. RAID 5 is an acceptable choice for data volumes, or data and logs, for many read-mostly workloads, where the cost of the extra I/O operations for writes isn’t a big deal.
The biggest performance cost with RAID 5 occurs if a disk fails because the data has to be reconstructed by reading all the other disks. This affects performance severely, and it’s even worse if you have lots of disks. If you’re trying to keep the server online during the rebuild, don’t expect either the rebuild or the array’s performance to be good. If you use RAID 5, it’s best to have some mechanism to fail over and take a machine out of service when there’s a problem. Either way, it’s a good idea to benchmark your system with a failed drive and during recovery, so you know what to expect. The disk performance might degrade by a factor of two or more with a failed drive and by a factor of five or more when rebuilding is in progress, and a server with storage that’s two to five times slower might be disproportionately affected overall.
Other performance costs include limited scalability because of the parity blocks--RAID 5 doesn’t scale well past 10 disks or so--and caching issues. Good RAID 5 performance depends heavily on the RAID controller’s cache, which can conflict with the database server’s needs. We discuss caching a bit later.
One of the mitigating factors for RAID 5 is that it’s so popular. As a result, RAID controllers are often highly optimized for RAID 5, and despite the theoretical limits, smart controllers that use caches well can sometimes perform nearly as well as RAID 10 controllers for some workloads. This might actually reflect that the RAID 10 controllers are less highly optimized, but regardless of the reason, this is what we’ve seen.
RAID 6
The largest issue with RAID 5 was the loss of two disks was catastrophic. The more disks that you have in your array, the higher probability you have of disk failure. RAID 6 helps to curb the failure possibility by adding a second parity disk. This allows you to sustain 2 disk failures and still rebuild the array. The downside is that calculating the additional parity will make writes slower than RAID 5.
RAID 10
RAID 10 is a very good choice for data storage. It consists of mirrored pairs that are striped, so it scales both reads and writes well. It is fast and easy to rebuild, in comparison to RAID 5. It can also be implemented in software fairly well.
The performance loss when one hard drive goes out can still be significant because that stripe can become a bottleneck. Performance can degrade by up to 50%, depending on the workload. One thing to watch out for is RAID controllers that use a “concatenated mirror” implementation for RAID 10. This is suboptimal because of the absence of striping: your most frequently accessed data might be placed on only one pair of disks, instead of being spread across many, so you’ll get poor performance.
RAID 50
RAID 50 consists of RAID 5 arrays that are striped, and it can be a good compromise between the economy of RAID 5 and the performance of RAID 10 if you have many disks. This is mainly useful for very large datasets, such as data warehouses or extremely large OLTP systems.

The following table summarizes various RAID configurations.

Table 3-1. Comparison of RAID levels
Level Synopsis Redundancy Disks required Faster reads Faster writes
RAID 0 Cheap, fast, dangerous No N Yes Yes
RAID 1 Fast reads, simple, safe Yes 2 (usually) Yes No
RAID 5 A safety, speed, and cost compromise cost Yes N + 1 Yes Depends
RAID 6 Like RAID 5 but more resilient Yes N + 2 Yes Depends
RAID 10 Expensive, fast, safe Yes 2N Yes Yes
RAID 50 For very large data stores Yes 2(N + 1) Yes Yes

RAID Failure, Recovery, and Monitoring

RAID configurations (with the exception of RAID 0) offer redundancy. This is important, but it’s easy to underestimate the likelihood of concurrent disk failures. You shouldn’t think of RAID as a strong guarantee of data safety.

RAID doesn’t eliminate - or even reduce - the need for backups. When there is a problem, the recovery time will depend on your controller, the RAID level, the array size, the disk speed, and whether you need to keep the server online while you rebuild the array.

There is a chance of disks failing at exactly the same time. For example, a power spike or overheating can easily kill two or more disks. What’s more common, however, is two disk failures happening close together. Many such issues can go unnoticed. A common cause is corruption on the physical media holding data that is seldom accessed. This might go undetected for months, until either you try to read the data, or another drive fails and the RAID controller tries to use the corrupted data to rebuild the array. The larger the hard drive is, the more likely this is.

That’s why it’s important to monitor your RAID arrays. Most controllers offer some software to report on the array’s status, and you need to keep track of this because you might otherwise be totally ignorant of a drive failure. You might miss your opportunity to recover the data and discover the problem only when a second drive fails when it’s too late. You should configure a monitoring system to alert you when a drive or volume changes to a degraded or failed status.

You can mitigate the risk of latent corruption by actively checking your arrays for consistency at regular intervals. Background Patrol Read, a feature of some controllers that checks for damaged media and fixes it while all the drives are online, can also help avert such problems. As with recovery, extremely large arrays can be slow to check, so make sure you plan accordingly when you create large arrays.

You can also add a hot spare drive, which is unused and configured as a standby for the controller to automatically use for recovery. This is a good idea if you depend on every server. It’s expensive with servers that have only a few hard drives because the cost of having an idle disk is proportionately higher, but if you have many disks, it’s almost foolish not to have a hot spare. Remember that the probability of a drive failure increases rapidly with more disks.

In addition to monitoring your drives for failures, you should monitor the RAID controller’s battery backup unit and write cache policy. If the battery fails, by default most controllers will disable write caching by changing the cache policy to WriteThrough instead of WriteBack. This can cause a severe drop in performance. Many controllers will also periodically cycle the battery through a learning process, during which time the cache is also disabled. Your RAID controller’s management utility should let you view and configure when the learning cycle is scheduled so that it doesn’t catch you off guard. Newer RAID controllers avoid this by using NVRAM to store uncommitted writes instead of a battery-backed cache. This avoids the entire pain of the learning cycle.

You might also want to benchmark your system with the cache policy set to WriteThrough so you’ll know what to expect. The preferred approach is to schedule your battery learning cycles at low traffic periods - typically at night or on the weekend. If performance suffers badly enough with WriteThrough at any time, you could also fail over to another server before your learning cycle begins. As a very last resort, you could reconfigure your servers by changing the innodb_flush_log_at_trx_commit and sync_binlog variables to lower durability settings. This will reduce the disk utilization during WriteThrough and may offer acceptable performance, however this should really be done as a last resort. Reducing durability has a big impact on how much data you may lose during a database crash and your ability to recover it.

RAID Configuration and Caching

You can usually configure the RAID controller itself by entering its setup utility during the machine’s boot sequence, or by running it from the command prompt. Although most controllers offer a lot of options, the two we focus on are the chunk size for striped arrays, and the on-controller cache (also known as the RAID cache; we use the terms interchangeably).

The RAID stripe chunk size

The optimal stripe chunk size is workload- and hardware-specific. In theory, it’s good to have a large chunk size for random I/O, because it means more reads can be satisfied from a single drive.

To see why this is so, consider the size of a typical random I/O operation for your workload. If the chunk size is at least that large, and the data doesn’t span the border between chunks, only a single drive needs to participate in the read. But if the chunk size is smaller than the amount of data to be read, there’s no way to avoid involving more than one drive in the read.

So much for theory. In practice, many RAID controllers don’t work well with large chunks. For example, the controller might use the chunk size as the cache unit in its cache, which could be wasteful. The controller might also match the chunk size, cache size, and read-unit size (the amount of data it reads in a single operation). If the read unit is too large, its cache might be less effective, and it might end up reading a lot more data than it really needs, even for tiny requests.

Also, in practice, it’s hard to know whether any given piece of data will span multiple drives. Even if the chunk size is 16 KB, which matches InnoDB’s page size, you can’t be certain all of the reads will be aligned on 16 KB boundaries. The filesystem might fragment the file, and it will typically align the fragments on the filesystem block size, which is often 4 KB. Some filesystems might be smarter, but you shouldn’t count on it.

You can configure the system so that blocks are aligned all the way from the application down to the underlying storage: InnoDB’s blocks, the filesystem’s blocks, LVM, the partition offset, the RAID stripe, and disk sectors. Our benchmarks showed that when everything is aligned, there can be a performance improvement on the order of 15% to 23% for random reads and random writes, respectively. The exact techniques for aligning everything are too specific to cover here, but there’s a lot of good information on it elsewhere, including http://www.mysqlperformanceblog.com.

The RAID cache

The RAID cache is a (relatively) small amount of memory that is physically installed on a hardware RAID controller. It can be used to buffer data as it travels between the disks and the host system. Here are some of the reasons a RAID card might use the cache:

Caching reads
After the controller reads some data from the disks and sends it to the host system, it can store the data; this will enable it to satisfy future requests for the same data without having to go to disk again.
This is usually a very poor use of the RAID cache. Why? Because the operating system and the database server have their own, much larger, caches. If there’s a cache hit in one of these caches, the data in the RAID cache won’t be used. Conversely, if there’s a miss in one of the higher-level caches, the chance that there’ll be a hit in the RAID cache is vanishingly small. Because the RAID cache is so much smaller, it will almost certainly have been flushed and filled with other data, too. Either way you look at it, it’s a waste of memory to cache reads in the RAID cache.
Caching read-ahead data
If the RAID controller notices sequential requests for data, it might decide to do a read-ahead read - that is, to prefetch data it predicts will be needed soon. It has to have somewhere to put the data until it’s requested, though. It can use the RAID cache for this. The performance impact of this can vary widely, and you should check to ensure it’s actually helping. Read-ahead operations might not help if the database server is doing its own smart read-ahead (as InnoDB does), and it might interfere with the all-important buffering of synchronous writes.
Caching writes
The RAID controller can buffer writes in its cache and schedule them for a later time. The advantage to doing this is twofold: first, it can return “success” to the host system much more quickly than it would be able to if it had to actually perform the writes on the physical disks, and second, it can accumulate writes and do them more efficiently.
Internal operations
Some RAID operations are very complex - especially RAID 5 writes, which have to calculate parity bits that can be used to rebuild data in the event of a failure. The controller needs to use some memory for this type of internal operation.
This is one reason why RAID 5 can perform poorly on some controllers: it needs to read a lot of data into the cache for good performance. Some controllers can’t balance caching writes with caching for the RAID 5 parity operations.

In general, the RAID controller’s memory is a scarce resource that you should try to use wisely. Using it for reads is usually a waste, but using it for writes is an important way to speed up your I/O performance. Many controllers let you choose how to allocate the memory. For example, you can choose how much of it to use for caching writes and how much for reads. For RAID 0, RAID 1, and RAID 10, you should probably allocate 100% of the controller’s memory for caching writes. For RAID 5, you should reserve some of the controller’s memory for its internal operations. This is generally good advice, but it doesn’t always apply - different RAID cards require different configurations.

When you’re using the RAID cache for write caching, many controllers let you configure how long it’s acceptable to delay the writes (one second, five seconds, and so on). A longer delay means more writes can be grouped together and flushed to the disks optimally. The downside is that your writes will be more “bursty.” That’s not a bad thing, unless your application happens to make a bunch of write requests just as the controller’s cache fills up, when it’s about to be flushed to disk. If there’s not enough room for your application’s write requests, it’ll have to wait. Keeping the delay shorter means you’ll have more write operations and they’ll be less efficient, but it smoothes out the spikiness and helps keep more of the cache free to handle bursts from the application. (We’re simplifying here - controllers often have complex, vendor-specific balancing algorithms, so we’re just trying to cover the basic principles.)

The write cache is very helpful for synchronous writes, such as issuing fsync() calls on the transaction logs and creating binary logs with sync_binlog enabled, but you shouldn’t enable it unless your controller has a battery backup unit (BBU) or other non-volatile storage. Caching writes without a BBU is likely to corrupt your database, and even your transactional filesystem, in the event of power loss. If you have a BBU, however, enabling the write cache can increase performance by a factor of 20 or more for workloads that do a lot of log flushes, such as flushing the transaction log when a transaction commits.

A final consideration is that many hard drives have write caches of their own, which can “fake” fsync() operations by lying to the controller that the data has been written to physical media. Hard drives that are attached directly (as opposed to being attached to a RAID controller) can sometimes let their caches be managed by the operating system, but this doesn’t always work either. These caches are typically flushed for an fsync() and bypassed for synchronous I/O, but again, the hard drive can lie. You should either ensure that these caches are flushed on fsync() or disable them, because they are not battery-backed. Hard drives that aren’t managed properly by the operating system or RAID firmware have caused many instances of data loss.

For this and other reasons, it’s always a good idea to do genuine crash testing (literally pulling the power plug out of the wall) when you install new hardware. This is often the only way to find subtle misconfigurations or sneaky hard drive behaviors. A handy script for this can be found at ULINK WITHOUT TEXT NODE.http://brad.livejournal.com/2116715.html.

To test whether you can really rely on your RAID controller’s BBU, make sure you leave the power cord unplugged for a realistic amount of time. Some units don’t last as long without power as they’re supposed to. Here again, one bad link can render your whole chain of storage components useless.

Network Configuration

Just as latency and throughput are limiting factors for a hard drive, latency and bandwidth are limiting factors for a network connection. The biggest problem for most applications is latency; a typical application does a lot of small network transfers, and the slight delay for each transfer adds up.

A network that’s not operating correctly is a major performance bottleneck, too. Packet loss is a common problem. Even 1% loss is enough to cause significant performance degradation because various layers in the protocol stack will try to fix the problems with strategies such as waiting a while and then resending packets, which adds extra time. Another common problem is broken or slow Domain Name System (DNS) resolution.

DNS is enough of an Achilles heel that enabling skip_name_resolve is a good idea for production servers. Broken or slow DNS resolution is a problem for lots of applications, but it’s particularly severe for MySQL. When MySQL receives a connection request, it does both a forward and a reverse DNS lookup. There are lots of reasons that this could go wrong. When it does, it will cause connections to be denied, slow down the process of connecting to the server, and generally wreak havoc, up to and including denial-of-service attacks. If you enable the skip_name_resolve option, MySQL won’t do any DNS lookups at all. However, this also means that your user accounts must have only IP addresses, “localhost,” or IP address wildcards in the host column. Any user account that has a hostname in the host column will not be able to log in.

It’s usually more important, though, to adjust your settings to deal efficiently with a lot of connections and small queries. One of the more common tweaks is to change your local port range. Linux systems have a range of local ports that can be used. When the connection is made back to a caller, it uses a local port. If you have many simultaneous connections, you can run out of local ports.

Here’s a system that is configured to default values:

[root@server ~]# cat /proc/sys/net/ipv4/ip_local_port_range
32768 61000

Sometimes you might need to change these values to a larger range. For example:

[root@server ~]# echo 1024 65535 >
/proc/sys/net/ipv4/ip_local_port_range

The TCP protocol allows a system to queue up incoming connections, like a bucket. If the bucket fills up, clients won’t be able to connect. You can allow more connections to queue up as follows:

[root@server ~]# echo 4096 > /proc/sys/net/ipv4/tcp_max_syn_backlog

For database servers that are used only locally, you can shorten the timeout that comes after closing a socket in the event that the peer is broken and doesn’t close its side of the connection. The default is one minute on most systems, which is rather long:

[root@server ~]# echo <value> > /proc/sys/net/ipv4/tcp_fin_timeout

Most of the time these settings can be left at their defaults. You’ll typically need to change them only when something unusual is happening, such as extremely poor network performance or very large numbers of connections. An Internet search for “TCP variables” will turn up lots of good reading about these and many more variables.

Choosing a Filesystem

Your filesystem choices are pretty dependent on your operating system. In many systems, such as Windows, you really have only one or two choices, and only one (NTFS) is really viable. GNU/Linux, on the other hand, supports many filesystems.

Many people want to know which filesystems will give the best performance for MySQL on GNU/Linux, or, even more specifically, which of the choices is best for InnoDB. The benchmarks actually show that most of them are very close in most respects, but looking to the filesystem for performance is really a distraction. The filesystem’s performance is very workload-specific, and no filesystem is a magic bullet. Most of the time, a given filesystem won’t perform significantly better or worse than any other filesystem. The exception is if you run into some filesystem limit, such as how it deals with concurrency, working with many files, fragmentation, and so on.

It’s more important to consider crash recovery time and whether you’ll run into specific limits, such as slow performance on directories with many files (a notorious problem with ext2 and older versions of ext3, but solved in modern versions of ext3 and ext4 with the dir_index option). The filesystem you choose is very important in ensuring your data’s safety, so we strongly recommend you don’t experiment on production systems.

When possible, it’s best to use a journaling filesystem, such as ext3, ext4, XFS, ZFS, or JFS. If you don’t, a filesystem check after a crash can take a long time. If the system is not very important, non-journaling filesystems might perform better than transactional ones. For example, ext2 might perform better than ext3, or you can use tunefs to disable the journaling feature on ext3. Mount time is also a factor for some filesystems. ReiserFS, for instance, can take a long time to mount and perform journal recovery on large partitions.

If you use ext3 or its successor ext4, you have three options for how the data is journaled, which you can place in the /etc/fstab mount options:

data=writeback
This option means only metadata writes are journaled. Writes to the metadata are not synchronized with the data writes. This is the fastest configuration, and it’s usually safe to use with InnoDB because it has its own transaction log. The exception is that a crash at just the right time could cause corruption in a .frm file.
Here’s an example of how this configuration could cause problems. Say a program decides to extend a file to make it larger. The metadata (the file’s size) will be logged and written before the data is actually written to the (now larger) file. The result is that the file’s tail--the newly extended area - contains garbage.
data=ordered
This option also journals only the metadata, but it provides some consistency by writing the data before the metadata so that they stay consistent. It’s only slightly slower than the writeback option, and it’s much safer when there’s a crash.
In this configuration, if we suppose again that a program wants to extend a file, the file’s metadata won’t reflect the file’s new size until the data that resides in the newly extended area has been written.
data=journal
This option provides atomic journaled behavior, writing the data to the journal before it’s written to the final location. It is usually unnecessary and has much higher overhead than the other two options. However, in some cases, it can improve performance because the journaling lets the filesystem delay the writes to the data’s final location.
Regardless of the filesystem, there are some specific options that it’s best to disable because they don’t provide any benefit and can add quite a bit of overhead. The most famous is recording access time, which requires a write even when you’re reading a file or directory. To disable this option, add the noatime,nodiratime mount options to your /etc/fstab; this can sometimes boost performance by as much as 5–10%, depending on the workload and the filesystem (although it might not make much difference in other cases). Here’s a sample /etc/fstab line for the ext3 options we mentioned: /dev/sda2 /usr/lib/mysql ext3 noatime,nodiratime,data=writeback 0 1.
You can also tune the filesystem’s read-ahead behavior because it might be redundant. For example, InnoDB does its own read-ahead prediction. Disabling or limiting read-ahead is especially beneficial on Solaris’s UFS. Using innodb_flush_method=O_DIRECT automatically disables read-ahead.

Some filesystems don’t support features you might need. For example, support for direct I/O might be important if you’re using the O_DIRECT flush method for InnoDB. Also, some filesystems handle a large number of underlying drives better than others; XFS is often much better at this than ext3, for instance. Finally, if you plan to use LVM snapshots for initializing replicas or taking backups, you should verify that your chosen filesystem and LVM version work well together.

The next table summarizes the characteristics of some common filesystems.

Table 3-2. Common filesystem characteristics
Filesystem Operating system Journaling Large directories
ext2 GNU/Linux No No
ext3 GNU/Linux Optional Optional/partial
ext4 GNU/Linux Yes Yes
HFS Plus Mac OS Optional Yes
JFS GNU/Linux Yes No
NTFS Windows Yes Yes
ReiserFS GNU/Linux Yes Yes
UFS (Solaris) Solaris Yes Tunable
UFS (FreeBSD) FreeBSD No Optional/partial
UFS2 FreeBSD No Optional/partial
XFS GNU/Linux Yes Yes
ZFS Solaris, FreeBSD Yes Yes

We usually recommend using the XFS filesystem. The ext3 filesystem just has too many serious limitations, such as its single mutex per inode, and bad behavior such as flushing all dirty blocks in the whole filesystem on fsync() instead of just one file’s dirty blocks. The ext4 filesystem is an acceptable choice, although there have been performance bottlenecks in specific kernel versions that you should investigate before committing to it.

When considering any filesystem for a database, it’s good to consider how long it has been available, how mature it is and how proven it has been in production environments. The filesystem bits are the very lowest level of data integrity you have in a database.

Choosing a Disk Queue Scheduler

On GNU/Linux, the queue scheduler determines the order in which requests to a block device are actually sent to the underlying device. The default is Completely Fair Queueing, or cfq. It’s okay for casual use on laptops and desktops, where it helps prevent I/O starvation, but it’s terrible for servers. It causes very poor response times under the types of workload that MySQL generates, because it stalls some requests in the queue needlessly.

You can see which schedulers are available, and which one is active, with the following command:

$ cat /sys/block/sda/queue/scheduler
noop deadline [cfq]

You should replace sda with the device name of the disk you’re interested in. In our example, the square brackets indicate which scheduler is in use for this device. The other two choices are suitable for server-class hardware, and in most cases they work about equally well. The noop scheduler is appropriate for devices that do their own scheduling behind the scenes, such as hardware RAID controllers and SANs, and deadline is fine both for RAID controllers and disks that are directly attached. Our benchmarks show very little difference between these two. The main thing is to use anything but cfq, which can cause severe performance problems.

Memory and Swapping

MySQL performs best with a large amount of memory allocated to it. As we learned in the first chapter, InnoDB uses memory as a cache to avoid disk access. This means that the performance of the memory system can have a direct impact on how fast queries are served. Even today, one of the best ways to ensure faster memory access has been to replace the built-in memory allocator (glibc) with an external one such as tcmalloc or jemalloc. Numerous benchmarks1 have shown that both of these offer improved performance and reduced memory fragmentation when compared with glibc.

Swapping occurs when the operating system writes some virtual memory to disk because it doesn’t have enough physical memory to hold it. Swapping is transparent to processes running on the operating system. Only the operating system knows whether a particular virtual memory address is in physical memory or on disk.

When dealing with legacy HDDs, swapping was very bad for MySQL’s performance. It defeats the purpose of caching in memory, and it results in lower efficiency than using too little memory for the caches. MySQL and its storage engines have many algorithms that treat in-memory data differently from data on disk, because they assume that in-memory data is cheap to access. Because swapping is invisible to user processes, MySQL (or the storage engine) won’t know when data it thinks is in memory is actually moved onto the disk.

The result can be very poor performance. For example, if the storage engine thinks the data is still in memory, it might decide it’s OK to lock a global mutex (such as the InnoDB buffer pool mutex) for a “short” memory operation. If this operation actually causes disk I/O, it can stall everything until the I/O completes. This means swapping is much worse than simply doing I/O as needed.

When using SSDs, the performance penalty isn’t nearly as sharp, but you should still actively avoid swapping - even if just to avoid unnecessary writes which may shorten the overall lifespan of the disk. You may also consider taking the approach of using no swap, which foregoes the potential altogether but does put you in a situation where running out of memory may lead to process termination.

On GNU/Linux, you can monitor swapping with vmstat (we show some examples in the next section). You need to look at the swap I/O activity, reported in the si and so columns, rather than the swap usage, which is reported in the swpd column. The swpd column can show processes that have been loaded but aren’t being used, which are not really problematic. We like the si and so column values to be 0, and they should definitely be less than 10 blocks per second.

In extreme cases, too much memory allocation can cause the operating system to run out of swap space. If this happens, the resulting lack of virtual memory can crash MySQL. But even if it doesn’t run out of swap space, very active swapping can cause the entire operating system to become unresponsive, to the point that you can’t even log in and kill the MySQL process. Sometimes the Linux kernel can even hang completely when it runs out of swap space. We recommend you run your databases without using swap space at all. Disk is still an order of magnitude slower than RAM and this avoids all of the headaches mentioned here.

Another thing that frequently happens under extreme virtual memory pressure is that the out-of-memory (OOM) killer process will kick in and kill something. This is frequently MySQL, but it can also be another process such as SSH, which can leave you with a system that’s not accessible from the network. You can prevent this by setting the SSH process’s oom_adj or oom_score_adj value. When working with dedicated database servers, we highly recommend that you identify any key processes like MySQL and SSH and proactively adjust the OOM killer score to prevent those from being selected first for termination.

You can solve most swapping problems by configuring your MySQL buffers correctly, but sometimes the operating system’s virtual memory system decides to swap MySQL anyway, sometimes related to how NUMA works2 in Linux. This usually happens when the operating system sees a lot of I/O from MySQL, so it tries to increase the file cache to hold more data. If there’s not enough memory, something must be swapped out, and that something might be MySQL itself. Some older Linux kernel versions also have counterproductive priorities that swap things when they shouldn’t, but this has been alleviated a bit in more recent kernels.

Operating systems usually allow some control over virtual memory and I/O. We mention a few ways to control them on GNU/Linux. The most basic is to change the value of /proc/sys/vm/swappiness to a low value, such as 0 or 1. This tells the kernel not to swap unless the need for virtual memory is extreme. For example, here’s how to check the current value:

$ cat /proc/sys/vm/swappiness
60

The value shown, 60, is the default swappiness setting (the range is from 0 to 100). This is a very bad default for servers. It’s only appropriate for laptops. Servers should be set to 0:

$ echo 0 > /proc/sys/vm/swappiness

Another option is to change how the storage engines read and write data. For example, using innodb_flush_method=O_DIRECT relieves I/O pressure. Direct I/O is not cached, so the operating system doesn’t see it as a reason to increase the size of the file cache. This parameter works only for InnoDB.

Another option is to use MySQL’s memlock configuration option, which locks MySQL in memory. This will avoid swapping, but it can be dangerous: if there’s not enough lockable memory left, MySQL can crash when it tries to allocate more memory. Problems can also be caused if too much memory is locked and there’s not enough left for the operating system.

Many of the tricks are specific to a kernel version, so be careful, especially when you upgrade. In some workloads, it’s hard to make the operating system behave sensibly, and your only recourse might be to lower the buffer sizes to suboptimal values.

Operating System Status

Your operating system provides tools to help you find out what the operating system and hardware are doing. In this section we’ll show you examples of how to use two widely available tools, iostat and vmstat. If your system doesn’t provide either of these tools, chances are it will provide something similar. Thus, our goal isn’t to make you an expert at using iostat or vmstat, but simply to show you what to look for when you’re trying to diagnose problems with tools such as these.

In addition to these tools, your operating system might provide others, such as mpstat or sar. If you’re interested in other parts of your system, such as the network, you might want to instead use tools such as ifconfig (which shows how many network errors have occurred, among other things) or netstat.

By default, vmstat and iostat produce just one report showing the average values of various counters since the server was started, which is not very useful. However, you can give both tools an interval argument. This makes them generate incremental reports showing what the server is doing right now, which is much more relevant. (The first line shows the statistics since the system was started; you can just ignore this line.)

How to Read vmstat Output

Let’s look at an example of vmstat first. To make it print out a new report every five seconds, use the following command:

$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 2632 25728 23176 740244 0 0 527 521 11 3 10 1 86 3
0 0 2632 27808 23180 738248 0 0 2 430 222 66 2 0 97 0

You can stop vmstat with Ctrl-C. The output you see depends on your operating system, so you might need to read the manual page to figure it out.

As stated earlier, even though we asked for incremental output, the first line of values shows the averages since the server was booted. The second line shows what’s happening right now, and subsequent lines will show what’s happening at five-second intervals. The columns are grouped by one of the following headers:

procs
The r column shows how many processes are waiting for CPU time. The b column shows how many are in uninterruptible sleep, which generally means they’re waiting for I/O (disk, network, user input, and so on).
memory
The swpd column shows how many blocks are swapped out to disk (paged). The remaining three columns show how many blocks are free (unused), how many are being used for buffers, and how many are being used for the operating system’s cache.
swap
These columns show swap activity: how many blocks per second the operating system is swapping in (from disk) and out (to disk). They are much more important to monitor than the swpd column.
We like to see si and so at 0 most of the time, and we definitely don’t like to see more than 10 blocks per second. Bursts are also bad.
io
These columns show how many blocks per second are read in from (bi) and written out to (bo) block devices. This usually reflects disk I/O.
system
These columns show the number of interrupts per second (in) and the number of context switches per second (cs).
cpu
These columns show the percentages of total CPU time spent running user (non-kernel) code, running system (kernel) code, idle, and waiting for I/O. A possible fifth column (st) shows the percent “stolen” from a virtual machine if you’re using virtualization. This refers to the time during which something was runnable on the virtual machine, but the hypervisor chose to run something else instead. If the virtual machine doesn’t want to run anything and the hypervisor runs something else, that doesn’t count as stolen time.

The vmstat output is system-dependent, so you should read your system’s vmstat(8) manpage if yours looks different from the sample we’ve shown. One important note: the memory, swap, and I/O statistics are in blocks, not in bytes. In GNU/Linux, blocks are usually 1,024 bytes.

How to Read iostat Output

Now let’s move on to iostat. By default, it shows some of the same CPU usage information as vmstat. We’re usually interested in just the I/O statistics, though, so we use the following command to show only extended device statistics:

$ iostat -dx 5
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 1.6 2.8 2.5 1.8 138.8 36.9 40.7 0.1 23.2 6.0 2.6

As with vmstat, the first report shows averages since the server was booted (we generally omit it to save space), and the subsequent reports show incremental averages. There’s one line per device.

There are various options that show or hide columns. The official documentation is a bit confusing, and we had to dig into the source code to figure out what was really being shown. Here’s what each column is showing:

rrqm/s and wrqm/s
The number of merged read and write requests queued per second. “Merged” means the operating system took multiple logical requests from the queue and grouped them into a single request to the actual device.
r/s and w/s
The number of read and write requests sent to the device per second.
rsec/s and wsec/s
The number of sectors read and written per second. Some systems also output rkB/s and wkB/s, the number of kilobytes read and written per second. We omit those for brevity.
avgrq-sz
The request size in sectors.
avgqu-sz
The number of requests waiting in the device’s queue.
await
The number of milliseconds spent in the disk queue. Unfortunately, iostat doesn’t show separate statistics for read and write requests, which are so different that they really shouldn’t be averaged together. This is often very important when you’re trying to diagnose a performance issue.
svctm
The number of milliseconds spent servicing requests, excluding queue time.
%util
The percentage of time during which at least one request was active.3 This is very confusingly named. It is not the device’s utilization, if you’re familiar with the standard definition of utilization in queueing theory. A device with more than one hard drive (such as a RAID controller) should be able to support a higher concurrency than 1, but %util will never exceed 100% unless there’s a rounding error in the math used to compute it. As a result, it is not a good indication of device saturation, contrary to what the documentation says, except in the special case where you’re looking at a single physical hard drive.

You can use the output to deduce some facts about a machine’s I/O subsystem. One important metric is the number of requests served concurrently. Because the reads and writes are per second and the service time’s unit is thousandths of a second, you can use Little’s Law to derive the following formula for the number of concurrent requests the device is serving:

concurrency = (r/s + w/s) * (svctm/1000)

Here’s a sample of iostat output:

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 105 311 298 820 3236 9052 10 127 113 9 96

Plugging the numbers into the concurrency formula gives a concurrency of about 9.6. This means that on average, the device was serving 9.6 requests at a time during the sampling interval. The sample is from a 10-disk RAID 10 volume, so the operating system is parallelizing requests to this device quite well. On the other hand, here’s a device that appears to be serializing requests instead:

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdc 81 0 280 0 3164 0 11 2 7 3 99

Other Helpful Tools

We’ve shown vmstat and iostat because they’re widely available, and vmstat is usually installed by default on many Unix-like operating systems. However, each of these tools has its limitations, such as confusing units of measurement, sampling at intervals that don’t correspond to when the operating system updates the statistics, and the inability to see all of the metrics at once. If these tools don’t meet your needs, you might be interested in dstat (ULINK WITHOUT TEXT NODE.http://dag.wieers.com/home-made/dstat/) or collectl (ULINK WITHOUT TEXT NODE.http://collectl.sourceforge.net).

We also like to use mpstat to watch CPU statistics; it provides a much better idea of how the CPUs are behaving individually, instead of grouping them all together. Sometimes this is very important when you’re diagnosing a problem. You might find blktrace to be helpful when you’re examining disk I/O usage, too.

Percona wrote their own replacement for iostat, called pt-diskstats. It’s part of Percona Toolkit. It addresses some of the complaints about iostat, such as the way that it presents reads and writes in aggregate, and the lack of visibility into concurrency. It is also interactive and keystroke-driven, so you can zoom in and out, change the aggregation, filter out devices, and show and hide columns. It is a great way to slice and dice a sample of disk statistics, which you can gather with a simple shell script even if you don’t have the tool installed. You can capture samples of disk activity and email or save them for later analysis.

Summary

Choosing and configuring hardware for MySQL, and configuring MySQL for the hardware, is not a mystical art. In general, you need the same skills and knowledge that you need for most other purposes. However, there are some MySQL-specific things you should know.

What we commonly suggest for most people is to find a good balance between performance and cost. First, we like to use commodity servers, for many reasons. For example, if you’re having trouble with a server and you need to take it out of service while you try to diagnose it, or if you simply want to try swapping it with another server as a form of diagnosis, this is a lot easier to do with a $5,000 server than one that costs $50,000 or more. MySQL is also typically a better fit--both in terms of the software itself and in terms of the typical workloads it runs--for commodity hardware.

The four fundamental resources MySQL needs are CPU, memory, disk, and network resources. The network doesn’t tend to show up as a serious bottleneck very often, but CPUs, memory, and disks certainly do. The balance of speed and quantity really depends on the workload and you should strive for a balance of fast and many as your budget allows. The more concurrency you expect, the more you should lean on more CPUs to accommodate your workload.

The relationship between CPUs, memory, and disks is intricate, with problems in one area often showing up elsewhere. Before you throw resources at a problem, ask yourself whether you should be throwing resources at a different problem instead. If you’re I/O bound, do you need more I/O capacity, or just more memory? The answer hinges on the working set size, which is the set of data that’s needed most frequently over a given duration.

Solid-state devices are great for improving server performance overall, and should generally be the standard for databases now, especially OLTP workloads. The only reason to continue using HDDs is in extremely budget constrained systems or ones where you need a staggeringly high amount of disk space - on the order of petabytes in a data warehousing situation.

In terms of the operating system, there are just a few Big Things that you need to get right, mostly related to storage, networking, and virtual memory management. If you use GNU/Linux, as most MySQL users do, we suggest using the XFS filesystem and setting the swappiness and disk queue scheduler to values that are appropriate for a server. There are some network parameters that you might need to change, and you might wish to tweak a number of other things (such as disabling SELinux), but those changes are a matter of preference.

1 See https://www.percona.com/blog/2012/07/05/impact-of-memory-allocators-on-mysql-performance/ and https://blog.herecura.eu/blog/2020-04-23-mysql-memory-usage/ for comparisons.

2 https://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/

3 Software raid, like md, may not show utilization for the RAID array itself.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.231.245