Chapter 24. Deploying MongoDB

This chapter gives recommendations for setting up a server to go into production. In particular, it covers:

  • Choosing what hardware to buy and how to set it up

  • Using virtualized environments

  • Important kernel and disk I/O settings

  • Network setup: who needs to connect to whom

Designing the System

You generally want to optimize for data safety and the quickest access you can afford. This section discusses the best way to accomplish these goals when choosing disks, RAID configuration, CPUs, and other hardware and low-level software components.

Choosing a Storage Medium

In order of preference, we would like to store and retrieve data from:

  1. RAM

  2. SSD

  3. Spinning disk

Unfortunately, most people have limited budgets or enough data that storing everything in RAM is impractical and SSDs are too expensive. Thus, the typical deployment is a small amount of RAM (relative to total data size) and a lot of space on a spinning disk. If you are in this camp, the important thing is that your working set is smaller than RAM, and you should be ready to scale out if the working set gets bigger.

If you are able to spend what you like on hardware, buy a lot of RAM and/or SSDs.

Reading data from RAM takes a few nanoseconds (say, 100). Conversely, reading from disk takes a few milliseconds (say, 10). It can be hard to picture the difference between these two numbers, so let’s scale them up to more relatable numbers: if accessing RAM took 1 second, accessing the disk would take over a day!

100 nanoseconds × 10,000,000 = 1 second

10 milliseconds × 10,000,000 = 1.16 days

These are very back-of-the-envelope calculations (your disk might be a bit faster or your RAM a bit slower), but the magnitude of this difference doesn’t change much. Thus, we want to access the disk as seldom as possible.

Recommended RAID Configurations

RAID is hardware or software that lets you treat multiple disks as though they were a single disk. It can be used for reliability, performance, or both. A set of disks using RAID is referred to as a RAID array (somewhat redundantly, as RAID stands for redundant array of inexpensive disks).

There are a number of ways to configure RAID, depending on the features you’re looking for—generally some combination of speed and fault tolerance. These are the most common varieties:

RAID0

Striping disks for improved performance. Each disk holds part of the data, similar to MongoDB’s sharding. Because there are multiple underlying disks, lots of data can be written to disk at the same time. This improves throughput on writes. However, if a disk fails and data is lost, there are no copies of it. It also can cause slow reads, as some data volumes may be slower than others.

RAID1

Mirroring for improved reliability. An identical copy of the data is written to each member of the array. This has lower performance than RAID0, as a single member with a slow disk can slow down all writes. However, if a disk fails, you will still have a copy of the data on another member of the array.

RAID5

Striping disks, plus keeping an extra piece of data about the other data that’s been stored to prevent data loss on server failure. Basically, RAID5 can handle one disk going down and hide that failure from the user. However, it is slower than any of the other varieties listed here because it needs to calculate this extra piece of information whenever data is written. This is particularly expensive with MongoDB, as a typical workload does many small writes.

RAID10

A combination of RAID0 and RAID1: data is striped for speed and mirrored for reliability.

We recommend using RAID10: it is safer than RAID0 and can smooth out performance issues that can occur with RAID1. However, some people feel that RAID1 on top of replica sets is overkill and opt for RAID0. It is a matter of personal preference: how much risk are you willing to trade for performance?

Do not use RAID5: it is very, very slow.

CPU

MongoDB historically was very light on CPU, but with the use of the WiredTiger storage engine this is no longer the case. The WiredTiger storage engine is multithreaded and can take advantage of additional CPU cores. You should therefore balance your investment between memory and CPU.

When choosing between speed and number of cores, go with speed. MongoDB is better at taking advantage of more cycles on a single processor than increased parallelization.

Operating System

64-bit Linux is the operating system MongoDB runs best on. If possible, use some flavor of that. CentOS and Red Hat Enterprise Linux are probably the most popular choices, but any flavor should work (Ubuntu and Amazon Linux are also common). Be sure to use the most recent stable version of the operating system, because old, buggy packages or kernels can sometimes cause issues.

64-bit Windows is also well supported.

Other flavors of Unix are not as well supported: proceed with caution if you’re using Solaris or one of the BSD variants. Builds for these systems have, at least historically, had a lot of issues. MongoDB explicitly stopped supporting Solaris in August 2017, noting a lack of adoption among users.

One important note on cross-compatibility: MongoDB uses the same wire protocol and lays out data files identically on all systems, so you can deploy on a combination of operating systems. For example, you could have a mongos process running on Windows and the mongods that are its shards running on Linux. You can also copy data files from Windows to Linux or vice versa with no compatibility issues.

Since version 3.4, MongoDB no longer supports 32-bit x86 platforms. Do not run any type of MongoDB server on a 32-bit machine.

MongoDB works with little-endian architectures and one big-endian architecture: IBM’s zSeries. Most drivers support both little- and big-endian systems, so you can run clients on either. However, the server will typically be run on a little-endian machine.

Swap Space

You should allocate a small amount of swap in case memory limits are reached to prevent the kernel from killing MongoDB. It doesn’t usually use any swap space, but in extreme circumstances the WiredTiger storage engine might use some. If this occurs, then you should consider increasing the memory capacity of your machine or reviewing your workload to avoid this problematic situation for performance and for stability.

The majority of memory MongoDB uses is “slippery”: it’ll be flushed to disk and replaced with other memory as soon as the system requests the space for something else. Therefore, database data should never be written to swap space: it’ll be flushed back to disk first.

However, occasionally MongoDB will use swap for operations that require ordering data: either building indexes or sorting. It attempts not to use too much memory for these types of operations, but by performing many of them at the same time you may be able to force swapping.

If your application is managing to make MongoDB use swap space, you should look into redesigning the application or reducing load on the swapping server.

Filesystem

For Linux, only the XFS filesystem is recommended for your data volumes with the WiredTiger storage engine. It is possible to use the ext4 filesystem with WiredTiger, but be aware there are known performance issues (specifically, that it may stall on WiredTiger checkpoints).

On Windows, either NTFS or FAT is fine.

Warning

Do not use Network File Storage (NFS) directly mounted for MongoDB storage. Some client versions lie about flushing, randomly remount and flush the page cache, and do not support exclusive file locking. Using NFS can cause journal corruption and should be avoided at all costs.

Virtualization

Virtualization is a great way to get cheap hardware and be able to expand fast. However, there are some downsides—particularly unpredictable network and disk I/O. This section covers virtualization-specific issues.

Memory Overcommitting

The memory overcommit Linux kernel setting controls what happens when processes request too much memory from the operating system. Depending on how it’s set, the kernel may give memory to processes even if that memory is not actually available (in the hopes that it’ll become available by the time the process needs it). That’s called overcommitting: the kernel promises memory that isn’t actually there. This operating system kernel setting does not work well with MongoDB.

The possible values for vm.overcommit_memory are 0 (the kernel guesses about how much to overcommit); 1 (memory allocation always succeeds); or 2 (don’t commit more virtual address space than swap space plus a fraction of the overcommit ratio). The value 2 is complicated, but it’s the best option available. To set this, run:

$ echo 2 > /proc/sys/vm/overcommit_memory

You do not need to restart MongoDB after changing this operating system setting.

Mystery Memory

Sometimes the virtualization layer does not handle memory provisioning correctly. Thus, you may have a virtual machine that claims to have 100 GB of RAM available but only ever allows you to access 60 GB of it. Conversely, we’ve seen people that were supposed to have 20 GB of memory end up being able to fit an entire 100 GB dataset into RAM!

Assuming you don’t end up on the lucky side, there isn’t much you can do. If your operating system readahead is set appropriately and your virtual machine just won’t use all the memory it should, you may just have to switch virtual machines.

Handling Network Disk I/O Issues

One of the biggest problems with using virtualized hardware is that you are generally sharing a disk with other tenants, which exacerbates the disk slowness mentioned previously because everyone is competing for disk I/O. Thus, virtualized disks can have very unpredictable performance: they can work fine while your neighbors aren’t busy and suddenly slow down to a crawl if someone else starts hammering the disks.

The other issue is that this storage is often not physically attached to the machine MongoDB is running on, so even when you have a disk all to yourself I/O will be slower than it would be with a local disk. There is also the unlikely-but-possible scenario of your MongoDB server losing its network connection to your data.

Amazon has what is probably the most widely used networked block store, called Elastic Block Store (EBS). EBS volumes can be connected to Elastic Compute Cloud (EC2) instances, allowing you to give a machine almost any amount of disk immediately. If you are using EC2, you should also enable AWS Enhanced Networking if it’s available for the instance type, as well as disable the dynamic voltage and frequency scaling (DVFS) and CPU power-saving modes plus hyperthreading. On the plus side, EBS makes backups very easy (take a snapshot from a secondary, mount the EBS drive on another instance, and start up mongod). On the downside, you may encounter variable performance.

If you require more predictable performance, there are a couple of options. One is to host MongoDB on your own servers—that way, you know no one else is slowing things down. However, that’s not an option for a lot of people, so the next best thing is to get an instance in the cloud that guarantees a certain number of I/O Operations Per Second (IOPS). See http://docs.mongodb.org for up-to-date recommendations on hosted offerings.

If you can’t pursue either of these options and you need more disk I/O than an overloaded EBS volume can sustain, there is a way to hack around it. Basically, what you can do is keep monitoring the volume MongoDB is using. If and when that volume slows down, immediately kill that instance and bring up a new one with a different data volume.

There are a couple of statistics to watch for:

  • Spiking I/O utilization (“IO wait” on Cloud Manager/Atlas), for obvious reasons.

  • Page fault rates spiking. Note that changes in application behavior could also cause working set changes: you should disable this assassination script before deploying new versions of your application.

  • The number of lost TCP packets going up (Amazon is particularly bad about this: when performance starts to fall, it drops TCP packets all over the place).

  • MongoDB’s read and write queues spiking (this can be seen in Cloud Manager/Atlas or in mongostat’s qr/qw column).

If your load varies over the day or week, make sure your script takes that into account: you don’t want a rogue cron job killing off all of your instances because of an unusually heavy Monday morning rush.

This hack relies on you having recent backups or relatively quick-to-sync datasets. If you have each instance holding terabytes of data, you might want to pursue an alternative approach. Also, this is only likely to work: if your new volume is also being hammered, it will be just as slow as the old one.

Using Non-Networked Disks

Note

This section uses Amazon-specific vocabulary. However, it may apply to other providers.

Ephemeral drives are the actual disks attached to the physical machine your VM is running on. They don’t have a lot of the problems networked storage does. Local disks can still be overloaded by other users on the same box, but with a large box you can be reasonably sure you’re not sharing disks with too many others. Even with a smaller instance, often an ephemeral drive will give better performance than a networked drive so long as the other tenants aren’t doing tons of IOPS.

The downside is in the name: these disks are ephemeral. If your EC2 instance goes down, there’s no guarantee you’ll end up on the same box when you restart the instance, and then your data will be gone.

Thus, ephemeral drives should be used with care. You should make sure that you do not store any important or unreplicated data on these disks. In particular, do not put the journal on these ephemeral drives, or your database on network storage. In general, think of ephemeral drives as a slow cache rather than a fast disk and use them accordingly.

Configuring System Settings

There are several system settings that can help MongoDB run more smoothly, which are mostly related to disk and memory access. This section covers each of these options and how you should tweak them.

Turning Off NUMA

When machines had a single CPU, all RAM was basically the same in terms of access time. As machines started to have more processors, engineers realized that having all memory be equally far from each CPU (as shown in Figure 24-1) was less efficient than having each CPU have some memory that is especially close to it and fast for that particular CPU to access (Figure 24-2). This architecture, where each CPU has its own “local” memory, is called nonuniform memory architecture (NUMA).

Figure 24-1. Uniform memory architecture: all memory has the same access cost for each CPU
Figure 24-2. Nonuniform memory architecture: certain memory is attached to a CPU, giving the CPU faster access to that memory; CPUs can still access other CPUs’ memory, but it is more expensive than accessing their own

For lots of applications, NUMA works well: the processors often need different data because they’re running different programs. However, this works terribly for databases in general and MongoDB in particular because databases have such different memory access patterns than other types of applications. MongoDB uses a massive amount of memory and needs to be able to access memory that is “local” to other CPUs. However, the default NUMA settings on many systems make this difficult.

CPUs favor using the memory that is attached to them, and processes tend to favor one CPU over the others. This means that memory often fills up unevenly, potentially leaving you with one processor using 100% of its local memory and the other processors using only a fraction of their memory, as shown in Figure 24-3.

Figure 24-3. Sample memory usage in a NUMA system

In the scenario in Figure 24-3, suppose CPU1 needs some data that isn’t in memory yet. It must use its local memory for data that doesn’t have a “home” yet, but its local memory is full. Thus, it has to evict some of the data in its local memory to make room for the new data, even though there’s plenty of space left in the memory attached to CPU2! This process tends to cause MongoDB to run much slower than expected, as it only has a fraction of the memory available that it should have. MongoDB vastly prefers semiefficient access to more data over extremely efficient access to less data.

When running MongoDB servers and clients on NUMA hardware, you should configure a memory interleave policy so that the host behaves in a non-NUMA fashion. MongoDB checks NUMA settings on startup when deployed on Linux and Windows machines. If the NUMA configuration may degrade performance, MongoDB prints a warning.

On Windows, memory interleaving must be enabled through the machine’s BIOS. Consult your system documentation for details.

When running MongoDB on Linux, you should disable zone reclaim in the sysctl settings using one of the following commands:

echo 0 | sudo tee /proc/sys/vm/zone_reclaim_mode

sudo sysctl -w vm.zone_reclaim_mode=0

Then, you should use numactl to start your mongod instances, including the config servers, mongos instances, and any clients. If you do not have the numactl command, refer to the documentation for your operating system to install the numactl package.

The following command demonstrates how to start a MongoDB instance using numactl:

numactl --interleave=all <path> <options>

The <path> is the path to the program you are starting and the <options> are any optional arguments to pass to the program.

To fully disable NUMA behavior, you must perform both operations. For more information, see the documentation.

Setting Readahead

Readahead is an optimization where the operating system reads more data from disk than was actually requested. This is useful because most workloads that computers handle are sequential: if you load the first 20 MB of a video, you are probably going to want the next couple of megabytes of it. Thus, the system will read more from disk than you actually request and store it in memory, just in case you need it soon.

For the WiredTiger storage engine, you should set readahead to between 8 and 32 regardless of the storage media type (spinning disk, SSD, etc.). Setting it higher benefits sequential I/O operations, but since MongoDB disk access patterns are typically random, a higher readahead value provides limited benefit and may even result in performance degradation. For most workloads, a readahead of between 8 and 32 provides optimal MongoDB performance.

In general, you should set the readahead within this range unless testing shows that a higher value is measurably, repeatably, and reliably beneficial. MongoDB Professional Support can provide advice and guidance on nonzero readahead configurations.

Disabling Transparent Huge Pages (THP)

THP causes similar issues to high readahead. Do not use this feature unless:

  • All of your data fits into memory.

  • You have no plans for it to ever grow beyond memory.

MongoDB needs to page in lots of tiny pieces of memory, so using THP can result in more disk I/O.

Systems move data from disk to memory and back by the page. Pages are generally a couple of kilobytes (x86 defaults to 4,096-byte pages). If a machine has many gigabytes of memory, keeping track of each of these (relatively tiny) pages can be slower than just tracking a few larger-granularity pages. THP is a solution that allows you to have pages that are up to 256 MB (on IA-64 architectures). However, using it means that you are keeping megabytes of data from one section of disk in memory. If your data does not fit in RAM, then swapping in larger pieces from disk will just fill up your memory quickly with data that will need to be swapped out again. Also, flushing any changes to disk will be slower, as the disk must write megabytes of “dirty” data, instead of a few kilobytes.

THP was actually developed to benefit databases, so this may be surprising to experienced database admins. However, MongoDB tends to do a lot less sequential disk access than relational databases do.

Note

On Windows these are called Large Pages, not Huge Pages. Some versions of Windows have this feature enabled by default and some do not, so check and make sure it is turned off.

Choosing a Disk Scheduling Algorithm

The disk controller receives requests from the operating system and processes them in an order determined by a scheduling algorithm. Sometimes changing this algorithm can improve disk performance. For other hardware and workloads, it may not make a difference. The best way to decide which algorithm to use is to test them out yourself on your workload. Deadline and completely fair queueing (CFQ) both tend to be good choices.

There are a couple of situations where the noop scheduler (a contraction of “no-op”) is the best choice. If you’re in a virtualized environment, use the noop scheduler. This scheduler basically passes the operations through to the underlying disk controller as quickly as possible. It is fastest to do this and let the real disk controller handle any reordering that needs to happen.

Similarly, on SSDs, the noop scheduler is generally the best choice. SSDs don’t have the same locality issues that spinning disks do.

Finally, if you’re using a RAID controller with caching, use noop. The cache behaves like an SSD and will take care of propagating the writes to the disk efficiently.

If you are on a physical server that is not virtualized, the operating system should use the deadline scheduler. The deadline scheduler caps maximum latency per request and maintains a reasonable disk throughput that is best for disk-intensive database applications.

You can change the scheduling algorithm by setting the --elevator option in your boot configuration.

Note

The option is called "elevator" because the scheduler behaves like an elevator, picking up people (I/O requests) from different floors (processes/times) and dropping them off where they want to go in an arguabley optimal way.

Often all of the algorithms work pretty well; you may not see much of a difference between them.

Disabling Access Time Tracking

By default, the system tracks when files were last accessed. As the data files used by MongoDB are very high-traffic, you can get a performance boost by disabling this tracking. You can do this on Linux by changing atime to noatime in /etc/fstab:

/dev/sda7 /data xfsf rw,noatime 1  2

You must remount the device for the changes to take effect.

atime is more of an issue on older kernels (e.g., ext3); newer ones use relatime as a default, which is less aggressively updated. Also, be aware that setting noatime can affect other programs using the partition, such as mutt or backup tools.

Similarly, on Windows you should set the disablelastaccess option. To turn off last access time recording, run:

C:> fsutil behavior set disablelastaccess 1

You must reboot for this setting to take effect. Setting this may affect the remote storage service, but you probably shouldn’t be using a service that automatically moves your data to other disks anyway.

Modifying Limits

There are two limits that MongoDB tends to blow by: the number of threads a process is allowed to spawn and the number of file descriptors a process is allowed to open. Both of these should generally be set to unlimited.

Whenever a MongoDB server accepts a connection, it spawns a thread to handle all activity on that connection. Therefore, if you have 3,000 connections to the database, the database will have 3,000 threads running (plus a few other threads for non-client-related tasks). Depending on your application server configuration, your client may spawn anywhere from a dozen to thousands of connections to MongoDB.

If your client will dynamically spawn more child processes as traffic increases (most application servers will do this), it is important to make sure that these child processes are not so numerous that they can max out MongoDB’s limits. For example, if you have 20 application servers, each one of which is allowed to spawn 100 child processes, and each child process can spawn 10 threads that all connect to MongoDB, that could result in the spawning of 20 × 100 × 10 = 20,000 connections at peak traffic. MongoDB is probably not going to be very happy about spawning tens of thousands of threads and, if you run out of threads per process, will simply start refusing new connections.

The other limit to modify is the number of file descriptors MongoDB is allowed to open. Every incoming and outgoing connection uses a file descriptor, so the client connection storm just mentioned would create 20,000 open filehandles.

mongos in particular tends to create connections to many shards. When a client connects to a mongos and makes a request, the mongos opens connections to any and all shards necessary to fulfill that request. Thus, if a cluster has 100 shards and a client connects to a mongos and tries to query for all of its data, the mongos must open 100 connections: one connection to each shard. This can quickly lead to an explosion in the number of connections, as you can imagine from the previous example. Suppose a liberally configured app server made a hundred connections to a mongos process. This could get translated to 100 inbound connections × 100 shards = 10,000 connections to shards! (This assumes a nontargeted query on each connection, which would be a bad design, so this is a somewhat extreme example.)

Thus, there are a few adjustments to make. Many people purposefully configure mongos processes to only allow a certain number of incoming connections by using the maxConns option. This is a good way to enforce that your client is behaving well.

You should also increase the limit on the number of file descriptors, as the default (generally 1,024) is simply too low. Set the max number of file descriptors to unlimited or, if you’re nervous about that, 20,000. Each system has a different way of changing these limits, but in general, make sure that you change both the hard and soft limits. A hard limit is enforced by the kernel and can only be changed by an administrator, whereas a soft limit is user-configurable.

If the maximum number of connections is left at 1,024, Cloud Manager will warn you by displaying the host in yellow in the host list. If low limits are the issue that triggered the warning, the Last Ping tab should display a message similar to that shown in Figure 24-4.

Figure 24-4. Cloud Manager low ulimit (file descriptors) setting warning

Even if you have a nonsharded setup and an application that only uses a small number of connections, it’s a good idea to increase the hard and soft limits to at least 4,096. That will stop MongoDB from warning you about them and give you some breathing room, just in case.

Configuring Your Network

This section covers which servers should have connectivity to which other servers. Often, for reasons of network security (and sensibility), you may want to limit the connectivity of MongoDB servers. Note that multiserver MongoDB deployments should handle networks being partitioned or down, but it isn’t recommended as a general deployment strategy.

For a standalone server, clients must be able to make connections to the mongod.

Members of a replica set must be able to make connections to every other member. Clients must be able to connect to all nonhidden, nonarbiter members. Depending on network configuration, members may also attempt to connect to themselves, so you should allow mongods to create connections to themselves.

Sharding is a bit more complicated. There are four components: mongos servers, shards, config servers, and clients. Connectivity can be summarized in the following three points:

  • A client must be able to connect to a mongos.

  • A mongos must be able to connect to the shards and config servers.

  • A shard must be able to connect to the other shards and the config servers.

The full connectivity chart is described in Table 24-1.

Table 24-1. Sharding connectivity
Connectivityfrom server type
to server typemongosShardConfig serverClient
mongosNot requiredNot requiredNot requiredRequired
ShardRequiredRequiredNot requiredNot recommended
Config serverRequiredRequiredNot requiredNot recommended
ClientNot requiredNot requiredNot requiredNot MongoDB-related

There are three possible values in the table. “Required” means that connectivity between these two components is required for sharding to work as designed. MongoDB will attempt to degrade gracefully if it loses these connections due to network issues, but you shouldn’t purposely configure it that way.

“Not required” means that these two elements never talk in the direction specified, so no connectivity is needed.

“Not recommended” means that these two elements should never talk, but due to user error they could. For example, it is recommended that clients only make connections to the mongos, not the shards, so that clients do not inadvertently make requests directly to shards. Similarly, clients should not be able to directly access config servers so that they cannot accidentally modify config data.

Note that mongos processes and shards talk to config servers, but config servers don’t make connections to anyone, even one another.

Shards must communicate during migrates: shards connect to one another directly to transfer data.

As mentioned earlier, replica set members that compose shards should be able to connect to themselves.

System Housekeeping

This section covers some common issues you should be aware of before deploying.

Synchronizing Clocks

In general, it’s safest to have your systems’ clocks within a second of each other. Replica sets should be able to handle nearly any clock skew. Sharding can handle some skew (if it gets beyond a few minutes, you’ll start seeing warnings in the logs), but it’s best to minimize it. Having in-sync clocks also makes figuring out what’s happening from logs easier.

You can keep clocks synchronized using the w32tm tool on Windows and the ntp daemon on Linux.

The OOM Killer

Very occasionally, MongoDB will allocate enough memory that it will be targeted by the out-of-memory (OOM) killer. This particularly tends to happen during index builds, as that is one of the only times when MongoDB’s resident memory should put any strain on the system.

If your MongoDB process suddenly dies with no errors or exit messages in the logs, check /var/log/messages (or wherever your kernel logs such things) to see if it has any messages about terminating mongod.

If the kernel has killed MongoDB for memory overuse, you should see something like this in the kernel log:

kernel: Killed process 2771 (mongod)
kernel: init invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0

If you were running with journaling, you can simply restart mongod at this point. If you were not, restore from a backup or resync the data from a replica.

The OOM killer gets particularly nervous if you have no swap space and start running low on memory, so a good way to prevent it from going on a spree is to configure a modest amount of swap. As mentioned earlier, MongoDB should never use it, but it makes the OOM killer happy.

If the OOM killer kills a mongos, you can simply restart it.

Turn Off Periodic Tasks

Check that there aren’t any cron jobs, antivirus scanners, or daemons that might periodically pop to life and steal resources. One culprit we’ve seen is package managers’ automatic update. These programs will come to life, consume a ton of RAM and CPU, and then disappear. This is not something you want running on your production server.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.117.109