Balancing data – how to track and keep our data balanced

One of the advantages of sharding in MongoDB is that it is mostly transparent to the application and requires minimal administration and operational effort.

One of the core tasks that MongoDB needs to perform continuously is balancing data between shards. No matter whether we implement range- or hash-based sharding, MongoDB will need to calculate bounds for the hashed field to be able to figure out on which shard to direct every new document insert or update. As our data grows, these bounds may need to get readjusted to avoid having a hot shard that ends up with a majority of our data.

For the sake of this example, let's assume that there is a data type named extra_tiny_int with integer values from [-12, 12). If we enable sharding on this extra_tiny_int field, the initial bounds of our data will be the whole range of values denoted by $minKey: -12 and $maxKey: 11.

After we insert some initial data, MongoDB will generate chunks and recalculate the bounds for each chunk to try and balance our data.

By default, the initial number of chunks created by MongoDB is 2*number of shards.

In our case of two shards and four initial chunks, the initial bounds will be calculated as:

Chunk1: [-12..-6)

Chunk2: [-6..0)

Chunk3: [0..6)

Chunk4: [6,12) where [ is inclusive and ) is not inclusive.

After we insert some data, our chunks will look like:

ShardA

Chunk1 Chunk2

-12,-8,-7 -6

ShardB

Chunk3 Chunk4

0, 2 7,8,9,10,11,11,11,11

In this case, we observe that Chunk4 has more items than any other chunk.

MongoDB will first split Chunk4 into two new chunks, attempting to keep the size of each chunk under a certain threshold (64 MB by default).

Now, instead of Chunk4, we have:

Chunk4A

7,8,9,10

Chunk4B

11,11,11,11

With the new bounds being:

Chunk4A: [6,11)

Chunk4B: [11,12)

Notice that Chunk4B can only hold one value. This is now an indivisible chunk—a chunk that can not be broken down into smaller ones anymore—and will grow in size unbounded, causing potential performance issues down the line.

This clarifies why we need to use a high-cardinality field as our shard key and why something like Boolean that only has true/false values is a bad selection for a shard key.

In our case, we now have two chunks in ShardA and three chunks in ShardB. According to the following table:

Number of chunks	Migration threshold
<20	2
20-79	4
>=80	8

We have not reached our migration threshold yet, since 3-2 = 1.

The migration threshold is calculated as the number of chunks in the shard with the highest count of chunks and the number of chunks in the shard with the lowest count of chunks:

Shard1 -> 85 chunks

Shard2 -> 86 chunks

Shard3 -> 92 chunks

In the preceding example, balancing will not occur until Shard3 (or Shard2) reaches 93 chunks because the migration threshold is 8 for >=80 chunks and the difference between Shard1 and Shard3 is still 7 chunks (92-85).

If we continue adding data in Chunk4A, it will eventually be split into Chunk4A1 and Chunk4A2.

Now we have four chunks in ShardB (Chunk3, Chunk4A1, Chunk4A2, and Chunk4B) and two chunks in ShardA (Chunk1 and Chunk2).

The MongoDB balancer will now migrate one chunk from ShardB to ShardA as 4-2 = 2, reaching the migration threshold for <20 chunks. The balancer will adjust the boundaries between the two shards to be able to query more effectively (targeted queries).

As seen in the preceding diagram, MongoDB will try to split >64 MB chunks into half in terms of size. Bounds between the two resulting chunks may be completely uneven if our data distribution is uneven to begin with. MongoDB can split chunks into smaller ones but cannot merge them automatically. We need to manually merge chunks, a delicate and operationally expensive procedure.

Table of Contents for Balancing data – how to track and keep our data balanced

Create new playlist

Sign In

Sign Up

Table of Contents for
Balancing data – how to track and keep our data balanced