One of the advantages of sharding in MongoDB is that it is mostly transparent to the application and requires minimal administration and operational effort.
One of the core tasks that MongoDB needs to perform continuously is balancing data between shards. No matter whether we implement range- or hash-based sharding, MongoDB will need to calculate bounds for the hashed field to be able to figure out on which shard to direct every new document insert or update. As our data grows, these bounds may need to get readjusted to avoid having a hot shard that ends up with a majority of our data.
For the sake of this example, let's assume that there is a data type named extra_tiny_int with integer values from [-12, 12). If we enable sharding on this extra_tiny_int field, the initial bounds of our data will be the whole range of values denoted by $minKey: -12 and $maxKey: 11.
After we insert some initial data, MongoDB will generate chunks and recalculate the bounds for each chunk to try and balance our data.
In our case of two shards and four initial chunks, the initial bounds will be calculated as:
Chunk1: [-12..-6)
Chunk2: [-6..0)
Chunk3: [0..6)
Chunk4: [6,12) where [ is inclusive and ) is not inclusive.
After we insert some data, our chunks will look like:
ShardA
Chunk1 Chunk2
-12,-8,-7 -6
ShardB
Chunk3 Chunk4
0, 2 7,8,9,10,11,11,11,11
In this case, we observe that Chunk4 has more items than any other chunk.
MongoDB will first split Chunk4 into two new chunks, attempting to keep the size of each chunk under a certain threshold (64 MB by default).
Now, instead of Chunk4, we have:
Chunk4A
7,8,9,10
Chunk4B
11,11,11,11
With the new bounds being:
Chunk4A: [6,11)
Chunk4B: [11,12)
Notice that Chunk4B can only hold one value. This is now an indivisible chunk—a chunk that can not be broken down into smaller ones anymore—and will grow in size unbounded, causing potential performance issues down the line.
This clarifies why we need to use a high-cardinality field as our shard key and why something like Boolean that only has true/false values is a bad selection for a shard key.
In our case, we now have two chunks in ShardA and three chunks in ShardB. According to the following table:
Number of chunks |
Migration threshold |
<20 |
2 |
20-79 |
4 |
>=80 |
8 |
We have not reached our migration threshold yet, since 3-2 = 1.
The migration threshold is calculated as the number of chunks in the shard with the highest count of chunks and the number of chunks in the shard with the lowest count of chunks:
Shard1 -> 85 chunks
Shard2 -> 86 chunks
Shard3 -> 92 chunks
In the preceding example, balancing will not occur until Shard3 (or Shard2) reaches 93 chunks because the migration threshold is 8 for >=80 chunks and the difference between Shard1 and Shard3 is still 7 chunks (92-85).
If we continue adding data in Chunk4A, it will eventually be split into Chunk4A1 and Chunk4A2.
Now we have four chunks in ShardB (Chunk3, Chunk4A1, Chunk4A2, and Chunk4B) and two chunks in ShardA (Chunk1 and Chunk2).
The MongoDB balancer will now migrate one chunk from ShardB to ShardA as 4-2 = 2, reaching the migration threshold for <20 chunks. The balancer will adjust the boundaries between the two shards to be able to query more effectively (targeted queries).