Choosing a Shard Key

The first step in sharding a large collection is to decide on a shard key that will be used to determine which documents should be stored in which shard. The shard key is an indexed field or an indexed compound field that must be included in every document in the collection. MongoDB uses the value of the shard key to split the collection between the shards in the cluster.

Selecting a good shard key can be critical to achieving the performance you need from MongoDB. A bad key can seriously impact the performance of the system, whereas a good key can improve performance and ensure future scalability. If a good key does not exist in your documents, you might want to consider adding a field specifically to be a sharding key.

When selecting a shard key, you should keep in mind the following considerations:

Image Easily divisible: The shard key needs to be easily divisible into chunks.

Image Randomness: When using range-based sharding, random keys can ensure that documents are more evenly distributed so that no one server is overloaded.

Image Compound keys: It is best to shard using a single field when possible. However, if a good single field key doesn’t exist, you can still get better performance from a good compound field than from a bad single field key.

Image Cardinality: Cardinality defines the uniqueness of the values of a field. A field has high cardinality if it is very unique (for example, a Social Security number for a million people). A field has low cardinality if it is generally not very unique (for example, eye color for a million people). Typically fields that have high cardinality provide much better options for sharding.

Image Query targeting: You should take a minute to look at the queries necessary in your applications. Queries will perform better if the data can be collected from a single shard in a cluster. If you can arrange for the shard key to match the most common query parameters, you will get better performance as long as all queries are not going to the same field value (for example, arranging documents based on the zip code of the user when all your queries are based on looking up users by zip code, since all the users for a given zip code will exist on the same shard server). If your queries are fairly distributed across zip codes, then a zip code key would be a good idea. However, if most of your queries are on a few zip codes, then a zip code key would be a bad idea.

To illustrate shard keys, consider the following keys:

Image { "zipcode": 1}: This shard key distributes documents based on the value of the zipcode field. This means that all lookups based on a specific zipcode will go to a single shard server.

Image { "zipcode": 1, "city": 1 }: This shard key first distributes documents by the value of the zipcode field. If a number of documents have the same value for zipcode, then they can be split off to other shards, based on the city field value. This means you are no longer guaranteed that a query on a single zipcode value will hit only one shard. However, queries based on zipcode and city will go to the same shard.

Image { "_id": "hashed" }: This shard key distributes documents based on a hash of the value of the _id field. This ensures a more even distribution across all shards in the cluster. However, it makes it impossible to target queries so that they will hit only a single shard server.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.107.55