Chapter 14. Introduction to Sharding

This chapter covers how to scale with MongoDB. We’ll look at:

  • What sharding is and the components of a cluster

  • How to configure sharding

  • The basics of how sharding interacts with your application

What Is Sharding?

Sharding refers to the process of splitting data up across machines; the term partitioning is also sometimes used to describe this concept. By putting a subset of data on each machine, it becomes possible to store more data and handle more load without requiring larger or more powerful machines—just a larger quantity of less-powerful machines. Sharding may be used for other purposes as well, including placing more frequently accessed data on more performant hardware or splitting a dataset based on geography to locate a subset of documents in a collection (e.g., for users based in a particular locale) close to the application servers from which they are most commonly accessed.

Manual sharding can be done with almost any database software. With this approach, an application maintains connections to several different database servers, each of which are completely independent. The application manages storing different data on different servers and querying against the appropriate server to get data back. This setup can work well but becomes difficult to maintain when adding or removing nodes from the cluster or in the face of changing data distributions or load patterns.

MongoDB supports autosharding, which tries to both abstract the architecture away from the application and simplify the administration of such a system. MongoDB allows your application to ignore the fact that it isn’t talking to a standalone MongoDB server, to some extent. On the operations side, MongoDB automates balancing data across shards and makes it easier to add and remove capacity.

Sharding is the most complex way of configuring MongoDB, both from a development and an operational point of view. There are many components to configure and monitor, and data moves around the cluster automatically. You should be comfortable with standalone servers and replica sets before attempting to deploy or use a sharded cluster. Also, as with replica sets, the recommended means of configuring and deploying sharded clusters is through either MongoDB Ops Manager or MongoDB Atlas. Ops Manager is recommended if you need to maintain control of your computing infrastructure. MongoDB Atlas is recommended if you can leave the infrastructure management to MongoDB (you have the option of running in Amazon AWS, Microsoft Azure, or Google Compute Cloud).

Understanding the Components of a Cluster

MongoDB’s sharding allows you to create a cluster of many machines (shards) and break up a collection across them, putting a subset of data on each shard. This allows your application to grow beyond the resource limits of a standalone server or replica set.

Note

Many people are confused about the difference between replication and sharding. Remember that replication creates an exact copy of your data on multiple servers, so every server is a mirror image of every other server. Conversely, every shard contains a different subset of data.

One of the goals of sharding is to make a cluster of 2, 3, 10, or even hundreds of shards look like a single machine to your application. To hide these details from the application, we run one or more routing processes called a mongos in front of the shards. A mongos keeps a “table of contents” that tells it which shard contains which data. Applications can connect to this router and issue requests normally, as shown in Figure 14-1. The router, knowing what data is on which shard, is able to forward the requests to the appropriate shard(s). If there are responses to a request the router collects them and, if necessary, merges them, and sends them back to the application. As far as the application knows, it’s connected to a standalone mongod, as illustrated in Figure 14-2.

Figure 14-1. Sharded client connection
Figure 14-2. Nonsharded client connection

Sharding on a Single-Machine Cluster

We’ll start by setting up a quick cluster on a single machine. First, start a mongo shell with the --nodb and --norc options:

$ mongo --nodb --norc

To create a cluster, use the ShardingTest class. Run the following in the mongo shell you just launched:

st = ShardingTest({
  name:"one-min-shards",
  chunkSize:1,
  shards:2,
  rs:{
    nodes:3,
    oplogSize:10
  },
  other:{
    enableBalancer:true
  }
});

The chunksize option is covered in Chapter 17. For now, simply set it to 1. As for the other options passed to ShardingTest here, name simply provides a label for our sharded cluster, shards specifies that our cluster will be composed of two shards (we do this to keep the resource requirements low for this example), and rs defines each shard as a three-node replica set with an oplogSize of 10 MiB (again, to keep resource shard as a three-node replica set with an oplogSize of 10 MiB (again, to keep resource utilization low). Though it is possible to run one standalone mongod for each shard, it paints a clearer picture of the typical architecture of a sharded cluster if we create each shard as a replica set. In the last option specified, we are instructing ShardingTest to enable the balancer once the cluster is spun up. This will ensure that data is evenly distributed across both shards.

ShardingTest is a class designed for internal use by MongoDB Engineering and is therefore undocumented externally. However, because it ships with the MongoDB server, it provides the most straightforward means of experimenting with a sharded cluster. ShardingTest was originally designed to support server test suites and is still used for this purpose. By default it provides a number of conveniences that help in keeping resource utilization as small as possible and in setting up the relatively complex architecture of a sharded cluster. It makes an assumption about the presence of a /data /db directory on your machine; if ShardingTest fails to run then create this directory and rerun the command again.

When you run this command, ShardingTest will do a lot for you automatically. It will create a new cluster with two shards, each of which is a replica set. It will configure the replica sets and launch each node with the necessary options to establish replication protocols. It will launch a mongos to manage requests across the shards so that clients can interact with the cluster as if communicating with a standalone mongod, to some extent. Finally, it will launch an additional replica set for the config servers that maintain the routing table information necessary to ensure queries are directed to the correct shard. Remember that the primary use cases for sharding are to split a dataset to address hardware and cost constraints or to provide better performance to applications (e.g., geographical partitioning). MongoDB sharding provides these capabilities in a way that is seamless to the application in many respects.

Once ShardingTest has finished setting up your cluster, you will have 10 processes up and running to which you can connect: two replica sets of three nodes each, one config server replica set of three nodes, and one mongos. By default, these processes should begin at port 20000. The mongos should be running at port 20009. Other processes you have running on your local machine and previous calls to ShardingTest can have an effect on which ports ShardingTest uses, but you should not have too much difficulty determining the ports on which your cluster processes are running.

Next, you’ll connect to the mongos to play around with the cluster. Your entire cluster will be dumping its logs to your current shell, so open up a second terminal window and launch another mongo shell:

$ mongo --nodb

Use this shell to connect to your cluster’s mongos. Again, your mongos should be running on port 20009:

> db = (new Mongo("localhost:20009")).getDB("accounts")

Note that the prompt in your mongo shell should change to reflect that you are connected to a mongos. Now you are in the situation shown earlier, in Figure 14-1: the shell is the client and is connected to a mongos. You can start passing requests to the mongos and it’ll route them to the shards. You don’t really have to know anything about the shards, like how many there are or what their addresses are. So long as there are some shards out there, you can pass the requests to the mongos and allow it to forward them appropriately.

Start by inserting some data:

> for (var i=0; i<100000; i++) {
...     db.users.insert({"username" : "user"+i, "created_at" : new Date()});
... }
> db.users.count()
100000

As you can see, interacting with mongos works the same way as interacting with a standalone server does.

You can get an overall view of your cluster by running sh.status(). It will give you a summary of your shards, databases, and collections:

> sh.status()
--- Sharding Status --- 
sharding version: {
  "_id": 1,
  "minCompatibleVersion": 5,
  "currentVersion": 6,
  "clusterId": ObjectId("5a4f93d6bcde690005986071")
}
shards:
{
  "_id" : "one-min-shards-rs0", 
  "host" : 
    "one-min-shards-rs0/MBP:20000,MBP:20001,MBP:20002",
  "state" : 1 }
{  "_id" : "one-min-shards-rs1",  
   "host" : 
     "one-min-shards-rs1/MBP:20003,MBP:20004,MBP:20005",
   "state" : 1 }
active mongoses:
  "3.6.1" : 1
autosplit:
  Currently enabled: no
balancer:
  Currently enabled:  no
  Currently running:  no
  Failed balancer rounds in last 5 attempts:  0
  Migration Results for the last 24 hours: 
    No recent migrations
databases:
  {  "_id" : "accounts",  "primary" : "one-min-shards-rs1",  
     "partitioned" : false }
  {  "_id" : "config",  "primary" : "config",  
     "partitioned" : true }
  config.system.sessions
shard key: { "_id" : 1 }
unique: false
balancing: true
chunks:
  one-min-shards-rs0	1
  { "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } 
  on : one-min-shards-rs0 Timestamp(1, 0)
Note

sh is similar to rs, but for sharding: it is a global variable that defines a number of sharding helper functions, which you can see by running sh.help(). As you can see from the sh.status() output, you have two shards and two databases (config is created automatically).

Your accounts database may have a different primary shard than the one shown here. A primary shard is a “home base” shard that is randomly chosen for each database. All of your data will be on this primary shard. MongoDB cannot automatically distribute your data yet because it doesn’t know how (or if) you want it to be distributed. You have to tell it, per collection, how you want it to distribute data.

Note

A primary shard is different from a replica set primary. A primary shard refers to the entire replica set composing a shard. A primary in a replica set is the single server in the set that can take writes.

To shard a particular collection, first enable sharding on the collection’s database. To do so, run the enableSharding command:

> sh.enableSharding("accounts")

Now sharding is enabled on the accounts database, which allows you to shard collections within the database.

When you shard a collection, you choose a shard key. This is a field or two that MongoDB uses to break up data. For example, if you chose to shard on "username", MongoDB would break up the data into ranges of usernames: "a1-steak-sauce" through "defcon", "defcon1" through "howie1998", and so on. Choosing a shard key can be thought of as choosing an ordering for the data in the collection. This is a similar concept to indexing, and for good reason: the shard key becomes the most important index on your collection as it gets bigger. To even create a shard key, the field(s) must be indexed.

So, before enabling sharding, you have to create an index on the key you want to shard by:

> db.users.createIndex({"username" : 1})

Now you can shard the collection by "username":

> sh.shardCollection("accounts.users", {"username" : 1})

Although we are choosing a shard key without much thought here, it is an important decision that should be carefully considered in a real system. See Chapter 16 for more advice on choosing a shard key.

If you wait a few minutes and run sh.status() again, you’ll see that there’s a lot more information displayed than there was before:

> sh.status()
--- Sharding Status --- 
sharding version: {
  "_id" : 1,
  "minCompatibleVersion" : 5,
  "currentVersion" : 6,
  "clusterId" : ObjectId("5a4f93d6bcde690005986071")
}
shards:
  {  "_id" : "one-min-shards-rs0", 
     "host" : 
       "one-min-shards-rs0/MBP:20000,MBP:20001,MBP:20002",  
     "state" : 1 }
  {  "_id" : "one-min-shards-rs1",  
     "host" : 
       "one-min-shards-rs1/MBP:20003,MBP:20004,MBP:20005",
     "state" : 1 }
active mongoses:
  "3.6.1" : 1
autosplit:
  Currently enabled: no
balancer:
  Currently enabled:  yes
  Currently running:  no
  Failed balancer rounds in last 5 attempts:  0
  Migration Results for the last 24 hours: 
    6 : Success
databases:
  {  "_id" : "accounts",  "primary" : "one-min-shards-rs1",  
     "partitioned" : true }
accounts.users
  shard key: { "username" : 1 }
  unique: false
  balancing: true
  chunks:
    one-min-shards-rs0	6
    one-min-shards-rs1	7
    { "username" : { "$minKey" : 1 } } -->> 
      { "username" : "user17256" } on : one-min-shards-rs0 Timestamp(2, 0) 
    { "username" : "user17256" } -->> 
      { "username" : "user24515" } on : one-min-shards-rs0 Timestamp(3, 0) 
    { "username" : "user24515" } -->> 
      { "username" : "user31775" } on : one-min-shards-rs0 Timestamp(4, 0) 
    { "username" : "user31775" } -->> 
      { "username" : "user39034" } on : one-min-shards-rs0 Timestamp(5, 0) 
    { "username" : "user39034" } -->> 
      { "username" : "user46294" } on : one-min-shards-rs0 Timestamp(6, 0) 
    { "username" : "user46294" } -->> 
      { "username" : "user53553" } on : one-min-shards-rs0 Timestamp(7, 0) 
    { "username" : "user53553" } -->> 
      { "username" : "user60812" } on : one-min-shards-rs1 Timestamp(7, 1) 
    { "username" : "user60812" } -->> 
      { "username" : "user68072" } on : one-min-shards-rs1 Timestamp(1, 7) 
    { "username" : "user68072" } -->> 
      { "username" : "user75331" } on : one-min-shards-rs1 Timestamp(1, 8) 
    { "username" : "user75331" } -->> 
      { "username" : "user82591" } on : one-min-shards-rs1 Timestamp(1, 9) 
    { "username" : "user82591" } -->> 
      { "username" : "user89851" } on : one-min-shards-rs1 Timestamp(1, 10) 
    { "username" : "user89851" } -->> 
      { "username" : "user9711" } on : one-min-shards-rs1 Timestamp(1, 11) 
    { "username" : "user9711" } -->> 
      { "username" : { "$maxKey" : 1 } } on : one-min-shards-rs1 Timestamp(1, 12) 
    {  "_id" : "config",  "primary" : "config",  "partitioned" : true }
config.system.sessions
  shard key: { "_id" : 1 }
  unique: false
  balancing: true
  chunks:
    one-min-shards-rs0	1
    { "_id" : { "$minKey" : 1 } } -->> 
      { "_id" : { "$maxKey" : 1 } } on : one-min-shards-rs0 Timestamp(1, 0)

The collection has been split up into 13 chunks, where each chunk is a subset of your data. These are listed by shard key range (the {"username" : minValue} -->> {"username" : maxValue} denotes the range of each chunk). Looking at the "on" : shard part of the output, you can see that these chunks have been evenly distributed between the shards.

This process of a collection being split into chunks is shown graphically in Figures 14-3 through 14-5. Before sharding, the collection is essentially a single chunk. Sharding splits it into smaller chunks based on the shard key, as shown in Figure 14-4. These chunks can then be distributed across the cluster, as Figure 14-5 shows.

Figure 14-3. Before a collection is sharded, it can be thought of as a single chunk from the smallest value of the shard key to the largest
Figure 14-4. Sharding splits the collection into many chunks based on shard key ranges
Figure 14-5. Chunks are evenly distributed across the available shards

Notice the keys at the beginning and end of the chunk list: $minKey and $maxKey. $minKey can be thought of as “negative infinity.” It is smaller than any other value in MongoDB. Similarly, $maxKey is like “positive infinity.” It is greater than any other value. Thus, you’ll always see these as the “caps” on your chunk ranges. The values for your shard key will always be between $minKey and $maxKey. These values are actually BSON types and should not be used in your application; they are mainly for internal use. If you wish to refer to them in the shell, use the MinKey and MaxKey constants.

Now that the data is distributed across multiple shards, let’s try doing some queries. First, try a query on a specific username:

> db.users.find({username: "user12345"})
{ 
    "_id" : ObjectId("5a4fb11dbb9ce6070f377880"), 
    "username" : "user12345", 
    "created_at" : ISODate("2018-01-05T17:08:45.657Z") 
}

As you can see, querying works normally. However, let’s run an explain to see what MongoDB is doing under the covers:

> db.users.find({username: "user12345"}}).explain()
{
  "queryPlanner" : {
    "mongosPlannerVersion" : 1,
    "winningPlan" : {
      "stage" : "SINGLE_SHARD",
      "shards" : [{
    "shardName" : "one-min-shards-rs0",
    "connectionString" :
      "one-min-shards-rs0/MBP:20000,MBP:20001,MBP:20002",
    "serverInfo" : {
        "host" : "MBP",
        "port" : 20000,
      "version" : "3.6.1",
      "gitVersion" : "025d4f4fe61efd1fb6f0005be20cb45a004093d1"
    },
    "plannerVersion" : 1,
    "namespace" : "accounts.users",
    "indexFilterSet" : false,
    "parsedQuery" : {
        "username" : {
         "$eq" : "user12345"
       }
    },
    "winningPlan" : {
      "stage" : "FETCH",
      "inputStage" : {
        "stage" : "SHARDING_FILTER",
          "inputStage" : {
              "stage" : "IXSCAN",
          "keyPattern" : {
            "username" : 1
          },
          "indexName" : "username_1",
          "isMultiKey" : false,
          "multiKeyPaths" : {
                "username" : [ ]
          },
          "isUnique" : false,
              "isSparse" : false,
            "isPartial" : false,
          "indexVersion" : 2,
          "direction" : "forward",
          "indexBounds" : {
            "username" : [
                  "["user12345", "user12345"]"
        ]
          }
        }
          }
    },
    "rejectedPlans" : [ ]
      }]
    }
  },
  "ok" : 1,
  "$clusterTime" : {
    "clusterTime" : Timestamp(1515174248, 1),
    "signature" : {
      "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
      "keyId" : NumberLong(0)
    }
  },
  "operationTime" : Timestamp(1515173700, 201)
}

From the "winningPlan" field in the explain output, we can see that our cluster satisfied this query using a single shard, one-min-shards-rs0. Based on the output of sh.status() shown earlier, we can see that user12345 does fall within the key range for the first chunk listed for that shard in our cluster.

Because "username" is the shard key, mongos was able to route the query directly to the correct shard. Contrast that with the results for querying for all of the users:

> db.users.find().explain()
{
  "queryPlanner":{
    "mongosPlannerVersion":1,
    "winningPlan":{
      "stage":"SHARD_MERGE",
      "shards":[
        {
          "shardName":"one-min-shards-rs0",
          "connectionString": 
            "one-min-shards-rs0/MBP:20000,MBP:20001,MBP:20002",
          "serverInfo":{
            "host":"MBP.fios-router.home",
            "port":20000,
            "version":"3.6.1",
            "gitVersion":"025d4f4fe61efd1fb6f0005be20cb45a004093d1"
          },
          "plannerVersion":1,
          "namespace":"accounts.users",
          "indexFilterSet":false,
          "parsedQuery":{

          },
          "winningPlan":{
            "stage":"SHARDING_FILTER",
            "inputStage":{
              "stage":"COLLSCAN",
              "direction":"forward"
            }
          },
          "rejectedPlans":[

          ]
        },
        {
          "shardName":"one-min-shards-rs1",
          "connectionString":
            "one-min-shards-rs1/MBP:20003,MBP:20004,MBP:20005",
          "serverInfo":{
            "host":"MBP.fios-router.home",
            "port":20003,
            "version":"3.6.1",
            "gitVersion":"025d4f4fe61efd1fb6f0005be20cb45a004093d1"
          },
          "plannerVersion":1,
          "namespace":"accounts.users",
          "indexFilterSet":false,
          "parsedQuery":{

          },
          "winningPlan":{
            "stage":"SHARDING_FILTER",
            "inputStage":{
              "stage":"COLLSCAN",
              "direction":"forward"
            }
          },
          "rejectedPlans":[

          ]
        }
      ]
    }
  },
  "ok":1,
  "$clusterTime":{
    "clusterTime":Timestamp(1515174893, 1),
    "signature":{
      "hash":BinData(0, "AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
      "keyId":NumberLong(0)
    }
  },
  "operationTime":Timestamp(1515173709, 514)
}

As you can see from this explain, this query has to visit both shards to find all the data. In general, if we are not using the shard key in the query, mongos will have to send the query to every shard.

Queries that contain the shard key and can be sent to a single shard or a subset of shards are called targeted queries. Queries that must be sent to all shards are called scatter-gather (broadcast) queries: mongos scatters the query to all the shards and then gathers up the results.

Once you are finished experimenting, shut down the set. Switch back to your original shell and hit Enter a few times to get back to the command line, then run st.stop() to cleanly shut down all of the servers:

> st.stop()

If you are ever unsure of what an operation will do, it can be helpful to use ShardingTest to spin up a quick local cluster and try it out.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.166.7