Correctly configuring the cluster

While understanding the distribution of shards is essential, understanding the distribution of documents is also critical. Elasticsearch works to evenly spread the documents at shards. This is an appropriate behavior. Having a shard with the majority of the data cannot be wise.

Let's start two Elasticsearch nodes and create an index by running the following command:

curl -XPUT localhost:9200/my_index -d '{
  settings: {
    number_of_shards: 2,
    number_of_replicas: 0
  }
}'
{"acknowledged":true}

We've created an index without replicas that are built of two shards. Now we add a document to index:

curl -XPOST localhost:9200/my_index/document -d '{
  "title": "The first document"
}'
{"_index":"my_index","_type":"document","_id":"AU_iaqgDlNVjy8IaI4FM","_version":1,"created":true}

We will get the current shard level stats of the my_index by using the following command:
curl -XGET 'localhost:9200/my_index/_stats?level=shards&pretty'
{
...
"shards": {
            "0": [
               {
                  "routing": {
                     "state": "STARTED",
                     "primary": true,
                     "node": "8EDJVceZRa2SZeEVTSjtsg",
                     "relocating_node": null
                  },
                  "docs": {
                     "count": 0,
                     "deleted": 0
                  },
                  ...
               }
            ],
            "1": [
               {
                  "routing": {
                     "state": "STARTED",
                     "primary": true,
                     "node": "gVKmlQefTqigLiJ7kVRczw",
                     "relocating_node": null
                  },
                  "docs": {
                     "count": 1,
                     "deleted": 0
                  },
                  ...
               }
            ]
         }
...
}

As you can see, there is one document in the second shard. Now we add another document to the my_index:

curl -XPOST localhost:9200/my_index/document -d '{
  "title": "The second document"
}'
{"_index":"my_index","_type":"document","_id":"AU_ijSHrlNVjy8IaI4Wu","_version":1,"created":true}
Now we are getting the shard level stats again:
curl -XGET 'localhost:9200/my_index/_stats?level=shards&pretty'
{
...
"shards": {
            "0": [
               {
                  "routing": {
                     "state": "STARTED",
                     "primary": true,
                     "node": "8EDJVceZRa2SZeEVTSjtsg",
                     "relocating_node": null
                  },
                  "docs": {
                     "count": 1,
                     "deleted": 0
                  },
                  ...
               }
            ],
            "1": [
               {
                  "routing": {
                     "state": "STARTED",
                     "primary": true,
                     "node": "gVKmlQefTqigLiJ7kVRczw",
                     "relocating_node": null
                  },
                  "docs": {
                     "count": 1,
                     "deleted": 0
                  },
                  ...
               }
            ]
         }
...
}

As you can see, there is one document in the first and second shards. So, Elasticsearch evenly spread the documents to shards. Now let's kill one node and count the number of documents of the my_index:

curl -XGET 'localhost:9200/my_index/_count?pretty'
{
  "count" : 1,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  }
}

Oops, we have a problem. A document is missing. If we control our cluster at this time, we see that the current color of the status is red:

curl -XGET 'localhost:9200/_cluster/health?pretty'
{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 1,
  "active_shards" : 1,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

When the current color of the status of a cluster is red, it means that all of the primary shards are not active. In this case, losing the data is inevitable. The replica shards solve this problem. If we want to take advantage of the replicas, we should have at least two nodes.

At this point, we might ask: How do we configure the cluster correctly? The question can be answered in two ways. The first way is that the default configuration (which means five shards and one replica) is sufficient to meet basic needs and standard use cases. The second way is that there is no current solution for every situation. There are factors that determine the correct configuration for our cluster. For example, we must know how many nodes we will work, what the size of the data is, and what system resource we have to determine the correct configuration.

First of all, the use of replica is recommended to avoid data loss. To use the replica as said before, there must be at least two nodes. So now, another question arises: How many shards/replicas should we use?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.19.147