Load balancing

A balanced Cassandra cluster is the one where each node owns an equal number of keys. This means when you query nodetool status, a balanced cluster will show the same percentage for all the nodes under the Owns or Effective Ownership columns. If the data is not uniformly distributed between the keys, even with equal ownership you will see some nodes are more occupied by the data than others. We use RandomPartitioner or Murmur3Partitioner to avoid this sort of lopsided cluster.

Note

Note that this section is valid for a setup that does not use vnodes. If you are using Cassandra Version 1.2 or a version after it with default settings, you can skip this section.

This section is specifically for a cluster that uses one token per Cassandra instance.

Anytime a new node is added or a node is decommissioned, the token distribution gets skewed. Normally, one always wants Cassandra to be fairly load balanced to avoid hotspots. Fortunately, it is very easy to load balance. The two-step load balancing process is as follows:

  1. Calculate the initial tokens based on the partitioner that you are using. It can be manually generated by equally dividing token range for a given partitioner among the number of nodes.

    If you are using RandomPartitioner, you can use tools/bin/token-generator to generate tokens for you. For example, the following command generates the tokens for two data centers; each has three nodes:

    $ tools/bin/token-generator 3 3
    DC #1:
      Node #1:  0
      Node #2:  56713727820156410577229101238628035242
      Node #3:  113427455640312821154458202477256070484
    
    DC #2:
      Node #1:  169417178424467235000914166253263322299
      Node #2:  55989722784154413846455963776007251813
      Node #3:  112703450604310824423685065014635287055
    

    For, Murmur3Partitioner (default type of partitioner, since Cassandra Version 1.2), you can use this Python script:

    python -c 'print [str(((2**64 / nodes_count) * i) - 2**63) for i in range(nodes_count)]'
    

    For example, if you have six nodes, run the following command:

    python -c 'print [str(((2**64 / 6) * i) - 2**63) for i in range(6)]'
    
    ['-9223372036854775808', '-6148914691236517206', '-3074457345618258604', '-2', '3074457345618258600', '6148914691236517202']
    

    If you have nodes distributed across two data centers, you may want to assign alternate tokens to each data center:

    DC #1:
      -9223372036854775808 -3074457345618258604 3074457345618258600
    DC #2:
      -6148914691236517206 -2 6148914691236517202
  2. Now that we have tokens, we need to call the following command:
    bin/nodetool -h <node_to_move> move <token_number>
    

The trick here is to assign a new token to a node that is closest to it. This will allow faster balancing as there will be less data to move. A live example of how load balancing is done is covered in the Adding nodes to a cluster section, where we add a node to the cluster, which makes the cluster lopsided. We finally balance it by moving tokens around.

It is actually very easy to write a shell or Python script that takes the ring and then balances it automatically. For someone using RandomPartitioner, there is a GitHub project, Cassandra-Balancer (https://github.com/tivv/cassandra-balancer), which calculates the tokens for a node and moves the data. So, instead of writing one of your own you can just use this Groovy script. Execute it on each node, one by one, and you are done.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.201.217