CRUSH map internals

To know what is inside a crush map, and for easy editing we need to extract and decompile it to convert it into a human-readable form. The following diagram illustrates this process:

CRUSH map internals

The change to the Ceph cluster by the CRUSH map is dynamic, that is, once the new crush map is injected into the Ceph cluster, all the changes will come into effect immediately, on the fly.

How to do it…

We will now take a look at the CRUSH map of our Ceph cluster:

  1. Extract the CRUSH map from any of the monitor nodes:
    # ceph osd getcrushmap -o crushmap_compiled_file
    
  2. Once you have the CRUSH map, decompile it to convert it into a human-readable/editable form:
    # crushtool -d crushmap_compiled_file -o crushmap_decompiled_file
    

    At this point, the output file, crushmap_decompiled_file, can be viewed/edited in your favorite editor. In the next recipe, we will learn how to perform changes to the CRUSH map.

  3. Once the changes are done, you should compile these changes:
    # crushtool -c crushmap_decompiled_file -o newcrushmap
    
  4. Finally, inject the newly compiled crush map into the Ceph cluster:
    # ceph osd setcrushmap -i newcrushmap
    

How it works…

Now that we know how to edit the Ceph CRUSH map, let's understand what's inside the CRUSH map. A CRUSH map file contains four main sections; they are as follows:

  • Devices: This section of the CRUSH map keeps a list of all the OSD devices in your cluster. The OSD is a physical disk corresponding to the ceph-osd daemon. To map the PG to the OSD device, CRUSH requires a list of OSD devices. This list of devices appears in the beginning of the CRUSH map to declare the device in the CRUSH map. The following is the sample device list:
    How it works…
  • Bucket types: This defines the types of buckets used in your CRUSH hierarchy. Buckets consist of a hierarchical aggregation of physical locations (for example, rows, racks, chassis, hosts, and so on) and their assigned weights. They facilitate a hierarchy of nodes and leaves, where the node bucket represents physical location and can aggregate other nodes and leaves buckets under the hierarchy. The leaf bucket represents the ceph-osd daemon and its underlying physical device. The following table lists the default bucket types:

    Number

    Bucket

    Description

    0

    OSD

    An OSD daemon (for example, osd.1, osd.2, and so on).

    1

    Host

    A host name containing one or more OSDs.

    2

    Rack

    A computer rack containing one or more hosts.

    3

    Row

    A row in a series of racks.

    4

    Room

    A room containing racks and rows of hosts.

    5

    Data Center

    A physical data center containing rooms.

    6

    Root

    This is the beginning of the bucket hierarchy.

    CRUSH also supports custom bucket type creation, these default bucket types can be deleted and new types can be introduced as per your needs.

  • Bucket instances: Once you define bucket types, you must declare bucket instances for your hosts. A bucket instance requires the bucket type, a unique name (string), a unique ID expressed as a negative integer, a weight relative to the total capacity of its item, a bucket algorithm (straw, by default), and the hash (0, by default, reflecting the CRUSH Hash rjenkins1). A bucket may have one or more items, and these items may consist of other buckets or OSDs. The item should have a weight that reflects the relative weight of the item. The general syntax of a bucket type looks like the following:
    [bucket-type] [bucket-name] {
      id [a unique negative numeric ID]
      weight [the relative capacity the item]
      alg [ the bucket type: uniform | list | tree | straw | straw2]
      hash [the hash type: 0 by default]
      item [item-name] weight [weight]
    }

    We will now briefly cover the parameters used by the CRUSH bucket instance:

    • bucket-type: It's the type of bucket, where we must specify the OSD's location in the CRUSH hierarchy.
    • bucket-name: A unique bucket name.
    • id: The unique ID, expressed as a negative integer.
    • weight: Ceph writes data evenly across the cluster disks, which helps in performance and better data distribution. This forces all the disks to participate in the cluster and make sure that all cluster disks are equally utilized, irrespective of their capacity. To do so, Ceph uses a weighting mechanism. CRUSH allocates weights to each OSD. The higher the weight of an OSD, the more physical storage capacity it will have. A weight is the relative difference between device capacities. We recommend using 1.00 as the relative weight for a 1 TB storage device. Similarly, a weight of 0.5 would represent approximately 500 GB, and a weight of 3.00 would represent approximately 3 TB.
    • alg: Ceph supports multiple algorithm bucket types for your selection. These algorithms differ from each other on the basis of performance and reorganizational efficiency. Let's briefly cover these bucket types:
      • Uniform: The uniform bucket can be used if the storage devices have exactly the same weight. For non-uniform weights, this bucket type should not be used. The addition or removal of devices in this bucket type requires the complete reshuffling of data, which makes this bucket type less efficient.
      • List: List buckets aggregate their contents as linked lists and can contain storage devices with arbitrary weights. In the case of cluster expansion, new storage devices can be added to the head of a linked list with minimum data migration. However, storage device removal requires a significant amount of data movement. So, this bucket type is suitable for scenarios under which the addition of new devices to the cluster is extremely rare or non-existent. In addition, list buckets are efficient for small sets of items, but they may not be appropriate for large sets.
      • Tree: Tree buckets store their items in a binary tree. It is more efficient than list buckets because a bucket contains a larger set of items. Tree buckets are structured as a weighted binary search tree with items at the leaves. Each interior node knows the total weight of its left and right subtrees and is labeled according to a fixed strategy. Tree buckets are an all-around boon, providing excellent performance and decent reorganization efficiency.
      • Straw: To select an item using List and Tree buckets, a limited number of hash values need to be calculated and compared by weight. They use a divide and conquer strategy, which gives precedence to certain items (for example, those at the beginning of a list). This improves the performance of the replica placement process but introduces moderate reorganization when bucket contents change due to addition, removal, or re-weighting.

        The straw bucket type allows all items to compete fairly against each other for replica placement. In a scenario where removal is expected and reorganization efficiency is critical, straw buckets provide optimal migration behavior between subtrees. This bucket type allows all items to fairly "compete" against each other for replica placement through a process analogous to a draw of straws.

      • Straw2: This is an improved straw bucket that correctly avoids any data movement between items A and B, when neither A's nor B's weights are changed. In other words, if we adjust the weight of item C by adding a new device to it, or by removing it completely, the data movement will take place to or from C, never between other items in the bucket. Thus, the straw2 bucket algorithm reduces the amount of data migration required when changes are made to the cluster.
    • hash: Each bucket uses a hash algorithm. Currently, Ceph supports rjenkins1. Enter 0 as your hash setting to select rjenkins1.
    • item: A bucket may have one or more items. These items may consist of node buckets or leaves. Items may have a weight that reflects the relative weight of the item.

    The following screenshot illustrates the CRUSH bucket instance. Here, we have three host bucket instances. These host bucket instances consist of OSDs buckets:

    How it works…
  • Rules: The CRUSH maps contain CRUSH rules that determine the data placement for pools. As the name suggests, these are the rules that define the pool properties and the way data gets stored in the pools. They define the replication and placement policy that allows CRUSH to store objects in a Ceph cluster. The default CRUSH map contains a rule for default pools, that is, rbd. The general syntax of a CRUSH rule looks like this:
    rule <rulename> {
        ruleset <ruleset>
            type [ replicated | erasure ]
            min_size <min-size>
            max_size <max-size>
            step take <bucket-type>
            step [choose|chooseleaf] [firstn] <num> <bucket-type>
            step emit
    }

    We will now briefly cover these parameters used by the CRUSH rule:

    • ruleset: An integer value; it classifies a rule as belonging to a set of rules.
    • type: A string value; it's the type of pool that is either replicated or erasure coded.
    • min_size: An integer value; if a pool makes fewer replicas than this number, CRUSH will not select this rule.
    • max_size: An integer value; if a pool makes more replicas than this number, CRUSH will not select this rule.
    • step take: This takes a bucket name and begins iterating down the tree.
    • step choose firstn {num} type {bucket-type}: This selects the number (N) of buckets of a given type, where the number (N) is usually the number of replicas in the pool (that is, pool size):
      • If num == 0, select N buckets
      • If num > 0 && < N, select num buckets
      • If num < 0, select N - num buckets

      Example: step choose firstn 1 type row

      In this example, num=1, and let's suppose the pool size is 3, then CRUSH will evaluate this condition as 1 > 0 && < 3. Hence, it will select 1 row type bucket.

    • step chooseleaf firstn {num} type {bucket-type}: This first selects a set of buckets of a bucket type, and then chooses the leaf node from the subtree of each bucket in the set of buckets. The number of buckets in the set (N) is usually the number of replicas in the pool:
    • If num == 0, selects N buckets
    • If num > 0 && < N, select num buckets
    • If num < 0, select N - num buckets

      Example: step chooseleaf firstn 0 type row

      In this example, num=0, and let's suppose the pool size is 3, then CRUSH will evaluate this condition as 0 == 0, and then select a row type bucket set, such that the set contains 3 buckets. Then it will choose the leaf node from the subtree of each bucket. In this way, CRUSH will select 3 leaf nodes.

    • step emit: This first outputs the current value and empties the stack. This is typically used at the end of a rule but may also be used to form different trees in the same rule.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.137.75