To know what is inside a crush map, and for easy editing we need to extract and decompile it to convert it into a human-readable form. The following diagram illustrates this process:
The change to the Ceph cluster by the CRUSH map is dynamic, that is, once the new crush map is injected into the Ceph cluster, all the changes will come into effect immediately, on the fly.
We will now take a look at the CRUSH map of our Ceph cluster:
# ceph osd getcrushmap -o crushmap_compiled_file
# crushtool -d crushmap_compiled_file -o crushmap_decompiled_file
At this point, the output file, crushmap_decompiled_file
, can be viewed/edited in your favorite editor. In the next recipe, we will learn how to perform changes to the CRUSH map.
# crushtool -c crushmap_decompiled_file -o newcrushmap
# ceph osd setcrushmap -i newcrushmap
Now that we know how to edit the Ceph CRUSH map, let's understand what's inside the CRUSH map. A CRUSH map file contains four main sections; they are as follows:
ceph-osd
daemon. To map the PG to the OSD device, CRUSH requires a list of OSD devices. This list of devices appears in the beginning of the CRUSH map to declare the device in the CRUSH map. The following is the sample device list:nodes
and leaves
, where the node
bucket represents physical location and can aggregate other nodes and leaves buckets under the hierarchy. The leaf
bucket represents the ceph-osd
daemon and its underlying physical device. The following table lists the default bucket types:
Number |
Bucket |
Description |
---|---|---|
0 |
OSD |
An OSD daemon (for example, |
1 |
Host |
A host name containing one or more OSDs. |
2 |
Rack |
A computer rack containing one or more hosts. |
3 |
Row |
A row in a series of racks. |
4 |
Room |
A room containing racks and rows of hosts. |
5 |
Data Center |
A physical data center containing rooms. |
6 |
Root |
This is the beginning of the bucket hierarchy. |
CRUSH also supports custom bucket type creation, these default bucket types can be deleted and new types can be introduced as per your needs.
[bucket-type] [bucket-name] { id [a unique negative numeric ID] weight [the relative capacity the item] alg [ the bucket type: uniform | list | tree | straw | straw2] hash [the hash type: 0 by default] item [item-name] weight [weight] }
We will now briefly cover the parameters used by the CRUSH bucket instance:
bucket-type
: It's the type of bucket, where we must specify the OSD's location in the CRUSH hierarchy.bucket-name
: A unique bucket name.id
: The unique ID, expressed as a negative integer.weight
: Ceph writes data evenly across the cluster disks, which helps in performance and better data distribution. This forces all the disks to participate in the cluster and make sure that all cluster disks are equally utilized, irrespective of their capacity. To do so, Ceph uses a weighting mechanism. CRUSH allocates weights to each OSD. The higher the weight of an OSD, the more physical storage capacity it will have. A weight is the relative difference between device capacities. We recommend using 1.00 as the relative weight for a 1 TB storage device. Similarly, a weight of 0.5 would represent approximately 500 GB, and a weight of 3.00 would represent approximately 3 TB.alg
: Ceph supports multiple algorithm bucket types for your selection. These algorithms differ from each other on the basis of performance and reorganizational efficiency. Let's briefly cover these bucket types:The straw bucket type allows all items to compete fairly against each other for replica placement. In a scenario where removal is expected and reorganization efficiency is critical, straw buckets provide optimal migration behavior between subtrees. This bucket type allows all items to fairly "compete" against each other for replica placement through a process analogous to a draw of straws.
hash
: Each bucket uses a hash algorithm. Currently, Ceph supports rjenkins1. Enter 0 as your hash setting to select rjenkins1.item
: A bucket may have one or more items. These items may consist of node buckets or leaves. Items may have a weight that reflects the relative weight of the item.The following screenshot illustrates the CRUSH bucket instance. Here, we have three host bucket instances. These host bucket instances consist of OSDs buckets:
rbd
. The general syntax of a CRUSH rule looks like this:rule <rulename> { ruleset <ruleset> type [ replicated | erasure ] min_size <min-size> max_size <max-size> step take <bucket-type> step [choose|chooseleaf] [firstn] <num> <bucket-type> step emit }
We will now briefly cover these parameters used by the CRUSH rule:
ruleset
: An integer value; it classifies a rule as belonging to a set of rules.type
: A string value; it's the type of pool that is either replicated or erasure coded.min_size
: An integer value; if a pool makes fewer replicas than this number, CRUSH will not select this rule.max_size
: An integer value; if a pool makes more replicas than this number, CRUSH will not select this rule.step take
: This takes a bucket name and begins iterating down the tree.step choose firstn {num} type {bucket-type}
: This selects the number (N) of buckets of a given type, where the number (N) is usually the number of replicas in the pool (that is, pool size):num == 0
, select N bucketsnum > 0 && < N
, select num bucketsnum < 0
, select N - num buckets
Example: step choose firstn 1 type row
In this example, num=1
, and let's suppose the pool size is 3, then CRUSH will evaluate this condition as 1 > 0 && < 3
. Hence, it will select 1 row type bucket.
step chooseleaf firstn {num} type {bucket-type}
: This first selects a set of buckets of a bucket type, and then chooses the leaf node from the subtree of each bucket in the set of buckets. The number of buckets in the set (N) is usually the number of replicas in the pool:num == 0
, selects N bucketsnum > 0 && < N
, select num bucketsnum < 0
, select N - num buckets
Example: step chooseleaf firstn 0 type row
In this example, num=0
, and let's suppose the pool size is 3
, then CRUSH will evaluate this condition as 0 == 0
, and then select a row type bucket set, such that the set contains 3 buckets. Then it will choose the leaf node from the subtree of each bucket. In this way, CRUSH will select 3 leaf nodes.
step emit
: This first outputs the current value and empties the stack. This is typically used at the end of a rule but may also be used to form different trees in the same rule.3.15.137.75