In this recipe, we will learn some performance tuning parameters for the Ceph cluster. These cluster-wide configuration parameters are defined in the Ceph configuration file so that each time any Ceph daemon starts, it will respect the defined settings. By default, the configuration file name is ceph.conf
, which is located in the /etc/ceph
directory. This configuration file has a global section as well as several sections for each service type. Whenever a Ceph service type starts, it applies the configuration defined under the [global]
section as well as the daemon specific section. A Ceph configuration file has multiple sections, as shown in the following screenshot:
We will now discuss the role of each section of the configuration file.
[global]
keyword. All the settings defined under this section apply to all the daemons of the Ceph cluster. The following is an example of a parameter defined under the [global]
section:public network = 192.168.0.0/24
[mon]
section of the config file are applied to all the Ceph monitor daemons in the cluster. The parameter defined under this section overrides the parameters defined under the [global]
section. The following is an example of a parameter usually defined under the [mon] section: mon initial members = ceph-mon1
[osd]
section are applied to all the Ceph OSD daemons in the Ceph cluster. The configuration defined under this section overrides the same setting defined under the [global]
section. The following is an example of the settings in this section:osd mkfs type = xfs
[mds]
section are applied to all the Ceph MDS daemons in the Ceph cluster. The configuration defined under this section overrides the same setting defined under the [global]
section. The following is an example of the settings in this section:mds cache size = 250000
[global]
section. The following is the example of the settings in this section:rbd cache size = 67108864
In the next recipe, we will learn some tips for performance tuning of the Ceph cluster. Performance tuning is a vast topic that requires understanding of Ceph, as well as other components of your storage stack. There is no silver bullet for performance tuning. It depends a lot on the underlying infrastructure and your environment.
The global parameters should be defined under the [global]
section of your Ceph cluster configuration file:
network
: It's recommended that you use two physically separated networks for your Ceph cluster, which are referred to as the public and cluster networks respectively. Earlier in this chapter, we covered the need for two different networks. Let's now understand how we can define them in a Ceph configuration.public network = {public network / netmask}
:public network = 192.168.100.0/24
cluster network = {cluster network / netmask}
:cluster network = 192.168.1.0/24
max open files
: If this parameter is in place and the Ceph cluster starts, it sets the max open file descriptors at the OS level. This keeps OSD daemons from running out of the file descriptors. The default value of this parameter is 0; you can set it as up to a 64-bit integer:max open files = 131072
osd pool default min size
: This is the replication level in a degraded state. It sets the minimum number of replicas for the objects in pool in order to acknowledge a write
operation from clients. The default value is 0:osd pool default min size = 1
osd pool default pg
and osd pool default pgp
: Make sure that the cluster has a realistic number of placement groups. The recommended value of placement group per osd is 100. Use this formula to calculate the PG count: (Total number of OSD * 100)/number of replicas
For 10 OSD and a replica size of 3, the PG count should be under (10*100)/3 = 333:
osd pool default pg num = 128 osd pool default pgp num = 128
As explained earlier, the PG and PGP number should be kept the same. The PG and PGP values vary a lot depending on the cluster size. The previously mentioned configurations should not harm your cluster, but you may want to rethink before applying these values. You should know that these parameters do not change the PG and PGP numbers for existing pools; they are applied when you create a new pool without specifying the PG and PGP values.
osd pool default min size
: This is the replication level in a degraded state, which should be set lower than the osd pool default size
value. It sets the minimum number of replicas for an object of a pool so that it can acknowledge the write
operation even if the cluster is degraded. If the min size does not match, Ceph will not acknowledge write
to the client:osd pool default min size = 1
osd pool default crush rule
: The default CRUSH ruleset to use when creating a pool:osd pool default crush rule = 0
debug <subsystem> = <log-level>/<memory-level>
The default logging level is good enough for your cluster, unless you see that the memory level logs are impacting your performance or memory consumption. In such a case, you can try to disable the in-memory logging. To disable the default values of in-memory logs, add the following parameters:
debug_lockdep = 0/0 debug_context = 0/0 debug_crush = 0/0 debug_buffer = 0/0 debug_timer = 0/0 debug_filer = 0/0 debug_objecter = 0/0 debug_rados = 0/0 debug_rbd = 0/0 debug_journaler = 0/0 debug_objectcatcher = 0/0 debug_client = 0/0 debug_osd = 0/0 debug_optracker = 0/0 debug_objclass = 0/0 debug_filestore = 0/0 debug_journal = 0/0 debug_ms = 0/0 debug_monc = 0/0 debug_tp = 0/0 debug_auth = 0/0 debug_finisher = 0/0 debug_heartbeatmap = 0/0 debug_perfcounter = 0/0 debug_asok = 0/0 debug_throttle = 0/0 debug_mon = 0/0 debug_paxos = 0/0 debug_rgw = 0/0
The monitor tuning parameters should be defined under the [mon]
section of your Ceph cluster configuration file:
mon osd down out interval
: This is the number of seconds Ceph waits before marking a Ceph OSD Daemon "down" and "out" if it doesn't respond. This option comes in handy when your OSD nodes crash and reboot by themselves or after some short glitch in the network. You don't want your cluster to start rebalancing as soon as the problem comes, rather wait for a few minutes and see if the problem gets fixed:mon_osd_down_out_interval = 600
mon allow pool delete
: To avoid the accidental deletion of the Ceph pool, set this parameter as false. This can be useful if you have many administrators managing the Ceph cluster, and you do not want to take any risk with client data:mon_allow_pool_delete = false
mon osd min down reporters
: The Ceph OSD daemon can report to MON about its peer OSDs if they are down; by default, this value is 1. With this option, you can change the minimum number of Ceph OSD Daemons required to report a down Ceph OSD to the Ceph monitor. In a large cluster, it's recommended that you have this value larger than the default; 3 should be a good number:mon_osd_min_down_reporters = 3
In this recipe, we will understand the general OSD tuning parameters that should be defined under the [osd]
section of your Ceph cluster configuration file.
The following settings allow the Ceph OSD daemon to determine the filesystem type, mount options, as well as some other useful settings:
osd mkfs options xfs
: At the time of OSD creation, Ceph will use these xfs options to create the OSD filesystem:osd_mkfs_options_xfs = "-f -i size=2048"
osd mount options xfs
: It supplies the xfs filesystem mount options to OSD. When Ceph is mounting an OSD, it will use the following options for OSD filesystem mounting:osd_mount_options_xfs = "rw,noatime,inode64,logbufs=8,logbsize=256k,delaylog,allocsize=4M"
osd max write size
: The maximum size in megabytes an OSD can write at a time:osd_max_write_size = 256
osd client message size cap
: The largest client data message in bytes that is allowed in memory:osd_client_message_size_cap = 1073741824
osd map dedup
: Remove duplicate entries in the OSD map:osd_map_dedup = true
osd op threads
: The number of threads to service the Ceph OSD Daemon operations. Set it to 0 to disable it. Increasing the number may increase the request-processing rate:osd_op_threads = 16
osd disk threads
: The number of disk threads that are used to perform background disk-intensive OSD operations such as scrubbing and snap trimming:osd_disk_threads = 1
osd disk thread ioprio class
: It is used in conjunction with osd_disk_thread_ioprio_priority
. This tunable can change the I/O scheduling class of the disk thread, and it only works with the Linux kernel CFQ scheduler. The possible values are idle
, be
, or rt
:idle
: The disk thread will have lower priority than any other thread in the OSD. It is useful when you want to slow down the scrubbing on an OSD that is busy handling client requests.be
: The disk threads have the same priority as other threads in the OSD.rt
: The disk thread will have more priority than all the other threads. This is useful when scrubbing is much needed, and it can be prioritized at the expense of client operations.osd_disk_thread_ioprio_class = idle
osd disk thread ioprio priority
: It's used in conjunction with osd_disk_thread_ioprio_class
. This tunable can change the I/O scheduling priority of the disk thread ranging from 0
(highest) to 7
(lowest). If all OSDs on a given host are in class idle
and are competing for I/O and not doing much operations, this parameter can be used to lower the disk thread priority of one OSD to 7
so that another OSD with the priority 0 can potentially scrub faster. Like osd_disk_thread_ioprio_class
, this also works with the Linux kernel CFQ scheduler:osd_disk_thread_ioprio_priority = 0
The Ceph OSD daemons support the following journal configurations:
osd journal size
: Ceph's default osd journal size
value is 0; you should use the osd_journal_size
parameter to set the journal size. The journal size should be at least twice the product of the expected drive speed and filestore max sync interval. If you are using SSD journals, it's usually good to create journals larger than 10 Gb and increase the filestore min/max sync interval:osd_journal_size = 20480
journal max write bytes
: The maximum number of bytes the journal can write at once:journal_max_write_bytes = 1073714824
journal max write entries
: The maximum number of entries the journal can write at once:journal_max_write_entries = 10000
journal queue max ops
: The maximum number of operations allowed in the journal queue at a given time:journal_queue_max_ops = 50000
journal queue max bytes
: The maximum number of bytes allowed in the journal queue at a given time:journal_queue_max_bytes = 10485760000
journal dio
: This enables direct i/o to the journal. It requires journal block align
to be set to true
:journal_dio = true
journal aio
: This enables using libaio for asynchronous writes to the journal. It requires journal dio
to be set to true
:journal_aio = true
journal block align
: This block aligns write
operations. It's required for dio
and aio
.These are a few filestore settings that can be configured for the Ceph OSD daemons:
filestore merge threshold
: This enables using libaio for asynchronous writes to the journal. It requires journal dio
to be set to true
:filestore_merge_threshold = 40
filestore split multiple
: The maximum number of files in a subdirectory before splitting into child directories:filestore_split_multiple = 8
filestore op threads
: The number of filesystem operation threads that execute in parallel:filestore_op_threads = 32
filestore xattr use omap
: Uses the object map for XATTRS (extended attributes). Needs to be set to true
for the ext4
file systems:filestore_xattr_use_omap = true
filestore sync interval
: In order to create a consistent commit point, the filestore needs to quiesce write
operations and do a syncfs()
operation, which syncs data from the journal to the data partition and thus frees the journal. A more frequently performed sync operation reduces the amount of data that is stored in a journal. In such cases, the journal becomes underutilized. Configuring less frequent syncs allows the filesystem to coalesce small writes better, and we might get improved performance. The following parameters define the minimum and maximum time period between two syncs:filestore_min_sync_interval = 10 filestore_max_sync_interval = 15
filestore queue max ops
: The maximum number of operations that a filestore can accept before blocking new operations from joining the queue:filestore_queue_max_ops = 2500
filestore queue max bytes
: The maximum number of bytes in an operation:filestore_queue_max_bytes = 10485760
filestore queue committing max ops
: The maximum number of operations the filestore can commit:filestore_queue_committing_max_ops = 5000
filestore queue committing max bytes
: The maximum number of bytes the filestore can commit:filestore_queue_committing_max_bytes = 10485760000
These settings should be used when you want performance over recovery or vice versa. If your Ceph cluster is unhealthy and is under recovery, you might not get its usual performance, as OSDs will be busy with recovery. If you still prefer performance over recovery, you can reduce the recovery priority to keep OSDs less occupied with recovery. You can also set these values if you want a quick recovery for your cluster, helping OSDs to perform recovery faster.
osd recovery max active
: The number of active recovery requests per OSD at a given moment:osd_recovery_max_active = 1
osd recovery max single start
: This is used in conjunction with osd_recovery_max_active
. To understand this, let's assume osd_recovery_max_single_start
is equal to 1
, and osd_recovery_max_active
is equal to 3
. In this case, it means that the OSD will start a maximum of one recovery operation at a time, out of a total of three operations active at that time:osd_recovery_max_single_start = 1
osd recovery op priority
: This is the priority set for the recovery operation. The lower the number, the higher the recovery priority:osd_recovery_op_priority = 50
osd recovery max chunk
: The maximum size of a recovered chunk of data in bytes:osd_recovery_max_chunk = 1048576
osd recovery threads
: The number of threads needed for recovering data:osd_recovery_threads = 1
OSD backfilling settings allow Ceph to set backfilling operations at a lower priority than requests to read
and write
:
OSD Scrubbing is important for maintaining data integrity, but it can reduce performance. You can adjust the following settings to increase or decrease scrubbing operations:
osd max scrubs
: The maximum number of simultaneous scrub operations for a Ceph OSD daemon:osd_max_scrubs = 1
osd scrub sleep
: The time in seconds that scrubbing sleeps between two consecutive scrubs:osd_scrub_sleep = .1
osd scrub chunk min
: The minimum number of data chunks an OSD should perform scrubbing on:osd_scrub_chunk_min = 1
osd scrub chunk max
: The maximum number of data chunks an OSD should perform scrubbing on:osd_scrub_chunk_max = 5
osd deep scrub stride
: The read
size in bytes while doing a deep scrub:osd_deep_scrub_stride = 1048576
osd scrub begin hour
: The earliest hour that scrubbing can begin. This is used in conjunction with osd_scrub_end_hour
to define a scrubbing time window:osd_scrub_begin_hour = 19
osd scrub end hour
: This is the upper bound when the scrubbing can be performed. This works in conjunction with osd_scrub_begin_hour
to define a scrubbing time window:osd_scrub_end_hour = 7
The client tuning parameters should be defined under the [client]
section of your Ceph configuration file. Usually, this [client]
section should also be present in the Ceph configuration file hosted on the client node:
rbd cache
: Enable caching for the RADOS Block Device (RBD):rbd_cache = true
rbd cache writethrough until flush
: Start out in the write-through
mode, and switch to writeback
after the first flush request is received:rbd_cache_writethrough_until_flush = true
rbd concurrent management ops
: The number of concurrent management operations that can be performed on rbd
:rbd_concurrent_management_ops = 10
rbd cache size
: The rbd
cache size in bytes:rbd_cache_size = 67108864 #64M
rbd cache max dirty
: The limit in bytes at which the cache should trigger a writeback
. It should be less than rbd_cache_size
:rbd_cache_max_dirty = 50331648 #48M
rbd cache target dirty
: The dirty target before the cache begins writing data to the backing store:rbd_cache_target_dirty = 33554432 #32M
rbd cache max dirty age
: The number of seconds that the dirty data is in the cache before writeback
starts:rbd_cache_max_dirty_age = 2
rbd default format
: This uses the second rbd
format, which is supported by librbd and the Linux kernel since version 3.11. This adds support for cloning and is more easily extensible, allowing more features in the future:rbd_default_format = 2
In the last recipe, we covered the tuning parameters for Ceph MON, OSD, and Clients. In this recipe, we will cover a few general tuning parameters, which can be applied to the Operating System.
kernel pid max
: This is a Linux kernel parameter that is responsible for the maximum number of threads and process IDs. By default, the Linux kernel has a relatively small kernel.pid_max
value. You should configure this parameter with a higher value on Ceph nodes hosting several OSDs, typically, more than 20 OSDs. This setting helps spawn multiple threads for faster recovery and rebalancing. To use this parameter, execute the following command from the root user:# echo 4194303 > /proc/sys/kernel/pid_max
file max
: This is the maximum number of open files on a Linux system. It's generally a good idea to have a larger value for this parameter:# echo 26234859 > /proc/sys/fs/file-max
Jumbo frames should be enabled on the host as well as on the network switch side, otherwise, a mismatch in the MTU size would result in packet loss. To enable jumbo frames on the interface eth0, execute the following command:
# ifconfig eth0 mtu 9000
Similarly, you should do this for other interfaces that are participating in the Ceph networks. To make this change permanent, you should add this configuration in the interface configuration file.
Disk read_ahead
: The read_ahead
parameter speeds up the disk read
operation by prefetching data and loading it in random access memory. Setting up a relatively higher value for read_ahead
will benefit clients performing sequential read
operations. Let's assume that disk vda
is an RBD that is mounted on a client node. Use the following command to check its read_ahead
value, which is default in most of the cases:
# cat /sys/block/vda/queue/read_ahead_kb
To set read_ahead
to a higher value, that is, 8 MB for vda
RBD, execute the following command:
# echo "8192" > /sys/block/vda/queue/read_ahead_kb
The read_ahead
settings are used on Ceph clients that use the mounts RBD. To get a read
performance boost, you can set it to several MB, depending on your hardware and on all your RBD devices.
swappiness
value is recommended for high IO workload. Set vm.swappiness
to zero in /etc/sysctl.conf
to prevent this: echo "vm.swappiness=0" >> /etc/sysctl.conf
min_free_kbytes
: This gives the minimum number of kilobytes to keep free across the system. You can keep 1 to 3% of the total system memory free with min_free_kbytes
by running the following command:# echo 262144 > /proc/sys/vm/min_free_kbytes
I/O Scheduler
: Linux gives us the option to select the I/O scheduler, and this can be changed without rebooting too. It provides three options for I/O schedulers, which are mentioned as follows:read
operations occur more often than write
operations. Queued I/O requests are sorted into a read
or write
batch and then scheduled for execution in increasing LBA order. The read
batches take precedence over the write
batches by default, as applications are more likely to block on read
I/O. For Ceph OSD workloads deadline, the I/O scheduler looks promising.Execute the following command to check the default I/O scheduler for the disk device sda
. The default scheduler should appear inside []
:
# cat /sys/block/sda/queue/scheduler
Change the default I/O scheduler for disk sda to deadline:
# echo deadline > /sys/block/sda/queue/scheduler
Change the default I/O scheduler for disk sda to noop:
# echo noop > /sys/block/sda/queue/scheduler
I/O Scheduler queue
: The default I/O scheduler queue size is 128
. The scheduler queue sorts and writes in an attempt to optimize for sequential I/O and reduce the seek time. Changing the depth of the scheduler queue to 1024
can increase the proportion of sequential I/O that disks perform and improve the overall throughput.To check the scheduler depth for block device sda, use the following command:
# cat /sys/block/sda/queue/nr_requests
To increase the scheduler depth to 1024, use the following command:
# echo 1024 > /sys/block/sda/queue/nr_requests
3.12.136.119