Ceph recommendation and performance tuning

In this recipe, we will learn some performance tuning parameters for the Ceph cluster. These cluster-wide configuration parameters are defined in the Ceph configuration file so that each time any Ceph daemon starts, it will respect the defined settings. By default, the configuration file name is ceph.conf, which is located in the /etc/ceph directory. This configuration file has a global section as well as several sections for each service type. Whenever a Ceph service type starts, it applies the configuration defined under the [global] section as well as the daemon specific section. A Ceph configuration file has multiple sections, as shown in the following screenshot:

Ceph recommendation and performance tuning

We will now discuss the role of each section of the configuration file.

  • Global section: The global section of the cluster configuration file begins with the [global] keyword. All the settings defined under this section apply to all the daemons of the Ceph cluster. The following is an example of a parameter defined under the [global] section:
    public network = 192.168.0.0/24
    
  • Monitor section: The settings defined under the [mon] section of the config file are applied to all the Ceph monitor daemons in the cluster. The parameter defined under this section overrides the parameters defined under the [global] section. The following is an example of a parameter usually defined under the [mon] section:
    mon initial members = ceph-mon1
    
  • OSD section: The settings defined under the [osd] section are applied to all the Ceph OSD daemons in the Ceph cluster. The configuration defined under this section overrides the same setting defined under the [global] section. The following is an example of the settings in this section:
    osd mkfs type = xfs
    
  • MDS section: The settings defined under the [mds] section are applied to all the Ceph MDS daemons in the Ceph cluster. The configuration defined under this section overrides the same setting defined under the [global] section. The following is an example of the settings in this section:
    mds cache size = 250000
    
  • Client section: The settings defined under the [client] section are applied to all the Ceph clients. The configuration defined under this section overrides the same setting defined under the [global] section. The following is the example of the settings in this section:
    rbd cache size = 67108864
    

In the next recipe, we will learn some tips for performance tuning of the Ceph cluster. Performance tuning is a vast topic that requires understanding of Ceph, as well as other components of your storage stack. There is no silver bullet for performance tuning. It depends a lot on the underlying infrastructure and your environment.

Global cluster tuning

The global parameters should be defined under the [global] section of your Ceph cluster configuration file:

  • network: It's recommended that you use two physically separated networks for your Ceph cluster, which are referred to as the public and cluster networks respectively. Earlier in this chapter, we covered the need for two different networks. Let's now understand how we can define them in a Ceph configuration.
    • Public Network: Use this syntax to define the public network: public network = {public network / netmask}:
      public network = 192.168.100.0/24
      
    • Cluster Network: Use this syntax to define the cluster network: cluster network = {cluster network / netmask}:
      cluster network = 192.168.1.0/24
      
  • max open files: If this parameter is in place and the Ceph cluster starts, it sets the max open file descriptors at the OS level. This keeps OSD daemons from running out of the file descriptors. The default value of this parameter is 0; you can set it as up to a 64-bit integer:
    max open files = 131072
    
  • osd pool default min size: This is the replication level in a degraded state. It sets the minimum number of replicas for the objects in pool in order to acknowledge a write operation from clients. The default value is 0:
    osd pool default min size = 1
    
  • osd pool default pg and osd pool default pgp: Make sure that the cluster has a realistic number of placement groups. The recommended value of placement group per osd is 100. Use this formula to calculate the PG count: (Total number of OSD * 100)/number of replicas

    For 10 OSD and a replica size of 3, the PG count should be under (10*100)/3 = 333:

    osd pool default pg num = 128
    osd pool default pgp num = 128
    

    As explained earlier, the PG and PGP number should be kept the same. The PG and PGP values vary a lot depending on the cluster size. The previously mentioned configurations should not harm your cluster, but you may want to rethink before applying these values. You should know that these parameters do not change the PG and PGP numbers for existing pools; they are applied when you create a new pool without specifying the PG and PGP values.

  • osd pool default min size: This is the replication level in a degraded state, which should be set lower than the osd pool default size value. It sets the minimum number of replicas for an object of a pool so that it can acknowledge the write operation even if the cluster is degraded. If the min size does not match, Ceph will not acknowledge write to the client:
    osd pool default min size = 1
    
  • osd pool default crush rule: The default CRUSH ruleset to use when creating a pool:
    osd pool default crush rule = 0
    
  • Disable In-Memory Logs: Each Ceph subsystem has a logging level for its output logs, and it logs in-memory. We can set different values for each of these subsystems by setting a log file level and a memory level for debug logging on a scale of 1 to 20, where 1 is terse and 20 is verbose. The first setting is the log level, and the second setting is the memory level. You must separate them with a forward slash (/): debug <subsystem> = <log-level>/<memory-level>

    The default logging level is good enough for your cluster, unless you see that the memory level logs are impacting your performance or memory consumption. In such a case, you can try to disable the in-memory logging. To disable the default values of in-memory logs, add the following parameters:

      debug_lockdep = 0/0
      debug_context = 0/0
      debug_crush = 0/0
      debug_buffer = 0/0
      debug_timer = 0/0
      debug_filer = 0/0
      debug_objecter = 0/0
      debug_rados = 0/0
      debug_rbd = 0/0
      debug_journaler = 0/0
      debug_objectcatcher = 0/0
      debug_client = 0/0
      debug_osd = 0/0
      debug_optracker = 0/0
      debug_objclass = 0/0
      debug_filestore = 0/0
      debug_journal = 0/0
      debug_ms = 0/0
      debug_monc = 0/0
      debug_tp = 0/0
      debug_auth = 0/0
      debug_finisher = 0/0
      debug_heartbeatmap = 0/0
      debug_perfcounter = 0/0
      debug_asok = 0/0
      debug_throttle = 0/0
      debug_mon = 0/0
      debug_paxos = 0/0
      debug_rgw = 0/0
    

Monitor tuning

The monitor tuning parameters should be defined under the [mon] section of your Ceph cluster configuration file:

  • mon osd down out interval: This is the number of seconds Ceph waits before marking a Ceph OSD Daemon "down" and "out" if it doesn't respond. This option comes in handy when your OSD nodes crash and reboot by themselves or after some short glitch in the network. You don't want your cluster to start rebalancing as soon as the problem comes, rather wait for a few minutes and see if the problem gets fixed:
    mon_osd_down_out_interval = 600
    
  • mon allow pool delete: To avoid the accidental deletion of the Ceph pool, set this parameter as false. This can be useful if you have many administrators managing the Ceph cluster, and you do not want to take any risk with client data:
    mon_allow_pool_delete = false
    
  • mon osd min down reporters: The Ceph OSD daemon can report to MON about its peer OSDs if they are down; by default, this value is 1. With this option, you can change the minimum number of Ceph OSD Daemons required to report a down Ceph OSD to the Ceph monitor. In a large cluster, it's recommended that you have this value larger than the default; 3 should be a good number:
    mon_osd_min_down_reporters = 3
    

OSD tuning

In this recipe, we will understand the general OSD tuning parameters that should be defined under the [osd] section of your Ceph cluster configuration file.

OSD General Settings

The following settings allow the Ceph OSD daemon to determine the filesystem type, mount options, as well as some other useful settings:

  • osd mkfs options xfs: At the time of OSD creation, Ceph will use these xfs options to create the OSD filesystem:
    osd_mkfs_options_xfs = "-f -i size=2048"
    
  • osd mount options xfs: It supplies the xfs filesystem mount options to OSD. When Ceph is mounting an OSD, it will use the following options for OSD filesystem mounting:
    osd_mount_options_xfs = "rw,noatime,inode64,logbufs=8,logbsize=256k,delaylog,allocsize=4M"
    
  • osd max write size: The maximum size in megabytes an OSD can write at a time:
    osd_max_write_size = 256
    
  • osd client message size cap: The largest client data message in bytes that is allowed in memory:
    osd_client_message_size_cap = 1073741824
    
  • osd map dedup: Remove duplicate entries in the OSD map:
    osd_map_dedup = true
    
  • osd op threads: The number of threads to service the Ceph OSD Daemon operations. Set it to 0 to disable it. Increasing the number may increase the request-processing rate:
    osd_op_threads = 16
    
  • osd disk threads: The number of disk threads that are used to perform background disk-intensive OSD operations such as scrubbing and snap trimming:
    osd_disk_threads = 1
    
  • osd disk thread ioprio class: It is used in conjunction with osd_disk_thread_ioprio_priority. This tunable can change the I/O scheduling class of the disk thread, and it only works with the Linux kernel CFQ scheduler. The possible values are idle, be, or rt:
    • idle: The disk thread will have lower priority than any other thread in the OSD. It is useful when you want to slow down the scrubbing on an OSD that is busy handling client requests.
    • be: The disk threads have the same priority as other threads in the OSD.
    • rt: The disk thread will have more priority than all the other threads. This is useful when scrubbing is much needed, and it can be prioritized at the expense of client operations.
      osd_disk_thread_ioprio_class = idle
      
  • osd disk thread ioprio priority: It's used in conjunction with osd_disk_thread_ioprio_class. This tunable can change the I/O scheduling priority of the disk thread ranging from 0 (highest) to 7 (lowest). If all OSDs on a given host are in class idle and are competing for I/O and not doing much operations, this parameter can be used to lower the disk thread priority of one OSD to 7 so that another OSD with the priority 0 can potentially scrub faster. Like osd_disk_thread_ioprio_class, this also works with the Linux kernel CFQ scheduler:
    osd_disk_thread_ioprio_priority = 0
    

OSD Journal settings

The Ceph OSD daemons support the following journal configurations:

  • osd journal size: Ceph's default osd journal size value is 0; you should use the osd_journal_size parameter to set the journal size. The journal size should be at least twice the product of the expected drive speed and filestore max sync interval. If you are using SSD journals, it's usually good to create journals larger than 10 Gb and increase the filestore min/max sync interval:
    osd_journal_size = 20480
    
  • journal max write bytes: The maximum number of bytes the journal can write at once:
    journal_max_write_bytes = 1073714824
    
  • journal max write entries: The maximum number of entries the journal can write at once:
    journal_max_write_entries = 10000
    
  • journal queue max ops: The maximum number of operations allowed in the journal queue at a given time:
    journal_queue_max_ops = 50000
    
  • journal queue max bytes: The maximum number of bytes allowed in the journal queue at a given time:
    journal_queue_max_bytes = 10485760000
    
  • journal dio: This enables direct i/o to the journal. It requires journal block align to be set to true:
    journal_dio = true
    
  • journal aio: This enables using libaio for asynchronous writes to the journal. It requires journal dio to be set to true:
    journal_aio = true
    
  • journal block align: This block aligns write operations. It's required for dio and aio.

OSD Filestore settings

These are a few filestore settings that can be configured for the Ceph OSD daemons:

  • filestore merge threshold: This enables using libaio for asynchronous writes to the journal. It requires journal dio to be set to true:
    filestore_merge_threshold = 40
    
  • filestore split multiple: The maximum number of files in a subdirectory before splitting into child directories:
    filestore_split_multiple = 8
    
  • filestore op threads: The number of filesystem operation threads that execute in parallel:
    filestore_op_threads = 32
    
  • filestore xattr use omap: Uses the object map for XATTRS (extended attributes). Needs to be set to true for the ext4 file systems:
    filestore_xattr_use_omap = true
    
  • filestore sync interval: In order to create a consistent commit point, the filestore needs to quiesce write operations and do a syncfs() operation, which syncs data from the journal to the data partition and thus frees the journal. A more frequently performed sync operation reduces the amount of data that is stored in a journal. In such cases, the journal becomes underutilized. Configuring less frequent syncs allows the filesystem to coalesce small writes better, and we might get improved performance. The following parameters define the minimum and maximum time period between two syncs:
    filestore_min_sync_interval = 10
    filestore_max_sync_interval = 15
    
  • filestore queue max ops: The maximum number of operations that a filestore can accept before blocking new operations from joining the queue:
    filestore_queue_max_ops = 2500
    
  • filestore queue max bytes: The maximum number of bytes in an operation:
    filestore_queue_max_bytes = 10485760
    
  • filestore queue committing max ops: The maximum number of operations the filestore can commit:
    filestore_queue_committing_max_ops = 5000
    
  • filestore queue committing max bytes: The maximum number of bytes the filestore can commit:
    filestore_queue_committing_max_bytes = 10485760000
    

OSD Recovery settings

These settings should be used when you want performance over recovery or vice versa. If your Ceph cluster is unhealthy and is under recovery, you might not get its usual performance, as OSDs will be busy with recovery. If you still prefer performance over recovery, you can reduce the recovery priority to keep OSDs less occupied with recovery. You can also set these values if you want a quick recovery for your cluster, helping OSDs to perform recovery faster.

  • osd recovery max active: The number of active recovery requests per OSD at a given moment:
    osd_recovery_max_active = 1
    
  • osd recovery max single start: This is used in conjunction with osd_recovery_max_active. To understand this, let's assume osd_recovery_max_single_start is equal to 1, and osd_recovery_max_active is equal to 3. In this case, it means that the OSD will start a maximum of one recovery operation at a time, out of a total of three operations active at that time:
    osd_recovery_max_single_start = 1
    
  • osd recovery op priority: This is the priority set for the recovery operation. The lower the number, the higher the recovery priority:
    osd_recovery_op_priority = 50
    
  • osd recovery max chunk: The maximum size of a recovered chunk of data in bytes:
    osd_recovery_max_chunk = 1048576
    
  • osd recovery threads: The number of threads needed for recovering data:
    osd_recovery_threads = 1
    

OSD Backfilling settings

OSD backfilling settings allow Ceph to set backfilling operations at a lower priority than requests to read and write:

  • osd max backfills: The maximum number of backfills allowed to or from a single OSD:
    osd_max_backfills = 2
    
  • osd backfill scan min: The minimum number of objects per backfill scan:
    osd_backfill_scan_min = 8
    
  • osd backfill scan max: The maximum number of objects per backfill scan:
    osd_backfill_scan_max = 64
    

OSD scrubbing settings

OSD Scrubbing is important for maintaining data integrity, but it can reduce performance. You can adjust the following settings to increase or decrease scrubbing operations:

  • osd max scrubs: The maximum number of simultaneous scrub operations for a Ceph OSD daemon:
    osd_max_scrubs = 1
    
  • osd scrub sleep: The time in seconds that scrubbing sleeps between two consecutive scrubs:
    osd_scrub_sleep = .1
    
  • osd scrub chunk min: The minimum number of data chunks an OSD should perform scrubbing on:
    osd_scrub_chunk_min = 1
    
  • osd scrub chunk max: The maximum number of data chunks an OSD should perform scrubbing on:
    osd_scrub_chunk_max = 5
    
  • osd deep scrub stride: The read size in bytes while doing a deep scrub:
    osd_deep_scrub_stride = 1048576
    
  • osd scrub begin hour: The earliest hour that scrubbing can begin. This is used in conjunction with osd_scrub_end_hour to define a scrubbing time window:
    osd_scrub_begin_hour = 19
    
  • osd scrub end hour: This is the upper bound when the scrubbing can be performed. This works in conjunction with osd_scrub_begin_hour to define a scrubbing time window:
    osd_scrub_end_hour = 7
    

Client tuning

The client tuning parameters should be defined under the [client] section of your Ceph configuration file. Usually, this [client] section should also be present in the Ceph configuration file hosted on the client node:

  • rbd cache: Enable caching for the RADOS Block Device (RBD):
    rbd_cache = true
    
  • rbd cache writethrough until flush: Start out in the write-through mode, and switch to writeback after the first flush request is received:
    rbd_cache_writethrough_until_flush = true
    
  • rbd concurrent management ops: The number of concurrent management operations that can be performed on rbd:
    rbd_concurrent_management_ops = 10
    
  • rbd cache size: The rbd cache size in bytes:
    rbd_cache_size = 67108864  #64M
    
  • rbd cache max dirty: The limit in bytes at which the cache should trigger a writeback. It should be less than rbd_cache_size:
    rbd_cache_max_dirty = 50331648  #48M
    
  • rbd cache target dirty: The dirty target before the cache begins writing data to the backing store:
    rbd_cache_target_dirty = 33554432  #32M
    
  • rbd cache max dirty age: The number of seconds that the dirty data is in the cache before writeback starts:
    rbd_cache_max_dirty_age = 2
    
  • rbd default format: This uses the second rbd format, which is supported by librbd and the Linux kernel since version 3.11. This adds support for cloning and is more easily extensible, allowing more features in the future:
    rbd_default_format = 2
    

Operating System tuning

In the last recipe, we covered the tuning parameters for Ceph MON, OSD, and Clients. In this recipe, we will cover a few general tuning parameters, which can be applied to the Operating System.

  • kernel pid max: This is a Linux kernel parameter that is responsible for the maximum number of threads and process IDs. By default, the Linux kernel has a relatively small kernel.pid_max value. You should configure this parameter with a higher value on Ceph nodes hosting several OSDs, typically, more than 20 OSDs. This setting helps spawn multiple threads for faster recovery and rebalancing. To use this parameter, execute the following command from the root user:
    # echo 4194303 > /proc/sys/kernel/pid_max
    
  • file max: This is the maximum number of open files on a Linux system. It's generally a good idea to have a larger value for this parameter:
    # echo 26234859 > /proc/sys/fs/file-max
    
  • Jumbo frames: The Ethernet frames that are more than 1,500 bytes of payload MTU are known as jumbo frames. Enabling jumbo frames of all the network interfaces that Ceph is using for both the cluster and client network should improve the network throughput and overall network performance.

    Jumbo frames should be enabled on the host as well as on the network switch side, otherwise, a mismatch in the MTU size would result in packet loss. To enable jumbo frames on the interface eth0, execute the following command:

    # ifconfig eth0 mtu 9000
    

Similarly, you should do this for other interfaces that are participating in the Ceph networks. To make this change permanent, you should add this configuration in the interface configuration file.

  • Disk read_ahead: The read_ahead parameter speeds up the disk read operation by prefetching data and loading it in random access memory. Setting up a relatively higher value for read_ahead will benefit clients performing sequential read operations.

    Let's assume that disk vda is an RBD that is mounted on a client node. Use the following command to check its read_ahead value, which is default in most of the cases:

    # cat /sys/block/vda/queue/read_ahead_kb
    

    To set read_ahead to a higher value, that is, 8 MB for vda RBD, execute the following command:

    # echo "8192" > /sys/block/vda/queue/read_ahead_kb
    

    The read_ahead settings are used on Ceph clients that use the mounts RBD. To get a read performance boost, you can set it to several MB, depending on your hardware and on all your RBD devices.

  • Virtual memory: Due to the heavily I/O-focused profile, swap usage can result in the entire server becoming unresponsive. A low swappiness value is recommended for high IO workload. Set vm.swappiness to zero in /etc/sysctl.conf to prevent this:
    echo "vm.swappiness=0" >> /etc/sysctl.conf
    
  • min_free_kbytes: This gives the minimum number of kilobytes to keep free across the system. You can keep 1 to 3% of the total system memory free with min_free_kbytes by running the following command:
    # echo 262144 > /proc/sys/vm/min_free_kbytes
    
  • I/O Scheduler: Linux gives us the option to select the I/O scheduler, and this can be changed without rebooting too. It provides three options for I/O schedulers, which are mentioned as follows:
    • Deadline: The deadline IO scheduler replaces CFQ as the default I/O scheduler in Red Hat Enterprise Linux 7 and its derivatives, as well as in Ubuntu Trusty. The Deadline scheduler favors reads over writes via the use of separate IO queues for each. This scheduler is suitable for most use cases, but particularly for those in which read operations occur more often than write operations. Queued I/O requests are sorted into a read or write batch and then scheduled for execution in increasing LBA order. The read batches take precedence over the write batches by default, as applications are more likely to block on read I/O. For Ceph OSD workloads deadline, the I/O scheduler looks promising.
    • CFQ: The Completely Fair Queuing (CFQ) scheduler was the default scheduler in Red Hat Enterprise Linux (4, 5, and 6) and its derivatives. The default scheduler is only for devices identified as SATA disks. The CFQ scheduler divides processes into three separate classes: real time, best effort, and idle. Processes in the real time class are always performed before processes in the best effort class, which are always performed before processes in the idle class. This means that processes in the real time class can starve both the best effort and idle processes of processor time. Processes are assigned to the best effort class by default.
    • Noop: The Noop I/O scheduler implements a simple FIFO (first-in first-out) scheduling algorithm. Requests are merged at the generic block layer through a simple last-hit cache. This can be the best scheduler for CPU-bound systems using fast storage. For an SSD, the NOOP I/O scheduler can reduce I/O latency and increase throughput as well as eliminate the CPU time spent re-ordering I/O requests. This scheduler typically works well with SSDs, Virtual Machines, and even with NVMe cards. Thus, the Noop IO scheduler should be a good choice for SSD disks used for Ceph journals.

      Execute the following command to check the default I/O scheduler for the disk device sda. The default scheduler should appear inside []:

      # cat /sys/block/sda/queue/scheduler
      

      Change the default I/O scheduler for disk sda to deadline:

      # echo deadline > /sys/block/sda/queue/scheduler
      

      Change the default I/O scheduler for disk sda to noop:

      # echo noop > /sys/block/sda/queue/scheduler
      

      Note

      You must repeat these commands to change the default schedulers to either deadline or noop, based on your requirements for all the disks. Also, to make this change permanent, you need to update the grub boot loader with the required elevator option.

  • I/O Scheduler queue: The default I/O scheduler queue size is 128. The scheduler queue sorts and writes in an attempt to optimize for sequential I/O and reduce the seek time. Changing the depth of the scheduler queue to 1024 can increase the proportion of sequential I/O that disks perform and improve the overall throughput.

    To check the scheduler depth for block device sda, use the following command:

    # cat /sys/block/sda/queue/nr_requests
    

    To increase the scheduler depth to 1024, use the following command:

    # echo 1024 > /sys/block/sda/queue/nr_requests
    
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.97.202