Ceph performance consideration – hardware level

When it comes to performance, the underlying hardware plays a major role. Traditional storage systems run specifically on their vendor-manufactured hardware, where users do not have any flexibility in terms of hardware selection based on their needs and unique workload requirements. It's very difficult for organizations that invest in such vendor-locked systems to overcome problems generated due to incapable hardware.

Ceph, on the other hand, is absolutely vendor-free; organizations are no longer tied up with hardware manufacturers, and they are free to use any hardware of their choice, budget, and performance requirements. They have full control over their hardware and the underlying infrastructure.

The other advantage of Ceph is that it supports heterogeneous hardware, that is, Ceph can run on cluster hardware from multiple vendors. Customers are allowed to mix hardware brands while creating their Ceph infrastructure. For example, while purchasing hardware for Ceph, customers can mix the hardware from different manufacturers such as HP, Dell, IBM, Fujitsu, Super Micro, and even off-the-shelf hardware. In this way, customers can achieve huge money-savings, their desired hardware, full control, and decision-making rights.

Hardware selection plays a vital role in the overall Ceph storage performance. Since the customers have full rights to select the hardware type for Ceph, it should be done with extra care, with proper estimation of the current and future workloads.

One should keep in mind that hardware selection for Ceph is totally dependent on the workload that you will put on your cluster, the environment, and all features you will use. In this section, we will learn some general practices for selecting hardware for your Ceph cluster.

Processor

Some of the Ceph components are not processor hungry. Like Ceph, monitor daemons are lightweight, as they maintain copies of cluster and do not serve any data to clients. Thus, for most cases, a single core processor for monitor will do the job. You can also think of running monitor daemons on any other server in your environment that has free resources. Make sure you have system resources such as memory, network, and disk space available in an adequate quantity for monitor daemons.

Ceph OSD daemons might require a fair amount of CPU as they serve data to clients and hence require some data processing. A dual-core processor for OSD nodes will be nice. From a performance point of view, it's important to know how you will use OSDs, whether it's in a replicated fashion or erasure coded. If you use OSDs in erasure coding, you should consider a quad-core processor as erasure-coding operations require a lot of computation. In the event of cluster recovery, the processor consumption by OSD daemons increases significantly.

Ceph MDS daemons are more process hungry as compared to MON and OSD. They need to dynamically redistribute their load that is CPU intensive; you should consider a quad-core processor for Ceph MDS.

Memory

Monitor and metadata daemons need to serve their data rapidly, hence they should have enough memory for faster processing. From a performance point of view, 2 GB or more per-daemon instance should be available for metadata and monitor. OSDs are generally not memory intensive. For an average workload, 1 GB of memory per-OSD-daemon instance should suffice; however, from a performance point of view, 2 GB per-OSD daemon will be a good choice. This recommendation assumes that you are using one OSD daemon for one physical disk. If you use more than one physical disk per OSD, your memory requirement will grow as well. Generally, more physical memory is good, since during cluster recovery, memory consumption increases significantly.

Network

All the cluster nodes should have dual-network interfaces for two different networks, that is, cluster network and client network. For a medium-size cluster of several hundred terabytes, 1 G network link should go well. However, if your cluster size is big and it serves several clients, you should think of 10 G or more bandwidth network. At the time of system recovery, network plays a vital role. If you have a good 10 G or more bandwidth network connection, your cluster will recover quickly, else it might take some time. So, from a performance point of view, 10 Gb or more dual network will be a good option. A well-designed Ceph cluster makes use of two physically separated networks, one for cluster network (internal network) and another for client network (external network); both these networks should be physically separated from the network switch; a point-of-availability setup will require a redundant dual network, as shown in the following diagram:

Network

Disk

Disk drive selection for Ceph storage cluster holds a lot of importance with respect to overall performance and the total cluster cost. Before taking your final decision on disk drive selection, you should understand your workload and possible performance requirements. Ceph OSD consists of two different parts: the OSD journal part and the OSD data part. Every write operation is a two-step process.

When any OSD receives client requests to store an object, it first writes the object to the journal part, and then from the journal, it writes the same object to the data part before it sends an acknowledge signal to the client. In this way, all the cluster performance revolves around OSD journal and data partition. From a performance point of view, it's recommended to use SSD for journals. By using SSD, you can achieve significant throughput improvements by reducing access time and read latency. In most environments, where you are not concerned about extreme performance, you can consider configuring journal and data partition on the same hard disk drive. However, if you are looking for significant performance improvements out of your Ceph cluster, it's worth investing on SSD for journals.

To use SSD as journals, we create logical partitions on each physical SSD that will be used as journals, such that each SSD journal partition is mapped to one OSD data partition. In this type of setup, you should keep in mind not to overload SSDs by storing multiple journals beyond its limits. By doing this, you will impact the overall performance. To achieve good performance out of your SSDs, you should store no more than four OSD journals on each SSD disk.

The dark side of using a single SSD for multiple journals is that if you lose your SSD hosting multiple journals, all the OSDs associated with this SSD will fail and you might lose your data. However, you can overcome this situation by using RAID 1 for journals, but this will increase your storage cost. Also, SSD cost per gigabyte is nearly 10 times more compared to HDD. So, if you are building a cluster with SSDs, it will increase the cost per gigabyte for your Ceph cluster.

Filesystem type selection is also one of the aspects for cluster performance. Btrfs is an advance filesystem that can write the object in a single operation as compared to XFS and EXT4, which require two steps to write an object. Btrfs is copy-on-write filesystem, that is, while writing the object to journal, it can simultaneously write the same object on data partition, providing significant performance improvements. However, Btrfs is not production ready at the time of writing. You might face data inconsistency problems with Btrfs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.245.140