Chapter 5. Deploying Ceph – the Way You Should Know

By this time, you must have learned enough about Ceph, including some hands-on practice. In this chapter, we will learn the following interesting stuff around Ceph:

  • Ceph cluster hardware planning
  • Preparing your Ceph installation
  • Ceph cluster manual deployment
  • Scaling up your cluster
  • Ceph cluster deployment using the ceph-deploy tool
  • Upgrading your Ceph cluster

Hardware planning for a Ceph cluster

Ceph is a software-based storage system that is designed to run on generally available commodity hardware. This ability of Ceph makes it an economic, scalable, and vendor-free storage solution.

Cluster hardware configuration requires planning based on your storage needs. The type of hardware as well as the cluster design are the factors that should be considered during the initial phase of project. Meticulous planning at an early stage can go a long way to avoid performance bottlenecks and helps in better cluster reliability. Hardware selection depends on various factors such as budget, whether the system needs to focus on performance or capacity or both, fault tolerance level, and the final use case. In this chapter, we will discuss general considerations with respect to hardware and cluster design.

Note

For more information on hardware recommendation, you can refer to Ceph's official documentation at http://ceph.com/docs/master/start/hardware-recommendations/.

Monitor requirements

A Ceph monitor takes care of the health of an entire cluster by maintaining cluster maps. They do not participate in storing cluster data. Hence they are not CPU and memory intensive and have fairly low system resource requirements. A low-cost, entry-level server with a single core CPU with a few gigabytes of memory is good enough in most cases for being a monitor node.

If you have an existing server with a fair amount of available system resources, you can choose that node to handle the additional responsibility of running Ceph monitors. In this case, you should make sure that the other services running on that node leave sufficient resources for Ceph monitor daemons. In a nonproduction environment where you have budget or hardware constraints, you can think of running Ceph monitors on physically separated virtual machines. However, the production practice is to run Ceph monitor on low-cost, low-configuration physical machines.

If you have configured your cluster to store logs on local monitor node, make sure you have a sufficient amount of local disk space on the monitor node to store the logs. For a healthy cluster, logs can grow as much as few gigabytes, but for an unhealthy cluster, when the debugging level is more, it could easily reach to several gigabytes. Please make sure that the cluster does not remain unhealthy for a long time with any space for logs. For a production environment, you should allocate a large enough partition for logs and develop a log rotation policy to keep up the free space.

The network for a monitor node should be redundant, since the monitor does not participate in cluster recovery. A redundant NIC with 1 Gbps is sufficient. Redundancy at the network level is important as the monitor forms quorum with other monitors, and failure of more than 50 percent of the monitor nodes would create a problem while connecting to a cluster. For a nonproduction environment, you can manage with a single NIC monitor node, but for a production setup, redundancy at the network level is a big factor.

OSD requirements

A typical Ceph cluster deployment creates one OSD for each physical disk in a cluster node, which is a recommended practice. However, OSD supports flexible deployment of one OSD per disk or one OSD per RAID volume. The majority of Ceph cluster deployment in a JBOD environment uses one OSD per physical disk. A Ceph OSD needs:

  • CPU and memory
  • An OSD journal (block device or a file)
  • An underlying filesystem (XFS, ext4, and Btrfs)
  • A separate OSD or cluster redundant network (recommended)

The recommended CPU and memory configuration that you should have is 1 GHz of CPU and 2 GB of RAM per OSD. This should be suitable for most of the cluster environments; however, it is crucial to note that in the event of a disaster, the recovery process requires more system resource than usual.

Tip

It is usually cost-effective to overallocate CPU and memory at an earlier stage of your cluster planning as we can anytime add more physical disks in a JBOD style to the same host if it has enough system resources, rather than purchasing an entirely new node, which is a bit costly.

The OSD is the main storage unit in Ceph; you should have plenty of hard disks as per your need for storage capacity. For a disk drive, it's usually the higher the capacity, the lower the price in terms of cost per gigabyte. You should take the cost-per-gigabyte advantage and use fairly large-sized OSDs. However, it's worth keeping in mind that the higher the size of a disk, the more memory it needs to operate.

From a performance point of view, you should consider separate journal disks for an OSD. Performance improvements have been seen when OSD journals are created on SSD disk partition and OSD data on a separate spinning disk. When SSD disks are used as journals, it improves cluster performance and manages workload quickly and efficiently. However, the downside of an SSD is that it increases storage cost per gigabyte for your cluster. The efficient way to invest in an SSD is to designate a single SSD disk as journals for more than one OSDs. The trade-off here is that if you lose the SSD journal disk, which is common for multiple OSDs, you will lose your data on those OSD disks, so try to avoid overloading the SSD with journals. A decent journal count should be two to four journals per SSD.

Network requirements

With respect to networking configuration, it's recommended that your cluster should have at least two separate networks, one for front-side data network or public networks and the other for the backside data network or clustered network. Two different physical networks are recommended so as to keep client data and Ceph cluster traffic separate. Most of the time, Ceph cluster traffic is more as compared to client data traffic as Ceph uses a cluster network to perform replication of each object as well as recovery in case of failure. If you keep both the networks physically the same, you might encounter some performance problems. Again, it is recommended to have two separate networks but you can always run your Ceph cluster with one physical network. These separate networks should have a bandwidth of a minimum of 1 Gbps. However, it's always good to have a 10 Gbps network based on your workload or performance needs. If you are designing a Ceph cluster that is going to scale up in the near future and might be responsible for a good amount of workload, starting with 10 Gbps physically separated networks for both data and cluster will be a wise choice as long as ROI and performance are concerned.

It is also advisable to have redundancy at each layer of your network configuration such as network controllers, ports, switches, and routers. The front-side data network or the public network provides interconnection between clients and the Ceph cluster. Every I/O operation between clients and the Ceph cluster will traverse from this interconnect. You should make sure that you have enough bandwidth per client. The second interconnect, which is your backend cluster network, is used internally by the Ceph OSD nodes. Since Ceph possesses distributed architecture, for every client write operation, data is replicated N times and stored across cluster. So for a single write operation, the Ceph cluster has to write N* the amount of one operation. All of this replicated data travels to peer nodes through cluster interconnect network. In addition to initial write replication, cluster network is also used for data rebalancing and recovery. Hence it plays a vital role for deciding your cluster performance.

Considering your business needs, workload, and overall performance, you should always rethink your network configuration and can opt for a 40 Gbps separate interconnect for both public as well as cluster network. This can make significant improvements for extra large Ceph clusters of hundreds of terabytes in size. Most of the cluster deployment relies on Ethernet networks; however, InfiniBand is also gaining popularity as high-speed Ceph frontend and backend networks. As an optional case, based on your budget, you can even consider a separate network for management as well as an emergency network to provide additional layer of network redundancy to your production cluster.

MDS requirements

As compared to the Ceph monitor (MON) and OSD, Ceph MDS is a bit more resource hungry. They require significantly high CPU processing powers with quad core or more. Ceph MDS depends a lot on data caching; as they need to serve data quickly, they would require plenty of RAM. The higher the RAM for Ceph MDS, the better the performance of CephFS will be. If you have a high amount of workload with CephFS, you should keep Ceph MDS on dedicated physical machines with relatively more physical CPU cores and memory. A redundant network interface with 1 GB or more speed will work for most cases for Ceph MDS.

Tip

It is recommended that you use separate disks for operating system configured under RAID for MON, OSDs, and MDS. You should not use an operating system disk/partition for cluster data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.233.43