Chapter 3. Clustering, Sharding, and Replication

As your data grows and your database gets larger, it becomes increasingly difficult to keep the entire database in a single physical location, and often, it becomes more efficient to keep data in more than one machine. RethinkDB is a distributed database. This means that it consists of multiple connected machines that store some data each; although, to users, it appears as a single, centralized database.

This chapter is all about scaling RethinkDB and setting up and managing database clusters, groups of servers serving the same database, and associated tables. We will look at how to set up a database cluster, add machines to it, and scale RethinkDB.

In this chapter, you will also learn the following topics:

  • Managing a RethinkDB cluster
  • What replication is, and how to replicate tables
  • What sharding is, and how to implement it within RethinkDB

Before we start working on the database, we will give a brief definition of scaling and explain why it is necessary and how it can be achieved.

An introduction to scaling

Scaling is an overloaded term. Finding a simple definition is tricky.

First of all, scaling doesn't refer to a specific technique or technology; scaling or scalability is an attribute of a specific architecture. As a general definition, we can say that scalability is the trait where a software application can handle increased loads of work. Examples of this can be larger datasets, higher request rates, and so on.

When talking about scaling software, we usually differentiate between the following:

  • Vertical scalability or scaling up can be defined as the ability to grow using stronger hardware and resources
  • Horizontal scalability or scaling out refers to the ability to grow by adding more hardware

It's important to note the differences between vertical and horizontal scaling. Vertical scaling basically means adding more capacity to a single node in a system. Virtually, all existing databases can be vertically scaled by adding memory, a faster CPU, or larger hard drives; however, there is a limit to the amount of resources you can add to a machine, so this makes scaling up insufficient for huge datasets:

An introduction to scaling

However, when someone uses the word scalability, they are often referring to horizontal scalability. With a horizontally scalable system, you can add capacity to the database by adding more nodes to the cluster.

Scaling a database horizontally is usually achieved by partitioning the data among multiple machines, and this provides a huge advantage. Database administrators have the ability of increasing capacity and improving resiliency and redundancy on the fly just by adding another machine to the cluster:

An introduction to scaling

What kind of system is it?

The most important question that you have to ask yourself when considering the scalability of your database is what kind of system am I working on? Are you working on a system where the majority of queries read data from the database? Or is it a write-intensive database? How much are you expecting the dataset to grow? Knowing what kind of queries your database is going to receive will help you select the right technologies when you tackle scaling. When scaling a database, it is important to understand exactly what you're going to scale. We can identify three general properties that you can scale in a database system:

  • Read queries
  • Write queries
  • Data

Scaling reads

A read query retrieves some data from the database and presents the results to the client application. This operation takes processing time and enough sockets (or file descriptors) need to be available in the system; however, a single server can process only a specific number of concurrent requests. The point here is that scaling reads essentially means reducing the number of requests that are made to the database backend.

If your system is primarily a read-heavy system, vertical scaling can often be an effective solution. This strategy can also be coupled with a robust caching system, such as memcached, that caches queries, limiting the number of requests to the database. In this way, used items can often be held in memory and returned to the client without hitting the database. If, however, your application generates more requests than a cache system can handle, you need to set up a second server that the client can read from. This is called a read replica and is achieved by replicating the database.

Replication allows you to create a copy of a database instance that can be used for scaling out read-heavy workloads, and therefore, improve the overall performance by using the replicas to distribute read traffic among multiple servers. We will discuss replication in more detail later on in the chapter.

Scaling writes

If your system is primarily write-intensive, adding a caching layer will not help much in a read-heavy environment. In this case, horizontal scaling is probably the most effective solution as this allows you to split writes among multiple instances. This kind of scaling requires you to partition your data as capacity will be added simply by adding more nodes to the database cluster, and this can be achieved through sharding.

Sharding is the process of splitting data records across multiple machines or shards. In RethinkDB, each shard is an independent part of a table and, collectively, shards make up a single logical table. Therefore, sharding reduces the number of operations each shard handles. As a result, a cluster can increase its capacity and write throughput horizontally.

Scaling data

As the amount of data stored in a database grows, you will become closer to the server's maximum capacity. Today's hard drives are cheap and provide enough capacity for most use cases, but a single server can't contain much data.

The solution is to distribute different tables and databases between multiple machines. Every machine will become a node, and all nodes together now form a cluster that holds all your data. In the next section, we will look at how to create a RethinkDB cluster:

Problem

Solution

Scaling the dataset

Clustering

Scaling read queries

Vertical scaling, caching layer, and replication

Scaling write queries

Data partitioning and sharding

Providing high availability

Data replication

The previous table attempts to provide a general idea of how to solve common scaling problems in databases. However, as with most things in computing, good solutions are not usually as simple as they seem. In this introduction of scaling, I've attempted to simplify the ideas in order to write about the concepts rather than any specific tactics. Scaling is a hard problem that requires pragmatic thought at every step of the process!

Now that we know a little more about scaling a database, we are ready to start working on a RethinkDB cluster. It's useful to note that while we were in this section, we took a separate look at the scaling of reads, writes, and data; these rarely occur isolated. The decision to scale one of these properties will most certainly affect others.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.15.85