Clustering RethinkDB

Clustering in RethinkDB's context refers to the ability of several server or instances to support a single database. An instance can be a database process on the same machine as other RethinkDB instances, or it can be a completely different and separate server.

Clustering offers three major advantages especially in databases with large datasets:

  • Data scalability
  • Fault tolerance
  • Load balancing

As we have seen in the previous sections, clustering solves the increasing data volume problem as it allows us to store large quantities of data in multiple instances. This is because a single machine has limited capacity.

Clustering also provides us with additional fault tolerance; that is, in the event that a software component fails, a backup component or procedure can immediately take place; in fact, in a clustered environment, because there is more than one instance for the client to connect to, there will always be an alternative endpoint for the client in the event of an individual server failure.

Finally, a database cluster can provide load balancing as incoming queries can be routed to different instances within the cluster, reducing load on a single machine.

Creating a cluster

If you've made it this far, I'm assuming that you more or less understand the basics of scaling—how to access the RethinkDB administration interface, and how to run simple queries. In this section, we'll cover how to create a RethinkDB cluster and configure an instance to be part of the cluster. Then, we'll go over how to add new machines to an existing cluster. Finally, we'll look at how the admin interface displays the status of the cluster and signals any problems.

We've already seen that a cluster is a group of two or more independent instances operating as a single system and that it can help us achieve high availability and improve performance for the database service.

Note

I must point out that it is not strictly necessary to run a RethinkDB cluster. In some use cases, it can make more sense to use a single machine. If you're working on a really small dataset and can get away with a single server, it's much simpler. If, however, you want to store large volumes of data or access it at a higher rate than a single server can handle, you'll need to set up a cluster. RethinkDB is specifically engineered to be used in a cluster of many machines that can split the load in very high-volume situations.

Let's start by creating a RethinkDB cluster. For this example, you'll need two servers. For the purpose of this example, I'll assume that their IP addresses are 10.0.0.1 and 10.0.0.2. When you start a RethinkDB instance, it reads the configuration file to determine if the node should connect to an existing cluster. In this case, we don't have an existing cluster, so our node will be the first instance in the cluster. This is called a seed node. If our first instance acts as a seed node for the second instance, when the second instance will come online, it will use the seed node as a reference point to enter the cluster.

Note

A seed node is used as a reference for other nodes that connect to the cluster; however, clusters in RethinkDB have no special nodes, such as a master node, which is a pure peer-to-peer network.

The first step is to install RethinkDB on both servers. If you don't remember the installation procedure, you can check it in Chapter 1, Introducing RethinkDB. The next step is to edit the configuration file for the first instance. For the purpose of this example, let's say this instance is called rethink1.

We can open the configuration file by running the following command:

sudonano /etc/rethinkdb/instances.d/default.conf

The instance's name is set in this file, so let's set it to rethink1 by finding this line:

# server-name=server1

After finding the previous line, we change it to:

server-name=rethink1

The next thing we want to do is make the instance accessible from all network interfaces. We can do this by editing the bind setting. Look for this line:

# bind=127.0.0.1

And change it to:

bind=all

This is necessary as other servers will communicate with this machine through the inter-cluster port; however, this configuration leaves the database accessible from the internet. So, make sure to secure your server.

Save the configuration for the instance and close the editor by pressing Ctrl + X, followed by Y, and then Enter. We can now restart the instance so that RethinkDB can apply the following changes:

sudo /etc/init.d/rethinkdb restart

We will now have a running RethinkDB instance called rethink1. Let's verify that everything is working correctly by accessing the administration panel. In this example, I can access it by going to http://10.0.0.1:8080.

You will get a screen similar to this one:

Creating a cluster

As you can see from the previous image, we have successfully created a cluster that contains one machine called rethink1. The next step will be to add another machine to the existing cluster.

Adding a server to the cluster

Now that we've got our RethinkDB cluster up and running, let's add another node to it. Following on from the previous example, suppose we want to connect the server rethink2 with IP address 10.0.0.2 to the cluster. The first thing we need to do is edit the configuration file and the server name.

First, we open the configuration file for editing:

sudonano /etc/rethinkdb/instances.d/default.conf

Then, we edit the server-name property:

# server-name=server1

Then, we change it to:

server-name=rethink2

The next step is to make the server accessible from other servers. Just as we did before, we need to edit the bind property and set it to all:

# bind=127.0.0.1

This gives us the following:

bind=all

Finally, the last change that we need to make to the settings file is the join property:

# join=example.com:29015

We need to replace the value with the IP address of the first machine:

join=10.0.0.1:29015

This tells RethinkDB to join an existing cluster that can be located at the specified address.

Save the configuration for this server and close the editor. We can now restart the instance so that RethinkDB can apply the changes:

sudo /etc/init.d/rethinkdb restart

If the instance starts without errors, we now have a running RethinkDB instance called rethink2 connected to the existing cluster. Just as we did before, let's verify that everything is working correctly by accessing the administration panel. You can access the admin page from any machine connected to the cluster. In this example, I'll access it by going to the following URL:

http://10.0.0.1:8080

You will get a screen similar to this one:

Adding a server to the cluster

As you can see from this screenshot, the administration interface confirms that there are two servers connected to the cluster. Congratulations! You've just created a full-fledged, two-node RethinkDB cluster!

Running queries on the cluster

Now that we have two machines connected to our cluster, it's interesting to see how RethinkDB automatically uses both machines when executing queries.

For example, let's create two new tables in our database. If you recall from the previous chapter, we used the tableCreate command to instruct RethinkDB to create a new table for us. In a single-machine environment, all tables and data would be created on the same machine. Now that we're working on a cluster, let's see what happens.

First, we can create a table called clusteringTest1 by running the following query in the Data Explorer:

r.db('test').tableCreate('clusteringTest1')

If the query succeeds, you will receive an output similar to this:

Running queries on the cluster

Can you see anything strange? If we take a look at the resulting JSON, the primary_replica key inside the shards object tells us that the table has been created on one of the nodes within the cluster.

Let's run this query one more time. This time, we will create a new table called clusteringTest2 by running the following query:

r.db('test').tableCreate('clusteringTest2')

The output will be similar to this:

Running queries on the cluster

Once again, if we look at the resulting JSON, we can notice that the table has been created on one of the nodes within the cluster.

RethinkDB has automatically started using both the nodes that make up the cluster and, without any specific query, the database stores data on both servers. We can verify this by browsing the Servers page of the web interface:

Running queries on the cluster

As you can see, the table displays both the servers, rethink2 and rethink1, and each one contains exactly one primary. This confirms that RethinkDB indeed uses all the machines in the cluster and balances tables across both nodes.

You may think that this behavior is expected; however, the database distributes the data between both nodes without any specific query from the user. Everything is done automatically. This is a big feature of RethinkDB; complex operations can be achieved in just a few clicks. In the following section, we'll cover replication.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.114.132