Chapter 4. Performance Tuning and Advanced Queries

One of the most important aspects when working with a database is optimization. In this chapter, we will look at how to tune RethinkDB to improve performance. A variety of settings helps us do this, and we'll also take a look at how to optimize queries.

This chapter also introduces RethinkDB's indexing, which allows us to boost query performance when working with large datasets.

In this chapter, you will also learn the following:

  • What indexing is and why you would want to use it
  • How to batch import data into RethinkDB
  • How to profile queries

At the end of the chapter, we'll also take a look at some more advanced queries, including joins and map/reduce queries.

Performance tuning

When we talk about performance tuning, we must distinguish between system and query optimizations. The latter is all about writing queries differently and accessing data in different ways to reduce the time that the database takes to run the query. Further in the chapter, we will look at some specific query optimizations; so, for now, we will focus on how to tune the RethinkDB database server to improve the overall performance.

There are several isolated settings in RethinkDB's configuration file that help us do this. Although the defaults are often appropriate and guarantee the best level of safety, there might be circumstances in which you need to change them. In the following section, we will look at several of those settings.

It's important to note that RethinkDB is a distributed database so, as a general rule, performance can be boosted simply by adding nodes to the database cluster: as you've learned in the previous chapter, sharding data allows us to spread the database load across multiple nodes, while replication can help boost performance in read-intensive use cases.

Increasing the cache size

RethinkDB uses a page cache to keep the recently used data in memory as this helps minimize the disk access. Increasing the size of the page cache means that a bigger portion of your dataset can fit in your system's RAM memory, thus automatically boosting the performance of all queries.

The cache size can be set by editing the database configuration file. Suppose, we want to set the page cache size to 2 GB. First, we open the configuration file using a text editor. If you're running Linux, you can run the following:

sudo nano /etc/rethinkdb/instances.d/default.conf

Then, we edit the line starting with cache-size:

cache-size=2048

Here we set it to 2048 MB (2 GB). Be sure to uncomment the line in the configuration file, by removing the #. Save the configuration for the instance and close the editor by pressing Ctrl + X, followed by Y, then Enter.

To apply the new settings, restart RethinkDB by running the following command:

sudo /etc/init.d/rethinkdb restart

We can verify the size of the page cache by looking at the instance's log file from the database administration panel:

Increasing the cache size

When you scroll through the log file you'll encounter a line similar to this:

Increasing the cache size

By default, the page cache size is automatically set to half of the system's available RAM memory. In most use cases, the default setting will be sufficient; however, if you're working with huge datasets, you may want to increase the value. To avoid out-of-memory conditions, be sure to consider other processes' memory consumption when you change the size of the page cache.

A big page cache is critical for the performance of read queries. In the following sections, we'll look at how to tune the performance of writes.

Increasing concurrency

Data safety and query performance are two related terms as every database system must make a compromise between the two. RethinkDB defaults to using a safe configuration; when you run a query that inserts or modifies the dataset, RethinkDB will make sure that the data is written to the disk before acknowledging it to the client. This is known as hard durability.

When inserting data into RethinkDB, the bottleneck is usually represented by the fact that the thread must wait for the disk acknowledgment before inserting the next piece of data.

One of the possible solutions to this problem is to increase concurrency using multiple threads and connections. Parallelized insert queries can result in massive boosts in the write performance.

Another possible solution is to perform batch inserts by inserting multiple pieces of data in one query.

Suppose, we want to insert the following three JSON documents into RethinkDB:

{"name": "Alex"}
{"name": "Louise"}
{"name": "Matt"}

Normally, you would write three separate queries as follows:

r.db('test').table('test').insert({"name": "Alex"})
r.db('test').table('test').insert({"name": "Louise"})
r.db('test').table('test').insert({"name": "Matt"})

Instead, you can combine the inserts into a single query as follows:

r.db('test').table('test').insert([{"name": "Alex"}, {"name": "Louise"}, {"name": "Matt"}])

By testing these queries on an average server, combining the inserts into a single query results in a 60% faster query!

Using soft durability mode

In the previous section, you learned how RethinkDB treats write queries using the default configuration (also called hard durability); however, there is also a less secure mode called soft durability.

In the soft durability mode, RethinkDB will acknowledge write queries to the client before writing the data to the disk. This will result in faster inserts; however, data maybe lost in the case of power failure. Depending on your specific use case, potentially losing a few seconds' worth of data could be a good trade-off between data safety and performance.

As an example, let's insert a JSON document into the table called test using the soft durability mode:

r.db('test').table('test').insert({"name": "Matt"}, {durability: 'soft'})

Although, this query is not as safe as the queries that run in the hard durability mode, in the event of power failure, the database will not get corrupted.

Now that you've learned how to tweak RethinkDB's settings to improve the overall performance, you'll look at how to optimize queries. Before doing this, we'll go over how to import large quantities of data into the database.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.210.71