Chapter 11. Performance Tuning

In this chapter, we look at how to tune Cassandra to improve performance. A variety of settings in the configuration file help us do this, and we present a few pointers on hardware selection and configuration. There are several isolated settings that you can update in Cassandra’s configuration file; although the defaults are often appropriate, there might be circumstances in which you need to change them. In this chapter, we look at several of those settings.

As a general rule, it’s important to note that simply adding nodes to a cluster will not improve performance on its own. You need to replicate the data appropriately, then send traffic to all the nodes from your clients. If you aren’t distributing client requests, the new nodes could just stand by somewhat idle.

We also see how to use the Python stress test tool that ships with Cassandra to run a reasonable load against Cassandra and quickly see how it behaves under stress test circumstances. We can then tune Cassandra appropriately and feel confident that we’re ready to launch in a staging environment.

Data Storage

There are two sets of files that Cassandra writes to as part of handling update operations: the commit log and the datafile. Their different purposes need to be considered in order to understand how to treat them during configuration.

The commit log can be thought of as short-term storage. As Cassandra receives updates, every write value is written immediately to the commit log in the form of raw sequential file appends. If you shut down the database or it crashes unexpectedly, the commit log can ensure that data is not lost. That’s because the next time you start the node, the commit log gets replayed. In fact, that’s the only time the commit log is read; clients never read from it. But the normal write operation to the commit log blocks, so it would damage performance to require clients to wait for the write to finish.

The datafile represents the Sorted String Tables (SSTables). Unlike the commit log, data is written to this file asynchronously. The SSTables are periodically merged during major compactions to free up space. To do this, Cassandra will merge keys, combine columns, and delete tombstones.

Read operations can refer to the in-memory cache and in this case don’t need to go directly to the datafiles on disk. If you can allow Cassandra a few gigabytes of memory, you can improve performance dramatically when the row cache and the key cache are hit.

The commit logs are periodically removed, following a successful flush of all their appended data to the dedicated datafiles. For this reason, the commit logs will not grow to anywhere near the size of the datafiles, so the disks don’t need to be as large; this is something to consider during hardware selection. For example, if Cassandra runs a flush, you’ll see something in the server logs like this:

INFO 18:26:11,497 Enqueuing flush of Memtable-LocationInfo@26830618(52 bytes, 2 
operations)
INFO 18:26:11,497 Writing Memtable-LocationInfo@26830618(52 bytes, 2 operations)
INFO 18:26:11,732 Completed flushing /var/lib/cassandra/data/system/
LocationInfo-2-Data.db
INFO 18:26:11,732 Discarding obsolete commit log:
  CommitLogSegment(/var/lib/cassandra/commitlog/CommitLog-1278894011530.log)

Then, if you check the commit log directory, that file has been deleted.

By default, the commit log and the datafile are stored in the following locations:

<CommitLogDirectory>/var/lib/cassandra/commitlog</CommitLogDirectory>

<DataFileDirectories>
  <DataFileDirectory>/var/lib/cassandra/data</DataFileDirectory>
</DataFileDirectories>

You can change these values to store the datafiles or commit log in different locations. You can specify multiple datafile directories if you wish.

Note

You don’t need to update these values for Windows, even if you leave them in the default location, because Windows will automatically adjust the path separator and place them under C:. Of course, in a real environment, it’s a good idea to specify them separately, as indicated.

For testing, you might not see a need to change these locations. However, it’s recommended that you store the datafiles and the commit logs on separate hard disks for maximum performance. Cassandra, like many databases, is particularly dependent on the speed of the hard disk and the speed of the CPUs (it’s best to have four or eight cores, to take advantage of Cassandra’s highly concurrent construction). So make sure for QA and production environments to get the fastest disks you can, and get at least two separate ones so that the commit logfiles and the datafiles are not competing for I/O time. It’s more important to have several processing cores than one or two very fast ones.

Reply Timeout

The reply timeout is a setting that indicates how long Cassandra will wait for other nodes to respond before deciding that the request is a failure. This is a common setting in relational databases and messaging systems. This value is set by the RpcTimeoutInMillis element (rpc_timeout_in_ms in YAML). By default, this is 5,000, or five seconds.

Commit Logs

You can set the value for how large the commit log is allowed to grow before it stops appending new writes to a file and creates a new one. This is similar to setting log rotation on Log4J.

This value is set with the CommitLogRotationThresholdInMB element (commitlog_rotation_threshold_in_mb in YAML). By default, the value is 128MB.

Another setting related to commit logs is the sync operation, represented by the commitlog_sync element. There are two possible settings for this: periodic and batch. periodic is the default, and it means that the server will make writes durable only at specified intervals. When the server is set to make writes durable periodically, you can potentially lose the data that has not yet been synced to disk from the write-behind cache.

In order to guarantee durability for your Cassandra cluster, you may want to examine this setting.

If your commit log is set to batch, it will block until the write is synced to disk (Cassandra will not acknowledge write operations until the commit log has been completely synced to disk). This clearly will have a negative impact on performance.

You can change the value of the configuration attribute from periodic to batch to specify that Cassandra must flush to disk before it acknowledges a write. Changing this value will require taking some performance metrics, as there is a necessary trade-off here: forcing Cassandra to write more immediately constrains its freedom to manage its own resources. If you do set commitlog_sync to batch, you need to provide a suitable value for CommitLogSyncBatchWindowInMS, where MS is the number of milliseconds between each sync effort. Moreover, this is not generally needed in a multinode cluster when using write replication, because replication by definition means that the write isn’t acknowledged until another node has it.

If you decide to use batch mode, you will probably want to split the commit log onto a separate device to mitigate the performance impact. It’s a good idea to split it out onto a separate disk from the SSTables (data) anyway, even if you don’t do this.

Memtables

Each column family has a single memtable associated with it. There are a few settings around the treatment of memtables. The size that the memtable can grow to before it is flushed to disk as an SSTable is specified with the MemtableSizeInMB element (binary_memtable_throughput_in_mb in YAML). Note that this value is based on the size of the memtable itself in memory, and not heap usage, which will be larger because of the overhead associated with column indexing.

You’ll want to balance this setting with MemtableObjectCountInMillions, which sets a threshold for the number of column values that will be stored in a memtable before it is flushed.

A related configurable setting is memtable_throughput_in_mb. This refers to the maximum number of columns that will be stored in a single memtable before the memtable is flushed to disk as an SSTable. The default value is 0.3, which is approximately 333,000 columns.

You can also configure how long to keep memtables in memory after they’ve been flushed to disk. This value can be set with the memtable_flush_after_mins element. When the flush is performed, it will write to a flush buffer, and you can configure the size of that buffer with flush_data_buffer_size_in_mb.

Another element related to tuning the memtables is memtable_flush_writers. This setting, which is 1 by default, indicates the number of threads used to write out the memtables when it becomes necessary. If you have a very large heap, it can improve performance to set this count higher, as these threads are blocked during disk I/O.

Concurrency

Cassandra differs from many data stores in that it offers much faster write performance than read performance. There are two settings related to how many threads can perform read and write operations: concurrent_reads and concurrent_writes. In general, the defaults provided by Cassandra out of the box are very good. But you might want to update the concurrent_reads setting immediately before you start your server. That’s because the concurrent_reads setting is optimal at two threads per processor core. By default, this setting is 8, assuming a four-core box. If that’s what you have, you’re in business. If you have an eight-core box, tune it up to 16.

The concurrent_writes setting behaves somewhat differently. This should match the number of clients that will write concurrently to the server. If Cassandra is backing a web application server, you can tune this setting from its default of 32 to match the number of threads the application server has available to connect to Cassandra. It is common in Java application servers such as WebLogic to prefer database connection pools no larger than 20 or 30, but if you’re using several application servers in a cluster, you’ll need to factor that in as well.

Caching

There are several settings related to caching, both within Cassandra and at the operating system level. Caches can use considerable memory, and it’s a good idea to tune them carefully once you understand your usage patterns.

There are two primary caches built into Cassandra: a row cache and a key cache. The row cache caches complete rows (all of their columns), so it is a superset of the key cache. If you are using a row cache for a given column family, you will not need to use a key cache on it as well.

Your caching strategy should therefore be tuned in accordance with a few factors:

  • Consider your queries, and use the cache type that best fits your queries.

  • Consider the ratio of your heap size to your cache size, and do not allow the cache to overwhelm your heap.

  • Consider the size of your rows against the size of your keys. Typically keys will be much smaller than entire rows.

The keys_cached setting indicates the number of key locations—not key values—that will be saved in memory. This can be specified as a fractional value (a number between 0 and 1) or as an integer. If you use a fraction, you’re indicating a percentage of keys to cache, and an integer value indicates an absolute number of keys whose locations will be cached.

Note

The keys_cached setting is a per-column family setting, so different column families can have different numbers of key locations cached if some are used more frequently than others.

This setting will consume considerable memory, but can be a good trade-off if your locations are not hot already.

The purpose of disk_access_mode is to enable memory mapped files so that the operating system can cache reads, thus reducing the load on Cassandra’s internal caches. This sounds great, but in practice, disk_access_mode is one of the less-useful settings, and at this point doesn’t work exactly as was originally envisioned. This may be improved in the future, but it is just as likely that the setting will be removed. Certainly feel free to play around with it, but you might not see much difference.

You can also populate the row cache when the server starts up. To do this, use the preload_row_cache element. The default setting for this is false, but you will want to set it to true to improve performance. The cost is that bootstrapping can take longer if there is considerable data in the column family to preload.

The rows_cached setting specifies the number of rows that will be cached. By default, this value is set to 0, meaning that no rows will be cached, so it’s a good idea to turn this on. If you use a fraction, you’re indicating a percentage of everything to cache, and an integer value indicates an absolute number of rows whose locations will be cached. You’ll want to use this setting carefully, however, as this can easily get out of hand. If your column family gets far more reads than writes, then setting this number very high will needlessly consume considerable server resources. If your column family has a lower ratio of reads to writes, but has rows with lots of data in them (hundreds of columns), then you’ll need to do some math before setting this number very high. And unless you have certain rows that get hit a lot and others that get hit very little, you’re not going to see much of a boost here.

Buffer Sizes

The buffer sizes represent the memory allocation when performing certain operations. The following is a quick overview of these settings:

flush_data_buffer_size_in_mb

By default, this is set to 32 megabytes and indicates the size of the buffer to use when memtables get flushed to disk.

flush_index_buffer_size_in_mb

By default, this is set to 8 megabytes. If each key defines only a few columns, then it’s a good idea to increase the index buffer size. Alternatively, if your rows have many columns, then you’ll want to decrease the size of the buffer.

sliced_buffer_size_in_kb

Depending on how variable your queries are, this setting is unlikely to be very useful. It allows you to specify the size, in kilobytes, of the buffer to use when executing slices of adjacent columns. If there is a certain slice query that you perform far more than others, or if your data is laid out with a relatively consistent number of columns per family, then this setting could be moderately helpful on read operations. But note that this setting is defined globally.

Using the Python Stress Test

Cassandra ships with a popular utility called py_stress that you can use to run a stress test on your Cassandra cluster. To run py_stress, navigate to the <cassandra-home>/contrib directory. You might want to check out the README.txt file, as it will have the list of dependencies required to run the tool.

There are a few steps we need to go through before we can run the tool. First, make sure that you still have the default keyspace (Keyspace1) in your configuration and have it loaded, as the tool will run against the default column family definitions.

Then, you need to build the Python Thrift interface, which may require a few steps.

Generating the Python Thrift Interfaces

Before we can run the stress tool, we need to make sure that we have the Thrift interfaces for Python available to it (because it’s a Python script). You’ll know that you haven’t done this step yet, or have done it incorrectly, if you see an error like this after running the command:

No module named thrift.transport

Make sure that you have Python installed on your system. To do this, open a terminal and type $python; you should see output similar to this:

eben@morpheus$ python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information

Make sure that you have Python 2.6 or better installed so that you can take advantage of the newest multithreading capabilities it offers, which come in handy with the stress test.

Getting Thrift

To get Thrift, download it from http://incubator.apache.org/thrift. This will give you a file like thrift-0.2.0-incubating.tar.gz, which you can decompress. Then, in the directory where you decompressed it, execute the following command on Linux to establish some dependencies:

$ sudo apt-get install -y libboost-dev libevent-dev python-dev automake pkg-config 
libtool flex bison

Thrift might not run on your system if you don’t have the C++ Boost libraries properly installed. Get Boost from http://www.boost.org. Then, in the Boost home directory, run these commands:

$ ./bootstrap.sh
$ ./bjam

This should compile and install Boost. Now, to build Thrift, run a series of commands as root from the Thrift home directory:

$ ./bootstrap.sh
$ ./configure
$ ./make
$ ./make install
$ cd lib/py
$ ./make

Now run the command $which thrift. This will tell you where Thrift installed. On my system, this is /usr/local/bin/thrift.

Note

On Ubuntu Linux, you might see an error such as “autoscan: not found” when trying to install Thrift. Autoscan is also not available via yum or apt-get. In order to get autoscan, which the Thrift bootstrap needs, you’ll have to run this command: $ sudo apt-get install automake. You can tell whether you have autoscan installed by executing the command $which autoscan. If it returns nothing, you don’t have it.

If Thrift is installed properly, you should be able to run the Thrift program and see help output:

$ thrift -version
Thrift version 0.2.0-exported

Put your site-packages directory on the Python path:

$ export PYTHONPATH=/usr/lib/python2.6/site-packages/

Now you can navigate to the <cassandra-home> directory. Run the following command to generate the Thrift interfaces for Python:

$ ant gen-thrift-py

You should see a successful build message. Now you’ll have the Cassandra Thrift interfaces for Python generated in the directory <cassandra-home>/interfaces/thrift/gen-py. Copy this directory to <cassandra-home>/contrib/py_stress. Now just make sure that your Cassandra server is started, and you’re ready to run the Python stress test.

Running the Python Stress Test

Now that we’re all set up, we can run the stress test. Navigate to the <cassandra-home>/contrib/py_stress directory. In a terminal, type stress.py to run the test against localhost.

If you see a message indicating that the stress test cannot connect to localhost:9160, you have a couple of options. First, make sure that your Cassandra server is actually started and listening on that address and port. If you configured Cassandra to listen on your IP or hostname instead of localhost, you can change the script to point where your server is, change your config to listen on localhost, or, probably best, just pass the script the -d option to indicate which node you want to connect to. Here we go:

$ stress.py -d 192.168.1.5

Note

You can execute stress.py -h to get usage on the stress test script.

The test will run until it inserts one million values, and then stop. I ran this test on a single regular workstation with an Intel I7 processor (which is similar to eight cores) with 4GB of RAM available and a lot of other processes running. Here was my output:

eben@morpheus$ ./stress.py -d 192.168.1.5 -o insert
total,interval_op_rate,avg_latency,elapsed_time
196499,19649,0.0024959407711,10
370589,17409,0.00282591440216,20
510076,13948,0.00295883878841,30
640813,13073,0.00438663874102,40
798070,15725,0.00312562838215,50
950489,15241,0.0029109908417,60
1000000,4951,0.00444872583334,70

Let’s unpack this a bit. What we’ve done is generated and inserted one million values into a completely untuned single node Cassandra server in about 70 seconds. You can see that in the first 10 seconds we inserted 196,499 randomly generated values. The average latency per operation is 0.0025 seconds, or 2.5 milliseconds. But this is using the defaults, and is on a Cassandra server that already has 1GB of data in the database to manage before running the test. Let’s give the test more threads to see if we can squeeze a little better performance out of it:

eben@morpheus$ ./stress.py -d 192.168.1.5 -o insert -t 10
total,interval_op_rate,avg_latency,elapsed_time
219217,21921,0.000410911544945,10
427199,20798,0.000430060066223,20
629062,20186,0.000443717396772,30
832964,20390,0.000437958271074,40
1000000,16703,0.000463042383339,50

What we’ve done here is used the -t flag to use 10 threads at once. This means that in 50 seconds, we wrote 1,000,000 records—about 2 milliseconds latency per write, with a totally untuned database that is already managing 1.5GB of data.

You should run the test several times to get an idea of the right number of threads given your hardware setup. Depending on the number of cores on your system, you’re going to see worse—not better—performance if you set the number of threads arbitrarily high, because the processor will devote more time to managing the threads than doing your work. You want this to be a rough match between the number of threads and the number of cores available to get a reasonable test.

Now that we have all of this data in the database, let’s use the test to read some values too:

$ ./stress.py -d 192.168.1.5 -o read
total,interval_op_rate,avg_latency,elapsed_time
103960,10396,0.00478858081549,10
225999,12203,0.00406984714627,20
355129,12913,0.00384438665076,30
485728,13059,0.00379976526221,40
617036,13130,0.00378045491559,50
749154,13211,0.00375620621777,60
880605,13145,0.00377542658007,70
1000000,11939,0.00374060139004,80

As you can see, Cassandra doesn’t read nearly as fast as it writes; it takes about 80 seconds to read one million values. Remember, though, that this is out of the box, untuned, single-threaded, on a regular workstation running other programs, and the database is 2GB in size. Regardless, this is a great tool to help you do performance tuning for your environment and to get a set of numbers that indicates what to expect in your cluster.

Startup and JVM Settings

Cassandra allows you to configure a variety of options for how the server should start up, how much Java memory should be allocated, and so forth. In this section we look at how to tune the startup.

If you’re using Windows, the startup script is called cassandra.bat, and on Linux it’s cassandra.sh. You can start the server by simply executing this file, which sets several defaults. But there’s another file in the bin directory that allows you to configure a variety of settings related to how Cassandra starts. This file is called cassandra.in.sh and it separates certain options, such as the JVM settings, into a different file to make it easier to update.

Tuning the JVM

Try tuning the options passed to the JVM at startup for better performance. The key JVM options included in cassandra.in.sh and guidelines for how you might tweak them are illustrated in Table 11-1. If you want to change any of these values, simply open the cassandra.in.sh file in a text editor, change the values, and restart.

Table 11-1. Java performance tuning options
Java optionSetting guidelines
Heap Min and MaxBy default, these are set to 256MB and 1GB, respectively. To tune these, set them higher (see following note) and to the same value.
AssertionsBy default, the JVM is passed the -ea option to enable assertions. Changing this option from -ea to -da (disable assertions) can have a positive effect on performance.
Survivor RatioThe Java heap is broadly divided into two object spaces: young and old. The young space is subdivided into one for new object allocation (called “eden space”) and another for new objects that are still in use. Older objects still have some reference, and have therefore survived a few garbage collections, so the Survivor Ratio is the ratio of eden space to survivor space in the young object part of the heap. Increasing the ratio makes sense for applications with lots of new object creation and low object preservation; decreasing it makes sense for applications with longer-living objects. Cassandra sets this value to 8 by default, meaning that the ratio of eden to survivor space is 1:8 (each survivor space will be 1/8 the size of eden). This is fairly low, because the objects are living longer in the memtables. Tune this setting along with MaxTenuringThreshold.
MaxTenuringThresholdEvery Java object has an age field in its header, indicating how many times it has been copied within the young generation space. They’re copied (into a new space) when they survive a young generation garbage collection, and this copying has a cost. Because long-living objects may be copied many times, tuning this value can improve performance. By default, Cassandra has this value set at 1. Set it to 0 to immediately move an object that survives a young generation collection to the tenured generation.
UseConcMarkSweepGCThis instructs the JVM on what garbage collection (GC) strategy to use; specifically, it enables the ConcurrentMarkSweep algorithm. This setting uses more RAM, and uses more CPU power to do frequent garbage collections while the application is running in order to keep the GC pause time to a minimum. When using this strategy, it’s important to set the heap min and max values to the same value, in order to prevent the JVM from having to spend a lot of time growing the heap initially. It is possible to tune this to -XX:+UseParallelGC, which also takes advantage of multiprocessor machines; this will give you peak application performance, but with occasional pauses. Do not use the Serial GC with Cassandra.

The majority of options in the include configuration file surround the Java settings. For example, the default setting for the maximum size of the Java heap memory usage is 1GB. If you’re on a machine capable of using more, you may want to tune this setting. Try setting the -Xmx and -Xms options to the same value to keep Java from having to manage heap growth.

Note

The maximum theoretical heap size for a 32-bit JVM is 4GB. However, do not simply set your JVM to use as much memory as you have available up to 4GB. There are many factors involved here, such as the amount of swap space and memory fragmentation. Simply increasing the size of the heap using -Xmx will not help if you don’t have any swap available. Typically it is possible to get approximately 1.6GB of heap on a 32-bit JVM on Windows and closer to 2GB on Solaris. Using a 64-bit JVM on a 64-bit system will allow more space. See http://java.sun.com/docs/hotspot/HotSpotFAQ.html for more information.

Tuning some of these options will make stress tests perform better. For example, I saw a 15% performance improvement using the following settings over the defaults:

JVM_OPTS=" 
        -da 
        -Xms1024M 
        -Xmx1024M 
        -XX:+UseParallelGC 
        -XX:+CMSParallelRemarkEnabled 
        -XX:SurvivorRatio=4 
        -XX:MaxTenuringThreshold=0 

When performance tuning, it’s a good idea to set only the heap min and max options, and nothing else at first. Only after real-world usage in your environment and some performance benchmarking with the aid of heap analysis tools and observation of your specific application’s behavior should you dive into tuning the more advanced JVM settings. If you tune your JVM options and see some success using a load-testing tool or something like the Python stress test in contrib, don’t get too excited. You need to test under real-world conditions; don’t simply copy these settings.

Note

For more information on Java 6 performance tuning (Java 6 operates differently than previous versions), see http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html.

In general, you’ll probably want to make sure that you’ve instructed the heap to dump its state if it hits an out of memory error. This is just good practice if you’ve been getting out of memory errors. You can also instruct the heap to print garbage-collection details. Also, if you have a lot of data in Cassandra and you’re noticing that garbage collection is causing long pauses, you can attempt to cause garbage collection to run when the heap has filled up less memory than it otherwise would take as a threshold to initialize a garbage collection. All of these parameters are shown here:

       -XX:CMSInitiatingOccupancyFraction=88 
       -XX:+HeapDumpOnOutOfMemoryError 
       -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -verbose:gc 

Summary

In this chapter we looked at the settings available in Cassandra to aid in performance tuning, including caching settings, memory settings, and hardware concerns. We also set up and used the Python stress test tool to write and then read one million rows of data.

If you’re somewhat new to Linux systems and you want to run Cassandra on Linux (which is recommended), you may want to check out Jonathan Ellis’s blog entry on using a variety of Linux performance monitoring tools to help you understand the performance of your underlying platform so that you can troubleshoot in the right place. You can find that entry at http://spyced.blogspot.com/2010/01/linux-performance-basics.html.

Ultimately, you want to have lots of memory available, and as many processors and cores as you can afford.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.218.147