Chapter 12. Performance Tuning

In this chapter, we look at how to tune Cassandra to improve performance. There are a variety of settings in the configuration file and on individual tables. Although the default settings are appropriate for many use cases, there might be circumstances in which you need to change them. In this chapter, we look at how and why to make these changes.

We also see how to use the cassandra-stress test tool that ships with Cassandra to generate load against Cassandra and quickly see how it behaves under stress test circumstances. We can then tune Cassandra appropriately and feel confident that we’re ready to deploy to a production environment.

Managing Performance

To be effective at achieving and maintaining a high level of performance in your cluster, it’s helpful to think of managing performance as a process that begins with the architecture of your application and continues through development and operations.

Setting Performance Goals

Before beginning any performance tuning effort, it is important to have clear goals in mind, whether you are just getting started on deploying an application on Cassandra, or maintaining an existing application.

When planning a cluster, it’s important to understand how the cluster will be used: the number of clients, intended usage patterns, expected peak hours, and so on. This will be useful in planning the initial cluster capacity and for planning cluster growth, as we’ll discuss further in Chapter 14.

An important part of this planning effort is to identify clear performance goals in terms of both throughput (the number of queries serviced per unit time) and latency (the time to complete a given query).

For example, let’s say that we’re building an ecommerce website that uses the hotel data model we designed in Chapter 5. We might set a performance goal such as the following for shopping operations on our cluster:

The cluster must support 30,000 read operations per second from the available_​rooms_by_hotel_date table with a 95th percentile read latency of 3 ms.

This is a statement that includes both throughput and latency goals. We’ll learn how to measure this using nodetool tablestats.

Regardless of your specific performance goals, it’s important to remember that performance tuning is all about trade-offs. Having well-defined performance goals will help you articulate what trade-offs are acceptable for your application. For example:

  • Enabling SSTable compression in order to conserve disk space, at the cost of additional CPU processing.
  • Throttling network usage and threads, which can be used to keep network and CPU utilization under control, at the cost of reduced throughput and increased latency.
  • Increasing or decreasing number of threads allocated to specific tasks such as reads, writes, or compaction in order to affect the priority relative to other tasks or to support additional clients.
  • Increasing heap size in order to decrease query times.

These are just a few of the trade-offs that you will find yourself considering in performance tuning. We’ll highlight others throughout the rest of this chapter.

Monitoring Performance

As the size of your cluster grows, the number of clients increases, and more keyspaces and tables are added, the demands on your cluster will begin to pull in different directions. Taking frequent baselines to measure the performance of your cluster against its goals will become increasingly important.

We learned in Chapter 10 about the various metrics that are exposed via JMX, including performance-related metrics for Cassandra’s StorageProxy and individual tables.  In that chapter, we also examined nodetool commands that publish performance-related statistics such as nodetool tpstats and nodetool tablestats and discussed how these can help identify loading and latency issues.

Now we’ll look at two additional nodetool commands that present performance statistics formatted as histograms: proxyhistograms and tablehistograms. First, let’s examine the output of the nodetool proxyhistograms command:

$ nodetool proxyhistograms

proxy histograms
Percentile      Read Latency     Write Latency     Range Latency
                    (micros)          (micros)          (micros)
50%                   654.95              0.00           1629.72
75%                   943.13              0.00           5839.59
95%                  4055.27              0.00          62479.63
98%                 62479.63              0.00         107964.79
99%                107964.79              0.00         129557.75
Min                   263.21              0.00            545.79
Max                107964.79              0.00         155469.30

The output shows the latency of reads, writes, and range requests for which the requested node has served as the coordinator. The output is expressed in terms of percentile rank as well as minimum and maximum values in microseconds. Running this command on multiple nodes can help identify the presence of slow nodes in the cluster. A large range latency (in the hundreds of milliseconds or more) can be an indicator of clients using inefficient range queries, such as those requiring the ALLOW FILTERING clause or index lookups. 

While the view provided by proxyhistograms is useful for identifying general performance issues, we’ll frequently need to focus on performance of specific tables. This is what nodetool tablehistograms allows us to do. Let’s look at the output of this command against the available_rooms_by_hotel_date table:

nodetool tablehistograms hotel available_rooms_by_hotel_date

hotel/available_rooms_by_hotel_date histograms
Percentile  SSTables  Write Latency  Read Latency  Partition Size  Cell Count
                      (micros)       (micros)      (bytes)                  
50%         0.00      0.00           0.00          2759            179
75%         0.00      0.00           0.00          2759            179
95%         0.00      0.00           0.00          2759            179
98%         0.00      0.00           0.00          2759            179
99%         0.00      0.00           0.00          2759            179
Min         0.00      0.00           0.00          2300            150
Max         0.00      0.00           0.00          2759            179

The output of this command is similar. It omits the range latency statistics and instead provides counts of SSTables read per query. The partition size and cell count are provided, and this provides another way of identifying large partitions.

Resetting Metrics

Note that in Cassandra releases through the 3.X series, the metrics reported are lifetime metrics since the node was started. To reset the metrics on a node, you have to restart it. The JIRA issue CASSANDRA-8433 requests the ability to reset the metrics via JMX and nodetool.

Once you’ve gained familiarity with these metrics and what they tell you about your cluster, you should identify key metrics to monitor and even implement automated alerts that indicate your performance goals are not being met. You can accomplish this via DataStax OpsCenter or any JMX-based metrics framework.

Analyzing Performance Issues

It’s not unusual for the performance of a cluster that has been working well to begin to degrade over time. When you’ve detected a performance issue, you’ll want to begin analyzing it quickly to ensure the performance doesn’t continue to deteriorate. Your goal in these circumstances should be to determine the root cause and address it.

In this chapter, we’ll be looking at many configuration settings that can be used to tune the performance of each node in a cluster as a whole, across all keyspaces and tables. It’s also important to try to narrow performance issues down to specific tables and even queries.

In fact, the quality of the data model is usually the most influential factor in the performance of a Cassandra cluster. For example, a table design that results in partitions with a growing number of rows can begin to degrade the performance of the cluster and manifest in failed repairs, or streaming failures on addition of new nodes. Conversely, partitions with partition keys that are too restrictive can result in rows that are too narrow, requiring many partitions to be read to satisfy a simple query.

Beware the Large Partition

In addition to the nodetool tablehistograms discussed earlier, you can detect large partitions by searching logs for WARN messages that reference “Writing large partition” or “Compacting large partition”. The threshold for warning on compaction of large partitions is set by the compaction_large_partition_warning_​threshold_mb property in the cassandra.yaml file.

You’ll also want to learn to recognize instances of the anti-patterns discussed in Chapter 5 such as queues, or other design approaches that generate a large amount of tombstones.

Tracing

If you can narrow your search down to a specific table and query of concern, you can use tracing to gain detailed insight. Tracing is an invaluable tool for understanding the communications between clients and nodes involved in each query and the time spent in each step. This helps us see the performance implications of design decisions we make in our data models and choice of replication factors and consistency levels.

There are several ways to access trace data. We’ll start by looking at how tracing works in cqlsh. First we’ll enable tracing, and then execute a simple command:

cqlsh:hotel> TRACING ON
Now Tracing is enabled
cqlsh:hotel> SELECT * from hotels where id='AZ123';

 id    | address | name                            | phone          | pois
-------+---------+---------------------------------+----------------+------
 AZ123 |    null | Super Hotel Suites at WestWorld | 1-888-999-9999 | null

(1 rows)

Tracing session: 6669f210-de99-11e5-bdb9-59bbf54c4f73

 activity           | timestamp                  | source    | source_elapsed
--------------------+----------------------------+-----------+----------------
Execute CQL3 query  | 2016-02-28 21:03:33.503000 | 127.0.0.1 |              0
Parsing SELECT *... | 2016-02-28 21:03:33.505000 | 127.0.0.1 |          41491
...            

We’ve truncated the output quite a bit for brevity, but if you run a command like this, you’ll see activities such as preparing statements, read repair, key cache searches, data lookups in memtables and SSTables, interactions between nodes, and the time associated with each step in microseconds.

You’ll want to be on the lookout for queries that require a lot of inter-node interaction, as these may indicate a problem with your schema design. For example, a query based on a secondary index will likely involve interactions with most or all of the nodes in the cluster.

Once you are done using tracing in cqlsh, you can turn it off using the TRACING OFF command.

Tracing information is also available to clients using the DataStax drivers.  Let’s modify one of our examples from Chapter 8 to see how to interact with traces via the DataStax Java Driver. The highlighted code enables tracing on a query and prints out the trace results:

SimpleStatement hotelSelect = session.newSimpleStatement(
                "SELECT * FROM hotels WHERE id='AZ123'");
hotelSelect.enableTracing();

ResultSet hotelSelectResult = session.execute(hotelSelect);

QueryTrace queryTrace = hotelSelectResult.getExecutionInfo().getQueryTrace();

System.out.printf("Trace id: %s

", queryTrace.getTraceId());
System.out.printf("%-42s | %-12s | %-10s 
", "activity",
                  "timestamp", "source");

System.out.println("-------------------------------------------" 
 + "--------------+------------");                 

SimpleDateFormat dateFormat = new SimpleDateFormat("HH:mm:ss.SSS");

for (QueryTrace.Event event : queryTrace.getEvents()) {
  System.out.printf("%42s | %12s | %10s
",    
    event.getDescription(),
    dateFormat.format((event.getTimestamp())),
    event.getSource());
}

Tracing is individually enabled on each Statement or PreparedStatement using the enableTracing() operation. To obtain the trace of a query, we look at the query results. You may have noticed in our previous examples that the Session.execute() operation always returns a ResultSet object, even for queries other than SELECT queries. This enables us to obtain metadata about the request via an ExecutionInfo object, even when there are no results to be returned. This metadata includes the consistency level that was achieved, the coordinator node and other nodes involved in the query, and information about tracing and paging.

Executing this code produces output that is quite similar to what we saw in cqlsh:

Trace id: 58b22960-90cb-11e5-897a-a9fac1d00bce

activity                                   | timestamp    | source    
-------------------------------------------+--------------+------------
   Parsing SELECT * FROM hotels WHERE id=? | 20:44:34.550 | 127.0.0.1
                       Preparing statement | 20:44:34.550 | 127.0.0.1
                      Read-repair DC_LOCAL | 20:44:34.551 | 127.0.0.1
Executing single-partition query on hotels | 20:44:34.551 | 127.0.0.1
              Acquiring sstable references | 20:44:34.551 | 127.0.0.1
                 Merging memtable contents | 20:44:34.551 | 127.0.0.1
               Merging data from sstable 3 | 20:44:34.552 | 127.0.0.1
    Bloom filter allows skipping sstable 3 | 20:44:34.552 | 127.0.0.1
               Merging data from sstable 2 | 20:44:34.552 | 127.0.0.1
    Bloom filter allows skipping sstable 2 | 20:44:34.552 | 127.0.0.1
         Read 1 live and 0 tombstone cells | 20:44:34.552 | 127.0.0.1

Tracing is also supported in DataStax DevCenter, where it is enabled by default. You can immediately view the trace of any request you make in DevCenter by selecting the “Query Trace” tab in the lower middle of the screen, as shown in Figure 12-1. (The “Connection”, “CQL Scripts”, “Schema”, and “Outline” panels have been minimized to enable readability.)

Figure 12-1. Viewing query traces in DataStax DevCenter

You can configure individual nodes to trace some or all of their queries via the nodetool settraceprobability command.  This command  takes a number between 0.0 (the default) and 1.0, where 0.0 disables tracing and 1.0 traces every query. This does not affect tracing of individual queries as requested by clients. Exercise care in changing the trace probability, as a typical trace session involves 10 or more writes. Setting a trace level of 1.0 could easily overload a busy cluster, so a value such as 0.01 or 0.001 is typically appropriate.

Traces Aren’t Forever

Cassandra stores query trace results in the system_traces keyspace. Since the 2.2 release, Cassandra also uses tracing to store the results of repair operations. Cassandra limits the TTL on these tables to prevent them from filling up your disk over time. You can configure these TTL values by editing the tracetype_query_ttl and tracetype_repair_ttl properties in the cassandra.yaml file.

Tuning Methodology

Once you’ve identified the root cause of performance issues related to one of your established goals, it’s time to begin tuning performance. The suggested methodology for tuning Cassandra performance is to change one configuration parameter at a time and test the results.

It is important to limit the amount of configuration changes you make when tuning so that you can clearly identify the impact of each change. You may need to repeat the change process multiple times until you’ve reached the desired performance goal.

In some cases, it may be that you can get the performance back in line simply by adding more resources such as memory or extra nodes, but make sure that you aren’t simply masking underlying design or configuration issues. In other cases, you may find that you can’t reach your desired goals through tuning alone, and that design changes are needed.

With this methodology in mind, let’s look at some of the various options that you can configure to tune your Cassandra clusters. We’ll highlight node configuration properties in the cassandra.yaml and cassandra-env.sh files as well as options that are configured on individual tables using CQL.

Caching

Caches are used to improve responsiveness to read operations. Additional memory is used to hold data, to minimize the number of disk reads that must be performed. As the cache size increases, so does the number of “hits” that can be served from memory.

There are three caches built into Cassandra: the key cache, row cache, and counter cache. The row cache caches a configurable number of rows per partition. If you are using a row cache for a given table, you will not need to use a key cache on it as well.

Your caching strategy should therefore be tuned in accordance with a few factors:

  • Consider your queries, and use the cache type that best fits your queries.
  • Consider the ratio of your heap size to your cache size, and do not allow the cache to overwhelm your heap.
  • Consider the size of your rows against the size of your keys. Typically keys will be much smaller than entire rows.

Let’s consider some specific tuning and configuration options for each cache.

Key Cache

Cassandra’s key cache stores a map of partition keys to row index entries, facilitating faster read access into SSTables stored on disk. We can configure usage of the key cache on a per-table basis. For example, let’s use cqlsh to examine the caching settings on our hotels table:

cqlsh:hotel> DESCRIBE TABLE hotels;

CREATE TABLE hotel.hotels (
    id text PRIMARY KEY,
    address frozen<address>,
    name text,
    phone text,
    pois set<text>
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
  ...

Because the key cache greatly increases reads without consuming a lot of additional memory, it is enabled by default; that is, key caching is enabled if you don’t specify a value when creating the table. We can disable key caching for this table with the ALTER TABLE command:

cqlsh:hotel> ALTER TABLE hotels 
   WITH caching = { 'keys' : 'NONE', 'rows_per_partition' : 'NONE' };

The legal options for the keys attribute are ALL or NONE.

The key_cache_size_in_mb setting indicates the maximum amount of memory that will be devoted to the key cache, which is shared across all tables. The default value is either 5% of the total JVM heap, or 100 MB, whichever is less.

Row Cache

The row cache caches entire rows and can speed up read access for frequently accessed rows, at the cost of more memory usage.

You’ll want to use the row cache size setting carefully, as this can easily lead to more performance issues than it solves. In many cases, a row cache can yield impressive performance results for small data sets when all the rows are in memory, only to degrade on larger data sets when the data must be read from disk.

If your table gets far more reads than writes, then configuring an overly large row cache will needlessly consume considerable server resources. If your table has a lower ratio of reads to writes, but has rows with lots of data in them (hundreds of columns), then you’ll need to do some math before setting the row cache size. And unless you have certain rows that get hit a lot and others that get hit very little, you’re not going to see much of a boost here.

For these reasons, row caching tends to yield fewer benefits than key caching. You may want to explore file caching features supported by your operating system as an alternative to row caching.

As with key caching, we can configure usage of the row cache on a per-table basis. The rows_per_partition setting specifies the number of rows that will be cached per partition. By default, this value is set to NONE, meaning that no rows will be cached. Other available options are positive integers or ALL. The following CQL statement sets the row cache to 200 rows:

cqlsh:hotel> ALTER TABLE hotels
   WITH caching = { 'keys' : 'NONE', 'rows_per_partition' : '200' };

The implementation of the row cache is pluggable via the row_cache_class_name property.  This defaults to the off-heap cache provider implemented by the org.apache.cassandra.OHCProvider class. The previous implementation was the SerializingCacheProvider.

Counter Cache

The counter cache improves counter performance by reducing lock contention for the most frequently accessed counters. There is no per-table option for configuration of the counter cache.

The counter_cache_size_in_mb setting determines the maximum amount of memory that will be devoted to the counter cache, which is shared across all tables. The default value is either 2.5% of the total JVM heap, or 50 MB, whichever is less.

Saved Cache Settings

Cassandra provides the ability to periodically save caches to disk, so that they can be read on startup as a way to quickly warm the cache. The saved cache settings are similar across all three cache types:

  • Cache files are saved under the directory specified by the saved_caches property. The files are written at the interval in seconds specified by the key_​cache_​save_period, row_cache_save_period, and counter_cache_​save_period, which default to 14000 (4 hours), 0 (disabled), and 7200 (2 hours), respectively.
  • Caches are indexed by key values. The number of keys to save in the file are indicated by the key_cache_keys_to_save, row_cache_keys_to_save, and counter_​cache_​keys_to_save properties.

Managing Caches via nodetool

Cassandra also provides capabilities for managing caching via nodetool:

  • You can clear caches using the invalidatekeycache, invalidate​rowcache, and invalidatecountercache commands.
  • Use the setcachecapacity command to override the configured settings for key and row cache capacity.
  • Use the setcachekeystosave command to override the configured settings for how many key and row cache elements to save to a file.

Remember that these settings will revert to the values set in the cassandra.yaml file on a  node restart. 

Memtables

Cassandra uses memtables to speed up writes, holding a memtable corresponding to each table it stores.  Cassandra flushes memtables to disk as SSTables when either the commit log threshold or memtable threshold has been reached.

Cassandra stores memtables either on the Java heap, off-heap (native) memory, or both. The limits on heap and off-heap memory can be set via the properties memtable_heap_space_in_mb and memtable_offheap_space_in_mb, respectively. By default, Cassandra sets each of these values to 1/4 of the total heap size set in the cassandra-env.sh file. Allocating memory for memtables reduces the memory available for caching and other internal Cassandra structures, so tune carefully and in small increments.

You can influence how Cassandra allocates and manages memory via the memtable_allocation_type property.  This property configures another of Cassandra’s pluggable interfaces, selecting which implementation of the abstract class org.apache.cassandra.utils.memory.MemtablePool is used to control the memory used by each memtable. The default value heap_buffers causes Cassandra to allocate memtables on the heap using the Java New I/O (NIO) API, while offheap_buffers uses Java NIO to allocate a portion of each memtable both on and off the heap. The offheap_objects uses native memory directly, making Cassandra entirely responsible for memory management and garbage collection of memtable memory. This is a less-well documented feature, so it’s best to stick with the default here until you can gain more experience.

Another element related to tuning the memtables is memtable_flush_writers. This setting, which is 2 by default, indicates the number of threads used to write out the memtables when it becomes necessary. If your data directories are backed by SSD, you should increase this to the number of cores, without exceeding the maximum value of 8. If you have a very large heap, it can improve performance to set this count higher, as these threads are blocked during disk I/O.

You can also enable metered flushing on each table via the CQL CREATE TABLE or ALTER TABLE command. The memtable_flush_period_in_ms option sets the interval at which the memtable will be flushed to disk.

Setting this property results in more predictable write I/O, but will also result in more SSTables and more frequent compactions, possibly impacting read performance. The default value of 0 means that periodic flushing is disabled, and flushes will only occur based on the commit log threshold or memtable threshold being reached.

Commit Logs

There are two sets of files that Cassandra writes to as part of handling update operations: the commit log and the SSTable files. Their different purposes need to be considered in order to understand how to treat them during configuration.

Remember that the commit log can be thought of as short-term storage that helps ensure that data is not lost if a node crashes or is shut down before memtables can be flushed to disk. That’s because when a node is restarted, the commit log gets replayed. In fact, that’s the only time the commit log is read; clients never read from it. But the normal write operation to the commit log blocks, so it would damage performance to require clients to wait for the write to finish.

You can set the value for how large the commit log is allowed to grow before it stops appending new writes to a file and creates a new one. This value is set with the commitlog_segment_size_in_mb property. By default, the value is 32 MB. This is similar to setting log rotation on a logging engine such as Log4J or Logback.

The total space allocated to the commit log is specified by the commitlog_total_​space_in_mb property. Setting this to a larger value means that Cassandra will need to flush tables to disk less frequently.

The commit logs are periodically removed, following a successful flush of all their appended data to the dedicated SSTable files. For this reason, the commit logs will not grow to anywhere near the size of the SSTable files, so the disks don’t need to be as large; this is something to consider during hardware selection.

To increase the amount of writes that the commit log can hold, you’ll want to enable log compression via the commitlog_compression property. The supported compression algorithms are LZ4, Snappy, and Deflate. Using compression comes at the cost of additional CPU time to perform the compression.

Additional settings relate to the commit log sync operation, represented by the commitlog_sync element.  There are two possible settings for this: periodic and batch. The default is periodic, meaning that the server will make writes durable only at specified intervals. The interval is specified by the commitlog_sync_period_in_ms property, which defaults to 10000 (10 seconds).

In order to guarantee durability for your Cassandra cluster, you may want to examine this setting. When the server is set to make writes durable periodically, you can potentially lose the data that has not yet been synced to disk from the write-behind cache.

If your commit log is set to batch, it will block until the write is synced to disk (Cassandra will not acknowledge write operations until the commit log has been completely synced to disk). Changing this value will require taking some performance metrics, as there is a necessary trade-off here: forcing Cassandra to write more immediately constrains its freedom to manage its own resources. If you do set commitlog_sync to batch, you need to provide a suitable value for commit_​log_​sync_batch_window_in_ms, where ms is the number of milliseconds between each sync effort.

SSTables

Unlike the commit log, Cassandra writes SSTable files to disk asynchronously. If you’re using hard disk drives, it’s recommended that you store the datafiles and the commit logs on separate disks for maximum performance. If you’re deploying on solid state drives (SSDs), it is fine to use the same disk.

Cassandra, like many databases, is particularly dependent on the speed of the hard disk and the speed of the CPUs. It’s more important to have several processing cores than one or two very fast ones, to take advantage of Cassandra’s highly concurrent construction. So make sure for QA and production environments to get the fastest disks you can—SSDs if you can afford them. If you’re using hard disks, make sure there are at least two so that you can store commit log files and the datafiles on separate disks and avoid competition for I/O time.

When reading SSTable files from disk, Cassandra uses a buffer cache (also known as a buffer pool), to help reduce database file I/O. This cache uses off-heap memory, but its size defaults to either 512 MB, or 1/4 of the Java heap, whichever is smaller. You can set the cache size using the file_cache_size_in_mb property in cassandra.yaml. You can also allow Cassandra to use the Java heap for buffers once the off-heap cache is full by setting buffer_pool_use_heap_if_exhausted to true.

As we discussed in Chapter 9, Cassandra maintains SSTable index summaries in memory in order to speed access into SSTable files. By default, Cassandra allocates 5% of the Java heap to these indexes, but you can override this via the index_summary_capacity_in_mb property in cassandra.yaml. In order to stay within the allocated limit, Cassandra will shrink indexes associated with tables that are read less frequently. Because read rates may change over time, Cassandra also resamples indexes from files stored on disk at the frequency specified by the index_​summary_​resize_interval_in_minutes property, which defaults to 60.

Cassandra also provides a mechanism to influence the relative amount of space allocated to indexes for different tables. This is accomplished via the min_index_​interval and max_index_interval properties, which can be set per via the CQL CREATE TABLE or ALTER TABLE commands. These values specify the minimum and maximum number of index entries to store per SSTable.

Hinted Handoff

Hinted handoff is one of several mechanisms that Cassandra provides to keep data in sync across the cluster. As we’ve learned previously, a coordinator node can keep a copy of data on behalf of a node that is down for some amount of time. We can tune hinted handoff in terms of the amount of disk space we’re willing to use to store hints, and how quickly hints will be delivered when a node comes back online.

We can control the bandwidth utilization of hint delivery using the property hinted_handoff_throttle_in_kb, or at runtime via nodetool sethintedhandoff​throttlekb.

This throttling limit has a default value of 1024, or 1 MB/second, and is used to set an upper threshold on the bandwidth that would be required of a node receiving hints. For example, in a cluster of three nodes, each of the two nodes delivering hints to a third node would throttle its hint delivery to half of this value, or 512KB/second.

Note that configuring this throttle is a bit different than configuring other Cassandra features, as the behavior that will be observed by a node receiving hints is entirely based on property settings on other nodes. You’ll definitely want to use the same values for these settings in sync across the cluster to avoid confusion.

In releases prior to 3.0, Cassandra stores hints in a table which is not replicated to other nodes, but starting with the 3.0 release, Cassandra stores hints in a directory specified by the hints_directory property, which defaults to the data/hints directory under the Cassandra installation. You can set a cap on the amount of disk space devoted to hints via the max_hints_file_size_in_mb property.

You can clear out any hints awaiting delivery to one or more nodes using the nodetool truncatehints command with a list of IP addresses or hostnames. Hints eventually expire after the value expressed by the max_hint_window_in_ms property.

It’s also possible to enable or disable hinted handoff entirely, as we learned in Chapter 10. While some would use this as a way to conserve disk and bandwidth, in general the hinted handoff mechanism does not consume a lot of resources in comparison to the extra layer of consistency it helps to provide, especially compared to the cost of repairing a node.

Compaction

Cassandra provides configuration options for compaction including the resources used by compaction on a node, and the compaction strategy to be used for each table.

Choosing the right compaction strategy for a table can certainly be a factor in improving performance. Let’s review the available strategies and discuss when they should be used.

SizeTieredCompactionStrategy

The SizeTieredCompactionStrategy (STCS) is the default compaction strategy, and it should be used in most cases. This strategy groups SSTables into tiers organized by size. When there are a sufficient number of SSTables in a tier (4 or more by default), a compaction is run to combine them into a larger SSTable. As the amount of data grows, more and more tiers are created. STCS performs especially well for write-intensive tables, but less so for read-intensive tables, as the data for a particular row may be spread across an average of 10 or so SSTables.

LeveledCompactionStrategy

The LeveledCompactionStrategy (LCS) creates SSTables of a fixed size (5 MB by default) and groups them into levels, with each level holding 10 times as many SSTables as the previous level. LCS guarantees that a given row appears in at most one SSTable per level. LCS spends additional effort on I/O to minimize the number of SSTables a row appears in; the average number of SSTables for a given row is 1.11. This strategy should be used if there is a high ratio of reads to writes or predictable latency is required. LCS will tend to not perform as well if a cluster is already I/O bound. If writes dominate reads, Cassandra may struggle to keep up.

DateTieredCompactionStrategy

The DateTieredCompactionStrategy (DTCS) was introduced in the 2.0.11 and 2.1.1 releases. It is intended to improve read performance for time series data, specifically for access patterns that involve accessing the most recently written data. It works by grouping SSTables in windows organized by the write time of the data. Compaction is only performed within these windows. Because DTCS is relatively new and targeted at a very specific use case, make sure you research carefully before making use of it.

Each strategy has its own specific parameters that can be tuned. Check the documentation for your release for further details.

Another per-table setting is the compaction threshold. The compaction threshold refers to the number of SSTables that are in the queue to be compacted before a minor compaction is actually kicked off. By default, the minimum number is 4 and the maximum is 32. You don’t want this number to be too small, or Cassandra will spend time fighting with clients for resources to perform many frequent, unnecessary compactions. You also don’t want this number to be too large, or Cassandra could spend a lot of resources performing many compactions at once, and therefore will have fewer resources available for clients.

The compaction threshold is set per table using the CQL CREATE TABLE or ALTER TABLE commands.  However, you can inspect or override this setting for a particular node using the nodetool getcompactionthreshold or setcompactionthreshold commands:

$ nodetool getcompactionthreshold hotel hotels
Current compaction thresholds for hotel/hotels:
 min = 4,  max = 32
$ nodetool setcompactionthreshold hotel hotels 8 32

Compaction can be intensive in terms of I/O and CPU, so Cassandra provides the ability to monitor the compaction process and influence when compaction occurs.

You can monitor the status of compaction on a node using the nodetool compactionstats command, which lists the completed and total bytes for each active compaction (we’ve omitted the ID column for brevity):

$ nodetool compactionstats

pending tasks: 1
   id  compaction type  keyspace  table   completed  total      unit   progress
  ...  Compaction       hotel     hotels  57957241   127536780  bytes  45.44%
Active compaction remaining time :   0h00m12s

If you see that the pending compactions are starting to stack up, you can use the nodetool commands getcompactionthroughput and setcompactionthroughput to check and set the throttle that Cassandra applies to compactions across the cluster. This corresponds to the property compaction_throughput_mb_per_sec in the cassandra.yaml file. Setting this value to 0 disables throttling entirely, but the default value of 16 MB/s is sufficient for most non-write-intensive cases.

If this does not fix the issue, you can increase the number of threads dedicated to compaction by setting the concurrent_compactors property in cassandra.yaml file, or at runtime via the CompactionManagerMBean. This property defaults to the minimum of the number of disks and number of cores, with a minimum of 2 and a maximum of 8.

Although it is not very common, a large compaction could negatively impact the performance of the cluster. You can use the nodetool stop command to stop all active compactions on a node. You can also identify a specific compaction to stop by ID, where the ID is obtained from the compactionstats output. Cassandra will reschedule any stopped compactions to run later.

You can force a major compaction via the nodetool compact command. Before kicking off a major compaction manually, remember that this is an expensive operation. The behavior of nodetool compact during compaction varies depending on the compaction strategy in use. We’ll discuss the implications of each compaction strategy on disk usage in Chapter 14.

The nodetool compactionhistory command prints statistics about completed compactions, including the size of data before and after compaction and the number of rows merged. The output is pretty verbose, so we’ve omitted it here.

Concurrency and Threading

Cassandra differs from many data stores in that it offers much faster write performance than read performance. There are two settings related to how many threads  can perform read and write operations: concurrent_reads and concurrent_writes. In general, the defaults provided by Cassandra out of the box are very good.

The concurrent_reads setting determines how many simultaneous read requests the node can service. This defaults to 32, but should be set to the number of drives used for data storage × 16. This is because when your data sets are larger than available memory, the read operation is I/O bound.

The concurrent_writes setting behaves somewhat differently. This should correlate to the number of clients that will write concurrently to the server. If Cassandra is backing a web application server, you can tune this setting from its default of 32 to match the number of threads the application server has available to connect to Cassandra. It is common in Java application servers to prefer database connection pools no larger than 20 or 30, but if you’re using several application servers in a cluster, you’ll need to factor that in as well.

Two additional settings—concurrent_counter_writes and concurrent_materialized_view_writes—are available for tuning special forms of writes. Because counter and materialized view writes both involve a read before write, it is best to set this to the lower of concurrent_reads and concurrent_writes.

There are several other properties in the cassandra.yaml file which control the number of threads allocated to the thread pools which implement various stages in Cassandra’s SEDA approach. We’ve looked at some of these already, but here is a summary:

max_hints_delivery_threads
Maximum number of threads devoted to hint delivery
memtable_flush_writers
Number of threads devoted to flushing memtables to disk
concurrent_compactors
Number of threads devoted to running compaction
native_transport_max_threads
Maximum number of threads devoted to processing incoming CQL requests (you may also notice the similar properties rpc_min_threads and rpc_​max_​threads, which are for the deprecated Thrift interface)

Note that some of these properties allow Cassandra to dynamically allocate and deallocate threads up to a maximum value, while others specify a static number of threads. Tuning these properties up or down affects how Cassandra uses its CPU time and how much I/O capacity it can devote to various activities.

Networking and Timeouts

As Cassandra is a distributed system, it provides mechanisms for dealing with network and node issues including retries, timeouts, and throttling. We’ve already discussed a couple of the ways Cassandra implements retry logic, such as the RetryPolicy in the DataStax client drivers, and speculative read execution in drivers and nodes.

Now let’s take a look at the timeout mechanisms that Cassandra provides to help it avoid hanging indefinitely waiting for other nodes to respond. The timeout properties listed in Table 12-1 are set in the cassandra.yaml file.

Table 12-1. Cassandra node timeouts
Property Name Default Value Description
read_request_timeout_in_ms 5000 (5 seconds) How long the coordinator waits for read operations to complete
range_request_timeout_in_ms 10000 (10 seconds) How long the coordinator should wait for range reads to complete
write_request_timeout_in_ms 2000 (2 seconds) How long the coordinator should wait for writes to complete
counter_write_request_timeout_in_ms 5000 (5 seconds) How long the coordinator should wait for counter writes to complete
cas_contention_timeout_in_ms 1000 (1 second) How long a coordinator should continue to retry a lightweight transaction
truncate_request_timeout_in_ms 60000 (1 minute) How long the coordinator should wait for truncates to complete (including snapshot)
streaming_socket_timeout_in_ms 3600000 (1 hour) How long a node waits for streaming to complete
request_timeout_in_ms 10000 (10 seconds) The default timeout for other, miscellaneous operations


The values for these timeouts are generally acceptable, but you may need to adjust them for your network environment.

Another property related to timeouts is cross_node_timeout, which defaults to false. If you have NTP configured in your environment, consider enabling this so that nodes can more accurately estimate when the coordinator has timed out on long-running requests and release resources more quickly.

Cassandra also provides several properties that allow you to throttle the amount of network bandwidth it will use for various operations. Tuning these allows you to prevent Cassandra from swamping your network, at the cost of longer time to complete these tasks. For example, the stream_throughput_outbound_megabits_per_sec and inter_dc_stream_throughput_outbound_megabits_per_sec properties specify a per-thread throttle on streaming file transfers to other nodes in the local and remote data centers, respectively.

The throttles for hinted handoff and batchlog replay work slightly differently. The values specified by hinted_handoff_throttle_in_kb and batchlog_​replay_​throttle_in_kb are considered maximum throughput values for the cluster and are therefore spread proportionally across nodes according to the formula:

That is, the throughput of a node x (Tx) is equal to the total throughput (Tt) divided by one less than the number of nodes in the cluster (Nn).

Finally, there are several properties that you can use to limit traffic to the native CQL port on each node. These may be useful in situations where you don’t have direct control over the client applications of your cluster. The default maximum frame size specified by the native_transport_max_frame_size_in_mb property is 256. Frame requests larger than this will be rejected by the node.

The node can also limit the maximum number of simultaneous client connections, via the native_transport_max_concurrent_connections property, but the default is -1 (unlimited). If you configure this value, you’ll want to make sure it makes sense in light of the concurrent_readers and concurrent_writers properties.

To limit the number of simultaneous connections from a single source IP address, configure the native_transport_max_concurrent_connections_per_ip property, which defaults to -1 (unlimited).

JVM Settings

Cassandra allows you to configure a variety of options for how the server JVM should start up, how much Java memory should be allocated, and so forth. In this section, we look at how to tune the startup.

If you’re using Windows, the startup script is called cassandra.bat, and on Linux it’s cassandra.sh. You can start the server by simply executing this file, which sets several defaults. But there’s another file in the conf directory that allows you to configure a variety of settings related to how Cassandra starts. This file is called cassandra-env.sh (cassandra-env.ps1 on Windows) and it separates certain options, such as the JVM settings, into a different file to make it easier to update.

Try tuning the options passed to the JVM at startup for better performance. The key JVM options included in cassandra-env.sh and guidelines for how you might tweak them are discussed momentarily. If you want to change any of these values, simply open the cassandra-env.sh file in a text editor, change the values, and restart.

The jvm.options File

The Cassandra 3.0 release introduced another settings file in the conf directory called jvm.options. The purpose of this file is to move JVM settings related to heap size and garbage collection out of cassandra.in.sh file to a separate file, as these are the attributes that are tuned most frequently. The jvm.options and cassandra.in.sh are included (sourced) by the cassandra-env.sh file.

Memory

By default, Cassandra uses the following algorithm to set the JVM heap size: if the machine has less than 1 GB of RAM, the heap is set to 50% of RAM. If the machine has more than 4 GB of RAM, the heap is set to 25% of RAM, with a cap of 8 GB. To tune the minimum and maximum heap size, use the -Xms and -Xmx flags. These should be set to the same value to allow the entire heap to be locked in memory and not swapped out by the OS. It is not recommended to set the heap larger than 8GB if you are using the Concurrent Mark Sweep (CMS) garbage collector, as heap sizes larger than this value tend to lead to longer GC pauses.

When performance tuning, it’s a good idea to set only the heap min and max options, and nothing else at first. Only after real-world usage in your environment and some performance benchmarking with the aid of heap analysis tools and observation of your specific application’s behavior should you dive into tuning the more advanced JVM settings. If you tune your JVM options and see some success, don’t get too excited. You need to test under real-world conditions.

In general, you’ll probably want to make sure that you’ve instructed the heap to dump its state if it hits an out-of-memory error, which is the default in cassandra-env.sh, set by the -XX:+HeapDumpOnOutOfMemory option. This is just good practice if you start experiencing out-of-memory errors.

Garbage Collection

The Java heap is broadly divided into two object spaces: young and old. The young space is subdivided into one for new object allocation (called “eden space”) and another for new objects that are still in use (the “survivor space”).

Cassandra uses the parallel copying collector in the young generation; this is set via the -XX:+UseParNewGC option. Older objects still have some reference, and have therefore survived a few garbage collections, so the Survivor Ratio is the ratio of eden space to survivor space in the young object part of the heap. Increasing the ratio makes sense for applications with lots of new object creation and low object preservation; decreasing it makes sense for applications with longer-living objects. Cassandra sets this value to 8 via the -XX:SurvivorRatio option, meaning that the ratio of eden to survivor space is 1:8 (each survivor space will be 1/8 the size of eden). This is fairly low, because the objects are living longer in the memtables.

Every Java object has an age field in its header, indicating how many times it has been copied within the young generation space. Objects are copied into a new space when they survive a young generation garbage collection, and this copying has a cost. Because long-living objects may be copied many times, tuning this value can improve performance. By default, Cassandra has this value set at 1 via the -XX:MaxTenuringThreshold option. Set it to 0 to immediately move an object that survives a young generation collection to the old generation. Tune the survivor ratio together along with the tenuring threshold.

By default, modern Cassandra releases use the Concurrent Mark Sweep (CMS) garbage collector for the old generation; this is set via the -XX:+UseConcMarkSweepGC option. This setting uses more RAM and CPU power to do frequent garbage collections while the application is running, in order to keep the GC pause time to a minimum. When using this strategy, it’s important to set the heap min and max values to the same value, in order to prevent the JVM from having to spend a lot of time growing the heap initially.

To get more insight on garbage collection activity in your Cassandra nodes, there are a couple of options. The gc_warn_threshold_in_ms setting in the cassandra.yaml file determines the pause length that will cause Cassandra  to generate log messages at the WARN logging level. This value defaults to 1000 ms (1 second). You can also instruct the JVM to print garbage-collection details by setting options in the cassandra-env.sh or jvm.options file.

Using cassandra-stress

Cassandra ships with a utility called cassandra-stress that you can use to run a stress test on your Cassandra cluster. To run cassandra-stress, navigate to the <cassandra-home>/tools/bin directory and run the command:

$ cassandra-stress write n=1000000
Connected to cluster: test-cluster
Datatacenter: datacenter1; Host: localhost/127.0.0.1; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.0.2; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.0.3; Rack: rack1
Created keyspaces. Sleeping 1s for propagation.
Sleeping 2s...
Warming up WRITE with 50000 iterations...
Running WRITE with 200 threads for 1000000 iteration
...

The output lists the nodes to which the tool is connected (in this case, a cluster created using ccm) and creates a sample keyspace and table to which it can write data. The test is warmed up by doing 50,000 writes, and then the tool begins to output metrics as it continues to write, which we’ve omitted for brevity. The tool creates a pool of threads (defaulting on my system to 200) that perform one write after another, until it inserts one million rows. Finally, it prints a summary of results:

Results:
op rate                   : 7620 [WRITE:7620]
partition rate            : 7620 [WRITE:7620]
row rate                  : 7620 [WRITE:7620]
latency mean              : 26.2 [WRITE:26.2]
latency median            : 2.6 [WRITE:2.6]
latency 95th percentile   : 138.4 [WRITE:138.4]
latency 99th percentile   : 278.8 [WRITE:278.8]
latency 99.9th percentile : 393.3 [WRITE:393.3]
latency max               : 820.9 [WRITE:820.9]
Total partitions          : 1000000 [WRITE:1000000]
Total errors              : 0 [WRITE:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:02:11

Let’s unpack this a bit. What we’ve done is generated and inserted one million values into a completely untuned three-node cluster in a little over two minutes, which represents a rate of 7,620 writes per second. The median latency per operation is 2.6 milliseconds, although a small number of writes took longer.

Now that we have all of this data in the database, let’s use the test to read some values too:

$ cassandra-stress read n=200000 
...
Running with 4 threadCount
Running READ with 4 threads for 200000 iteration

If you examine the output, you will see that it starts out with a small number of threads and ramps up. On one test run, it peaked at over 600 threads, as shown here:

Results:
op rate                   : 13828 [READ:13828]
partition rate            : 13828 [READ:13828]
row rate                  : 13828 [READ:13828]
latency mean              : 67.1 [READ:67.1]
latency median            : 9.9 [READ:9.9]
latency 95th percentile   : 333.2 [READ:333.2]
latency 99th percentile   : 471.1 [READ:471.1]
latency 99.9th percentile : 627.9 [READ:627.9]
latency max               : 1060.5 [READ:1060.5]
Total partitions          : 200000 [READ:200000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:00:14
Improvement over 609 threadCount: 7%

The tool periodically prints out statistics about the last several writes. The preceding output came from the last block of statistics. As you can see, Cassandra doesn’t read nearly as fast as it writes; the mean read latency was around 10 ms. Remember, though, that this is out of the box, untuned, single-threaded, on a regular workstation running other programs, and the database is approximately 2 GB in size. Regardless, this is a great tool to help you do performance tuning for your environment and to get a set of numbers that indicates what to expect in your cluster.

We can also run cassandra-stress on our own tables by creating a specification in a yaml file. For example, we could create a cqlstress-hotel.yaml file to describe queries that read from tables in our hotel keyspace. This file defines queries that we would use to stress the available_rooms_by_hotel_date table:

keyspace: hotel
table: available_rooms_by_hotel_date

columnspec:
  - name: date
    cluster: uniform(20..40)

insert:
  partitions: uniform(1..50)
  batchtype: LOGGED
  select: uniform(1..10)/10

queries:
   simple1:
      cql: select * from available_rooms_by_hotel_date 
        where hotel_id = ? and date = ?
      fields: samerow             
   range1:
      cql: select * from available_rooms_by_hotel_date 
        where hotel_id = ? and date >= ? and date <= ?
      fields: multirow   

We can then execute these queries in a run of cassandra-stress. For example, we might run a mixed load of writes, single item queries, and range queries, as follows:

$ cassandra-stress user profile=~/cqlstress-hotel.yaml 
  ops(simple1=2,range1=1,insert=1) no-warmup

The numbers associated with each query indicate the desired ratio of usage. This command performs three reads for every write.

Additional Help on cassandra-stress

There are a number of options. You can execute cassandra-stress help to get the list of supported commands and options, and cassandra-stress help <command> to get  more information on a specific command.

You may also want to investigate cstar_perf, an open source performance testing platform for Cassandra provided by DataStax. This tool supports automation of stress testing, including the ability to spin up clusters and run performance tests across multiple Cassandra versions, or separate clusters with different configuration settings for the purpose of comparison. It provides a web-based user interface for creating and running test scripts and viewing test results. You can download cstar_perf and read the  documentation at http://datastax.github.io/cstar_perf/setup_cstar_perf_tool.html

Summary

In this chapter, we looked at managing Cassandra performance and the settings available in Cassandra to aid in performance tuning, including caching settings, memory settings, and hardware concerns. We also set up and used the cassandra-stress test tool to write and then read one million rows of data.

If you’re somewhat new to Linux systems and you want to run Cassandra on Linux (which is recommended), you may want to check out Al Tobey’s blog post on tuning Cassandra. This walks through several Linux performance monitoring tools to help you understand the performance of your underlying platform so that you can troubleshoot in the right place. Although the post references Cassandra 2.1, the advice it contains is broadly applicable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.35.77