Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 13. Performance Tuning

In this chapter, you’ll learn how and why to tune Cassandra to improve performance, and a methodology for setting performance goals, monitoring your cluster’s performance, simulating stress loads, and troubleshooting performance goals. You’ll also learn about specific settings in Cassandra’s configuration files and options on individual tables and how they affect the performance and resource utilization of your cluster.

Managing Performance

To be effective at achieving and maintaining a high level of performance in your cluster, it’s helpful to think of managing performance as a process that begins with the architecture of your application and continues through development and operations.

Setting Performance Goals

Before beginning any performance tuning effort, it is important to have clear goals in mind, whether you are just getting started on deploying an application, or maintaining an existing one.

When planning a cluster, it’s important to understand how the cluster will be used: the number of clients, intended usage patterns, expected peak periods, and so on. This is useful in planning the initial cluster capacity and for planning cluster growth, as discussed in Chapter 10.

An important part of this planning effort is to identify clear performance goals in terms of both throughput (the number of queries serviced per unit time) and latency (the time to complete a given query). Usually, you will be trying to increase throughput while reducing latency. A good place to start is with the use cases that you anticipate will put the greatest load on your cluster.

For example, let’s say that you’re building an ecommerce website for hotel reservations that uses the data model designed in Chapter 5. If you’re following a process of setting performance goals for various operations on the website, you might anticipate that most of the traffic will come from customers browsing the site, shopping for available hotel rooms. As a result, you set a goal for site to respond to each search for available rooms in under a second. Through the process of allocating that performance budget to various services and layers, you might then arrive at the following goal for shopping queries on your Cassandra cluster:

The cluster must support 30,000 read operations per second from the available_rooms_by_hotel_date table with a 99th percentile read latency of 5 ms.

This is a statement that includes both throughput and latency goals. Throughput goals are typically expressed in terms of number of operations per second that can be supported by the cluster as a whole. Latency goals are expressed in terms of histograms: for example, the goal above that 99 percent of all queries complete in under 5 seconds.

In this chapter, you’ll learn how to measure performance goals similar to the one above using nodetool tablestats. It’s useful to have similar goals for each of the access patterns and resulting Cassandra queries that your application must support. You can use the techniques identified in Chapter 11 to track performance against these goals and identify when performance is trending in the wrong direction.

The USE method

Brendan Gregg has created a methodology known for analyzing system performance based on the utilization, saturation, and errors (USE) of each system resource, including CPU, memory, and disk and network I/O. By tracking performance metrics and thresholds across multiple resources, you can have a better awareness of the state of your system holistically, identifying and addressing root causes instead of making naive attempts to optimize individual resources in isolation.

For example, tracking CPU and disk I/O in parallel, you might identify a period of increased CPU activity corresponding to saturation of disk I/O corresponding to increased query latency. Further investigation of this might point to a high compaction workload, especially if you can correlate the high activity period with warning messages in Cassandra’s logs. This might lead you to identifying issues with your data model or application, such as usage that results in a very wide partition or a large number of tombstones. You can read more about the USE method at Brendan Gregg’s website.

Regardless of your specific performance goals, it’s important to remember that performance tuning is all about trade-offs. Having well-defined performance goals will help you articulate what trade-offs are acceptable for your application. For example:

Enabling SSTable compression in order to conserve disk space, at the cost of additional CPU processing.
Throttling network usage and threads, which can be used to keep network and CPU utilization under control, at the cost of reduced throughput and increased latency.
Increasing or decreasing number of threads allocated to specific tasks such as reads, writes, or compaction in order to affect the priority relative to other tasks or to support additional clients.
Increasing heap size in order to decrease query times.
Reducing the read or write consistency level required by your application in order to increase throughput.

These are just a few of the trade-offs that you will find yourself considering in performance tuning. We’ll highlight others throughout the rest of this chapter.

Benchmarking and Stress Testing

Once you have set performance targets, it’s helpful to create some load on your cluster to get an idea of the performance you can expect in your production deployment. There are two main approaches to consider: benchmarking and stress testing.

Benchmarking is the process of measuring the performance of software under a defined load. The purpose of this might be to compare one database against another, or to compare different configurations of the same system. There are standard database benchmarks available, such as the Yahoo! Cloud Serving Benchmark (YCSB), which has proven popular for benchmarking distributed NoSQL databases. However, it can be difficult to get an apples-to-apples comparison between different databases without a significant amount of tuning.

Stress testing is similar to benchmarking in that you are generating a load on the system. However in this case the goal is slightly different. Instead of measuring performance for comparison against a baseline, in a stress test you increase the load on the system in order to discover errors and performance degradations such as bottlenecks.

Our recommendation is to focus on creating benchmarks and stress tests that exercise your Cassandra data models on test clusters resembling your desired production topology and configuration options, with loads resembling your expected nominal and peak operating conditions. Its especially important to have tests available that can verify there is no performance degradation when you make significant changes to your data models, cluster configuration and topology, or application.

Using cassandra-stress

The first tool to examine is one that ships with Cassandra. You can use cassandra-stress to run a stress test on your Cassandra cluster. To run cassandra-stress, navigate to the <cassandra-home>/tools/bin directory and run the command:

$ cassandra-stress write n=1000000

First the stress tool will print out a detailed description of your chosen configuration including how the tool will connect to the cluster, the mix of CQL commands that will be run, and the schema to be used for testing, including how random values will be generated. Then you’ll see a few lines indicating the test is starting:

Connected to cluster: reservation_cluster, max pending requests per connection 128, max connections per host 8
Datacenter: datacenter1; Host: localhost/127.0.0.1:9042; Rack: rack1
Datacenter: datacenter1; Host: /127.0.0.2:9042; Rack: rack1
Datacenter: datacenter1; Host: /127.0.0.3:9042; Rack: rack1
Created keyspaces. Sleeping 1s for propagation.
Sleeping 2s...
Warming up WRITE with 50000 iterations...
Running WRITE with 200 threads for 1000000 iteration
...

If you’re using the ccm tool introduced in Chapter 10 to run a local cluster, you could run ccm node1 stress write n=1000000.

The output lists the nodes to which the tool is connected (in this case, a cluster created using ccm) and creates a sample keyspace and table to which it can write data. The test warms up by doing 50,000 writes, and then the tool begins to output metrics as it continues to write, which we’ve omitted for brevity. The tool creates a pool of threads (defaulting to 200) that perform one write after another, until it inserts one million rows. Finally, it prints a summary of results:

Results:
Op rate                   :   22,187 op/s  [WRITE: 22,187 op/s]
Partition rate            :   22,187 pk/s  [WRITE: 22,187 pk/s]
Row rate                  :   22,187 row/s [WRITE: 22,187 row/s]
Latency mean              :    9.0 ms [WRITE: 9.0 ms]
Latency median            :    1.0 ms [WRITE: 1.0 ms]
Latency 95th percentile   :   50.9 ms [WRITE: 50.9 ms]
Latency 99th percentile   :  131.9 ms [WRITE: 131.9 ms]
Latency 99.9th percentile :  267.9 ms [WRITE: 267.9 ms]
Latency max               :  628.6 ms [WRITE: 628.6 ms]
Total partitions          :  1,000,000 [WRITE: 1,000,000]
Total errors              :          0 [WRITE: 0]
Total GC count            : 0
Total GC memory           : 0.000 KiB
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:00:45

Let’s unpack these results. They summarize the insertion of one million values into a completely untuned three-node cluster running on a single machine using ccm. The insertions completed in about 45 seconds, which represents a rate over 20,000 writes per second. The median latency per write operation was 1 millisecond, although a small number of writes took longer. Your results will of course vary depending on the configuration of your cluster including topology and choice of hardware.

Now that you have all of this data in the database, use the test to read some values too:

$ cassandra-stress read n=200000
...
Warming up READ with 50000 iterations...
Thread count was not specified

Running with 4 threadCount
Running READ with 4 threads for 200000 iteration
...

If you examine the output, you will see that it first performs a run using a small number of threads (4) and ramps up the number of threads used on each subsequent run, printing the results of each run and comparing the results with the previous run. Here’s an example of a run that used 121 client threads:

Running with 121 threadCount
Running READ with 121 threads for 200000 iteration
...

Results:
Op rate                   :   23,493 op/s  [READ: 23,493 op/s]
Partition rate            :   23,493 pk/s  [READ: 23,493 pk/s]
Row rate                  :   23,493 row/s [READ: 23,493 row/s]
Latency mean              :    5.1 ms [READ: 5.1 ms]
Latency median            :    0.9 ms [READ: 0.9 ms]
Latency 95th percentile   :   22.0 ms [READ: 22.0 ms]
Latency 99th percentile   :   88.8 ms [READ: 88.8 ms]
Latency 99.9th percentile :  146.3 ms [READ: 146.3 ms]
Latency max               :  305.1 ms [READ: 305.1 ms]
Total partitions          :    200,000 [READ: 200,000]
Total errors              :          0 [READ: 0]
Total GC count            : 0
Total GC memory           : 0.000 KiB
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:00:08

Improvement over 81 threadCount: 6%

The tool periodically prints out statistics about the last several writes. As you can see, Cassandra doesn’t read quite as fast as it writes; the mean read latency was around 5 ms. Remember, though, that these results were generated with default configuration options on a regular workstation running other programs. Regardless, this is a great tool to help you do performance tuning for your environment and to get a set of numbers that indicates what to expect in your cluster.

You can also run cassandra-stress on your own tables by creating a specification in a yaml file. For example, you could create a cqlstress-hotel.yaml file to describe read and write queries on tables in the hotel keyspace. This file defines queries that you could use to stress the available_rooms_by_hotel_date table:

keyspace: hotel
table: available_rooms_by_hotel_date

columnspec:
  - name: date
    cluster: uniform(20..40)

insert:
  partitions: uniform(1..50)
  batchtype: LOGGED
  select: uniform(1..10)/10

queries:
   simple1:
      cql: select * from available_rooms_by_hotel_date
        where hotel_id = ? and date = ?
      fields: samerow
   range1:
      cql: select * from available_rooms_by_hotel_date
        where hotel_id = ? and date >= ? and date <= ?
      fields: multirow

You can then execute these queries in a run of cassandra-stress. For example, you might run a mixed load of writes, single item queries, and range queries, as follows:

$ cassandra-stress user profile=~/cqlstress-hotel.yaml
  ops(simple1=2,range1=1,insert=1) no-warmup

The numbers associated with each query indicate the desired ratio of usage. This command performs three reads for every write.

Additional Help on cassandra-stress

You can execute cassandra-stress help to get the list of supported commands and options, and cassandra-stress help <command> to get more information on a specific command.

Additional Load Testing Tools

There are a few additional tools for load and stress testing that you may find useful:

TLP Stress: Jon Haddad and others at The Last Pickle have released TLP Stress, which is available on GitHub. TLP-stress is a command-line tool similar to cassandra-stress, but with simpler syntax. It comes with built in workloads for common data model patterns such as time-series and key-value data models. These workloads offer parameters that allow you to tailor the behavior such as the number of requests and the read/write mix. It also includes workloads that demonstrate the performance impact of features that can make Cassandra work harder, such as materialized views, lightweight transactions, and queries using ALLOW FILTERING. Planned improvements include the ability to add your own custom workloads.
Cstar_perf: cstar_perf is an open source performance testing platform provided by DataStax for testing Cassandra and DataStax Enterprise. This tool supports automation of stress testing, including the ability to spin up clusters and run performance tests across multiple Cassandra versions, or to spin up separate clusters with different configuration settings for the purpose of comparison. It provides a web-based user interface for creating and running test scripts and viewing test results. You can download cstar_perf and read the documentation at http://datastax.github.io/cstar_perf/.
Apache JMeter: JMeter is an open source Java framework designed for functional and load testing. Originally designed for testing web applications, it has been extended to allow testing of applications via a variety of applications and protocols and works with many continuous integration frameworks. You can see an example of stress testing a Cassandra cluster using Groovy scripts and the DataStax Java Driver in Alain Rastoul’s blog post.

Monitoring Performance

As the size of your cluster grows, the number of clients increases, and more keyspaces and tables are added, the demands on your cluster will begin to pull in different directions. Taking frequent baselines to measure the performance of your cluster against its goals will become increasingly important.

You learned in Chapter 11 about the various metrics that are exposed via JMX, including performance-related metrics for Cassandra’s StorageProxy and individual tables. In that chapter, you also examined nodetool commands that publish performance-related statistics such as nodetool tpstats and nodetool tablestats and discussed how these can help identify loading and latency issues.

Now let’s look at two additional nodetool commands that present performance statistics formatted as histograms: proxyhistograms and tablehistograms. First, examine the output of the nodetool proxyhistograms command:

$ nodetool proxyhistograms

proxy histograms
Percentile      Read Latency     Write Latency      Range Latency  ...
                    (micros)          (micros)           (micros)  ...
50%                   654.95              0.00            1629.72  ...
75%                   943.13              0.00            5839.59  ...
95%                  4055.27              0.00           62479.63  ...
98%                 62479.63              0.00          107964.79  ...
99%                107964.79              0.00          129557.75  ...
Min                   263.21              0.00             545.79  ...
Max                107964.79              0.00          155469.30  ...

The output shows the latency of reads, writes, and range requests for which the requested node has served as the coordinator. We’ve shortened the output to omit the additional columns CAS Read Latency, CAS Write Latency, and View Write Latency. These columns track read and write latency associated with Cassandra’s lightweight transactions (CAS is an abbreviation of Check and Set), and latency for writing to materialized view tables. These columns are only applicable if you are using these features.

The output is expressed in terms of percentile rank as well as minimum and maximum values in microseconds. Running this command on multiple nodes can help identify the presence of slow nodes in the cluster. A large range latency (in the hundreds of milliseconds or more) can be an indicator of clients using inefficient range queries, such as those requiring the ALLOW FILTERING clause or index lookups.

While the view provided by proxyhistograms is useful for identifying general performance issues, you’ll frequently need to focus on performance of specific tables. This is what nodetool tablehistograms allows you to do. Let’s look at the output of this command against the available_rooms_by_hotel_date table:

nodetool tablehistograms hotel available_rooms_by_hotel_date

hotel/available_rooms_by_hotel_date histograms
Percentile  Read Latency  Write Latency  SSTables  Partition Size  Cell Count
                (micros)       (micros)                   (bytes)
50%                 0.00          0.00       0.00            2759         179
75%                 0.00          0.00       0.00            2759         179
95%                 0.00          0.00       0.00            2759         179
98%                 0.00          0.00       0.00            2759         179
99%                 0.00          0.00       0.00            2759         179
Min                 0.00          0.00       0.00            2300         150
Max                 0.00          0.00       0.00            2759         179

The output of this command is similar. It omits the range latency statistics and instead provides counts of SSTables read per query. The partition size and cell count are provided, where cells are values stored in a partition. These metrics provide another way of identifying large partitions.

Once you’ve gained familiarity with these metrics and what they tell you about your cluster, you should identify key metrics to monitor and even implement automated alerts that indicate your performance goals are not being met. You can accomplish this via frameworks discussed in Chapter 11.

Analyzing Performance Issues

It’s not unusual for the performance of a cluster that has been working well to begin to degrade over time. When you detect a performance issue, you’ll want to begin analyzing it quickly to ensure the performance doesn’t continue to deteriorate. Your goal in these circumstances should be to determine the root cause and address it.

In this chapter, you’ll learn many configuration settings that you can use to tune the performance of each node in a cluster as a whole, across all keyspaces and tables. It’s also important to narrow performance issues down to specific tables and even queries.

In fact, the quality of the data model is usually the most influential factor in the performance of a Cassandra cluster. For example, a table design that results in partitions with a growing number of rows can begin to degrade the performance of the cluster and manifest in failed repairs, or streaming failures on addition of new nodes. Conversely, partitions with partition keys that are too restrictive can result in rows that are too narrow, requiring many partitions to be read to satisfy a simple query.

Beware the Large Partition

In addition to the nodetool tablehistograms discussed earlier, you can detect large partitions by searching logs for WARN messages that reference “Writing large partition” or “Compacting large partition”. The threshold for warning on compaction of large partitions is set by the compaction_large_partition_warning_threshold_mb property in the cassandra.yaml file.

You’ll also want to learn to recognize instances of the anti-patterns discussed in Chapter 5 such as queues, or other design approaches that generate a large amount of tombstones.

Tracing

In Chapter 11, we described tracing as one of the key elements of an overall observability strategy for your applications. Now you’re ready to explore how tracing fits into the strategy. The idea is to use metrics and logging to narrow your search down to a specific table and query of concern, and then to use tracing to gain detailed insight. Tracing is an invaluable tool for understanding the communications between clients and nodes involved in each query and the time spent in each step. This helps you see the performance implications of design decisions you make in your data models and choice of replication factors and consistency levels.

There are several ways to access trace data. Let’s start by looking at how tracing works in cqlsh. First, enable tracing, and then execute a simple command:

cqlsh:hotel> TRACING ON
Now Tracing is enabled
cqlsh:hotel> SELECT * from hotels where id='AZ123';

 id    | address | name                            | phone          | pois
-------+---------+---------------------------------+----------------+------
 AZ123 |    null | Super Hotel Suites at WestWorld | 1-888-999-9999 | null

(1 rows)

Tracing session: 6669f210-de99-11e5-bdb9-59bbf54c4f73

 activity           | timestamp                  | source    | source_elapsed
--------------------+----------------------------+-----------+----------------
Execute CQL3 query  | 2019-12-23 21:03:33.503000 | 127.0.0.1 |              0
Parsing SELECT *... | 2019-12-23 21:03:33.505000 | 127.0.0.1 |          16996
...

We’ve truncated the output quite a bit for brevity, but if you run a command like this, you’ll see activities such as preparing statements, read repair, key cache searches, data lookups in memtables and SSTables, interactions between nodes, and the time associated with each step in microseconds.

You’ll want to be on the lookout for queries that require a lot of inter-node interaction, as these may indicate a problem with your schema design. For example, a query based on a secondary index will likely involve interactions with most or all of the nodes in the cluster.

Once you are done using tracing in cqlsh, you can turn it off using the TRACING OFF command. Tracing visualization is also supported in tools including DataStax DevCenter and DataStax Studio, as well as your application code using DataStax drivers. Taking the DataStax Java Driver as an example, tracing is individually enabled or disabled on a Statement using the setTracing() operation.

To obtain the trace of an query, take a look at the ResultSet object. You may have noticed in previous examples that the CqlSession.execute() operation always returns a ResultSet, even for queries other than SELECT queries. This enables you to obtain metadata about the request via the getExecutionInfo() operation, even when there are no results to be returned. The resulting ExecutionInfo includes the consistency level that was achieved, the coordinator node and other nodes involved in the query, and information about tracing via the getQueryTrace() operation.

The available query trace information includes the trace ID, coordinator node, and a list of events representing the same information available in cqlsh. Because this operation triggers a read from the tables in the system_traces keyspace, there is also an asynchronous variant: getQueryTraceAsync(). Additional configuration options are available under the datastax-java-driver.advanced.request.trace namespace.

Traces Aren’t Forever

Cassandra stores query trace results in the system_traces keyspace. Since the 2.2 release, Cassandra also uses tracing to store the results of repair operations. Cassandra limits the TTL on these tables to prevent them from filling up your disk over time. You can configure these TTL values by editing the tracetype_query_ttl and tracetype_repair_ttl properties in the cassandra.yaml file.

On the server side, you can configure individual nodes to trace some or all of their queries via the nodetool settraceprobability command. This command takes a number between 0.0 (the default) and 1.0, where 0.0 disables tracing and 1.0 traces every query. This does not affect tracing of individual queries as requested by clients. Exercise care in changing the trace probability, as a typical trace session involves 10 or more writes. Setting a trace level of 1.0 could easily overload a busy cluster, so a value such as 0.01 or 0.001 is typically appropriate.

Distributed Tracing Frameworks

As microservice architectures have gained in popularity, developers have recognized the challenge of analyzing and debugging the interactions across multiple services and infrastructure components that are typically required to satisfy a client request. Google’s Dapper paper from 2010 introduced the concept of distributed tracing as an approach for tracking these interactions in order to identify sources of error and poor performance. Distributed tracing solutions use a correlation identifier generated at the entry point to the system that is passed as metadata between the various networked components. The correlation identifier is included in log entries at each component. Log aggregation tools can then pull the logging statements with the same correlation ID to assemble a distributed trace of what happened. This can be useful to dig into the overall context of your application to identify any cases where Cassandra may be impacting performance.

The Cassandra 3.4 release introduced the improvement documented in CASSANDRA-10392 for pluggable query trace logging. You can override the default tracing implementation found in the org.apache.cassandra.tracing package by setting the system property cassandra.custom_tracing_class with the the name of a class that extends the abstract Tracing class and adding the implementation to Cassandra’s classpath.

You can read about an implementation Mick Semb Weaver created using Zipkin query tracing on The Last Pickle blog. The implementation is available on linkhttps://github.com/thelastpickle/cassandra-zipkin-tracing[GitHub]. The Zipkin tracing implementation leverages the ability provided by client drivers to add custom metadata to CQL queries including a client-provided correlation identifier, and sends tracing data to a Zipkin service instead of writing it to Cassandra, reducing the impact of write amplification.

Infracloud has provided an alternate implementation using Jaeger, and more options are likely to become available as the OpenTracing project has emerged to provide standard APIs for distributed tracing frameworks.

Tuning Methodology

Once you’ve identified the root cause of performance issues related to one of your established goals, it’s time to begin tuning performance. The suggested methodology for tuning Cassandra performance is to change one configuration parameter at a time and test the results.

It is important to limit the amount of configuration changes you make when tuning so that you can clearly identify the impact of each change. You may need to repeat the change process multiple times until you’ve reached the desired performance goal.

In some cases, it may be that you can get the performance back in line simply by adding more resources such as memory or extra nodes, but make sure that you aren’t simply masking underlying design or configuration issues. In other cases, you may find that you can’t reach your desired goals through tuning alone, and that design changes are needed.

With this methodology in mind, let’s look at some of the various options that you can configure to tune your Cassandra clusters, including node configuration properties in the cassandra.yaml and cassandra-env.sh files, as well as options that are configured on individual tables using CQL.

Caching

Caches are used to improve responsiveness to read operations. Additional memory is used to hold data, to minimize the number of disk reads that must be performed. As the cache size increases, so does the number of “hits” that can be served from memory.

There are three caches built into Cassandra: the key cache, row cache, and counter cache. The row cache caches a configurable number of rows per partition. If you are using a row cache for a given table, you will not need to use a key cache on it as well.

Your caching strategy should therefore be tuned in accordance with a few factors:

Consider your queries, and use the cache type that best fits your queries.
Consider the ratio of your heap size to your cache size, and do not allow the cache to overwhelm your heap.
Consider the size of your rows against the size of your keys. Typically keys will be much smaller than entire rows.

Let’s consider some specific tuning and configuration options for each cache.

Key Cache

Cassandra’s key cache stores a map of partition keys to row index entries, facilitating faster read access into SSTables stored on disk. You configure usage of the key cache on a per-table basis. For example, use cqlsh to examine the caching settings on the hotels table:

cqlsh:hotel> DESCRIBE TABLE hotels;

CREATE TABLE hotel.hotels (
    id text PRIMARY KEY,
    address frozen<address>,
    name text,
    phone text,
    pois set<text>
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
  ...

Because the key cache greatly increases reads without consuming a lot of additional memory, it is enabled by default; that is, key caching is enabled if you don’t specify a value when creating the table. You can disable key caching for this table with the ALTER TABLE command:

cqlsh:hotel> ALTER TABLE hotels
   WITH caching = { 'keys' : 'NONE', 'rows_per_partition' : 'NONE' };

The legal options for the keys attribute are ALL or NONE.

Some key cache behaviors can be configured via settings in the cassandra.yaml file. The key_cache_size_in_mb setting indicates the maximum amount of memory that will be devoted to the key cache, which is shared across all tables. The default value is either 5% of the total JVM heap, or 100 MB, whichever is less.

Row Cache

With the row cache, you can cache entire rows and can speed up read access for frequently accessed rows, at the cost of more memory usage.

You’ll want to configure the row cache size setting carefully, as the wrong setting can easily lead to more performance issues than it solves. In many cases, a row cache can yield impressive performance results for small data sets when all the rows are in memory, only to degrade on larger data sets when the data must be read from disk.

If your table gets far more reads than writes, then configuring an overly large row cache will needlessly consume considerable server resources. If your table has a lower ratio of reads to writes, but has rows with lots of data in them (hundreds of columns), then you’ll need to do some math before setting the row cache size. And unless you have certain rows that get hit a lot and others that get hit very little, you’re not going to see much of a boost here.

For these reasons, row caching tends to yield fewer benefits than key caching. You may want to explore file caching features supported by your operating system as an alternative to row caching.

As with key caching, you can configure usage of the row cache on a per-table basis. The rows_per_partition setting specifies the number of rows that will be cached per partition. By default, this value is set to NONE, meaning that no rows will be cached. Other available options are positive integers or ALL. The following CQL statement sets the row cache to 200 rows:

cqlsh:hotel> ALTER TABLE hotels
   WITH caching = { 'keys' : 'NONE', 'rows_per_partition' : '200' };

The implementation of the row cache is pluggable via the row_cache_class_name property. This defaults to the off-heap cache provider implemented by the org.apache.cassandra.OHCProvider class. The previous implementation was the SerializingCacheProvider. Third party solutions are available, such as the CAPI Row Cache (CAPI stands for Coherent Accelerator Processor Interface, a Linux extension for Flash memory access).

Counter Cache

The counter cache improves counter performance by reducing lock contention for the most frequently accessed counters. There is no per-table option for configuration of the counter cache.

The counter_cache_size_in_mb setting determines the maximum amount of memory that will be devoted to the counter cache, which is shared across all tables. The default value is either 2.5% of the total JVM heap, or 50 MB, whichever is less.

Saved Cache Settings

Cassandra provides the ability to periodically save caches to disk, so that they can be read on startup as a way to quickly warm the cache. The saved cache settings are similar across all three cache types:

Cache files are saved under the directory specified by the saved_caches property. The files are written at the interval in seconds specified by the key_ cache_ save_period, row_cache_save_period, and counter_cache_ save_period, which default to 14000 (4 hours), 0 (disabled), and 7200 (2 hours), respectively.
Caches are indexed by key values. The number of keys to save in the file are indicated by the key_cache_keys_to_save, row_cache_keys_to_save, and counter_cache_keys_to_save properties.

Managing Caches via nodetool

You’ll want to monitor your caches to make sure they are providing the value you expect. The output of the nodetool info command includes information about each of Cassandra’s caches, including the recent hit rate metrics. These metrics are also available via JMX. If you’ve enabled tracing, you’ll be able to see caches in use, for example when a row is loaded from cache instead of disk.

Cassandra also provides capabilities for managing caches via nodetool:

You can clear caches using the invalidatekeycache, invalidaterowcache, and invalidatecountercache commands.
Use the setcachecapacity command to override the configured settings for key, row, and counter cache capacity.
Use the setcachekeystosave command to override the configured settings for how many key, row, and counter cache elements to save to a file.

Remember that these settings will revert to the values set in the cassandra.yaml file on a node restart.

Memtables

Cassandra uses memtables to speed up writes, holding a memtable corresponding to each table it stores. Cassandra flushes memtables to disk as SSTables when either the commit log threshold or memtable threshold has been reached.

Cassandra stores memtables either on the Java heap, off-heap (native) memory, or both. The limits on heap and off-heap memory can be set via the properties memtable_heap_space_in_mb and memtable_offheap_space_in_mb, respectively. By default, Cassandra sets each of these values to 1/4 of the total heap size set in the cassandra-env.sh file. Changing the ratio of memory used for memtables reduces the memory available for caching and other internal Cassandra structures, so tune carefully and in small increments. We’ll discuss the overall heap size settings in “Memory”.

You can influence how Cassandra allocates and manages memory via the memtable_allocation_type property. This property configures another of Cassandra’s pluggable interfaces, selecting which implementation of the abstract class org.apache.cassandra.utils.memory.MemtablePool is used to control the memory used by each memtable. The default value heap_buffers causes Cassandra to allocate memtables on the heap using the Java New I/O (NIO) API, while offheap_buffers uses Java NIO to allocate a portion of each memtable both on and off the heap. The offheap_objects uses native memory directly, making Cassandra entirely responsible for memory management and garbage collection of memtable memory. This is a less documented feature, so it’s best to stick with the default here until you can gain more experience.

Another element related to tuning the memtables is memtable_flush_writers. This setting, which is 2 by default, indicates the number of threads used to write out the memtables when it becomes necessary. If your data directories are backed by SSD, you should increase this to the number of cores, without exceeding the maximum value of 8. If you have a very large heap, it can improve performance to set this count higher, as these threads are blocked during disk I/O.

You can also enable metered flushing on each table via the CQL CREATE TABLE or ALTER TABLE command. The memtable_flush_period_in_ms option sets the interval at which the memtable will be flushed to disk.

Setting this property results in more predictable write I/O, but will also result in more SSTables and more frequent compactions, possibly impacting read performance. The default value of 0 means that periodic flushing is disabled, and flushes will only occur based on the commit log threshold or memtable threshold being reached.

Commit Logs

There are two sets of files that Cassandra writes to as part of handling update operations: the commit log and the SSTable files. Their different purposes need to be considered in order to understand how to treat them during configuration. You would typically look at tuning at commit log settings for write heavy workloads.

Remember that the commit log can be thought of as short-term storage that helps ensure that data is not lost if a node crashes or is shut down before memtables can be flushed to disk. That’s because when a node is restarted, the commit log gets replayed. In fact, that’s the only time the commit log is read; clients never read from it. But the normal write operation to the commit log blocks, so it would damage performance to require clients to wait for the write to finish.

You can set the value for how large the commit log is allowed to grow before it stops appending new writes to a file and creates a new one. This value is set with the commitlog_segment_size_in_mb property. By default, the value is 32 MB. This is similar to setting log rotation on a logging engine such as Log4J or Logback.

The total space allocated to the commit log is specified by the commitlog_total_space_in_mb property. Setting this to a larger value means that Cassandra will need to flush tables to disk less frequently.

The commit logs are periodically removed, following a successful flush of all their appended data to the dedicated SSTable files. For this reason, the commit logs will not grow to anywhere near the size of the SSTable files, so the disks don’t need to be as large; this is something to consider during hardware selection.

To increase the amount of writes that the commit log can hold, you’ll want to enable log compression via the commitlog_compression property. The supported compression algorithms are LZ4, Snappy, and Deflate. Using compression comes at the cost of additional CPU time to perform the compression.

Additional settings relate to the commit log sync operation, represented by the commitlog_sync element. There are two possible settings for this: periodic and batch. The default is periodic, meaning that the server will make writes durable only at specified intervals. The interval is specified by the commitlog_sync_period_in_ms property, which defaults to 10000 (10 seconds).

In order to guarantee durability for your Cassandra cluster, you may want to examine this setting. When the server is set to make writes durable periodically, you can potentially lose the data that has not yet been synced to disk from the write-behind cache.

If your commit log is set to batch, it will block until the write is synced to disk (Cassandra will not acknowledge write operations until the commit log has been completely synced to disk). Changing this value will require taking some performance metrics, as there is a necessary trade-off here: forcing Cassandra to write more immediately constrains its freedom to manage its own resources. If you do set commitlog_sync to batch, you need to provide a suitable value for commit_log_sync_batch_window_in_ms, where ms is the number of milliseconds between each sync effort.

SSTables

Unlike the commit log, Cassandra writes SSTable files to disk asynchronously. If you’re using hard disk drives, it’s recommended that you store the datafiles and the commit logs on separate disks for maximum performance. If you’re deploying on solid state drives (SSDs), it is fine to use the same disk.

Cassandra, like many databases, is particularly dependent on the speed of the hard disk and the speed of the CPUs. It’s more important to have several processing cores than one or two very fast ones, to take advantage of Cassandra’s highly concurrent construction. So make sure for QA and production environments to get the fastest disks you can—preferably SSDs. If you’re using hard disks, make sure there are at least two so that you can store commit log files and the datafiles on separate disks and avoid competition for I/O time.

When reading SSTable files from disk, Cassandra sets aside a portion of 32 MB of off-heap memory known a buffer cache (also known as a buffer pool), to help reduce database file I/O. This buffer cache is part of a larger file cache which is also used for caching uncompressed chunks of SSTables, which you’ll see momentarily. The file cache size can be set by the file_cache_size_in_mb property in cassandra.yaml, but defaults to either 512 MB, or 1/4 of the Java heap, whichever is smaller. You can also allow Cassandra to use the Java heap for file buffers once the off-heap cache is full by setting buffer_pool_use_heap_if_exhausted to true.

By default, Cassandra compresses data contained in SSTables using the LZ4 compression algorithm. The data is compressed in chunks according to the value of the chunk_length_in_kb property set on each table as part of the compression options:

CREATE TABLE reservation.reservations_by_confirmation (...)
  WITH ... compression = {'chunk_length_in_kb': '16',
  'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
  ...

The remaining portion of the file cache not used for the buffer cache is used for caching uncompressed chunks to speed up reads. The default chunk length is 16kb in Cassandra 4.0, which provides a good tradeoff between the lesser amount of disk I/O required to read and write compressed data vs the CPU time expended to encrypt and decrypt each chunk. You can alter the compression options on a table, but they will only take effect on SSTables written after the options are set. To update existing SSTable files to use the new compression options, you’ll need to use the nodetool scrub or nodetool upgradesstables commands.

As discussed in Chapter 9, Cassandra uses SSTable index summaries and bloom filters to improve performance of the read path. Cassandra maintains a copy of Bloom filters in memory, although you may recall that the Bloom filters are stored in files alongside the SSTable data files so that they don’t have to be recalculated if the node is restarted.

The Bloom filter does not guarantee that the SSTable contains the partition, only that it might contain it. You can set the bloom_filter_fp_chance property on each table to control the percentage of false positives that the Bloom filter reports. This increased accuracy comes at the cost of additional memory use.

Index summaries are kept in memory in order to speed access into SSTable files. By default, Cassandra allocates 5% of the Java heap to these indexes, but you can override this via the index_summary_capacity_in_mb property in cassandra.yaml. In order to stay within the allocated limit, Cassandra will shrink indexes associated with tables that are read less frequently. Because read rates may change over time, Cassandra also resamples indexes from files stored on disk at the frequency specified by the index_summary_resize_interval_in_minutes property, which defaults to 60.

Cassandra also provides a mechanism to influence the relative amount of space allocated to indexes for different tables. This is accomplished via the min_index_interval and max_index_interval properties, which can be set per via the CQL CREATE TABLE or ALTER TABLE commands. These values specify the minimum and maximum number of index entries to store per SSTable.

Hinted Handoff

Hinted handoff is one of several mechanisms that Cassandra provides to keep data in sync across the cluster. As you learned in Chapter 6, a coordinator node can keep a copy of data on behalf of a node that is down for some amount of time. You can tune hinted handoff in terms of the amount of disk space you’re willing to use to store hints, and how quickly hints will be delivered when a node comes back online.

You can control the bandwidth utilization of hint delivery using the property hinted_handoff_throttle_in_kb, or at runtime via nodetool sethintedhandoff throttlekb.

This throttling limit has a default value of 1024, or 1 MB per second, and is used to set an upper threshold on the bandwidth that would be required of a node receiving hints. For example, in a cluster of three nodes, each of the two nodes delivering hints to a third node would throttle its hint delivery to half of this value, or 512 KB persecond.

Note that configuring this throttle is a bit different than configuring other Cassandra features, as the behavior that will be observed by a node receiving hints is entirely based on property settings on other nodes. You’ll definitely want to use the same values for these settings in sync across the cluster to avoid confusion.

In releases prior to 3.0, Cassandra stores hints in a table which is not replicated to other nodes, but starting with the 3.0 release, Cassandra stores hints in a directory specified by the hints_directory property, which defaults to the data/hints directory under the Cassandra installation. You can set a cap on the amount of disk space devoted to hints via the max_hints_file_size_in_mb property.

You can clear out any hints awaiting delivery to one or more nodes using the nodetool truncatehints command with a list of IP addresses or hostnames. Hints eventually expire after the value expressed by the max_hint_window_in_ms property.

It’s also possible to enable or disable hinted handoff entirely, as you learned in Chapter 11. While some would use this as a way to conserve disk and bandwidth, in general the hinted handoff mechanism does not consume a lot of resources in comparison to the extra layer of consistency it helps to provide, especially compared to the cost of repairing a node.

Compaction

Cassandra provides configuration options for compaction including the resources used by compaction on a node, and the compaction strategy to be used for each table.

Choosing the right compaction strategy for a table can certainly be a factor in improving performance. Let’s review the available strategies and discuss when they should be used.

SizeTieredCompactionStrategy: The SizeTieredCompactionStrategy (STCS) is the default compaction strategy, and it should be used in most cases. This strategy groups SSTables into tiers organized by size. When there are a sufficient number of SSTables in a tier (4 or more by default), a compaction is run to combine them into a larger SSTable. As the amount of data grows, more and more tiers are created. STCS performs especially well for write-intensive tables, but less so for read-intensive tables, as the data for a particular row may be spread across an average of 10 or so SSTables.
LeveledCompactionStrategy: The LeveledCompactionStrategy (LCS) creates SSTables of a fixed size (5 MB by default) and groups them into levels, with each level holding 10 times as many SSTables as the previous level. LCS guarantees that a given row appears in at most one SSTable per level. LCS spends additional effort on I/O to minimize the number of SSTables a row appears in; the average number of SSTables for a given row is 1.11. This strategy should be used if there is a high ratio of reads to writes or predictable latency is required. LCS will tend to not perform as well if a cluster is already I/O bound. If writes dominate reads, Cassandra may struggle to keep up.
TimeWindowCompactionStrategy: The TimeWindowCompactionStrategy (TWCS) introduced in the 3.8 release is designed to improve read performance for time series data. It works by grouping SSTables in windows organized by the write time of the data. Compaction is performed within these windows using STCS, and TWCS inherits most of its configuration parameters from STCS. Although processes such as hints and repair could cause the generation of multiple SSTables within a time bucket, this causes no issues because SSTable from one time bucket are never compacted with an SSTable from another bucket. Because this compaction strategy is specifically designed for time series data, you are strongly recommended to set time-to-live (TTL) on all inserted rows, avoid updating existing rows and to prefer allowing Cassandra to age data out via TTL rather than deleting it explicitly.

DateTieredCompactionStrategy Deprecated

TWCS replaces the DateTieredCompactionStrategy (DTCS) introduced in the 2.0.11 and 2.1.1 releases, which had similar goals but also some rough edges that made it difficult to use and maintain. DTCS is now considered deprecated as of the 3.8 release. New tables should use TWCS.

Each strategy has its own specific parameters that can be tuned. Check the documentation for your release for further details.

Testing Compaction Strategies with Write Survey Mode

If you’d like to test out using a different compaction strategy for a table, you don’t have to change it on the whole cluster in order to see how it works. Instead, you can create a test node running in write survey mode to see how the new compaction strategy will work.

To do this, add the following options to the cassandra-env.sh file for the test node:

JVM_OPTS="$JVM_OPTS -Dcassandra.write_survey=true"
JVM_OPTS="$JVM_OPTS -Djoin_ring=false"

Once the node is up, you can then access the org.apache.cassandra.db.ColumnFamilyStoreMBean for the table under org.apache.cassandra.db > Tables in order to configure the compaction strategy. Set the CompactionParameters attribute, which is a map of parameter names and values similar to what you might see in an CREATE or ALTER TABLE command:

CREATE TABLE reservation.reservations_by_confirmation (...)
  WITH ...
    AND compaction = {'class': 'org.apache.cassandra.db.compaction
      .SizeTieredCompactionStrategy', 'max_threshold': '32',
      'min_threshold': '4'}
...

After this configuration change, add the node to the ring via nodetool join so that it can start receiving writes. Writes to the test node place a minimal additional load on the cluster and do not count toward consistency levels. You can now monitor the performance of writes to the node using nodetool tablestats and tablehistograms.

You can test the impact of new compaction strategy on reads by stopping the node, bringing it up as a standalone machine, and then testing read performance on the node.

Write survey mode is also useful for testing out other configuration changes or even a new version of Cassandra.

Another per-table setting is the compaction threshold. The compaction threshold refers to the number of SSTables that are in the queue to be compacted before a minor compaction is actually kicked off. By default, the minimum number is 4 and the maximum is 32. You don’t want this number to be too small, or Cassandra will spend time fighting with clients for resources to perform many frequent, unnecessary compactions. You also don’t want this number to be too large, or Cassandra could spend a lot of resources performing many compactions at once, and therefore will have fewer resources available for clients.

The compaction threshold is set per table using the CQL CREATE TABLE or ALTER TABLE commands. However, you can inspect or override this setting for a particular node using the nodetool getcompactionthreshold or setcompactionthreshold commands:

$ nodetool getcompactionthreshold reservation reservations_by_confirmation
Current compaction thresholds for reservation/reservations_by_confirmation:
 min = 4,  max = 32
$ nodetool setcompactionthreshold reservation reservations_by_confirmation 8 32

Compaction can be intensive in terms of I/O and CPU, so Cassandra provides the ability to monitor the compaction process and influence when compaction occurs.

You can monitor the status of compaction on a node using the nodetool compactionstats command, which lists the completed and total bytes for each active compaction (we’ve omitted the ID column for brevity):

$ nodetool compactionstats

pending tasks: 1
   id  compaction type  keyspace  table   completed  total      unit   progress
  ...  Compaction       hotel     hotels  57957241   127536780  bytes  45.44%
Active compaction remaining time :   0h00m12s

If you see that the pending compactions are starting to stack up, you can use the nodetool commands getcompactionthroughput and setcompactionthroughput to check and set the throttle that Cassandra applies to compactions across the cluster. This corresponds to the property compaction_throughput_mb_per_sec in the cassandra.yaml file. Setting this value to 0 disables throttling entirely, but the default value of 16 MB/s is sufficient for most non-write-intensive cases.

If this does not fix the issue, you can increase the number of threads dedicated to compaction by setting the concurrent_compactors property in cassandra.yaml file, or at runtime via the CompactionManagerMBean. This property defaults to the minimum of the number of disks and number of cores, with a minimum of 2 and a maximum of 8.

Although it is not very common, a large compaction could negatively impact the performance of the cluster. You can use the nodetool stop command to stop all active compactions on a node. You can also identify a specific compaction to stop by ID, where the ID is obtained from the compactionstats output. Cassandra will reschedule any stopped compactions to run later. By default, compaction is run on all keyspaces and tables, but you can use the nodetool disableautocompaction and enableautocompaction commands for more selective control.

You can force a major compaction via the nodetool compact command. Before kicking off a major compaction manually, remember that this is an expensive operation. The behavior of nodetool compact during compaction varies depending on the compaction strategy in use. If your concern is specifically related to cleanup of deleted data, you may use nodetool garbagecollect as an alternative to a major compaction.

The nodetool compactionhistory command prints statistics about completed compactions, including the size of data before and after compaction and the number of rows merged. The output is pretty verbose, so we’ve omitted it here.

Concurrency and Threading

Cassandra differs from many data stores in that it offers much faster write performance than read performance. There are two settings related to how many threads can perform read and write operations: concurrent_reads and concurrent_writes. In general, the defaults provided by Cassandra out of the box are very good.

The concurrent_reads setting determines how many simultaneous read requests the node can service. This defaults to 32, but should be set to the number of drives used for data storage × 16. This is because when your data sets are larger than available memory, the read operation is I/O bound.

The concurrent_writes setting behaves somewhat differently. This should correlate to the number of clients that will write concurrently to the server. If Cassandra is backing a web application server, you can tune this setting from its default of 32 to match the number of threads the application server has available to connect to Cassandra. It is common in Java application servers to prefer database connection pools no larger than 20 or 30, but if you’re using several application servers in a cluster, you’ll need to factor that in as well.

Two additional settings—concurrent_counter_writes and concurrent_materialized_view_writes—are available for tuning special forms of writes. Because counter and materialized view writes both involve a read before write, it is best to set this to the lower of concurrent_reads and concurrent_writes.

There are several other properties in the cassandra.yaml file which control the number of threads allocated to the thread pools Cassandra’s allocates for performing various tasks. You’ve seen some of these already, but here is a summary:

max_hints_delivery_threads: Maximum number of threads devoted to hint delivery
memtable_flush_writers: Number of threads devoted to flushing memtables to disk
concurrent_compactors: Number of threads devoted to running compaction
native_transport_max_threads: Maximum number of threads devoted to processing incoming CQL requests

Note that some of these properties allow Cassandra to dynamically allocate and deal-locate threads up to a maximum value, while others specify a static number of threads. Tuning these properties up or down affects how Cassandra uses its CPU time and how much I/O capacity it can devote to various activities.

Networking and Timeouts

As Cassandra is a distributed system, it provides mechanisms for dealing with network and node issues including retries, timeouts, throttling, and message coalescing. We’ve already discussed a couple of the ways Cassandra implements retry logic, such as the RetryPolicy in the DataStax client drivers, and speculative read execution in drivers and nodes.

Now let’s take a look at the timeout mechanisms that Cassandra provides to help it avoid hanging indefinitely waiting for other nodes to respond. The timeout properties listed in Table 13-1 are set in the cassandra.yaml file.

Table 13-1. Cassandra node timeouts
Property Name	Default Value	Description
`read_request_timeout_in_ms`	5000 (5 seconds)	How long the coordinator waits for read operations to complete
`range_request_timeout_in_ms`	10000 (10 seconds)	How long the coordinator should wait for range reads to complete
`write_request_timeout_in_ms`	2000 (2 seconds)	How long the coordinator should wait for writes to complete
`counter_write_request_timeout_in_ms`	5000 (5 seconds)	How long the coordinator should wait for counter writes to complete
`cas_contention_timeout_in_ms`	1000 (1 second)	How long a coordinator should continue to retry a lightweight transaction
`truncate_request_timeout_in_ms`	60000 (1 minute)	How long the coordinator should wait for truncates to complete (including snapshot)
`streaming_socket_timeout_in_ms`	3600000 (1 hour)	How long a node waits for streaming to complete
`request_timeout_in_ms`	10000 (10 seconds)	The default timeout for other, miscellaneous operations

The values for these timeouts are generally acceptable, but you may need to adjust them for your network environment.

Another property related to timeouts is cross_node_timeout, which defaults to false. If you have NTP configured in your environment, consider enabling this so that nodes can more accurately estimate when the coordinator has timed out on long-running requests and release resources more quickly.

Cassandra also provides several properties that allow you to throttle the amount of network bandwidth it will use for various operations. Tuning these allows you to prevent Cassandra from swamping your network, at the cost of longer time to complete these tasks. For example, the stream_throughput_outbound_megabits_per_sec and inter_dc_stream_throughput_outbound_megabits_per_sec properties specify a per-thread throttle on streaming file transfers to other nodes in the local and remote data centers, respectively.

The throttles for hinted handoff and batchlog replay work slightly differently. The values specified by hinted_handoff_throttle_in_kb and batchlog_replay_throttle_in_kb are considered maximum throughput values for the cluster and are therefore spread proportionally across nodes according to the formula:

$T_{x} = \frac{T_{t}}{N_{n} - 1}$ $T_{x} = \frac{T_{t}}{N_{n} - 1}$

That is, the throughput of a node x (T_x) is equal to the total throughput (T_t) divided by one less than the number of nodes in the cluster (N_n).

There are several properties that you can use to limit traffic to the native CQL port on each node. These may be useful in situations where you don’t have direct control over the client applications of your cluster. The default maximum frame size specified by the native_transport_max_frame_size_in_mb property is 256. Frame requests larger than this will be rejected by the node.

The node can also limit the maximum number of simultaneous client connections, via the native_transport_max_concurrent_connections property, but the default is −1 (unlimited). If you configure this value, you’ll want to make sure it makes sense in light of the concurrent_readers and concurrent_writers properties.

To limit the number of simultaneous connections from a single source IP address, configure the native_transport_max_concurrent_connections_per_ip property, which defaults to −1 (unlimited).

JVM Settings

Cassandra allows you to configure a variety of options for how the server JVM should start up, how much Java memory should be allocated, and so forth. In this section, you’ll learn how to tune the startup.

If you’re using Windows, the startup script is called cassandra.bat, and on Unix systems it’s cassandra.sh. You can start the server by simply executing this file, which detects the selected JVM and configures the Java classpath and paths for loading native libraries. The startup script sources (includes) additional files in the conf directory that allow you to configure a variety of JVM settings.

The cassandra-env.sh (cassandra-env.ps1 on Windows) is primarily concerned with configuring JMX and Java heap options.

The Cassandra 3.0 release began a practice of moving JVM settings related to heap size and garbage collection to a dedicated settings file in the conf directory called jvm.options, as these are the settings that are tuned most frequently. In the 3.0 release, the jvm.options file is included (sourced) by the cassandra-env.sh file.

Since the Cassandra 4.0 release can be run against either JDK 8 or JDK 11, multiple options files are provided:

The conf/jvm-server.options file contains settings that apply to running Cassandra regardless of your selected JVM
The conf/jvm8-server.options and conf/jvm11-server.options contain settings that apply to running Cassandra using JDK 8 or JDK 11, respectively
The conf/jvm-client.options, conf/jvm8-client.options, and conf/jvm11-client.options files contain JVM settings for running clients such as nodetool and cqlsh, following the same conventions as the server options.

Memory

By default, Cassandra uses the following algorithm to set the JVM heap size: if the machine has less than 1 GB of RAM, the heap is set to 50% of RAM. If the machine has more than 4 GB of RAM, the heap is set to 25% of RAM, with a cap of 8 GB. To tune the minimum and maximum heap size yourself, use the -Xms and -Xmx flags. These should be set to the same value to allow the entire heap to be locked in memory and not swapped out by the OS. It is not recommended to set the heap larger than 12GB if you are using the Concurrent Mark Sweep (CMS) garbage collector, as heap sizes larger than this value tend to lead to longer GC pauses.

When performance tuning, it’s a good idea to set only the heap min and max options, and nothing else at first. Only after real-world usage in your environment and some performance benchmarking with the aid of heap analysis tools and observation of your specific application’s behavior should you dive into tuning the more advanced JVM settings. If you tune your JVM options and see some success, don’t get too excited. You need to test under real-world conditions.

In general, you’ll probably want to make sure that you’ve instructed the heap to dump its state if it hits an out-of-memory error, which is the default in cassandra-env.sh, set by the -XX:+HeapDumpOnOutOfMemory option. This is just good practice if you start experiencing out-of-memory errors.

Garbage Collection

Garbage collection (GC) is the process of reclaiming heap memory in the JVM that has been allocated for but is no longer used. Garbage collection has traditionally been a focus of performance tuning efforts for Java applications since it is an administrative process that consumes processing resources, and GC pauses can negatively affect latencies of remote calls in networked applications like Cassandra. The good news is that there is a lot of innovation in the area of Java garbage collection. Let’s look at the available options based on the JVM you’re using.

Default configuration (JDK 8 or 11)

The default configuration for the 3.x and 4.0 releases uses two different garbage collection algorithms that work on different parts of the heap. In this approach, the Java heap is broadly divided into two object spaces: young and old. The young space is subdivided into one for new object allocation (called “eden space”) and another for new objects that are still in use (the “survivor space”).

Cassandra uses the parallel copying collector in the young generation; this is set via the -XX:+UseParNewGC option. Older objects still have some reference, and have therefore survived a few garbage collections, so the Survivor Ratio is the ratio of eden space to survivor space in the young object part of the heap. Increasing the ratio makes sense for applications with lots of new object creation and low object preservation; decreasing it makes sense for applications with longer-living objects. Cassandra sets this value to 8 via the -XX:SurvivorRatio option, meaning that the ratio of eden to survivor space is 1:8 (each survivor space will be 1/8 the size of eden). This is fairly low, because the objects are living longer in the memtables.

Every Java object has an age field in its header, indicating how many times it has been copied within the young generation space. Objects are copied into a new space when they survive a young generation garbage collection, and this copying has a cost. Because long-living objects may be copied many times, tuning this value can improve performance. By default, Cassandra has this value set at 1 via the -XX:MaxTenuringThreshold option. Set it to 0 to immediately move an object that survives a young generation collection to the old generation. Tune the survivor ratio together along with the tenuring threshold.

Modern Cassandra releases use the Concurrent Mark Sweep (CMS) garbage collector for the old generation; this is set via the -XX:+UseConcMarkSweepGC option. This setting uses more RAM and CPU power to do frequent garbage collections while the application is running, in order to keep the GC pause time to a minimum. When using this strategy, it’s recommended to set the heap min and max values to the same value, in order to prevent the JVM from having to spend a lot of time growing the heap initially.

Garbage First Garbage Collector (JDK 8 or JDK 11)

The Garbage First garbage collector (also known as G1GC) was introduced in Java 7. It was intended to become the long term replacement for CMS GC, especially on multiprocessor machines with more memory.

G1GC divides the heap into multiple equal size regions and allocates these to eden, survivor, and old generations dynamically, so that each generation is a logical collection of regions that need not be consecutive regions in memory. This approach enables garbage collection to run continually and require fewer of the “stop the world” pauses that characterize traditional garbage collectors.

G1GC generally requires fewer tuning decisions; the intended usage is that you need only define the min and max heap size and a pause time goal. A lower pause time will cause GC to occur more frequently.

There has been considerable discussion in the Cassandra community about switching to G1GC as the default. For example, G1GC was originally the default for Cassandra 3.0 release, but was backed out due to the fact that it did not perform as well as the CMS for heap sizes smaller than 8 GB. The emerging consensus is that the G1GC performs well without tuning, but the default configuration of ParNew/CMS can result in shorter pauses when properly tuned.

If you’d like to experiment with using G1GC and a larger heap size on a Cassandra 2.2 or later release, the settings are readily available in the cassandra-env.sh file (or jvm.options) file.

Z Garbage Collector (JDK 11 and later)

The Z Garbage Collector (ZGC) is a garbage collector developed at Oracle and introduced in JDK 11. Its primrary goals are to limit pause times to 10ms or less and to scale even to heaps in the multiple terabyte range. ZGC represents a new approach that does not rely on young and old generations. Instead, it divides the heap into regions and copies data into spare regions in order to perform collections in parallel while your application continues to execute. For this reason ZGC is referred to as a concurrent compactor. ZGC does require that you maintain some headroom in your Java heap in order to support the copied data.

ZGC was originally only supported on Linux platforms but Windows and MacOS support is coming in JDK 14. To give ZGC a try on Cassandra 3 or 4, we recommend using the settings on the ZGC wiki page.

Shenandoah Garbage Collector (JDK12)

Shenandoah (SGC) is a new garbage collector developed by RedHat and included for the first time in JDK 12 (although backported versions are available for JDK 8 and JDK 11). SGC also uses a region-based approach similar to ZGC, but works better than ZGC on smaller heap sizes, especially when there is less headroom on the heap. While it has similar overall latency to ZGC with most pauses under 10ms, SGC can have worse tail latencies.

SGC is still considered experimental, but if you’d like to experiment you can read more about how to configure it on the Shenandoah Wiki.

To get more insight on garbage collection activity in your Cassandra nodes, there are a couple of options. The gc_warn_threshold_in_ms setting in the cassandra.yaml file determines the pause length that will cause Cassandra to generate log messages at the WARN logging level. This value defaults to 1000 ms (1 second). You can also instruct the JVM to print garbage-collection details by setting options in the cassandra-env.sh or jvm.options files.

Operating System Tuning

If you’re somewhat new to Linux systems and you want to run Cassandra on Linux (which is recommended), you may want to check out Amy Tobey’s blog post on tuning Cassandra. This walks through several Linux performance monitoring tools to help you understand the performance of your underlying platform so that you can troubleshoot in the right place. Although the post references Cassandra 2.1, the advice it contains is broadly applicable. The Cassandra documentation also contains advice on using lower level JVM and OS tools to tune performance.

Tuning Cassandra Clients

While the majority of this chapter has focused on tuning performance from the server side, there are typically options available to you on the client side as well. Let’s consider performance related features of the DataStax Java Driver that you should be aware of.

Prepared Statements: Make sure that you’re using PreparedStatements for queries that are repeated multiple times in order to take advantage of the bandwidth savings and security benefits of only transmitting parameter values rather than full CQL statements.
Compression: Another way to speed communications is to compress messages. By default, the Java driver does not use compression for communications with Cassandra nodes, but you can select to to use either the LZ4 or Snappy algorithms by setting configuration options in the advanced.protocol.compression namespace. The tradeoff you are making by transmitting fewer bytes over the network is the additional CPU effort for compression and decompression on each end. You might consider enabling compression if your queries involve a non-trivial amount of data.
Request tracking: The Java driver provides a RequestTracker interface. You can specify an implementation of your own or use the provided RequestLogger implementation by configuring the properties in the datastax-java-driver.advanced.request-tracker namespace. The RequestLogger tracks every query your application executes and has options to enable logging for successful, failed, and slow queries. Use the slow query logger to identify queries that are not within your defined performance SLAs.
Request tracing: As we discussed in “Tracing”, you can enable tracing on individual queries in order to obtain detailed information on the interactions between nodes. This is useful for debugging your slow queries to see what is happening.
Request throttling: If you’re concerned about a client flooding the cluster with a large number of requests, you can use the Java driver’s request throttling feature to limit the rate of queries to a value you define using configuration options in the advanced.throttler namespace. Queries in excess of the rate are queued until the utilization is back within range. This behavior is mostly transparent from the client perspective, but it is possible to receive a RequestThrottlingException on executing a statement; this indicates that the CqlSession is overloaded and unable to queue the request.

For more details on these and other performance-related tips, check out the DataStax Java Driver Performance page. Many of these features are available from other drivers as well; check the documentation of the driver you’re using for more information.

Summary

In this chapter, you learned how to managing Cassandra performance and the settings available in Cassandra to aid in performance tuning, including caching settings, memory settings, and hardware concerns. You also learned a methodology for planning, monitoring, and debugging performance and how to use stress tools to effectively simulate operational conditions on your clusters. Next, let’s have a look at how to secure your Cassandra clusters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 13. Performance Tuning

Create new playlist

Sign In

Sign Up