Performance tuning

With stress tests completed, you might have identified the key areas for improvement. The broadest area that you can categorize performance tuning into is the read and write performance area. But there may be worries such as the I/O contention (compaction tuning) on servers. Apart from these, there may be several external factors, for example, slow disk, shared resources (such as shared CPU), and connectivity issues. We are not going to discuss external factors here. The assumption is that you will have sufficient resources allocated to the Cassandra servers. This section will discuss various tweaks to get Cassandra performing the best that it can within the given resources.

Write performance

Cassandra writes are sequential; all it needs to do is to append to commit log and put in memory. There is not much that can be done internally to Cassandra settings to tweak writes. However, if disk writes are fast, and somehow I/O contentions can be lowered due to multiple things that happen in the Cassandra life cycle, such as flushing MemTable to disk, compaction, and writes to commit logs, it can boost the write performance.

So, having fast disks, and having commit logs and datafiles in separate dedicated hard disks directly attached to the system, will improve write throughput.

Read performance

Reading in Cassandra is rather complicated. It may need to read from the memory or from the hard drive, and it may need to aggregate multiple fragments of data from different SSTables, it may require to get data across multiple nodes, take care of tombstones, and validate, digest, and get tombstones back to the client. But the common pattern of increasing the read performance in Cassandra is the same as any other data system—caching: to keep the most frequent data in memory, minimize disk access, and keep search path/hops small on disk. Also, a fast network and lesser networks access as well as a low read consistency level may help.

Choosing the right compaction strategy

With each flush of a MemTable, an immutable SSTable gets created. So, with time, there will be numerous SSTables if their numbers are not limited by an external process. The main problem with this is slow read speed. A search may need to hop through multiple SSTables to fetch the requested data. The compaction process executes repeatedly to merge these SSTables into one larger SSTable, which has a cleaned-up version of the data that was scattered in fragments into different smaller SSTables littered with tombstones. This also means that compaction is pretty disk I/O intensive, so the longer and more frequently it runs, the more contention it will produce for other Cassandra processes that require to read from or write to the disc.

Cassandra provides two compaction strategies as of Version 1.1.11. A compaction strategy is a column family level setting, so you can set an appropriate compaction strategy for a column family based on its behavior.

Size tiered compaction strategy

This is the default strategy. The way it works is as soon as the min_threadhold (default value is 4) number of equal-sized SSTables are available, they get compacted into one big SSTable. As the compacted SSTables get bigger and bigger, it is rare that large SSTables gets compacted further. This leaves some very large SSTables and many smaller SSTables. This also means that row updates will be scattered in multiple SSTables and we will require more time to process multiple SSTables to get the fragments of a row.

With the occasional burst of I/O load during compaction, SizeTieredCompactionStrategy is a good fit where rows are inserted and never mutated. This will ensure that all the bits of rows are in one SSTable. This is a write and I/O friendly strategy.

Leveled compaction

LeveledCompactionStrategy is a relatively new introduction to Cassandra, and the concepts are taken from Google's LevelDB project. Unlike size tiered compaction, this has many small and fixed sized SSTables grouped into levels. Within a level, the SSTables do not have any overlap. Leveled compaction works in such a way that for most cases, a row will require to access just one SSTable. This is a big advantage over the size tiered version. It makes LeveledCompactionStrategy a better choice for reading heavy column families. Updates are favored in level compaction as it tries to spread rows as low as possible.

The downside of leveled compaction is high I/O. There is no apparent benefit if data is of write-once type. It's because in that case, even with the size tiered strategy, the data is going to stay in one SSTable.

Anyone coming from the traditional database world who has ever worked on scaling up read requests and speeding up data retrievals knows that caching is the most effective way to speed up the reads. It prevents database hits for the data that is already fetched (in the recent past) for the price of extra RAM that the caching mechanism uses to store the data temporarily. So, you have a third-party caching system such as Memcached that manages it for you. The nasty side of third-party caching mechanisms is the whole managing-the-distributed-cache part. It may intrude into the application code.

Cassandra provides an inbuilt caching mechanism that can be really helpful if your application requires heavy read capability. There are two types of caches in Cassandra, row cache and key cache. Refer to the following figure:

Leveled compaction

Figure 5.3: How caching works

Row cache

Row cache is true caching in the sense that it caches a complete row of data and returns immediately without touching the hard drive. It is fast and complete. Row cache stores data off heap (if JNA is available), which means it will be unaffected by the garbage collector.

Cassandra is capable of storing really large rows of data with about 2 billion columns in it. This means that the cache is going to take up as much space, which may not be what you wanted. So, while row cache is generally good to boost the read speed, it is best suited for not-so-large rows. So, you can cache the users column family in row cache, but it will be a bad idea to have the users_browsing_history or users_click_pattern column family in a row cache.

Key cache

Key cache is to store the row keys in memory. Key caches are default for a column family. It does not take much space but it boosts performance to a large extent (but less than the row cache). As of Cassandra Version 1.1.11, it assigns a minimum of 100 MB or 5 percent of the JVM heap memory to the key cache.

Key caches contain information about the location of the row in SSTables, so it's just one disk seek to retrieve the data. This simplifies the process of looking through a sampled index and then scanning the index file for the key range. Since it does not take up as much space as a row cache, you can have a large number of keys cached in a relatively small amount of memory.

Cache settings

The general rule of thumb is, for all normal purposes, a key cache is good enough. You may tweak a key cache to stretch its limits. You'd buy more for a little increase in the key cache settings. Row caching, on the other hand, needs a little thinking to do. A good fit data for row cache is same as a good fit data for a third-party caching mechanism; the data should be read mostly, and mutated occasionally. Rows with a smaller number of columns are better suited. Before you go ahead with cache tweaking, you may want to check the current cache usage. You can use jConsole as discussed in Chapter 7, Monitoring, to see the cache hit. In jConsole, the cache statistics can be obtained by expanding the org.apache.cassandra.db menu. It shows the cache hit rate and the number of hits and the cache size for a particular node and column family.

Cache settings are mostly global as of Version 1.1.11. The settings can be altered in cassandra.yaml. At column family level, the only choices that you have are what cache type to use or if you should use any cache at all. The options are: all, keys_only, rows_only, and none. The all option is to use both caches; the none option is to use none of the two. Here is an example:

CREATE COLUMN FAMILY rowcachedCF WITH COMPARATOR = UTF8Type AND CACHING = rows_only;

Here are the caching specific settings in cassandra.yaml:

  • key_cache_size_in_mb: By default, this is set to 100 MB or 5 percent of the heap size. To disable it globally, set it to zero.
  • key_cache_save_period: The time after which the cache is saved to disk to avoid a cold start. A cold start is when a node starts afresh and gets marred by lots of requests; with no caching at the time of start, it will take some time to get a cache loaded with the most requested keys. During this time, responses may be sluggish.

The caches are saved under a directory as described by the saved_caches_directory setting in the .yaml file. We configured this during the cluster deployment in Chapter 4, Deploying a Cluster. The default value for this setting is 14,400 seconds:

  • key_cache_keys_to_save: Number of keys to save. It is commented to store all the keys. In general, it is okay to let it be commented.
  • row_cache_size_in_mb: Row caching is disabled by default by setting this attribute to zero. Set it to an appropriate positive integer. It may be worth taking a look at nodetool -h <hostname> cfstats and taking the row mean size and the number of keys into account.
  • row_cache_save_period: Similar to key_cache_save_period, this saves the cache to the saved caches directory after the prescribed time. Unlike key caches, row caches are bigger, and saving to disk is an I/O expensive task to do. Compared to the fuss of saving a row cache, it does not give proportional benefit. It is okay to leave it as zero, that is, disabled.
  • row_cache_keys_to_save: Same as key_cache_keys_to_save.
  • row_cache_provider: Out of the box, Cassandra provides two mechanisms to enable a row cache:
    • SerializingCacheProvider: A modern cache provider that stores data off-heap. It is faster and has a smaller memory footprint than the other implementation. It is the default cache provider and it is preferred to read-heavy and mutate-light types of environment, which is where caching shines, anyway. Being off-heap means a smaller JVM heap, which means faster garbage collection and hence smaller GC pauses.
    • ConcurrentLinkedHashCacheProvider: In JVM heap caching, its performance is low than the other cache mechanism. It performs better for update-heavy cases with its update feature in place.
  • You can plug in your own cache provider by putting a.jar or .class file in Cassandra's installation's lib directory and mentioning a fully-qualified class name as row_cache_provider. The class must implement org.apache.cassandra.cache.IRowCacheProvider.

Enabling compression

Column family compression is a very effective mechanism to improve read and write throughput. As you might expect, (any) compression leads to a compact representation of data at the cost of some CPU cycles. Enabling compression on a column family makes the disk representation of the data (SSTables) terse. That means efficient disk utilization, lesser I/O, and a little extra burden to the CPU. In case of Cassandra, the tradeoff between I/O and the CPU, due to compression, almost always yields favorable results. The cost of CPU performing compression and decompression is less than what it takes to read more data from a disc.

Compression setting is done column family wise; if you do not mention any compression mechanism, SnappyCompressor is applied to the column family by default. Here is how you assign a compression type:

# Create column family with compression settingsCREATE COLUMN FAMILY compressedCF
WITH compression_options = 
{
 'sstable_compression':'SnappyCompressor', 
 'chunk_length_kb':64
};

Let's see what compression options we have:

  • sstable_compression: This parameter specifies which compressor is used to compress a disk representation of an SSTable when a MemTable is flushed. Compression takes place at the time of flush. As of Cassandra Version 1.1.11, it provides two compressors out of the box: SnappyCompressor and DeflateCompressor.

Note

Cassandra Version 1.2.2+ has another compressor named LZ4Compressor. LZ4 is 50 percent faster than Snappy compression. If you think that it may worth reading more about it, read this JIRA ticket https://issues.apache.org/jira/browse/CASSANDRA-5038

  • SnappyCompressor is faster to deflate, but less effective in terms of compression when compared to DeflateCompressor. This means that SnappyCompressor will take up a little extra space, but it will have higher read speed.
  • Like everything else in Cassandra, compressors are pluggable. You can write your own compressor by implementing org.apache.cassandra.io.compress.ICompressor, compiling the compressor, and putting the .class or .jar files in the lib directory. Provide the fully-qualified class name of the compression as sstable_compression.
  • chunk_length_kb: Chunk length is the smallest slice of the row that gets decompressed during reads. Depending on the query pattern and the median size of the rows, this parameter can be tweaked in such a way that it is big enough to not have to deflate multiple chunks, but it should be small enough to not have to decompress excessive unnecessary data. Practically, it is hard to guesstimate this. The most common suggestion is to keep it as 64 KB if you do not have any idea.
  • Compression can be added, removed, or altered anytime during the lifetime of a column family. In general, compression always boosts performance and it is a great way to maximize the utilization of disk space. Compression gives double to quadruple reduction in data size when compared to an uncompressed version. So, you should always set a compression to start with; it can be disabled pretty easily:
    # Disable Compression
    UPDATE COLUMN FAMILY compressedCF
    WITH compression_options = null;

    It must be noted that enabling compression may not immediately halve the space used by SSTables. The compression is applied to the SSTables that get created after the compression is enabled. With time, as compaction merges SSTables, older SSTables get compressed.

Tuning the bloom filter

Accessing a disk is the most expensive task. Cassandra thinks twice before needing to read from a disk. The bloom filter can help to identify which SSTables may contain the row that the client has requested. But the bloom filter, being a problematic data structure, yields false positives (refer to Chapter 1, Quick Start). The more the false positives, the more the SSTables need to be read before realizing whether the row actually exists in the SSTable or not.

False positive ratio is basically the probability of getting a true value from the bloom filter of an SSTable for a key that does not exist in it. In simpler words, if the false positive ratio is 0.5, chances are that 50 percent of the time you end up looking into an index file for the key, but it is not there. So, why not set the false positive ratio to zero; never have to make a disk touch without being 100 percent sure. Well, it comes with a cost—memory consumption. If you remember from Chapter 1, Quick Start, the smaller the size of the bloom filter, the smaller the memory consumption; the more the likelihood of the collision of hashes, the higher the false positive. So, as you decrease the false positive value, your memory consumption shoots up. So, we need a balance here.

The default value of the bloom filter false positive ratio is set to 0.000744. To disable the bloom filter, that is, to allow all the queries to SSTable—all false positive— this ratio needs to be set to 1.0. You may need to bypass the bloom filter by setting the false positive ratio to 1 if you have to scan all SSTables for data mining or other analytical applications.

Here is how you can create a column family with false positive chance as 0.01:

# Create column family with false positive chance = 0.01CREATE COLUMN FAMILY myCFwithHighFPWITH bloom_filter_fp_chance = 0.01; 

You may alter the false positive chance on an up-and-running cluster without any need to reboot. But the false positive chance is applied only to the newer SSTables—created by means of flush or via compaction.

You can always see the bloom filter chance by running the describe command in cassandra-cli or by running a nodetool request for cfstats. Node tool displays the current ratio too.

# Incassandra-clidescribe myCFwithHighFP; 
ColumnFamily: myCFwithHighFP[–- snip –]Bloom Filter FP chance: 0.01[-- snip –-]

In case you want to force the new value of false positive across all nodes, you need to execute upgradesstable on the column family via nodetool.

# enforcing new settings to all the sstables of a CF$ /opt/cassandra/bin/nodetool -h 10.147.171.159 upgradesstablesmyKeyspacemyCFwithHighFP

More tuning via cassandra.yaml

cassandra.yaml is the hub of almost all the global settings for the node or the cluster. It is well documented, and you can learn very easily by reading this documentation. Listed here are some of those properties from Cassandra Version 1.1.11 and short descriptions of them. It is suggested that a reader should refer cassandra.yaml of his version of Cassandra and read the details there.

index_interval

Each SSTable is accompanied by a primary index that tells you which row is where (offset) in the SSTable. It is inefficient to read a primary index from a disk to seek from SSTables; moreover, if the bloom filter was false positive, it is a waste of computational power. Cassandra keeps the sampled index values in memory, which is a subset of the primary index. A sampled index is created by choosing one entry out of all index_interval entries from the primary index. This means that the smaller the index_interval, the larger the sampled index, the larger the memory usage, the lesser the disk lookups, the better the read performance.

The default is 128. It is suggested to use between the values 128 and 512. To avoid storing too many samples (that overshoots memory) or too few samples (so that samples do not provide much leverage compared to directly reading from the primary index), you may use a decent default index interval with a larger key cache size to get a much better performance than index_interval alone.

commitlog_sync

As we know from Chapter 1, Quick Start, Cassandra provides durable writes by the virtue of appending the new writes to the commit logs. This is not entirely true. To guarantee that each write is made in such a manner that a hard reboot/crash does not wash off any data, it must be fsync'd to the disk. Flushing commit logs after each write is detrimental to write performance due to slow disk seeks. Instead of doing that, Cassandra periodically (by default, commitlog_sync: periodic) flushes the data to the disk after an interval described by commitlog_sync_period_in_ms in milliseconds. But Cassandra does not wait for the commit logs to synchronize; it immediately acknowledges the write. This means that if a heavy write is going on and a machine crashes, you will lose at the most the data written in the commitlog_sync_period_in_ms window. But you should not really worry; we have a replication factor and consistency level to help recover this loss, unless you are unlucky enough that all the replicas die in the same instant.

Note

fsync() transfers (flushes) all modified in-core data of (that is, modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. Read more on http://linux.die.net/man/2/fsync.

The commitlog_sync method gives high performance at some risk. To someone who is paranoid about data loss, Cassandra provides a guarantee write option, set commitlog_sync to batch mode. In the batch mode, Cassandra accrues all the writes to go to the commit logs and then fsyncs after commitlog_sync_batch_window_in_ms, which is usually set to a smaller value such as 50 milliseconds. This avoids the problem of flushing to disk after every write, but the durability guarantee forces the acknowledgement to be done only after the flush is done or the batch window is over; whichever is sooner. This means the batch modes will always be slower than the periodic modes.

For most practical cases, the default value periodic and default fsync() period of 10 seconds will do just fine.

column_index_size_in_kb

This property tells Cassandra to add a column index if the size of a row (serialized size) grows beyond the KBs mentioned by this property. In other words, if the row size is 314 KB and column_index_size_in_kb is 64 KB, there will be a column index with at least five entries, each containing the start and the finish column name in the chunk and its offset and width.

So, if the row contains many columns (wide rows) or you have columns with really large values, you may want to up the default. This has a con. For large column indexes in KB, Cassandra will need to read at least this much amount of data; even a single column with small rows and small values needs to be read. On the other hand, too small value for this property, large index data will need to be read at each access. The default is okay for most cases.

commitlog_total_space_in_mb

The commit log file is memory mapped. This means the file takes the address space. Cassandra flushes any unflushed MemTable that exists in the oldest mmap'd commit log segment to disk. Thinking from the I/O point of view, it does not make sense to keep this property small, because the total space of smaller commit logs will be filled up quickly, requiring frequent writes to the disk and higher disk I/O. On the other hand, we do not want commit logs to hog all the memory. Note that the data that was not flushed to the disk in the event of a shutdown is replayed from the commit log. So, the larger the commit log, the more the replay time will take upon restart.

The default for 32-bit JVM is 32 MB and for 64-bit JVM is 1024 MB. You may tune it based on the memory availability on the node.

Tweaking JVM

Cassandra runs in JVM (Java Virtual Machine)—all the threading, memory management, processes, and the interaction with underlying OS is done by Java. Investing time to optimize JVM settings to speed up Cassandra operations pays off. We will see that the general assumptions, such as setting Java, heap too high to eat up most of the system's memory, and may not be a good idea.

Java heap

If you look into conf/cassandra-env.sh, you will see nicely written logic such as does the following: max(min(1/2 ram, 1024MB), min(1/4 ram, 8GB). This means that the max heap depends on the system memory as Cassandra chooses a decent default, which is:

  • Max heap = 50 percent for a system with less than 2 GB of RAM
  • Max heap = 1 GB for 2 GB to 4 GB RAM
  • Max heap = 25 percent for a system with 4 GB to 32 GB of RAM
  • Max heap = 8 GB for 32 GB onwards RAM

The reason to not go down with large heap is the garbage collection that does not do well for more than 8 GB. High heaps may also lead to poor page caches of the underlying operating system. In general, the default serves good enough. In case you choose to alter the heap size, you need to edit cassandra-env.sh and set MAX_HEAP_SIZE to the appropriate value.

Garbage collection

Further down cassandra-env.sh, you may find the garbage collection setting:

# GC tuning optionsJVM_OPTS="$JVM_OPTS -XX:+UseParNewGC" JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8" JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1" JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75" JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly" JVM_OPTS="$JVM_OPTS -XX:+UseTLAB" 

Cassandra, by default, uses the concurrent mark and sweep garbage collector (CMS GC). It performs garbage collection concurrently with the execution of Cassandra and pauses for a very short while. This is a good candidate for high performance applications such as Cassandra. With a concurrent collector, a parallel version of the young generation copying collector is used. And thus, we have UseParNewGC, which is a parallel copy collector that copies surviving objects in young generation from Eden to Survivor spaces and from there to old generation—it is written to work with concurrent collectors such as CMS GC.

Further more, CMSParallelRemarkEnabled reduces the pauses during the remark phase.

The other garbage settings do not impact garbage collection significantly. However, low values for CMSInitiatingOccupancyFraction may lead to frequent garbage collection, because concurrent collection starts if occupancy of the tenured generation grows above the initial occupancy. CMSInitiatingOccupancyFraction sets the percentage of current tenured generation size.

If you decide to debug the garbage collection, it is a good idea to use tools such as JConsole in order to look into how frequent garbage collection takes place, CPU usage, and so on. You may also want to uncomment the GC logging options in cassandra-env.sh to see what's going on beneath.

Note

If you decide to increase Java heap size over 6 GB, it may be interesting to switch the garbage collection settings too. Garbage First Garbage Collector (G1 GC) is shipped with Oracle Java 7 update 4+. It is claimed to work reliably on larger heap sizes. Also, it is suggested that applications currently running on CMS GC will be benefited by G1 GC in more ways than one. (Read more about G1 GC on http://www.oracle.com/technetwork/java/javase/tech/g1-intro-jsp-135488.html.)

Other JVM options

Compressed Ordinary Object Pointers (OOPs): In 64-bit JVM, OOPs normally have the same size as the machine pointer, that is, 64-bit. This causes a larger heap size requirement on a 64-bit machine for the same application when compared to the heap size requirement on a 32-bit machine. Compressed OOP options help to keep the heap size smaller on 64-bit machines. (Read more about compressed OOPs on http://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html#compressedOop.)

It is generally suggested to run Cassandra on Oracle Java 6. Compressed OOPs are supported and activated by default from Java SE Version 6u23+. For earlier releases, you need to explicitly activate this by passing the -XX:+UseCompressedOops flag.

Enable Java Native Access (JNA): Please refer to Chapter 3, Design Patterns for specific information regarding how to install the JNA library for your operating system. JNA gives Cassandra access to native OS libraries (shared libraries, DLLs). Cassandra uses JNA for off-heap row caching that does not get swapped and in general gives favorable performance results on reads and writes. If no JNA exists, Cassandra falls back to on-heap row caching, which has a negative impact on performance. JNA also the helps while taking backups; snapshots are created with help of JNA which would have taken much longer with fork and Java.

Scaling horizontally and vertically

We will see scaling in more detail in Chapter 6, Managing a Cluster – Scaling, Node Repair, and Backup, but let's discuss scaling in context of performance tuning. As we know, Cassandra is linearly horizontally scalable. So, adding more nodes in Cassandra will result in a proportional gain in performance. This is called horizontal scaling.

The other thing that we have observed is that Cassandra reads are memory and disk speed bound. So, having larger memory, allocating more memory to caches, and dedicated and fast spinning hard discs (or solid state drives) will boost the performance. Having high processing power with multi-core processors will help compression, decompression, and garbage collection to run more smoothly. So, having a beefy server will help improve overall performance. In today's cloud age, it is economical to use lots of cheap and low-end machines than to use a couple of expensive but high I/O, high CPU, and high memory machines.

Network

Cassandra, like any other distributed system, has network as one of the important aspects that can vary performance. Cassandra needs to make calls across networks for both reads and writes. A slow network can therefore be a bottleneck for the performance. It is suggested to use a gigabit network and redundant network interfaces. In a system with more than one network interface, it is recommended to bind listen_address for clients (Thrift) to one network interface card and rpc_address to another.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.245.167