Install Oracle JRE from oracle.com, where you can find the latest JRE to be installed. Download the binaries from http://cassandra.apache.org/download/. You have the choice of installing a Ubuntu or tar package from the site. DataStax has an installation package, which is distributed at http://www.datastax.com/download.
Follow the instructions and get it installed; once installed, you will find the following files in the respective locations:
/etc/cassandra/conf
or {CASSANDRA_HOME}/conf
/etc/cassandra/conf
<install_location>/conf
/etc/dse/cassandra
It is advisable to configure the cassandra-env.sh
to modify:
#MAX_HEAP_SIZE="4G"
#HEAP_NEWSIZE="800M"
It takes a lot of tests to get the GC configurations right, but as a rule of thumb, the default configuration works well for most use cases. We have seen that the following configuration is much more conservative if you have enough memory in the system:
#MAX_HEAP_SIZE="8G"
#HEAP_NEWSIZE="1G"
Enable GC logging and look for concurrent mode failures. Visual GC will provide us with visualizations to understand the GC cycles. Here, the idea is to make sure that most of the garbage is collected in the initial stages, where it is less expensive to be collected. Tuning GC is complicated and a long topic by itself; it is out of the scope of this book too. But testing constantly with various settings at your peek traffic will help mitigate any surprises.
The Cassandra.yaml
file is well documented and is self-explanatory sometimes. But we will talk about a few items in the YAML, which will help users in configuring Cassandra.
cluster_name is the name of the cluster. Cassandra, checks the cluster name before it starts any gossip communication with its peer's nodes.
The seed node has no purpose other than bootstrapping the gossip process for new nodes joining the cluster. Seed nodes are not used beyond the bootstrapping of nodes. Once bootstrapped, cluster information is stored to disc and will be called. While bootstrapping a node to a live cluster, make sure that the seed node list doesn't contain itself. The following shows a sample YAML configuration.
- class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: "<IP ADD OF SEED>,<IP ADD OF SEED>"
A partitioner distributes rows across nodes in the cluster. Any IPartitioner may be used as long as it is on the classpath. Cassandra supports the following partitioners:
Enabling this configuration makes the new Cassandra nodes automatically stream the right data to themselves provided that the cluster is already up and running. Cassandra will ignore this setting when the node is configured as a seed.
You can also re-bootstrap an existing node by deleting all the data directories and starting the node with -Dcassandra.replace_token=<Token's>
(you can start Cassandra with the –D
option with the startup script). Essentially, the node that is part of the writing goes into the bootstrap mode and streams the required data into itself, and once it is online, the hints stored for that node are replayed. This setting is very helpful in environments such as EC2, where node recovery is difficult after the node dies (since the data is stored in ephemeral drives). Again, since the advent of vnodes, it is easier to decommission the old node and re-bootstrap the newer node.
If your Cassandra cluster is deployed in environments where you have two IPs—one for communication within DCs and the other for communication outside DCs—you might want to configure this. The best example for this is AWS, where every instance gets a private and public IP. This configuration needs to point to the public IP for inter-DC configuration.
Commit logs usually provide sequential writes to the partition. It is recommended to use a separate disk if available, but it's not always required.
Make sure to change the name of the data directories. You can specify a list of directories to be used. These directories can be in different physical drives. You can also configure the column family to be in a specific directory. The data directories are configured in the YAML that is to be used. Please note that Cassandra will load balance the data files between different directories, but if you want more control over the way blocks are distributed between the drives, you might want to use hardware RA with some overhead.
There will be cases where it will make more sense to use this configuration when you have a server with both spinning drives and SSD; then, we might be able to configure it.
The default value of this setting is to stop the gossip and thrift when an I/O error or disk failure is detected in order to avoid exposure of the corrupt or bad data to other nodes. You can change it to best_effort
or set it to ignore
depending on how you want to deal with errors. It is better to detect errors and stop the node than discovering issues later.
initial_token
assigns the node token position in the ring and a range of data to the node when it first starts up. After vnodes (Version 1.2 and above), this configuration has not been used. Vnodes provide automatic token management. You need to enable the num_tokens
parameter to control the number of partitions a node will have. Increasing this number results in lots of smaller partitions with a caveat that the range queries on keys will be more expensive.
While using vnodes, there will be more partitions than hosts, and these partitions are distributed across the hosts to create a many-to-one mapping, providing a faster bootstrap and automatic load balancing. Traditional in a distributed hash table style, Cassandra installation of the nodes and tokens has a one-to-one relationship. Hence, while bootstrapping new nodes into the existing cluster, we need to rebalance the cluster, which is an expensive operation. Vnode partitions are randomly distributed across all the nodes.
Losing a node / decommissioning / adding a node to the cluster will evenly redistribute the load to (essentially) all the nodes in the cluster. This drastically reduces the time taken for recovery from failures.
Vnodes work great for almost all cases; but, if you want to disable them for any other reasons, you might want to disable num_tokens
and fall back to manual token management. There are a lot of tools available online to calculate the tokens and form equally-spaced rings.
The IP address of the node is used for communication. If it is left empty, Cassandra tries to find the local hostname/IP to listen on.
This sets the snitch that Cassandra uses to place data for safer operation and to avoid data losses during rack or DC failures. The default is a simple snitch that considers the nodes next to each other in different racks and has no notion of DCs. The YAML file has been updated and has more detailed information on what to do. You might need to choose the right one.
The writes to Cassandra are being written to the commit log and then to memory, following which the writes to a particular node are considered complete. When writes are performed to a disk partition, the OS caches/batches before it actually performs a disk I/O to write to disk. The application can call fsync
to force flush to disk, but this is an expensive process; hence, Cassandra provides two methods of operation:
commitlog_sync_period_in_ms
method controls when a commit log will be synced to the disk; by default, the sync happens every 10 secondscommitlog_sync_batch_window_in_ms
to control how long Cassandra waits for other writes before actually doing fsync
; writes are not acknowledged until they are fsynced to the diskThis sets the size of the individual commit log segment files. Increase it if you want to reduce the amount of commit log segments. A downside of increasing the number of commit log segments is that it can increase the cost during startup. On the other hand, if you want an incremental backup of every individual transaction on the system, you might want to keep it big enough to allow it to copy to an external source. The commitlog_archiving.properties
method allows you to copy the incremental commit log into a separate partition/filesystem for later use. The advantage of commit log backup over incremental Cassandra backup is that it provides all the operations performed in the order of execution on the cluster. The properties file is self-explanatory for most of the properties it supports.
This takes care of the total space used for commit log segment files. If the used space goes above this value, Cassandra rotates the next segment. When the space is full and there are no commit log segments available for recycle, memtables are flushed to the disk so that the commit log segments can be reused since they are no longer needed.
It is worth noting that the commit log segments can be recycled once the memtables are flushed to the disk as the data which is in the commit logs. It is already synced to disk. Hence, commit log has redundant information and can safely be thrown away.
Cassandra allows you to save the row and key caches to disk, which will enable a restart of the service to bring the caches back to the same state before the service went down.
The row_cache_size_in_mb
and key_cache_size_in_mb
settings control the amount of memory used by the row and key caches.
The key cache has an entry for every SSTable. When compacting, Cassandra creates new SSTables and will invalidate the old ones that were cached. Now, since we have new SSTables, this setting will make Cassandra preheat the rows for the particular row provided the key cache is enabled. This setting is advantageous when a user wants the key cache to be always up-to-date (rather than fault filling). Since compaction can invalidate the keys, it might be worth in some cases to preheat the keys.
This configuration is not available in 2.0 and Cassandra uses SerializingCacheProvider. The two types of cache providers are as follows:
org.apache.cassandra.cache.IRowCacheProvider
. It's worth noting that row_cache_size_in_mb
only controls the amount of off-heap storage and doesn't limit the heap storage of the cache keys.SerializingCacheProvider
performs badly in an update-heavy workload. Since off-heap cache uses native memory to allocate memory and might cause fragmentation of the memory available for future allocation, Cassandra relies on the default allocator to do this (GCC); tests show jemalloc
has better savings than GCC malloc
. You might want to try a memory_allocator
configuration in Cassandra.yaml
.Inherently, JVM is not good at managing caches (LRU kind of caches) where the data is frequently freed and new data is added. This causes JVM memory to fragment, causing more GC pressure to compact them. There is no way to hint the JVM about the type of objects.
Column index is used to speed up the range query (or queries within a row) on a wide row in order to speed up the lookup/start of a particular column. Increase this setting if your columns K/V
are larger. Column index is helpful while doing pagination too. When a query needs to start scanning from the middle of a wide row, Cassandra uses the column index to seek the column faster than scanning the whole row to find the location of the row. As mentioned before, Cassandra's columns inside the rows are always kept sorted, (The order is controlled by the column comparator setting in the column family.) Hence, the index can speed-up reads by directly seeking to the start of the requested data location.
This is the most important setting, which almost all of Cassandra users modify to get a better I/O performance from Cassandra. This setting throttles compaction disk utilization across all the compactions within the node. The faster the data is inserted, the faster it is compacted to keep the data and SSTable size low enough for the reads to be less expensive. Compaction is mostly sequential with the exception that it has to read from four (default setting) or more SSTs to write a new one. Disabling this setting (by setting it to 0
) will take away the valuable I/O bandwidth from the read operations. Some experimentation around the default setting is needed to get the right settings. While using this setting on SSD, you might want to set it to a higher value. With SSD, the limiting factor will be the JVM memory.
Cassandra uses a variation of external merge sorts to compact the rows in multiple SSTables to one. This setting controls how much of the memory Cassandra will use in memory. The worst case for this sort is one row that cannot fit in memory. To compact a single wide row, Cassandra then switches to a two-phase compaction, which is expensive. When this happens, a log message is printed; it is suggested to look for these and tune the in-memory size reasonably big enough to accommodate them. But the upper bound of this setting should not surpass 10 to 15 percent of the new generation (GC setting in Cassandra-env.sh
) as a rule of thumb to avoid GC pressure, hence affecting read or write operations.
This setting will enable Cassandra to do concurrent compaction for multiple column families in the cluster. By default, it is set based on the number of cores in the system. This might be fine in most cases, but there are cases where you need to allow only one compaction in the system to avoid more random reads (reading multiple files and causing more disk seeks). But try tuning compaction_throughput_mb_per_sec
before trying to optimize the concurrent compactors. All these settings allow Cassandra to be much smarter and safer; most users won't be tuning a lot of these parameters and the defaults work perfectly well.
When compacting data from old to new SSTs, the filesystem cache is not populated automatically; it takes several cycles before the OS cache is back to normal. This is usually good because the cache is populated organically rather than Cassandra making that decision. But there are use cases where the whole dataset can fit in memory, and we might want to speed up the process of filling the cache. Enabling this setting will force Cassandra to populate the file cache during compaction and flush, hence reducing the spikes in latency (if any due to compaction and filesystem cache). You can confirm this scenario by closely monitoring the I/O wait time of the filesystem.
To fetch data from disk, Cassandra employs some reader (worker) threads that read the data from disk (reading BF, SST indexes, and finally reading the data from SSTs), which tells Cassandra how many of them can execute concurrently. This setting enables concurrency for read operations, but at the same time, increasing this number very high will not help read latencies as most of the threads will end up waiting.
As a rule of thumb, make this setting to be 16 multiplied by the number of drives available for reads or writes. For the SSDs per data size that can fit in memory, Cassandra readers will be mostly CPU-bound than disk-bound. For these use cases, set it as a factor of the CPUs available (For example, 8 * NUM_CPU).
The writes in Cassandra are not I/O bound (unless there are disk issues that cause the flush writers to backup). Hence, the number of writer threads should be a factor of the number of CPUs available, and it is recommended to have eight multiplied by the number of CPUs available.
When Cassandra detects that the JVM heap is filling up more than the old objects that JVM GC is able to free, Cassandra tries to free up its reference to memory by flushing the largest memtable to disk. This parameter is an emergency measure to prevent out-of-memory errors. When this happens, Cassandra also logs this incident. When you see this happening, you might want to tune the JVM to avoid it. You might want to look at the memtable_total_space_in_mb
setting before trying to increase or decrease this setting. This setting doesn't need tuning for most cases.
index_interval
controls the interval at which the index summary is calculated. As mentioned before, an index summary is an in-memory structure for an individual SSTable and allows faster scanning of the SSTable index. Instead of starting the scan from the beginning of the index, Cassandra uses the in-memory structure to quickly jump to the location that is more relevant (skipping others) to start. This setting controls the number of samples that have to be taken to constitute the index summary. Increasing this setting will reduce the memory overhead of the index sample but also increase the amount of sequential I/O required to get to the row. Generally, the best tradeoff between memory usage and performance is a value between 128 and 512. If you are dealing with wide rows and the amount of rows in the cluster is small, consider reducing the interval.
This is a parameter that controls the total memory used by all the column family's memtables on a given node. This setting needs to be in sync with the used GC heap settings. While using size-tiered compaction, this setting is a key parameter because you might want to flush SSTables slower. Thus, it reduces the number of SSTables to be compacted and provides a little more efficiency with less IO.
This setting controls the number of flush writers that can be queued before backing up the writes. A flush to disk is sequential and you may not end up tuning this setting unless there are some disk issues that cause the flush to slow down.
This sets the number of memtable flush writer threads. The threads are blocked by disk I/O, and each one holds a memtable in memory while blocked.
This throttles the streaming file transfer to the specified MB/s from this node to any other node. Streaming happens when a node is bootstrapped or when it's repaired (after identifying parts of the SSTables which need to be streamed because of inconsistency). Cassandra does sequential I/O to stream data between nodes. Streaming tasks are normally performed during bootstrap, move, rebalance, and repairs. With sequential I/O on the disk, your network becomes a bottleneck, and Cassandra will probably saturate the network cards, and hence throttling the streams will avoid affecting the reads and writes on the cluster.
This method consists of two components:
This method consists of two components:
throttle_limit
: This throttles the active requests after the configured number of active requests per client, making sure we are serving all the clients equally. Requests beyond this limit are queued up until running requests have been completed. default_weight
: This denotes the number of requests that will be executed when a client is chosen during the round-robin phase.This needs to be set when, using a connection, a poll is done in the client side or when the client leaves a connection open for a long time. This is not an absolute necessity if the clients are short-lived, but it doesn't have to be enabled.
Cassandra supports two types of RPC servers apart from native services. We highly recommended connection pooling while using a synced RPC server:
Since Version 1.2, Cassandra provides an alternative Netty-based transport implementation instead of thrift. Setting start_native_transport
can enable native transport.
Frame size (maximum field length) for thrift is used to create buffers for incoming and outgoing data.
Both HSHA and sync RPC servers use a thread pool to execute any operation received from the client. This setting sets the maximum number of threads that will be used to execute the operations. Native transport has a similar setting, native_transport_max_threads
, for its thread pool. Currently, there is an effort to remove these thread pools in the Cassandra 2.1 release and reduce the overhead.
This sets the minimum thread pool size for remote procedure calls. For native transport, use native_transport_min_threads
.
This is the time in milliseconds for which a coordinating node will wait for the reply from other nodes before the command is timed out. This setting is used intensely by Cassandra internally. Reducing this number to a smaller number might cause trashing of the connections and resources. It needs to be set reasonably higher for the operations to be completed and small enough so the clients can retry in:
read_request_timeout_in_ms
: This controls how long the coordinator should wait for read operations to completerange_request_timeout_in_ms
: This controls how long the coordinator should wait for the range scan request to completewrite_request_timeout_in_ms
: This controls how long the coordinator should wait for writes to completetruncate_request_timeout_in_ms
: This controls how long the coordinator should wait for truncates to completerequest_timeout_in_ms
: This is the default timeout for other miscellaneous operationsstreaming_socket_timeout_in_ms
: This defaults to 0
(disabled) and denotes the socket timeout for file streaming. When a timeout occurs during streaming, it is retried. When there are network issues, the socket can hang forever. To avoid this situation, set this setting to timeout the connection if the other node doesn't respond on time. Avoid setting this value too low as it can result in a significant amount of data restreaming, which might in turn lead to too many small compactions in the other node.The community has spent a lot of time tuning these parameters for better performance. Most of the time you don't need to tune them; however, it is nice to know about it:
dynamic_snitch_badness_threshold
: This parameter controls the threshold for dynamically routing requests away from a poorly performing node. The percentage tolerance for the latency is set via this parameter; for example, 0.1
means Cassandra will try to avoid nodes whose latencies on operations (overall) compared to other nodes in the cluster is 10 percent worse. Until the threshold is reached, incoming client requests are statically routed to the closest replica chosen by the snitch. This setting is the key to pin all the requests to a particular host. Setting this lower will cause the read operations to be routed indeterminately. Changing this setting to a higher value will cause latency increase when the node to which the request is routed to overloads. During compaction and repair, the severity of those operations is added to the badness threshold, which causes the coordinator of the request to prefer a different node than the one in the list. The idea is to avoid nodes that have a lower chance of completing the request in time.dynamic_snitch_reset_interval_in_ms
: This describes the interval in milliseconds at which the heuristics of the nodes are reset, allowing traffic to the bad nodes to receive traffic again.dynamic_snitch_update_interval_in_ms
: The time interval in milliseconds for calculating the threshold.phi_convict_threshold
: This property controls the sensitivity of the failure detector on an exponential scale. Lower values increase the likelihood of an unresponsive node to be marked as down, but care should be taken while working with environments which have more transient networks and node failures. Any hiccup there can cause all the nodes to be marked down, resulting in a failure in the client side.hinted_handoff_enabled
: This setting enables or disables hinted handoff. As mentioned before, the hinted handoff is one of the mechanisms that keeps the data consistent when the write operation is not performed on a specific node. A hint indicates that the write needs to be replayed to a node that was either down or did not respond to a write request. Hints are spread across the whole cluster for a particular node. When the node comes back online, these hints are replayed by the coordinator to the specific node.max_hint_window_in_ms
, max_hints_delivery_threads
: This parameter controls the hint window for which the hints are stored; after this, we start to drop the hints. This setting avoids storing a lot of hints for the nodes which won't come back. For a node that is down for a longer period, it might well be cheaper to bootstrap the node than allowing it to replay all the hints. max_hints_delivery_threads
: This will tell Cassandra how many threads have to be used to replay the hints. In a multi-DC Cassandra installation, the hints replayed are naturally throttled by network latencies. There are cases where the throttle causes the nodes to replay hints very slowly; hence, having more threads provides better concurrency.hinted_handoff_throttle_delay_in_ms
: When the coordinating node/proxy detects that a data node for which it is holding hints has recovered, it begins sending the hints to that data node. This setting specifies the sleep interval in milliseconds after delivering each hint to the destination. This is a time throttle as every node in the cluster is trying to replay and hence a network throttle is irrelevant. For larger deployments, increasing this number will help; for smaller deployments, increasing this number will increase the time for the node to become consistent. It is worth noting that hinted handoff is the best effort to make the data consistent; however, repair the cluster or always bootstrap a downed node if you are worried about consistency. Cassandra also logs messages when it starts to drop hints, and when you see this message in system.log
, it is time to run the repair when all the nodes are up.inter_dc_tcp_nodelay
: TCP tries to batch the transfer of data from one node to another to avoid small packet transmission. Most of the applications disable this feature because it causes additional latency to every request. But for multi-DC cases where the clusters are geographically distant from each other, this setting helps.Out of the box, Cassandra provides incremental, snapshot, and commit log backup configurations. Backup in Cassandra is unique because SSTs are immutable and can safely be backed up. Incremental and snapshot backup operates on SSTables directly by creating a hard link of the files to a separate directory (which is a low overhead operation). External scripts can watch the directory and copy the files out of the node for archival. A snapshot of the nodes can be taken via the nodetool
command (most users run cron jobs).
Incremental backup, if enabled, will create hard links to the SSTs once flushed to disk. The idea is that snapshot backup is a full backup of the node, and flush is the new data which is written to Cassandra since it was snapshotted.
Compacted SSTs are not backed up because they are just a copy of the flushed SSTs. To restore data into the cluster, the snapshot has to be restored and incremental SSTs can be replayed on top before starting the nodes.
3.22.216.254