Downloading/choosing binaries to install

Install Oracle JRE from oracle.com, where you can find the latest JRE to be installed. Download the binaries from http://cassandra.apache.org/download/. You have the choice of installing a Ubuntu or tar package from the site. DataStax has an installation package, which is distributed at http://www.datastax.com/download.

Follow the instructions and get it installed; once installed, you will find the following files in the respective locations:

  • Configuration directory:/etc/cassandra/conf or {CASSANDRA_HOME}/conf
  • Cassandra packaged installs: /etc/cassandra/conf
  • Cassandra binary installs:<install_location>/conf
  • DataStax Enterprise packaged installs:/etc/dse/cassandra

Configuring cassandra-env.sh

It is advisable to configure the cassandra-env.sh to modify:

  • #MAX_HEAP_SIZE="4G"
  • #HEAP_NEWSIZE="800M"

It takes a lot of tests to get the GC configurations right, but as a rule of thumb, the default configuration works well for most use cases. We have seen that the following configuration is much more conservative if you have enough memory in the system:

  • #MAX_HEAP_SIZE="8G"
  • #HEAP_NEWSIZE="1G"

Enable GC logging and look for concurrent mode failures. Visual GC will provide us with visualizations to understand the GC cycles. Here, the idea is to make sure that most of the garbage is collected in the initial stages, where it is less expensive to be collected. Tuning GC is complicated and a long topic by itself; it is out of the scope of this book too. But testing constantly with various settings at your peek traffic will help mitigate any surprises.

Configuring Cassandra.yaml

The Cassandra.yaml file is well documented and is self-explanatory sometimes. But we will talk about a few items in the YAML, which will help users in configuring Cassandra.

cluster_name

cluster_name is the name of the cluster. Cassandra, checks the cluster name before it starts any gossip communication with its peer's nodes.

seed_provider

The seed node has no purpose other than bootstrapping the gossip process for new nodes joining the cluster. Seed nodes are not used beyond the bootstrapping of nodes. Once bootstrapped, cluster information is stored to disc and will be called. While bootstrapping a node to a live cluster, make sure that the seed node list doesn't contain itself. The following shows a sample YAML configuration.

- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "<IP ADD OF SEED>,<IP ADD OF SEED>"

Partitioner

A partitioner distributes rows across nodes in the cluster. Any IPartitioner may be used as long as it is on the classpath. Cassandra supports the following partitioners:

  • Murmur3Partitioner random partitioner: The row key is hashed to find the position in the cluster.
  • ByteOrderedPartitioner: This orders rows lexically by key bytes (highly discouraged since this partioner can cause hot spots, where some nodes in the cluster will have a lot more data than the others).
  • RandomPartitioner: It uses the MD5 algorithm to hash the keys; Murmur3Partitioner is faster than MD5. RandomPartitioner is used for backward compatibility and it is deprecated.

auto_bootstrap

Enabling this configuration makes the new Cassandra nodes automatically stream the right data to themselves provided that the cluster is already up and running. Cassandra will ignore this setting when the node is configured as a seed.

You can also re-bootstrap an existing node by deleting all the data directories and starting the node with -Dcassandra.replace_token=<Token's> (you can start Cassandra with the –D option with the startup script). Essentially, the node that is part of the writing goes into the bootstrap mode and streams the required data into itself, and once it is online, the hints stored for that node are replayed. This setting is very helpful in environments such as EC2, where node recovery is difficult after the node dies (since the data is stored in ephemeral drives). Again, since the advent of vnodes, it is easier to decommission the old node and re-bootstrap the newer node.

broadcast_address

If your Cassandra cluster is deployed in environments where you have two IPs—one for communication within DCs and the other for communication outside DCs—you might want to configure this. The best example for this is AWS, where every instance gets a private and public IP. This configuration needs to point to the public IP for inter-DC configuration.

commitlog_directory

Commit logs usually provide sequential writes to the partition. It is recommended to use a separate disk if available, but it's not always required.

data_file_directories

Make sure to change the name of the data directories. You can specify a list of directories to be used. These directories can be in different physical drives. You can also configure the column family to be in a specific directory. The data directories are configured in the YAML that is to be used. Please note that Cassandra will load balance the data files between different directories, but if you want more control over the way blocks are distributed between the drives, you might want to use hardware RA with some overhead.

There will be cases where it will make more sense to use this configuration when you have a server with both spinning drives and SSD; then, we might be able to configure it.

disk_failure_policy

The default value of this setting is to stop the gossip and thrift when an I/O error or disk failure is detected in order to avoid exposure of the corrupt or bad data to other nodes. You can change it to best_effort or set it to ignore depending on how you want to deal with errors. It is better to detect errors and stop the node than discovering issues later.

initial_token

initial_token assigns the node token position in the ring and a range of data to the node when it first starts up. After vnodes (Version 1.2 and above), this configuration has not been used. Vnodes provide automatic token management. You need to enable the num_tokens parameter to control the number of partitions a node will have. Increasing this number results in lots of smaller partitions with a caveat that the range queries on keys will be more expensive.

While using vnodes, there will be more partitions than hosts, and these partitions are distributed across the hosts to create a many-to-one mapping, providing a faster bootstrap and automatic load balancing. Traditional in a distributed hash table style, Cassandra installation of the nodes and tokens has a one-to-one relationship. Hence, while bootstrapping new nodes into the existing cluster, we need to rebalance the cluster, which is an expensive operation. Vnode partitions are randomly distributed across all the nodes.

Losing a node / decommissioning / adding a node to the cluster will evenly redistribute the load to (essentially) all the nodes in the cluster. This drastically reduces the time taken for recovery from failures.

Vnodes work great for almost all cases; but, if you want to disable them for any other reasons, you might want to disable num_tokens and fall back to manual token management. There are a lot of tools available online to calculate the tokens and form equally-spaced rings.

listen_address/rpc_address

The IP address of the node is used for communication. If it is left empty, Cassandra tries to find the local hostname/IP to listen on.

Ports

The following are the ports to be configured and their functionalities:

  • rpc_port: This is the port for the Thrift RPC
  • native_transport_port: This port is for native transport protocol
  • storage_port: This is used for internal communication

endpoint_snitch

This sets the snitch that Cassandra uses to place data for safer operation and to avoid data losses during rack or DC failures. The default is a simple snitch that considers the nodes next to each other in different racks and has no notion of DCs. The YAML file has been updated and has more detailed information on what to do. You might need to choose the right one.

commitlog_sync

The writes to Cassandra are being written to the commit log and then to memory, following which the writes to a particular node are considered complete. When writes are performed to a disk partition, the OS caches/batches before it actually performs a disk I/O to write to disk. The application can call fsync to force flush to disk, but this is an expensive process; hence, Cassandra provides two methods of operation:

  • periodic: The commitlog_sync_period_in_ms method controls when a commit log will be synced to the disk; by default, the sync happens every 10 seconds
  • batch: This is used with commitlog_sync_batch_window_in_ms to control how long Cassandra waits for other writes before actually doing fsync; writes are not acknowledged until they are fsynced to the disk

commitlog_segment_size_in_mb

This sets the size of the individual commit log segment files. Increase it if you want to reduce the amount of commit log segments. A downside of increasing the number of commit log segments is that it can increase the cost during startup. On the other hand, if you want an incremental backup of every individual transaction on the system, you might want to keep it big enough to allow it to copy to an external source. The commitlog_archiving.properties method allows you to copy the incremental commit log into a separate partition/filesystem for later use. The advantage of commit log backup over incremental Cassandra backup is that it provides all the operations performed in the order of execution on the cluster. The properties file is self-explanatory for most of the properties it supports.

commitlog_total_space_in_mb

This takes care of the total space used for commit log segment files. If the used space goes above this value, Cassandra rotates the next segment. When the space is full and there are no commit log segments available for recycle, memtables are flushed to the disk so that the commit log segments can be reused since they are no longer needed.

It is worth noting that the commit log segments can be recycled once the memtables are flushed to the disk as the data which is in the commit logs. It is already synced to disk. Hence, commit log has redundant information and can safely be thrown away.

Key cache and row cache saved to disk

Cassandra allows you to save the row and key caches to disk, which will enable a restart of the service to bring the caches back to the same state before the service went down.

The row_cache_size_in_mb and key_cache_size_in_mb settings control the amount of memory used by the row and key caches.

compaction_preheat_key_cache

The key cache has an entry for every SSTable. When compacting, Cassandra creates new SSTables and will invalidate the old ones that were cached. Now, since we have new SSTables, this setting will make Cassandra preheat the rows for the particular row provided the key cache is enabled. This setting is advantageous when a user wants the key cache to be always up-to-date (rather than fault filling). Since compaction can invalidate the keys, it might be worth in some cases to preheat the keys.

row_cache_provider

This configuration is not available in 2.0 and Cassandra uses SerializingCacheProvider. The two types of cache providers are as follows:

  • SerializingCacheProvider: This serializes the contents of the row and stores it in native memory. Serialized rows take significantly less memory than live rows in JVM. The main idea behind the off-heap cache is that the GC overhead will not be higher as the cache size grows, reducing the stop-the-world GC pauses. Even though the row is stored off-heap, the cache key is stored in Java memory. Hence, there is an overhead in using off-heap cache. You can enable off-heap cache by using org.apache.cassandra.cache.IRowCacheProvider. It's worth noting that row_cache_size_in_mb only controls the amount of off-heap storage and doesn't limit the heap storage of the cache keys.
  • ConcurrentLinkedHashCacheProvider: This is purely in-heap cache; you might find this faster in some use cases where the row sizes are small enough to fit in JVM heap memory. SerializingCacheProvider performs badly in an update-heavy workload. Since off-heap cache uses native memory to allocate memory and might cause fragmentation of the memory available for future allocation, Cassandra relies on the default allocator to do this (GCC); tests show jemalloc has better savings than GCC malloc. You might want to try a memory_allocator configuration in Cassandra.yaml.

Inherently, JVM is not good at managing caches (LRU kind of caches) where the data is frequently freed and new data is added. This causes JVM memory to fragment, causing more GC pressure to compact them. There is no way to hint the JVM about the type of objects.

column_index_size_in_kb

Column index is used to speed up the range query (or queries within a row) on a wide row in order to speed up the lookup/start of a particular column. Increase this setting if your columns K/V are larger. Column index is helpful while doing pagination too. When a query needs to start scanning from the middle of a wide row, Cassandra uses the column index to seek the column faster than scanning the whole row to find the location of the row. As mentioned before, Cassandra's columns inside the rows are always kept sorted, (The order is controlled by the column comparator setting in the column family.) Hence, the index can speed-up reads by directly seeking to the start of the requested data location.

compaction_throughput_mb_per_sec

This is the most important setting, which almost all of Cassandra users modify to get a better I/O performance from Cassandra. This setting throttles compaction disk utilization across all the compactions within the node. The faster the data is inserted, the faster it is compacted to keep the data and SSTable size low enough for the reads to be less expensive. Compaction is mostly sequential with the exception that it has to read from four (default setting) or more SSTs to write a new one. Disabling this setting (by setting it to 0) will take away the valuable I/O bandwidth from the read operations. Some experimentation around the default setting is needed to get the right settings. While using this setting on SSD, you might want to set it to a higher value. With SSD, the limiting factor will be the JVM memory.

in_memory_compaction_limit_in_mb

Cassandra uses a variation of external merge sorts to compact the rows in multiple SSTables to one. This setting controls how much of the memory Cassandra will use in memory. The worst case for this sort is one row that cannot fit in memory. To compact a single wide row, Cassandra then switches to a two-phase compaction, which is expensive. When this happens, a log message is printed; it is suggested to look for these and tune the in-memory size reasonably big enough to accommodate them. But the upper bound of this setting should not surpass 10 to 15 percent of the new generation (GC setting in Cassandra-env.sh) as a rule of thumb to avoid GC pressure, hence affecting read or write operations.

concurrent_compactors

This setting will enable Cassandra to do concurrent compaction for multiple column families in the cluster. By default, it is set based on the number of cores in the system. This might be fine in most cases, but there are cases where you need to allow only one compaction in the system to avoid more random reads (reading multiple files and causing more disk seeks). But try tuning compaction_throughput_mb_per_sec before trying to optimize the concurrent compactors. All these settings allow Cassandra to be much smarter and safer; most users won't be tuning a lot of these parameters and the defaults work perfectly well.

populate_io_cache_on_flush

When compacting data from old to new SSTs, the filesystem cache is not populated automatically; it takes several cycles before the OS cache is back to normal. This is usually good because the cache is populated organically rather than Cassandra making that decision. But there are use cases where the whole dataset can fit in memory, and we might want to speed up the process of filling the cache. Enabling this setting will force Cassandra to populate the file cache during compaction and flush, hence reducing the spikes in latency (if any due to compaction and filesystem cache). You can confirm this scenario by closely monitoring the I/O wait time of the filesystem.

concurrent_reads

To fetch data from disk, Cassandra employs some reader (worker) threads that read the data from disk (reading BF, SST indexes, and finally reading the data from SSTs), which tells Cassandra how many of them can execute concurrently. This setting enables concurrency for read operations, but at the same time, increasing this number very high will not help read latencies as most of the threads will end up waiting.

As a rule of thumb, make this setting to be 16 multiplied by the number of drives available for reads or writes. For the SSDs per data size that can fit in memory, Cassandra readers will be mostly CPU-bound than disk-bound. For these use cases, set it as a factor of the CPUs available (For example, 8 * NUM_CPU).

concurrent_writes

The writes in Cassandra are not I/O bound (unless there are disk issues that cause the flush writers to backup). Hence, the number of writer threads should be a factor of the number of CPUs available, and it is recommended to have eight multiplied by the number of CPUs available.

flush_largest_memtables_at

When Cassandra detects that the JVM heap is filling up more than the old objects that JVM GC is able to free, Cassandra tries to free up its reference to memory by flushing the largest memtable to disk. This parameter is an emergency measure to prevent out-of-memory errors. When this happens, Cassandra also logs this incident. When you see this happening, you might want to tune the JVM to avoid it. You might want to look at the memtable_total_space_in_mb setting before trying to increase or decrease this setting. This setting doesn't need tuning for most cases.

index_interval

index_interval controls the interval at which the index summary is calculated. As mentioned before, an index summary is an in-memory structure for an individual SSTable and allows faster scanning of the SSTable index. Instead of starting the scan from the beginning of the index, Cassandra uses the in-memory structure to quickly jump to the location that is more relevant (skipping others) to start. This setting controls the number of samples that have to be taken to constitute the index summary. Increasing this setting will reduce the memory overhead of the index sample but also increase the amount of sequential I/O required to get to the row. Generally, the best tradeoff between memory usage and performance is a value between 128 and 512. If you are dealing with wide rows and the amount of rows in the cluster is small, consider reducing the interval.

memtable_total_space_in_mb

This is a parameter that controls the total memory used by all the column family's memtables on a given node. This setting needs to be in sync with the used GC heap settings. While using size-tiered compaction, this setting is a key parameter because you might want to flush SSTables slower. Thus, it reduces the number of SSTables to be compacted and provides a little more efficiency with less IO.

memtable_flush_queue_size

This setting controls the number of flush writers that can be queued before backing up the writes. A flush to disk is sequential and you may not end up tuning this setting unless there are some disk issues that cause the flush to slow down.

memtable_flush_writers

This sets the number of memtable flush writer threads. The threads are blocked by disk I/O, and each one holds a memtable in memory while blocked.

stream_throughput_outbound_megabits_per_sec

This throttles the streaming file transfer to the specified MB/s from this node to any other node. Streaming happens when a node is bootstrapped or when it's repaired (after identifying parts of the SSTables which need to be streamed because of inconsistency). Cassandra does sequential I/O to stream data between nodes. Streaming tasks are normally performed during bootstrap, move, rebalance, and repairs. With sequential I/O on the disk, your network becomes a bottleneck, and Cassandra will probably saturate the network cards, and hence throttling the streams will avoid affecting the reads and writes on the cluster.

request_scheduler

This method consists of two components:

  • org.apache.cassandra.scheduler.NoScheduler: Cassandra doesn't schedule the requests and serves them in the same order as they are received
  • org.apache.cassandra.scheduler.RoundRobinScheduler: This uses the round-robin algorithm when scheduling the requests with additional options to throttle and add weights on the requests

request_scheduler_options

This method consists of two components:

  • throttle_limit: This throttles the active requests after the configured number of active requests per client, making sure we are serving all the clients equally. Requests beyond this limit are queued up until running requests have been completed.
  • default_weight: This denotes the number of requests that will be executed when a client is chosen during the round-robin phase.

rpc_keepalive

This needs to be set when, using a connection, a poll is done in the client side or when the client leaves a connection open for a long time. This is not an absolute necessity if the clients are short-lived, but it doesn't have to be enabled.

rpc_server_type

Cassandra supports two types of RPC servers apart from native services. We highly recommended connection pooling while using a synced RPC server:

  • sync: This stands for one thread per RPC connection. For a very large number of clients, memory is the limiting factor.
  • hsha: This stands for a half synchronous, half asynchronous RPC server type. There is a set of threads (based on the number of CPUs) that is used to manage the connections. The HSHA server collects the data from the connections (which is usually CPU-bound) and executes the requests in a RPC thread pool. Basically, the advantage is you don't need to have n threads for n connections like in the RPC server. In the latest versions of Cassandra, this setting might provide a better response time (99 percentile).

Since Version 1.2, Cassandra provides an alternative Netty-based transport implementation instead of thrift. Setting start_native_transport can enable native transport.

thrift_framed_transport_size_in_mb

Frame size (maximum field length) for thrift is used to create buffers for incoming and outgoing data.

rpc_max_threads

Both HSHA and sync RPC servers use a thread pool to execute any operation received from the client. This setting sets the maximum number of threads that will be used to execute the operations. Native transport has a similar setting, native_transport_max_threads, for its thread pool. Currently, there is an effort to remove these thread pools in the Cassandra 2.1 release and reduce the overhead.

rpc_min_threads

This sets the minimum thread pool size for remote procedure calls. For native transport, use native_transport_min_threads.

Timeouts

This is the time in milliseconds for which a coordinating node will wait for the reply from other nodes before the command is timed out. This setting is used intensely by Cassandra internally. Reducing this number to a smaller number might cause trashing of the connections and resources. It needs to be set reasonably higher for the operations to be completed and small enough so the clients can retry in:

  • read_request_timeout_in_ms: This controls how long the coordinator should wait for read operations to complete
  • range_request_timeout_in_ms: This controls how long the coordinator should wait for the range scan request to complete
  • write_request_timeout_in_ms: This controls how long the coordinator should wait for writes to complete
  • truncate_request_timeout_in_ms: This controls how long the coordinator should wait for truncates to complete
  • request_timeout_in_ms: This is the default timeout for other miscellaneous operations
  • streaming_socket_timeout_in_ms: This defaults to 0 (disabled) and denotes the socket timeout for file streaming. When a timeout occurs during streaming, it is retried. When there are network issues, the socket can hang forever. To avoid this situation, set this setting to timeout the connection if the other node doesn't respond on time. Avoid setting this value too low as it can result in a significant amount of data restreaming, which might in turn lead to too many small compactions in the other node.

Dynamic snitch

The community has spent a lot of time tuning these parameters for better performance. Most of the time you don't need to tune them; however, it is nice to know about it:

  • dynamic_snitch_badness_threshold: This parameter controls the threshold for dynamically routing requests away from a poorly performing node. The percentage tolerance for the latency is set via this parameter; for example, 0.1 means Cassandra will try to avoid nodes whose latencies on operations (overall) compared to other nodes in the cluster is 10 percent worse. Until the threshold is reached, incoming client requests are statically routed to the closest replica chosen by the snitch. This setting is the key to pin all the requests to a particular host. Setting this lower will cause the read operations to be routed indeterminately. Changing this setting to a higher value will cause latency increase when the node to which the request is routed to overloads. During compaction and repair, the severity of those operations is added to the badness threshold, which causes the coordinator of the request to prefer a different node than the one in the list. The idea is to avoid nodes that have a lower chance of completing the request in time.
  • dynamic_snitch_reset_interval_in_ms: This describes the interval in milliseconds at which the heuristics of the nodes are reset, allowing traffic to the bad nodes to receive traffic again.
  • dynamic_snitch_update_interval_in_ms: The time interval in milliseconds for calculating the threshold.
  • phi_convict_threshold: This property controls the sensitivity of the failure detector on an exponential scale. Lower values increase the likelihood of an unresponsive node to be marked as down, but care should be taken while working with environments which have more transient networks and node failures. Any hiccup there can cause all the nodes to be marked down, resulting in a failure in the client side.
  • hinted_handoff_enabled: This setting enables or disables hinted handoff. As mentioned before, the hinted handoff is one of the mechanisms that keeps the data consistent when the write operation is not performed on a specific node. A hint indicates that the write needs to be replayed to a node that was either down or did not respond to a write request. Hints are spread across the whole cluster for a particular node. When the node comes back online, these hints are replayed by the coordinator to the specific node.
  • max_hint_window_in_ms, max_hints_delivery_threads: This parameter controls the hint window for which the hints are stored; after this, we start to drop the hints. This setting avoids storing a lot of hints for the nodes which won't come back. For a node that is down for a longer period, it might well be cheaper to bootstrap the node than allowing it to replay all the hints.
  • max_hints_delivery_threads: This will tell Cassandra how many threads have to be used to replay the hints. In a multi-DC Cassandra installation, the hints replayed are naturally throttled by network latencies. There are cases where the throttle causes the nodes to replay hints very slowly; hence, having more threads provides better concurrency.
  • hinted_handoff_throttle_delay_in_ms: When the coordinating node/proxy detects that a data node for which it is holding hints has recovered, it begins sending the hints to that data node. This setting specifies the sleep interval in milliseconds after delivering each hint to the destination. This is a time throttle as every node in the cluster is trying to replay and hence a network throttle is irrelevant. For larger deployments, increasing this number will help; for smaller deployments, increasing this number will increase the time for the node to become consistent. It is worth noting that hinted handoff is the best effort to make the data consistent; however, repair the cluster or always bootstrap a downed node if you are worried about consistency. Cassandra also logs messages when it starts to drop hints, and when you see this message in system.log, it is time to run the repair when all the nodes are up.
  • inter_dc_tcp_nodelay: TCP tries to batch the transfer of data from one node to another to avoid small packet transmission. Most of the applications disable this feature because it causes additional latency to every request. But for multi-DC cases where the clusters are geographically distant from each other, this setting helps.

Backup configurations

Out of the box, Cassandra provides incremental, snapshot, and commit log backup configurations. Backup in Cassandra is unique because SSTs are immutable and can safely be backed up. Incremental and snapshot backup operates on SSTables directly by creating a hard link of the files to a separate directory (which is a low overhead operation). External scripts can watch the directory and copy the files out of the node for archival. A snapshot of the nodes can be taken via the nodetool command (most users run cron jobs).

incremental_backups

Incremental backup, if enabled, will create hard links to the SSTs once flushed to disk. The idea is that snapshot backup is a full backup of the node, and flush is the new data which is written to Cassandra since it was snapshotted.

Compacted SSTs are not backed up because they are just a copy of the flushed SSTs. To restore data into the cluster, the snapshot has to be restored and incremental SSTs can be replayed on top before starting the nodes.

auto_snapshot

This setting controls whether Cassandra has to take a snapshot before the keyspace or the column family has to be truncated or dropped. This setting is strongly advised and has saved many users from accidentally dropping the CFs via CQL/CLI.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.216.254