The term observability is often used to describe a desirable attribute of distributed systems. Observability means having visibility into the various components of a system in order to detect, predict, and perhaps even prevent the complex failures that can occur in distributed systems. Failures in individual components can affect other components in turn, and multiple failures can interact in unforeseen ways, leading to system-wide outages. Common elements of an observability strategy for a system include metrics, logging, and tracing.
In this chapter, you’ll learn how Cassandra supports these elements of observability and how to use available tools to monitor and understand important events in the life cycle of your Cassandra cluster. We’ll look at some simple ways to see what’s going on, such as changing the logging levels and understanding the output.
To begin, let’s discuss how Cassandra uses the Java Management Extensions (JMX) to expose information about its internal operations and allow the dynamic configuration of some of its behavior. That will give you a basis to learn how to monitor Cassandra with various tools.
Cassandra makes use of Java Management Extensions (JMX) to enable remote management of your nodes. JMX started as Java Specification Request (JSR) 160 and has been a core part of Java since version 5.0. You can read more about the JMX implementation in Java by examining the java.lang.management
package.
JMX is a Java API that provides management of applications in two key ways. First, it allows you to understand your application’s health and overall performance in terms of memory, threads, and CPU usage—things that apply to any Java application. Second, it allows you to work with specific aspects of your application that you have instrumented.
Instrumentation refers to putting a wrapper around application code that provides hooks from the application to the JVM in order to allow the JVM to gather data that external tools can use. Such tools include monitoring agents, data analysis tools, profilers, and more. JMX allows you not only to view such data but also, if the application enables it, to manage your application at runtime by updating values.
Many popular Java applications are instrumented using JMX, including the JVM itself, the Tomcat application server, and Cassandra. Figure 11-1 shows the JMX architecture as used by Cassandra.
The JMX architecture is simple. The JVM collects information from the underlying operating system. The JVM itself is instrumented, so many of its features are exposed including memory management and garbage collection, threading and deadlock detection, classloading, and logging.
An instrumented Java application (such as Cassandra) runs on top of this, also exposing some of its features as manageable objects. The Java Development Kit (JDK) includes an MBean server that makes the instrumented features available over a remote protocol to a JMX Management Application. The JVM also offers management capabilities via Simple Network Monitoring Protocol (SNMP), which may be useful if you are using SMTP monitoring tools such as Nagios or Zenoss.
By default, Cassandra runs with JMX enabled for local access only. To enable remote access, edit the file <cassandra-home>/cassandra-env.sh (or cassandra.ps1 on Windows). Search for “JMX” to find the section of the file with options to control the JMX port and other local/remote connection settings.
Within a given application, you can manage only what the application developers have made available for you to manage. Luckily, the Cassandra developers have instrumented large parts of the database engine, making management via JMX fairly straightforward.
A managed bean, or MBean, is a special type of Java bean that represents a single manageable resource inside the JVM. MBeans interact with an MBean server to make their functions remotely available. Many classes in Cassandra are exposed as MBeans, which means in practical terms that they implement a custom interface that describes attributes they expose and operations that need to be implemented and for which the JMX agent will provide hooks.
For example, let’s look at Cassandra’s CompactionManager
from the org.apache.cassandra.db.compaction
package and how it uses MBeans. Here’s a portion of the definition of the CompactionManagerMBean
class, with comments omitted for brevity:
public interface CompactionManagerMBean { public List<Map<String, String>> getCompactions(); public List<String> getCompactionSummary(); public TabularData getCompactionHistory(); public void forceUserDefinedCompaction(String dataFiles); public void stopCompaction(String type); public void stopCompactionById(String compactionId); public int getCoreCompactorThreads(); public void setCoreCompactorThreads(int number); ... }
Some simple values in the application are exposed as attributes. An example of this is the coreCompactorThreads
attribute, for which getter and setter operations are provided. Other attributes that are read-only are the current compactions
in progress, the compactionSummary
and the compactionHistory
. You can refresh to see the most recent values, but that’s pretty much all you can do with them. Because these values are maintained internally in the JVM, it doesn’t make sense to set them externally (they’re derived from actual events, and not configurable).
MBeans can also make operations available to the JMX agent that let you execute some useful action. The forceUserDefinedCompaction()
and stopCompaction()
methods are operations that you can use to force a compaction to occur or stop a running compaction from a JMX client.
As you can see by this MBean interface definition, there’s no magic going on. This is just a regular interface defining the set of operations. The CompactionManager
class implements this interface and does the work of registering itself with the MBean server for the JMX attributes and operations that it maintains locally:
public static final String MBEAN_OBJECT_NAME = "org.apache.cassandra.db:type=CompactionManager"; static { instance = new CompactionManager(); MBeanWrapper.instance.registerMBean(instance, MBEAN_OBJECT_NAME); }
Note that the MBean is registered in the domain org.apache.cassandra.db
with a type of CompactionManager
. The attributes and operations exposed by this MBean appear under org.apache.cassandra.db > CompactionManager
in JMX clients.
In the following sections, you’ll learn about some of the key MBeans that Cassandra exposes to allow monitoring and management via JMX. Many of these MBeans correspond to the services and managers introduced in Chapter 6. In most cases, the operations and attributes exposed by the MBeans are accessible via nodetool
commands discussed throughout this book.
These are the Cassandra classes related to the core database itself that are exposed to clients in the org.apache.cassandra.db
domain. There are many MBeans in this domain, but we’ll focus on a few key ones related to the data the node is responsible for storing, including caching, the commit log, and metadata about specific tables.
Because Cassandra is a database, it’s essentially a very sophisticated storage program; therefore, Cassandra’s storage engine as implemented in the org.apache.cassandra.service.StorageService
is an essential focus of monitoring and management. The corresponding MBean for this service is the StorageServiceMBean
, which provides many useful attributes and operations.
The MBean exposes identifying information about the node, including its host ID, the cluster name and partitioner in use. It also allows you to inspect the node’s OperationMode
, which reports normal
if everything is going smoothly (other possible states are leaving
, joining
, decommissioned
, and client
). These attributes are used by nodetool
commands such as describecluster
and info
You can also view the current set of live nodes, as well as the set of unreachable nodes in the cluster. If any nodes are unreachable, Cassandra will tell you their IP addresses in the UnreachableNodes
attribute.
To get an understanding of how much data is stored on each node, you can use the getLoadMapWithPort()
method, which will return to you a Java Map with keys of IP addresses with values of their corresponding storage loads. You can also use the effectiveOwnershipWithPort(String keyspace)
operation to access the percentage of the data in a keyspace owned by each node. This information is used in the nodetool ring
command.
If you’re looking for which nodes own a certain partition key, you can use the getNaturalEndpointsWithPort()
operation. Pass it the keyspace name, table name and the partition key for which you want to find the endpoint value, and it will return a list of IP addresses (with port number) that are responsible for storing this key.
You can also use the describeRingWithPortJMX()
operation to get a list of token ranges in the cluster including their ownership by nodes in the cluster. This is used by the nodetool describering
operation.
There are many standard maintenance operations that the StorageServiceMBean
affords you, including resumeBootstrap()
, joinRing()
, flush()
, truncate()
, repairAsync()
, cleanup()
, scrub()
, drain()
, removeNode()
, decommission()
, and operations to start and stop gossip, and the native transport. We’ll dig into the nodetool
commands that correspond to these operations in Chapter 12.
If you want to change Cassandra’s log level at runtime without interrupting service (as you’ll see in “Logging”), you can invoke the setLoggingLevel(String classQualifier, String level)
method.
As you learned in Chapter 6, the org.apache.cassandra.service.StorageProxy
provides a layer on top of the StorageService
to handle client requests and inter-node communications. The StorageProxyMBean
provides the ability to check and set timeout values for various operations including read and write. Along with many other attributes exposed by Cassandra’s MBeans, these timeout values would originally be specified in the cassandra.yaml file. Setting these attributes takes effect only until the next time the node is restarted, whereupon they’ll be initialized to the values in the configuration file.
This MBean also provides access to hinted handoff settings such as the maximum time window for storing hints. Hinted handoff statistics include getTotalHints()
and getHintsInProgress()
. You can disable hints for nodes in a particular data center with the disableHintsForDC()
operation.
You can also turn this node’s participation in hinted handoff on or off via setHintedHandoffEnabled()
, or check the current status via getHintedHandoffEnabled()
. These are used by nodetool
’s enablehandoff
, disablehandoff
, and statushandoff
commands, respectively.
In addition to the hinted handoff operations on the StorageServiceMBean
, Cassandra provides more hinted handoff controls via the org.apache.cassandra.hints.HintsServiceMBean
. The MBean exposes the ability to pause and resume hint delivery You can delete hints that are stored up for a specific node with deleteAllHintsForEndpoint()
.
Additionally, you can pause and resume hint delivery to all nodes with pauseDispatch()
and resumeDispatch()
. You can delete stored hints for all nodes with the deleteAllHints()
operation, or for a specific node with deleteAllHintsForEndpoint()
. These are used by nodetool
’s pausehandoff
, resumehandoff
, and truncatehints
commands.
Cassandra registers an instance of the org.apache.cassandra.db.ColumnFamilyStoreMBean
for each table stored in the node under org.apache.cassandra.db > Tables
(this is a legacy name: tables were known as column families in early versions of Cassandra).
The ColumnFamilyStoreMBean
provides access to the compaction and compression settings for each table. This allows you to temporarily override these settings on a specific node. The values will be reset to those configured on the table schema when the node is restarted.
The MBean also exposes a lot of information about the node’s storage of data for this table on disk. The getSSTableCountPerLevel()
operation provides a list of how many SStables are in each tier. The estimateKeys()
operation provides an estimate of the number of partitions stored on this node. Taken together, this information can give you some insight as to whether invoking the forceMajorCompaction()
operation for this table might help free space on this node and increase read performance.
There is also a trueSnapshotsSize()
operation that allows you to determine the size of SSTable shapshots which are no longer active. A large value here indicates that you should consider deleting these snapshots, possibly after making an archive copy.
Because Cassandra stores indexes as tables, there is also a ColumnFamilyStoreMBean
instance for each indexed column, available under org.apache.cassandra.db > IndexTables
(previously IndexColumnFamilies
), with the same attributes and operations.
The org.apache.cassandra.db.commitlog.CommitLogMBean
exposes attributes and operations that allow you to learn about the current state of commit logs. The CommitLogMBean
also exposes the recover()
operation which can be used to restore database state from archived commit log files.
The default settings that control commit log recovery are specified in the conf/commitlog_archiving.properties file, but can be overridden via the MBean. You’ll learn more about data recovery in Chapter 12.
We’ve already taken a peek inside the source of the org.apache.cassandra.db.compaction.CompactionManagerMBean
to see how it interacts with JMX, but we didn’t really talk about its purpose. This MBean allows us to get statistics about compactions performed in the past, and the ability to force compaction of specific SSTable files we identify by calling the forceUserDefinedCompaction
method of the CompactionManager
class. This MBean is leveraged by nodetool
commands including compact
, compactionhistory
, and compactionstats
.
The org.apache.cassandra.service.CacheServiceMBean
provides access to Cassandra’s key cache, row cache, and counter cache under the domain org.apache.cassandra.db > Caches
. The information available for each cache includes the maximum size and time duration to cache items, and the ability to invalidate each cache.
The final MBeans we’ll consider describe internal operations of the Cassandra node, including threading, garbage collection, security, and exposing metrics.
Cassandra’s thread pools are implemented via the JMXEnabledThreadPoolExecutor
and JMXConfigurableThreadPoolExecutor
classes in the org.apache.cassandra.concurrent
package. The first class The MBeans for each stage implement the JMXEnabledThreadPoolExecutorMBean
and JMXConfigurableThreadPoolExecutorMBean
interfaces, respectively, which allow you to view and configure the number of core threads in each thread pool as well as the maximum number of threads. The MBeans for each type of thread pool appear under the org.apache.cassandra.internal
domain to JMX clients.
The JVM’s garbage collection processing can impact tail latencies if not tuned properly, so it’s important to monitor its performance, as we’ll see in Chapter 13. The GCInspectorMXBean
appears in the org.apache.cassandra.service
domain. It exposes the operation getAndResetStats()
which retrieves and then resets garbage collection metrics that Cassandra collects on its JVM, which is used by the nodetool gcstats
command. It also exposes attributes that control the thresholds at which INFO
and WARN
logging messages are generated for long GC pauses.
The org.apache.cassandra.auth
domain defines the AuthCacheMBean
, which exposes operations used to configure how Cassandra caches client authentication records. We’ll discuss this MBean in Chapter 14.
The ability to access metrics related to application performance, health, and key activities has become an essential tool for maintaining web-scale applications. Fortunately, Cassandra collects a wide range of metrics on its own activities to help us understand the behavior. JMX supports several different styles of metrics including counters, gauges, meters, histograms and timers.
To make its metrics accessible via JMX, Cassandra uses the Dropwizard Metrics open source Java library. Cassandra uses the org.apache.cassandra.metrics.CassandraMetricsRegistry
to register its metrics with the Dropwizard Metrics library, which in turn exposes them as MBeans in the org.apache.cassandra.metrics
domain. You’ll see below in “Metrics” a summary of the specific metrics that Cassandra reports and learn how these can be exposed to metrics aggregation frameworks.
We’ve already explored a few of the commands offered by nodetool
in previous chapters, but let’s take this opportunity to get properly acquainted.
nodetool
ships with Cassandra and can be found in <cassandra-home>/bin. This is a command-line program that offers a rich array of ways to look at your cluster, understand its activity, and modify it. nodetool
lets you get statistics about the cluster, see the ranges each node maintains, move data from one node to another, decommission a node, and even repair a node that’s having trouble.
Behind the scenes, nodetool
uses JMX to access the MBeans described above using a helper class called org.apache.cassandra.tools.NodeProbe
. The NodeProbe
class connects to the JMX agent at a specified node by its IP address and JMX port, locate MBeans, retrieve their data, and invoke their operations. The NodeToolCmd
class in the same package is an abstract class which is extended by each nodetool
command to provide access to administrative functionality in an interactive command-line interface.
nodetool
uses the same environment settings as the Cassandra daemon: bin/cassandra.in.sh and conf/cassandra-env.sh on Unix (or bin/cassandra.in.bat and conf/ cassandra-env.ps1 on Windows). The logging settings are found in the conf/logbacktools.xml file; these work the same way as the Cassandra daemon logging settings found in conf/logback.xml.
Starting nodetool
is a breeze. Just open a terminal, navigate to <cassandra-home>, and enter the following command:
$ bin/nodetool help
This causes the program to print a list of available commands, several of which we will cover momentarily. Running nodetool
with no arguments is equivalent to the help
command. You can also execute help with the name of a specific command to get additional details.
With the exception of the help
command, nodetool
must connect to a Cassandra node in order to access information about that node or the cluster as a whole.
You can use the -h
option to identify the IP address of the node to connect to with nodetool
. If no IP address is specified, the tool attempts to connect to the default port on the local machine.
If you have a ccm
cluster available as discussed in Chapter 10, you can run nodetool
commands against specific nodes, for example:
ccm node1 nodetool help
To get more interesting statistics from a cluster as you try out the commands in this chapter yourself, you might want to run your own instance of the Reservation Service introduced in Chapter 7 and Chapter 8.
You can get a variety of information about the cluster and its nodes, which we look at in this section. You can get basic information on an individual node or on all the nodes participating in a ring.
The describecluster
command prints out basic information about the cluster, including the name, snitch, and partitioner. For example, here’s a portion of the output when run against the cluster created for the Reservation Service using ccm
:
$ ccm node1 nodetool describecluster Cluster Information: Name: reservation_service Snitch: org.apache.cassandra.locator.SimpleSnitch DynamicEndPointSnitch: enabled Partitioner: org.apache.cassandra.dht.Murmur3Partitioner Schema versions: 2b88dbfd-6e40-3ef1-af11-d88b6dff2c3b: [127.0.0.4, 127.0.0.3, 127.0.0.2, 127.0.0.1] ...
We’ve shortened the output a bit for brevity. The “Schema versions” portion of the output is especially important for identifying any disagreements in table definitions, or “schema,” between nodes. While Cassandra propagates schema changes through a cluster, any differences are typically resolved quickly, so any lingering schema differences usually correspond a node that is down or unreachable that needs to be restarted, which you should be able to confirm via the summary statistics on nodes that are also printed out.
A more direct way to identify the nodes in your cluster and what state they’re in, is to use the status
command:
$ ccm node1 nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 127.0.0.1 251.77 KiB 256 48.7% d23716cb-a6e2-423a-a387-f26f5b623ae7 rack1 UN 127.0.0.2 250.28 KiB 256 50.0% 635f2ab7-e81a-423b-a566-674d8010c819 rack1 UN 127.0.0.3 250.47 KiB 256 53.6% a1cd5663-02a3-4038-8c34-0ce6de397da7 rack1 UN 127.0.0.4 403.46 KiB 256 47.7% b493769e-3dce-47d7-bae6-1bed956cf560 rack1
The status is organized by data center and rack. Each node’s status is identified by a two-character code, in which the first character indicates whether the node is up (currently available and ready for queries) or down, and the second character indicates the state or operational mode of the node. The load
column represents the byte count of the data each node is holding. The owns
column indidates the effective percentage of the token range owned by the node, taking replication into account.
The info
command tells nodetool
to connect with a single node and get basic data about its current state. Just pass it the address of the node you want info for:
$ ccm node2 nodetool info ID : 635f2ab7-e81a-423b-a566-674d8010c819 Gossip active : true Native Transport active: true Load : 250.28 KiB Generation No : 1574894395 Uptime (seconds) : 146423 Heap Memory (MB) : 191.69 / 495.00 Off Heap Memory (MB) : 0.00 Data Center : datacenter1 Rack : rack1 Exceptions : 0 Key Cache : entries 10, size 896 bytes, capacity 24 MiB, 32 hits, 44 requests, 0.727 recent hit rate, 14400 save period in seconds Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds Counter Cache : entries 0, size 0 bytes, capacity 12 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds Chunk Cache : entries 16, size 256 KiB, capacity 91 MiB, 772 misses, 841 requests, 0.082 recent hit rate, NaN microseconds miss latency Percent Repaired : 100.0% Token : (invoke with -T/--tokens to see all 256 tokens)
The information reported includes the memory and disk usage (“Load”) of the node and the status of various services offered by Cassandra. You can also check the status of individual services by the nodetool
commands statusgossip
, statusbinary
, and statushandoff
(note that handoff status is not part of info
).
To determine what nodes are in your ring and what state they’re in, use the ring
command on nodetool
, like this:
$ Datacenter: datacenter1 ========== Address Rack Status State Load Owns Token 9218490134647118760 127.0.0.1 rack1 Up Normal 251.77 KiB 48.73% -9166983985142207552 127.0.0.4 rack1 Up Normal 403.46 KiB 47.68% -9159867377343852899 127.0.0.2 rack1 Up Normal 250.28 KiB 49.99% -9159653278489176223 127.0.0.1 rack1 Up Normal 251.77 KiB 48.73% -9159520114055706114 ...
This output is organized in terms of vnodes. Here we see the IP addresses of all the nodes in the ring. In this case, we have three nodes, all of which are up (currently available and ready for queries). The load column represents the byte count of the data each node is holding. The output of the describering
command is similar but is organized around token ranges.
Other useful status commands provided by nodetool
include:
The getLoggingLevels
and setLoggingLevels
commands allow dynamic configuration of logging levels, using the Logback JMXConfiguratorMBean
we discussed previously.
The gossipinfo
command prints the information this node disseminates about itself and has obtained from other nodes via gossip, while failuredetector
provides the Phi failure detection value calculated for other nodes.
The version
command prints the version of Cassandra this node is running.
nodetool
also lets you gather statistics about the state of your server in the aggregate level as well as down to the level of specific keyspaces and tables. Two of the most frequently used commands are tpstats
and tablestats
, both of which we examine now.
The tpstats
tool gives us information on the thread pools that Cassandra maintains. Cassandra is highly concurrent, and optimized for multiprocessor/multicore machines, so understanding the behavior and health of the thread pools is important to good Cassandra maintenance.
To find statistics on the thread pools, execute nodetool
with the tpstats
command:
$ bin/nodetool tpstats ccm node1 nodetool tpstats Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 399 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 0 0 95541 0 0 MutationStage 0 0 0 0 0 ... Message type Dropped Latency waiting in queue (micros) 50% 95% 99% Max READ_RSP 0 0.00 0.00 0.00 0.00 RANGE_REQ 0 0.00 0.00 0.00 0.00 PING_REQ 0 0.00 0.00 0.00 0.00 _SAMPLE 0 0.00 0.00 0.00 0.00
The top portion of the output presents data on tasks in each of Cassandra’s thread pools. You can see directly how many operations are in what stage, and whether they are active, pending, or completed. For example, by reviewing the number of active tasks in the MutationStage
, you can learn how many writes are in progress.
The bottom portion of the output indicates the number of dropped messages for the node. Dropped messages are an indicator of Cassandra’s load shedding implementation, which each node uses to defend itself when it receives more requests than it can handle. For example, internode messages that are received by a node but not processed within the rpc_timeout
are dropped, rather than processed, as the coordinator node will no longer be waiting for a response.
Seeing lots of zeros in the output for blocked tasks and dropped messages means that you either have very little activity on the server or that Cassandra is doing an exceptional job of keeping up with the load. Lots of non-zero values is indicative of situations where Cassandra is having a hard time keeping up, and may indicate a need for some of the techniques described in Chapter 13.
To see overview statistics for keyspaces and tables, you can use the tablestats
command. You may also recognize this command from its previous name cfstats
. Here is sample output on the reservations_by_confirmation
table:
$ ccm node1 nodetool tablestats reservation.reservations_by_confirmation Total number of tables: 43 ---------------- Keyspace : reservation Read Count: 0 Read Latency: NaN ms Write Count: 0 Write Latency: NaN ms Pending Flushes: 0 Table: reservations_by_confirmation SSTable count: 0 Old SSTable count: 0 Space used (live): 0 Space used (total): 0 Space used by snapshots (total): 0 Off heap memory used (total): 0 SSTable Compression Ratio: -1.0 Number of partitions (estimate): 0 Memtable cell count: 0 Memtable data size: 0 Memtable off heap memory used: 0 Memtable switch count: 0 Local read count: 0 Local read latency: NaN ms Local write count: 0 Local write latency: NaN ms Pending flushes: 0 Percent repaired: 100.0 Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used: 0 Bloom filter off heap memory used: 0 Index summary off heap memory used: 0 Compression metadata off heap memory used: 0 Compacted partition minimum bytes: 0 Compacted partition maximum bytes: 0 Compacted partition mean bytes: 0 Average live cells per slice (last five minutes): NaN Maximum live cells per slice (last five minutes): 0 Average tombstones per slice (last five minutes): NaN Maximum tombstones per slice (last five minutes): 0 Dropped Mutations: 0
We can see the read and write latency and total number of reads and writes. We can also see detailed information about Cassandra’s internal structures for the table, including memtables, Bloom filters and SSTables. You can get statistics for all the tables in a keyspace by specifying just the keyspace name, or specify no arguments to get statistics for all tables in the cluster.
In the 4.0 release, Cassandra added a virtual tables feature. Virtual tables are so named because they are not actual tables that are stored using Cassandra’s typical write path with data written to memtables and SSTables. Instead, these virtual tables are views that provide metadata about nodes and tables via standard CQL.
This metadata is available via two keyspaces which you may have noticed in earlier chapters when we used the DESCRIBE KEYSPACES
command, called system_views
and system_virtual_schema
:
cqlsh> DESCRIBE KEYSPACES; reservation system_traces system_auth system_distributedsystem_views
system_schema systemsystem_virtual_schema
These two keyspaces contain virtual tables that provide different types of metadata. Before we look into them, here are a couple of important things you should know about virtual tables:
You may not define your own virtual tables.
The scope of virtual tables is the local node.
When interacting with Virtual Tables through cqlsh
, results will come from the node that cqlsh
connected to, as you’ll see next.
Virtual tables are not persisted, so any statistics will be reset when the node restarts
Let’s look first at the tables in the system_virtual_schema
:
cqlsh:system> USE system_virtual_schema; cqlsh:system_virtual_schema> DESCRIBE TABLES; keyspaces columns tables
If you examine the schema and contents of the keyspaces
table, you’ll see that the schema of this table is extremely simple - it’s just a list of keyspace names.
cqlsh:system_virtual_schema> SELECT * FROM KEYSPACES; keyspace_name ----------------------- reservation system_views system_virtual_schema (3 rows)
The design of the tables
table is quite similar, consisting of keyspace_name
, table_name
and comment
columns, in which the primary key is (keyspace_name, table_name)
.
The columns table is more interesting. We’ll focus on a subset of the available columns:
cqlsh:system_virtual_schema> SELECT column_name, clustering_order, kind, position, type FROM columns WHERE keyspace_name = 'reservation' and table_name = 'reservations_by_hotel_date'; column_name | clustering_order | kind | position | type ---------------------+------------------+---------------+----------+---------- confirmation_number | none | regular | -1 | text end_date | none | regular | -1 | date guest_id | none | regular | -1 | uuid hotel_id | none | partition_key | 0 | text room_number | asc | clustering | 0 | smallint start_date | none | partition_key | 1 | date (6 rows)
As you can see, this query provides enough data to describe the schema of a table including the columns and primary key definition. Although it does not include the table options, the response is otherwise quite similar in content to the output of the cqlsh DESCRIBE
operations.
Interestingly, cqlsh
traditionally scanned tables in the system
keyspace to implement these operations but is updated in the 4.0 release to use virtual tables.
The second keyspace containing virtual tables is the system_views
keyspace. Let’s get a list of the available views:
cqlsh:system_virtual_schema> SELECT * from tables where keyspace_name = 'system_views'; keyspace_name | table_name | comment ---------------+---------------------------+----------------------------- system_views | caches | system caches system_views | clients | currently connected clients system_views | coordinator_read_latency | system_views | coordinator_scan_latency | system_views | coordinator_write_latency | system_views | disk_usage | system_views | internode_inbound | system_views | internode_outbound | system_views | local_read_latency | system_views | local_scan_latency | system_views | local_write_latency | system_views | max_partition_size | system_views | rows_per_read | system_views | settings | current settings system_views | sstable_tasks | current sstable tasks system_views | thread_pools | system_views | tombstones_per_read | (17 rows)
As you can see, there is a mix of tables that provide latency histograms for reads and writes to local storage and when this node is acting as a coordinator.
The max_partition_size
and tombstones_per_read
tables are particularly useful in helping to identify some of the situations that lead to poor performance in Cassandra clusters, which we’ll address in Chapter 12.
The disk_usage
view provides the storage
expressed in mebibytes (1,048,576 bytes). Again, remember this is how much storage for each table on that individual node. Related to this is the max_partition_size
. This could be useful in determining if a node is affected by a wide partition. We’ll learn more about how to detect and address these in Chapter 13.
Let’s look a bit more closely at a couple of these tables. First, let’s have a look at the clients
table:
cqlsh:system_virtual_schema> USE system_views; cqlsh:system_views> SELECT * FROM clients; address | port | connection_stage | driver_name | driver_version | hostname | protocol_version | request_count | ssl_cipher_suite | ssl_enabled | ssl_protocol | username -----------+-------+------------------+-------------+----------------+-----------+------------------+---------------+------------------+-------------+--------------+----------- 127.0.0.1 | 50631 | ready | null | null | localhost | 4 | 261 | null | False | null | anonymous 127.0.0.1 | 50632 | ready | null | null | localhost | 4 | 281 | null | False | null | anonymous
As you can see, this table provides information about each client with an active connection to the node including its location and number of requests. This information is useful to make sure the list of clients and their level of usage is in line with what you expect for your application.
Another useful table is settings
. This allows you to see the values of configurable parameters for the node set via the cassandra.yaml
file or subsequently modified via JMX:
cqlsh:system_views> SELECT * FROM settings LIMIT 10; name | value ----------------------------------------------+-------------------------------------------------------------------- allocate_tokens_for_keyspace | null allocate_tokens_for_local_replication_factor | null audit_logging_options_audit_logs_dir | /Users/jeffreycarpenter/.ccm/reservation_service/node1/logs/audit/ audit_logging_options_enabled | false audit_logging_options_excluded_categories | audit_logging_options_excluded_keyspaces | system,system_schema,system_virtual_schema audit_logging_options_excluded_users | audit_logging_options_included_categories | audit_logging_options_included_keyspaces | audit_logging_options_included_users | (10 rows)
You’ll notice that much of this same information could be accessed via various nodetool
commands. However, the value of virtual tables is that they may be accessed through any client using the CQL native protocol, including applications you write using the DataStax Java Drivers. Of course, some of these values you may not wish to allow your clients access to; we’ll discuss how to secure access to keyspaces and tables in Chapter 14.
While the 4.0 release provides a very useful set of virtual tables, the Cassandra community has come up with many other virtual table requests, which you can find in the Cassandra Jira:
Set configuration values on the settings
table (CASSANDRA-15254)
Hints metadata (CASSANDRA-14795)
Additional table metrics (CASSANDRA-14572)
Access to individual partition sizes (CASSANDRA-12367)
Get a listing of current running queries (CASSANDRA-15241)
Repair status (CASSANDRA-15399)
We expect that the data available via virtual tables will eventually catch up with JMX, and even surpass it in some areas.
If you’re interested in the implementation of virtual tables, you can find the code in the org.apache.cassandra.db.virtual
package. For more detail on using virtual tables, see Alexander Dejanovski’s blog post Virtual tables are coming in Cassandra 4.0.
As we mentioned at the start of this chapter, metrics are a vital component of the observability of the system. It’s important to have access to metrics at the OS, JVM, and application level. The metrics Cassandra reports at the application level include:
Buffer pool metrics describing Cassandra’s use of memory.
CQL metrics including the number of prepared and regular statement executions.
Cache metrics for key, row, and counter caches such as the number of entries versus capacity, as well as hit and miss rates.
Client metrics including the number of connected clients, and information about client requests such as latency, failures, and timeouts.
Commit log metrics including the commit log size and statistics on pending and completed tasks.
Compaction metrics including the total bytes compacted and statistics on pending and completed compactions.
Connection metrics to each node in the cluster including gossip.
Dropped message metrics which are used as part of nodetool tpstats
.
Read repair metrics describing the number of background versus blocking read repairs performed over time.
Storage metrics, including counts of hints in progress and total hints.
Streaming metrics, including the total incoming and outgoing bytes of data streamed to other nodes in the cluster
Thread pool metrics, including active, completed, and blocked tasks for each thread pool.
Table metrics, including caches, memtables, SSTables, and Bloom filter usage and the latency of various read and write operations, reported at one, five, and fifteen minute intervals.
Keyspace metrics that aggregate the metrics for the tables in each keyspace.
Many of these metrics are used by nodetool
commands such as tpstats
, tablehistograms
, and proxyhistograms
. For example, tpstats
is simply a presentation of the thread pool and dropped message metrics.
Note that in Cassandra releases through 4.0, the metrics reported are lifetime metrics since the node was started. To reset the metrics on a node, you have to restart it. The JIRA issue CASSANDRA-8433 requests the ability to reset the metrics via JMX and nodetool
.
While you can learn a lot about the overall health of your cluster from metrics, logging provides a way to get more specific detail about what’s happening in your database so that you can investigate and troubleshoot specific issues. Cassandra uses the Simple Logging Facade for Java (SLF4J) API for logging, with Logback as the implementation. SLF4J provides a facade over various logging frameworks such as Logback, Log4J, and Java’s built-in logger (java.util.logging). You can learn more about Logback at http://logback.qos.ch/. Cassandra’s default logging configuration is found in the file <cassandra-home>/conf/logback.xml
The Log4J API is built around the concepts of loggers and appenders. Each class in a Java application has a dedicated logger, plus there are loggers for each level of the package hierarchy as well as a root logger. This allows fine grained control over logging; you can configure the log level for a particular class or any level in the package hierarchy, or even the root level. Log4J uses a progression of log levels: ALL < DEBUG < INFO < WARN < ERROR < FATAL < OFF
. When you configure a log level for a logger, messages at that log level and greater will be output via appenders (which we’ll introduce below). You can see how the logging level for Cassandra’s classes is set in the logback.xml file:
<logger name="org.apache.cassandra" level="DEBUG"/>
Note that the root logger defaults to the INFO
logging level, so that is the level at which all other classes that use Log4J will report.
An appender in Log4J is responsible for taking generated log messages that match a provided filter and outputting them to some location. According to the default configuration found in logback.xml, Cassandra provides appenders for logging into three different files:
The system.log
contains logging messages at the INFO
logging level and greate. You’ve already seen some of the contents of the system.log
in Chapter 10 as you were starting and stopping a node, so you know that this log will contain information about nodes joining and leaving a cluster. It also contains information about schema changes.
The debug.log
contains more detailed messages useful for debugging, incorporating the DEBUG
log level and above. This log can be pretty noisy but provides a lot of useful information about internal activity within a node, including memtable flushing and compaction.
The gc.log
contains messages related to the JVM’s garbage collection. This is a standard Java GC log file and is particularly useful for identifying long GC pauses. We’ll discuss GC tuning in Chapter 13.
The default configuration also describes an appender for the console log, which you can access in the terminal window where you start Cassandra by setting the -f
flag (to keep output visible in the foreground of the terminal window).
By default, Cassandra’s log files are stored in the logs directory underneath the Cassandra installation directory. If you want to change the location of the logs directory, you can override this value using the CASSANDRA_LOG_DIR
environment variable when starting Cassandra, or you could edit the logback.xml file directly.
The default configuration does not pick up changes to the logging settings on a live node. You can ask Logback to rescan the configuration file once a minute, by setting properties in the logback.xml file:
<configuration scan="true" scanPeriod="60 seconds">
You may also view the log levels on a running node through the nodetool getlogginglevels
command and override the log level for the logger at any level of the Java package and class hierarchy using nodetool setlogginglevel
.
Other settings in the logback.xml file support rolling log files. By default, the logs are configured to use the SizeAndTimeBasedRollingPolicy
. Each log file will be rolled to an archive once it reaches a size of 50 MB or at midnight, whichever comes first, with a maximum of 5GB across all system logs. For example, look at the configuration of the rolling policy for the system.log:
<rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy"> <fileNamePattern>${cassandra.logdir}/system.log.%d{yyyy-MM-dd}.%i.zip</fileNamePattern> <maxFileSize>50MB</maxFileSize> <maxHistory>7</maxHistory> <totalSizeCap>5GB</totalSizeCap> </rollingPolicy>
Each log file archive is compressed in zip format and named according to the pattern described above, which will lead to files named system.log.2020-05-30.0.zip, system.log.2020-05-30.1.zip, and so on. These archived log files are kept for 7 days by default. The default settings may well be suitable for development and production environments; just make sure you account for the storage that will be required in your planning.
You can examine log files in order to determine things that are happening with your nodes. One of the most important tasks in monitoring Cassandra is to regularly check log files for statements at the WARN
and ERROR
log levels. Several of the conditions under which Cassandra generates WARN
log messages are configurable via the cassandra.yaml file:
The tombstone_warn_threshold
property sets the maximum number of tombstones that Cassandra will encounter on a read before generating a warning. This value defaults to 1000.
The batch_size_warn_threshold_in_kb
property sets the maximum size of the total data in a batch command, which is useful for detecting clients that might be trying to insert a quantity of data in a batch that will negatively impact the coordinator node performance. The default value is 5kb.
The gc_warn_threshold_in_ms
property sets the maximum garbage collection pause that will cause a warning log. This defaults to 1000ms (1 second), and the corresponding setting for INFO
log messages gc_log_threshold_in_ms
is set at 200ms.
Here’s an example of a message you might find in the logs for a query that encounters a large number of tombstones:
WARN [main] 2020-04-08 14:30:45,111 ReadCommand.java:598 - Read 0 live rows and 3291 tombstone cells for query SELECT * FROM reservation.reservations_by_hotel_date (see tombstone_warn_threshold)
As with the use of metrics aggregation, it’s also frequently helpful to aggregate logs across multiple microservices and infrastructure components in order to analyze threads of related activity correlated by time. There are many commercial log aggregation solutions available, and the ELK stack consisting of Elasticsearch, Logstash and Kibana is a popular combination of open source projects used for log aggregation and analysis.
An additional step beyond basic aggregation is the ability to perform distributed traces of calls throughout a system. This involves incorporating a correlation ID into metadata passed on remote calls or messages between services. The correlation ID is a unique identifier, typically assigned by a service at the entry point into the system. The correlation ID can be used as a search criteria through aggregated logs to identify work performed across a system associated with a particular request. You’ll learn more about tracing with Cassandra in Chapter 13.
You can also observe the regular operation of the cluster through the log files. For example, you could connect to a node in a cluster started using ccm
as described in Chapter 10 and write a simple value to the database using cqlsh
:
$ ccm node2 cqlsh cqlsh> INSERT INTO reservation.reservations_by_confirmation (confirmation_number, hotel_id, start_date, end_date, room_number, guest_id) VALUES ('RS2G0Z', 'NY456', '2020-06-08', '2020-06-10', 111, 1b4d86f4-ccff-4256-a63d-45c905df2677);
If you execute this command, cqlsh
will use node2
as the coordinator (a feature of cqlsh
and enabled by the Python driver), so we can check the logs for node2
$ tail ~/.ccm/reservation_service/node2/logs/system.log INFO [Messaging-EventLoop-3-5] 2019-12-07 16:00:45,542 OutboundConnection.java:1135 - 127.0.0.2:7000(127.0.0.3:7000)->127.0.0.3:7000(127.0.0.3:62706)-SMALL_MESSAGES-627a8d80 successfully connected, version = 12, framing = CRC, encryption = disabled INFO [Messaging-EventLoop-3-10] 2019-12-07 16:00:45,545 OutboundConnection.java:1135 - 127.0.0.2:7000(127.0.0.4:7000)->127.0.0.4:7000(127.0.0.4:62707)-SMALL_MESSAGES-5bc34c55 successfully connected, version = 12, framing = CRC, encryption = disabled INFO [Messaging-EventLoop-3-8] 2019-12-07 16:00:45,593 InboundConnectionInitiator.java:450 - 127.0.0.1:7000(127.0.0.2:62710)->127.0.0.2:7000-SMALL_MESSAGES-9e9a00e9 connection established, version = 12, framing = CRC, encryption = disabled INFO [Messaging-EventLoop-3-7] 2019-12-07 16:00:45,593 InboundConnectionInitiator.java:450 - 127.0.0.3:7000(127.0.0.2:62709)->127.0.0.2:7000-SMALL_MESSAGES-e037c87e connection established, version = 12, framing = CRC, encryption = disabled
This output shows connections initiated from node2
to the other nodes in the cluster to write replicas, and the corresponding responses. If you examine the debug.log
, you’ll see similar information, but not the details of the specific query that was executed.
If you want more detail on exact CQL query strings that are used by client applications, use the full query logging feature introduced in Cassandra 4.0. The full query log is a binary log which is designed to be extremely fast and add the minimum possible overhead to your queries. Full query logging is also useful for live traffic capture and replay.
To enable full query logging on a node, create a directory to hold the logs and then set the full_query_logging_options
in the cassandra.yaml file to point to the directory:
full_query_logging_options: log_dir: /var/tmp/fql_logs
Other configuration options allow you to control how often the log is rolled over to a new file (hourly by default), specify a command used to archive the log files, and to set a limit for full query logs. The full query log will not be enabled until you run the nodetool enablefullquerylog
command.
Cassandra provides a tool to read the logs under the tools/bin/fqltool
directory. Here’s an example of what the output looks like after running some simple queries:
$ tools/bin/fqltool dump /var/tmp/fql_logs Type: single-query Query start time: 1575842591188 Protocol version: 4 Generated timestamp:-9223372036854775808 Generated nowInSeconds:1575842591 Query: INSERT INTO reservation.reservations_by_confirmation (confirmation_number, hotel_id, start_date, end_date, room_number, guest_id) VALUES ('RS2G0Z', 'NY456', '2020-06-08', '2020-06-10', 111, 1b4d86f4-ccff-4256-a63d-45c905df2677); Values: Type: single-query Query start time: 1575842597849 Protocol version: 4 Generated timestamp:-9223372036854775808 Generated nowInSeconds:1575842597 Query: SELECT * FROM reservation.reservations_by_confirmation ; Values:
Once you’re done collecting full query logs, run the nodetool disablefullquerylog
command.
In this chapter, we looked at ways you can monitor and manage your Cassandra cluster. In particular, we went into some detail on JMX and learned the rich variety of operations Cassandra makes available to the MBean server. We saw how to use nodetool
, virtual tables, metrics, and logs to view what’s happening in your Cassandra cluster. You are now ready to learn how to perform routine maintenance tasks to help keep your Cassandra cluster healthy.
98.82.120.188