Chapter 3. Installing Cassandra

For those among us who like instant gratification, we’ll start by installing Cassandra. Because Cassandra introduces a lot of new vocabulary, there might be some unfamiliar terms as we walk through this. That’s OK; the idea here is to get set up quickly in a simple configuration to make sure everything is running properly. This will serve as an orientation. Then, we’ll take a step back and understand Cassandra in its larger context.

Installing the Apache Distribution

While there are a number of options available for installing Cassandra on various operating systems, let’s start our journey by downloading the Apache distribution from http://cassandra.apache.org so we can get a good look at what’s inside. We’ll explore other installation options in “Other Cassandra Distributions”.

Click the link on the Cassandra home page to download a version as a gzipped archive. Typically multiple versions of Cassandra are provided. The latest version is the current recommended version for use in production. There are other supported releases which are still viable for production usage and receive bug fixes. The project goal is to limit the number of supported releases, but reasonable accomodations are made. For example, the 2.2 and 2.1 releases were maintained up until the release of 4.0. For all releases, the prebuilt binary is named apache-cassandra-x.x.x-bin.tar.gz, where x.x.x represents the version number. The download for Cassandra 4.0 is around 40MB.

Extracting the Download

You can unpack the compressed file using any regular ZIP utility. On Unix-based systems such as Linux or MacOS, GZip extraction utilities should be pre-installed; on Windows, you’ll need to get a program such as WinZip, which is commercial, or something like 7-Zip, which is freeware.

Open your extracting program. You might have to extract the ZIP file and the TAR file in separate steps. Once you have a folder on your filesystem called apache-cassandra-x.x.x, you’re ready to run Cassandra.

What’s In There?

Once you decompress the tarball, you’ll see that the Cassandra binary distribution includes several files and directories.

The files include the NEWS.txt file, which includes the release notes describing features included in the current and prior releases, and the CHANGES.txt, which is similar but focuses on bug fixes. You’ll want to make sure to review these files whenever you are upgrading to a new version so you know what changes to expect. The LICENSE.txt and NOTICE.txt files contain the Apache 2.0 license used by Cassandra, and copyright notices for Cassandra and included software, respectively.

Let’s take a moment to look around in the directories and see what we have.

bin

This directory contains the executables to run Cassandra as well as clients, including the query language shell (cqlsh). It also has scripts to run the nodetool, which is a utility for inspecting a cluster to determine whether it is properly configured, and to perform a variety of maintenance operations. We look at nodetool in depth later. The directory also contains several utilities for performing operations on SSTables, the files in which Cassandra stores its data on disk. We’ll discuss these utilities in Chapter 12.

conf

This directory contains the files for configuring your Cassandra instance. The configuration files you may use most frequently include the cassandra.yaml file, which is the primary configuration for running Cassandra, and the logback.xml file, which lets you change the logging settings to suit your needs. Additional files can be used to configure Java Virtual Machine (JVM) settings, the network topology, metrics reporting, archival and restore commands, and triggers. We see how to use these configuration files when we discuss configuration in Chapter 10.

doc

Traditionally documentation has been one of the weaker areas of the project, but a concerted effort for the 4.0 release including sponsorship from the Google Season of Docs project yielded significant progress to the documentation included in the Cassandra distribution as well as the documentation on the Cassandra website at http://cassandra.apache.org/doc/latest/. The documentation includes a getting started guide, an architectural overview, and instructions for configuring and operating Cassandra.

javadoc

This directory contains a documentation website generated using Java’s JavaDoc tool. Note that JavaDoc reflects only the comments that are stored directly in the Java code, and as such does not represent comprehensive documentation. It’s helpful if you want to see how the code is laid out. Moreover, Cassandra is a wonderful project, but the code contains relatively few comments, so you might find the JavaDoc’s usefulness limited. It may be more fruitful to simply read the class files directly if you’re familiar with Java. Nonetheless, to read the JavaDoc, open the javadoc/index.html file in a browser.

lib

This directory contains all of the external libraries that Cassandra needs to run. For example, it uses two different JSON serialization libraries, the Google collections project, and several Apache Commons libraries.

pylib

This directory contains Python libraries that are used by cqlsh.

tools

This directory contains tools that are used to maintain your Cassandra nodes. We’ll look at these tools in Chapter 12.

Additional Directories

If you’ve already run Cassandra using the default configuration, you will notice two additional directories under the main Cassandra directory: data and log. We’ll discuss the contents of these directories momentarily.

Building from Source

Cassandra uses Apache Ant for its build scripting language and Maven for dependency management.

Downloading Ant

You can download Ant from http://ant.apache.org. You don’t need to download Maven separately just to build Cassandra.

Building from source requires a complete Java 8 JDK (or later version), not just the JRE. If you see a message about how Ant is missing tools.jar, either you don’t have the full JDK or you’re pointing to the wrong path in your environment variables. Maven downloads files from the Internet so if your connection is invalid or Maven cannot determine the proxy, the build will fail.

Downloading Development Builds

If you want to download the latest Cassandra builds or view test results, you can find these in Jenkins, which the Cassandra project uses as its Continuous Integration tool. See https://builds.apache.org/label/cassandra/ for the latest builds and test coverage information.

If you interested in having a look at the Cassandra source, you can get the trunk version of the Cassandra source using this command:

$ git clone https://github.com/apache/cassandra.git

Because Maven takes care of all the dependencies, it’s easy to build Cassandra once you have the source. Just make sure you’re in the root directory of your source download and execute the ant program, which will look for a file called build.xml in the current directory and execute the default build target. Ant and Maven take care of the rest. To execute the Ant program and start compiling the source, just type:

$ ant

That’s it. Maven will retrieve all of the necessary dependencies, and Ant will build the hundreds of source files and execute the tests. If all went well, you should see a BUILD SUCCESSFUL message. If all did not go well, make sure that your path settings are all correct, that you have the most recent versions of the required programs, and that you downloaded a stable Cassandra build. You can check the Jenkins report to make sure that the source you downloaded actually can compile.

More Build Output

If you want to see detailed information on what is happening during the build, you can pass Ant the -v option to cause it to output verbose details regarding each operation it performs.

Additional Build Targets

To compile the server, you can simply execute ant as shown previously. This command executes the default target, jar. This target will perform a complete build including unit tests and output a file into the build directory called apache-cassandra-x.x.x.jar.

If you want to see a list of all of the targets supported by the build file, simply pass Ant the -p option to get a description of each target. Here are a few others you might be interested in:

test

Users will probably find this the most helpful, as it executes the battery of unit tests. You can also check out the unit test sources themselves for some useful examples of how to interact with Cassandra.

stress-build

This target builds the Cassandra stress tool, which we will try out in Chapter 13.

clean

This target removes locally created artifacts such as generated source files and classes and unit test results. The related target realclean performs a clean and additionally removes the Cassandra distribution JAR files and JAR files downloaded by Maven.

Running Cassandra

The Cassandra developers have done a terrific job of making it very easy for new users start using Cassandra immediately, as you can start a single node without making any changes to the default configuration. We’ll note some of the available configuration options as we go.

Required Java Version

Cassandra versions from 3.0 onward require a Java 8 JVM or later, preferably the latest stable version. It has been tested on both the OpenJDK and Oracle’s JDK. Cassandra 4.0 has been compiled and tested against both Java 8 and Java 11. You can check your installed Java version by opening a command prompt and executing java -version. If you need a JDK, you can get one at http://www.oracle.com/technetwork/java/javase/downloads/index.html or https://jdk.java.net.

Setting the environment

Once you have the binary (or the source downloaded and compiled), you’re ready to start the database server.

Setting the JAVA_HOME environment variable is recommended. To do this on a Windows system, click the Start button and then right-click on Computer. Click Advanced System Settings, and then click the Environment Variables… button. Click New… to create a new system variable. In the Variable Name field, type JAVA_HOME. In the Variable Value field, type the path to your Java installation. This is probably something like C:Program FilesJavajre1.8.0_25.

Once you’ve started the server for the first time, Cassandra will add directories to your system to store its data files. The default configuration creates directories under the CASSANDRA_HOME directory.

data

This directory is where Cassandra stores its data. By default, there are three sub-directories under the data directory, corresponding to the various data files Cassandra uses: commitlog, data, and saved_caches. We’ll explore the significance of each of these data files in Chapter 6. If you’ve been trying different versions of the database and aren’t worried about losing data, you can delete these directories and restart the server as a last resort.

logs

This directory is where Cassandra stores its logs in a file called system.log. If you encounter any difficulties, consult the log to see what might have happened.

Data File Locations

The data file locations are configurable in the cassandra.yaml file, located in the conf directory. The properties are called data_file_directories, commit_log_directory, and saved_caches_directory. We’ll discuss the recommended configuration of these directories in Chapter 10.

Many users on Unix-based systems prefer to use the /var/lib directory for data storage. If you are changing this configuration, you will need to edit the conf/cassandra.yaml file and create the referenced directories for Cassandra to store its data, making sure to configure write permissions for the user that will be running Cassandra:

$ sudo mkdir -p /var/lib/cassandra
$ sudo chown -R username /var/lib/cassandra

Instead of username, substitute your own username, of course.

Starting the Server

To start the Cassandra server on any OS, open a command prompt or terminal window, navigate to the <cassandra-directory>/bin where you unpacked Cassandra, and run the command cassandra -f to start your server.

Starting Cassandra in the Foreground

Using the -f switch tells Cassandra to stay in the foreground instead of running as a background process, so that all of the server logs will print to standard out (stdout in Unix systems) and you can see them in your terminal window, which is useful for testing. In either case, the logs will append to the system.log file.

In a clean installation, you should see quite a few log statements as the server gets running. The exact syntax of logging statements will vary depending on the release you’re using, but there are a few highlights we can look for. If you search for cassandra.yaml, you’ll quickly run into the following:

INFO  [main] 2019-08-25 17:42:11,712 YamlConfigurationLoader.java:89 -
  Configuration location: file:/Users/jeffreycarpenter/cassandra/conf/cassandra.yaml
INFO  [main] 2019-08-25 17:42:11,855 Config.java:598 - Node configuration:[
  allocate_tokens_for_keyspace=null;
  ...

These log statements indicate the location of the cassandra.yaml file containing the configured settings. The Node configuration statement lists out the settings read from the config file.

Now search for JVM and you’ll find something like this:

INFO  [main] 2019-08-25 17:42:12,308 CassandraDaemon.java:487 -
  JVM vendor/version: OpenJDK 64-Bit Server VM/12.0.1
INFO  [main] 2019-08-25 17:42:12,309 CassandraDaemon.java:488 -
  Heap size: 3.900GiB/3.900GiB

These log statements provide information describing the JVM being used, including memory settings.

Next, search for the versions in use—Cassandra version, CQL version, Native protocol supported versions:

INFO  [main] 2019-08-25 17:42:17,847 StorageService.java:610 -
  Cassandra version: 4.0-alpha3
INFO  [main] 2019-08-25 17:42:17,848 StorageService.java:611 -
  CQL version: 3.4.5
INFO  [main] 2019-08-25 17:42:17,848 StorageService.java:612 -
  Native protocol supported versions: 3/v3, 4/v4, 5/v5-beta (default: 4/v4)

We can also find statements where Cassandra is initializing internal data structures such as caches:

INFO [main] 2015-12-08 06:02:43,633 CacheService.java:115 -
  Initializing key cache with capacity of 24 MBs.
INFO [main] 2015-12-08 06:02:43,679 CacheService.java:137 -
  Initializing row cache with capacity of 0 MBs
INFO [main] 2015-12-08 06:02:43,686 CacheService.java:166 -
  Initializing counter cache with capacity of 12 MBs

If we search for terms like JMX, gossip, and listening, we can find statements like the following:

WARN  [main] 2019-08-25 17:42:12,363 StartupChecks.java:168 -
  JMX is not enabled to receive remote connections.
  Please see cassandra-env.sh for more info.
INFO  [main] 2019-08-25 17:42:18,354 StorageService.java:814 -
  Starting up server gossip
INFO  [main] 2019-08-25 17:42:18,070 InboundConnectionInitiator.java:130 -
  Listening on address: (127.0.0.1:7000), nic: lo0, encryption: enabled (openssl)

These log statements indicate the server is beginning to initiate communications with other servers in the cluster and expose publicly available interfaces. By default, the management interface via the Java Management Extensions (JMX) is disabled for remote access. We’ll explore the management interface in Chapter 11.

Finally, search for state jump and you’ll see the following:

INFO  [main] 2019-08-25 17:42:18,581 StorageService.java:1507 -
  JOINING: Finish joining ring
INFO  [main] 2019-08-25 17:42:18,591 StorageService.java:2508 -
  Node 127.0.0.1:7000 state jump to NORMAL

Congratulations! Now your Cassandra server should be up and running with a new single node cluster called Test Cluster ready to interact with other nodes and clients. If you continue to monitor the output, you’ll begin to see periodic output such as memtable flushing and compaction, which we’ll learn about soon.

Starting Over

The committers work hard to ensure that data is readable from one minor dot release to the next and from one major version to the next. The commit log, however, needs to be completely cleared out from version to version (even minor versions).

If you have any previous versions of Cassandra installed, you may want to clear out the data directories for now, just to get up and running. If you’ve messed up your Cassandra installation and want to get started cleanly again, you can delete the data folders.

Stopping Cassandra

Now that we’ve successfully started a Cassandra server, you may be wondering how to stop it. You may have noticed the stop-server command in the bin directory. Let’s try running that command. Here’s what you’ll see on Unix systems:

$ ./stop-server
please read the stop-server script before use

So you see that our server has not been stopped, but instead we are directed to read the script. Taking a look inside with your favorite code editor, you’ll learn that the way to stop Cassandra is to kill the JVM process that is running Cassandra. The file suggests a couple of different techniques by which you can identify the JVM process and kill it.

The first technique is to start Cassandra using the -p option, which provides Cassandra with the name of a file to which it should write the process identifier (PID) upon starting up. This is arguably the most straightforward approach to making sure we kill the right process.

However, because we did not start Cassandra with the -p option, we’ll need to find the process ourselves and kill it. The script suggests using pgrep to locate processes for the current user containing the term “cassandra”:

user=`whoami`
pgrep -u $user -f cassandra | xargs kill −9

Stopping Cassandra on Windows

On Windows installations, you can find the JVM process and kill it using the Task Manager.

Other Cassandra Distributions

The instructions we just reviewed showed us how to install the Apache distribution of Cassandra. In addition to the Apache distribution, there are a couple of other ways to get Cassandra:

DataStax Distribution of Apache Cassandra (DDAC)

DataStax provides a production-certified distribution of Cassandra, along with support and services. DDAC runs on common Unix-based platforms and is available as a tarball or Docker image. DDAC also includes developer tools such as a bulk loading tool. DDAC releases generally track the Apache releases, with availability soon after each Apache release.

DataStax Enterprise Edition

DataStax also provides a fully supported version certified for production use. The product line provides an integrated database platform with support for complementary data technologies such as Apache Solr, Apache Spark, and Apache Kafka. We’ll explore some of these integrations in Chapter 15.

Virtual machine images

A frequent model for deployment of Cassandra is to package one of the preceding distributions in a virtual machine image. For example, multiple such images are available in the Amazon Web Services (AWS) Marketplace.

Containers

It has become increasingly popular to run Cassandra in Docker containers, especially in development environments. We’ll provide some simple instructions for running the Apache distribution in Docker in “Running Cassandra in Docker”.

Managed services

There are a few providers of Cassandra as a managed service, where the provider provides hosting and management of Cassandra clusters. These include Instaclustr and Aiven. DataStax provides an Apache Cassandra as a Service called Apollo as part of its Constellation platform.

We’ll take a deeper look at several options for deploying Cassandra in production environments, including Kubernetes and cloud computing environments, in Chapter 10.

Selecting the right distribution will depend on your deployment environment; your needs for scale, stability, and support; and your development and maintenance budgets. Having both open source and commercial deployment options provides the flexibility to make the right choice for your organization.

Running the CQL Shell

Now that you have a Cassandra installation up and running, let’s give it a quick try to make sure everything is set up properly. We’ll use the CQL shell (cqlsh) to connect to our server and have a look around.

Deprecation of the CLI

If you’ve used Cassandra in releases prior to 3.0, you may also be familiar with the command-line client interface known as cassandra-cli. The CLI was removed in the 3.0 release because it depends on the legacy Thrift API, which was deprecated in 3.0 and removed entirely in 4.0.

To run the shell, create a new terminal window, change to the Cassandra home directory, and type the following command (you should see output similar to that shown here):

$ bin/cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 4.0-alpha3 | CQL spec 3.4.5 | Native protocol v4]
Use HELP for help.

Because we did not specify a node to which we wanted to connect, the shell helpfully checks for a node running on the local host, and finds the node we started earlier. The shell also indicates that you’re connected to a Cassandra server cluster called “Test Cluster”. That’s because this cluster of one node at localhost is set up for you by default.

Renaming the Default Cluster

In a production environment, be sure to change the cluster name to something more suitable to your application.

To connect to a specific node, specify the hostname and port on the command line. For example, the following will connect to our local node:

$ bin/cqlsh localhost 9042

Another alternative for configuring the cqlsh connection is to set the environment variables $CQLSH_HOST and $CQLSH_PORT. This approach is useful if you will be frequently connecting to a specific node on another host. The environment variables will be overriden if you specify the host and port on the command line.

Connection Errors

Have you run into an error like this while trying to connect to a server?

Exception connecting to localhost/9042. Reason:
  Connection refused.

If so, make sure that a Cassandra instance is started at that host and port, and that you can ping the host you’re trying to reach. There may be firewall rules preventing you from connecting.

To see a complete list of the command-line options supported by cqlsh, type the command cqlsh -help.

Basic cqlsh Commands

Let’s take a quick tour of cqlsh to learn what kinds of commands you can send to the server. We’ll see how to use the basic environment commands and how to do a round-trip of inserting and retrieving some data.

Case in cqlsh

The cqlsh commands are all case insensitive. For our examples, we’ll adopt the convention of uppercase to be consistent with the way the shell describes its own commands in help topics and output.

cqlsh Help

To get help for cqlsh, type HELP or ? to see the list of available commands:

cqlsh> help

Documented shell commands:
===========================
CAPTURE  CLS          COPY  DESCRIBE  EXPAND  LOGIN   SERIAL  SOURCE   UNICODE
CLEAR    CONSISTENCY  DESC  EXIT      HELP    PAGING  SHOW    TRACING

CQL help topics:
================
AGGREGATES               CREATE_KEYSPACE           DROP_TRIGGER      TEXT
ALTER_KEYSPACE           CREATE_MATERIALIZED_VIEW  DROP_TYPE         TIME
ALTER_MATERIALIZED_VIEW  CREATE_ROLE               DROP_USER         TIMESTAMP
ALTER_TABLE              CREATE_TABLE              FUNCTIONS         TRUNCATE
ALTER_TYPE               CREATE_TRIGGER            GRANT             TYPES
ALTER_USER               CREATE_TYPE               INSERT            UPDATE
APPLY                    CREATE_USER               INSERT_JSON       USE
ASCII                    DATE                      INT               UUID
BATCH                    DELETE                    JSON
BEGIN                    DROP_AGGREGATE            KEYWORDS
BLOB                     DROP_COLUMNFAMILY         LIST_PERMISSIONS
BOOLEAN                  DROP_FUNCTION             LIST_ROLES
COUNTER                  DROP_INDEX                LIST_USERS
CREATE_AGGREGATE         DROP_KEYSPACE             PERMISSIONS
CREATE_COLUMNFAMILY      DROP_MATERIALIZED_VIEW    REVOKE
CREATE_FUNCTION          DROP_ROLE                 SELECT
CREATE_INDEX             DROP_TABLE                SELECT_JSON

cqlsh Help Topics

You’ll notice that the help topics listed differ slightly from the actual command syntax. The CREATE_TABLE help topic describes how to use the syntax > CREATE TABLE …, for example.

To get additional documentation about a particular command, type HELP <command>. Many cqlsh commands may be used with no parameters, in which case they print out the current setting. Examples include CONSISTENCY, EXPAND, and PAGING.

Describing the Environment in cqlsh

Now that you have connected to your Cassandra instance Test Cluster, to learn more about the cluster you’re working in, type:

cqlsh> DESCRIBE CLUSTER;
Cluster: Test Cluster
Partitioner: Murmur3Partitioner

To see which keyspaces are available in the cluster, issue this command:

cqlsh> DESCRIBE KEYSPACES;
system_traces  system_auth  system_distributed     system_views
system_schema  system       system_virtual_schema

Initially this list will consist of several system keyspaces. Once you have created your own keyspaces, they will be shown as well. The system keyspaces are managed internally by Cassandra, and aren’t for us to put data into. In this way, these keyspaces are similar to the master and temp databases in Microsoft SQL Server. Cassandra uses these keyspaces to store the schema, tracing, and security information. We’ll learn more about these keyspaces in Chapter 6.

You can use the following command to learn the client, server, and protocol versions in use:

cqlsh> SHOW VERSION;
[cqlsh 5.0.1 | Cassandra 4.0-alpha3 | CQL spec 3.4.5 | Native protocol v4]

You may have noticed that this version info is printed out when cqlsh starts. There are a variety of other commands with which you can experiment. For now, let’s add some data to the database and get it back out again.

Creating a Keyspace and Table in cqlsh

A Cassandra keyspace is sort of like a relational database. It defines one or more tables. When you start cqlsh without specifying a keyspace, the prompt will look like this: cqlsh>, with no keyspace specified.

Let’s create our own keyspace so we have something to write data to. In creating our keyspace, there are some required options. To walk through these options, we could use the command HELP CREATE_KEYSPACE, but instead we’ll use the helpful command-completion features of cqlsh. Type the following and then hit the Tab key:

cqlsh> CREATE KEYSPACE my_keyspace WITH

When you hit the Tab key, cqlsh begins completing the syntax of our command:

cqlsh> CREATE KEYSPACE my_keyspace WITH replication = {'class': '

This is informing us that in order to specify a keyspace, we also need to specify a replication strategy. Let’s Tab again to see what options we have:

cqlsh> CREATE KEYSPACE my_keyspace WITH replication = {'class': '
NetworkTopologyStrategy    OldNetworkTopologyStrategy SimpleStrategy

Now cqlsh is giving us three strategies to choose from. We’ll learn more about these strategies in Chapter 6. For now, we will choose the SimpleStrategy by typing the name. We’ll indicate we’re done with a closing quote and Tab again:

cqlsh> CREATE KEYSPACE my_keyspace WITH replication = {'class':
  'SimpleStrategy', 'replication_factor':

The next option we’re presented with is a replication factor. For the simple strategy, this indicates how many nodes the data in this keyspace will be written to. For a production deployment, we’d want copies of our data stored on multiple nodes, but because we’re just running a single node at the moment, we’ll ask for a single copy. Let’s specify a value of “1” and a space and Tab again:

cqlsh> CREATE KEYSPACE my_keyspace WITH replication = {'class':
  'SimpleStrategy', 'replication_factor': 1};

We see that cqlsh has now added a closing bracket, indicating we’ve completed all of the required options. Let’s complete our command with a semicolon and return, and our keyspace will be created.

Keyspace Creation Options

For a production keyspace, we would probably never want to use a value of 1 for the replication factor. There are additional options on creating a keyspace depending on the replication strategy that is chosen. The command completion feature will walk through the different options.

Let’s have a look at our keyspace using the DESCRIBE KEYSPACE command:

cqlsh> DESCRIBE KEYSPACE my_keyspace
CREATE KEYSPACE my_keyspace WITH replication = {'class':
  'SimpleStrategy', 'replication_factor': '1'} AND
  durable_writes = true;

We see that the table has been created with the SimpleStrategy, a replication_factor of one, and durable writes. Notice that our keyspace is described in much the same syntax that we used to create it, with one additional option that we did not specify: durable_writes = true. Don’t worry about this option now; we’ll return to it in Chapter 6.

After you have created your own keyspace, you can switch to it in the shell by typing:

cqlsh> USE my_keyspace;
cqlsh:my_keyspace>

Notice that the prompt has changed to indicate that we’re using the keyspace.

Now that we have a keyspace, we can create a table in our keyspace. To do this in cqlsh, use the following command:

cqlsh:my_keyspace> CREATE TABLE user ( first_name text ,
  last_name text, title text, PRIMARY KEY (last_name, first_name)) ;

This creates a new table called “user” in our current keyspace with three columns to store first and last names and a title, all of type text. The text and varchar types are synonymous and are used to store strings. We’ve specified a primary key for this table consisting of the first_name and last_name and taken the defaults for other table options. We’ll learn more about primary keys and the significance of our choice of primary key in Chapter 4, but for now let’s think of that combination of names as identifying unique rows in our table. The title column is the only one in our table that is not part of the primary key.

Using Keyspace Names in cqlsh

We could have also created this table without switching to our keyspace by using the syntax CREATE TABLE my_keyspace.user.

We can use cqlsh to get a description of a the table we just created using the DESCRIBE TABLE command:

cqlsh:my_keyspace> DESCRIBE TABLE user;
CREATE TABLE my_keyspace.user (
    first_name text,
    last_name text,
    title text,
    PRIMARY KEY (last_name, first_name)
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.
      SizeTieredCompactionStrategy', 'max_threshold': '32',
      'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '16', 'class':
      'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99p';

You’ll notice that cqlsh prints a nicely formatted version of the CREATE TABLE command that we just typed in but also includes default values for all of the available table options that we did not specify. We’ll worry about these settings later. For now, we have enough to get started.

Writing and Reading Data in cqlsh

Now that we have a keyspace and a table, we’ll write some data to the database and read it back out again. It’s OK at this point not to know quite what’s going on. We’ll come to understand Cassandra’s data model in depth later. For now, you have a keyspace (database), which has a table, which holds columns, the atomic unit of data storage.

To write rows, we use the INSERT command:

cqlsh:my_keyspace> INSERT INTO user (first_name, last_name, title)
  VALUES ('Bill', 'Nguyen', 'Mr.');

Here we have created a new row with two columns for the key Bill, to store a set of related values. The column names are first_name and last_name.

Now that we have written some data, let’s read it, using the SELECT command:

cqlsh:my_keyspace> SELECT * FROM user WHERE first_name='Bill' and last_name='Nguyen';

 last_name | first_name | title
-----------+------------+-------
    Nguyen |       Bill |   Mr.

(1 rows)

In this command, we requested to return rows matching the primary key including all columns. For this query, we specified both of the columns referenced by the primary key. What happens when we only specify one of the values? Let’s find out.

cqlsh:my_keyspace> SELECT * FROM user where last_name = 'Nguyen';

 last_name | first_name | title
-----------+------------+-------
    Nguyen |       Bill |   Mr.

(1 rows)
cqlsh:my_keyspace> SELECT * FROM user where first_name = 'Bill';
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this
query as it might involve data filtering and thus may have unpredictable performance. If
you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"

This behavior might not seem intuitive at first, but it has to do with the composition of the primary key we used for this table. This is our first clue that there might be something a bit different about accessing data in Cassandra as compared to what you might be used to in SQL. We’ll examine the significance of our primary key selection and the ALLOW FILTERING option in Chapter 4 and other chapters.

Counting data and full table scans

Many new Cassandra users, especially those who are coming from a relational background, will be inclined to use the the SELECT COUNT command as a way to ensure data has been written. For example, we could use the following command to verify our write to the user table:

cqlsh:my_keyspace> SELECT COUNT (*) FROM user;
 count
-------
     1

(1 rows)

Warnings :
Aggregation query used without partition key

Note that when we execute this command, cqlsh gives us the correct count of rows, but also gives us a warning. This is because we’ve asked Cassandra to perform a full table scan. In a multi-node cluster with potentially large amounts of data, this COUNT could be a very expensive operation. Throughout the rest of the book, we’ll encounter various ways in which Cassandra tries to warn us or constrain our ability to perform operations that will perform poorly at scale in a distributed architecture.

You can delete a column using the DELETE command. Here we will delete the title column from the row inserted above:

    cqlsh:my_keyspace> DELETE title FROM USER WHERE
      first_name='Bill' AND last_name='Nguyen';

We can perform this delete because the title column is not part of the primary key. To make sure that the value has been removed, we can query again:

cqlsh:my_keyspace> SELECT * FROM user WHERE first_name='Bill'
  AND last_name='Nguyen';

 last_name | first_name | title
-----------+------------+-------
    Nguyen |       Bill |  null

(1 rows)

Now we’ll clean up after ourselves by deleting the entire row. It’s the same command, but we don’t specify a column name:

cqlsh:my_keyspace> DELETE FROM USER WHERE first_name='Bill'
  AND last_name='Nguyen';

To make sure that it’s removed, we can query again:

cqlsh:my_keyspace> SELECT * FROM user WHERE first_name='Bill'
  AND last_name='Nguyen';

 last_name | first_name | title
-----------+------------+-------

(0 rows)

If we really want to clean up after ourselves, we can remove all data from the table using the TRUNCATE command, or even delete the table schema using the DROP TABLE command.

cqlsh:my_keyspace> TRUNCATE user;
cqlsh:my_keyspace> DROP TABLE user;

cqlsh Command History

Now that you’ve been using cqlsh for a while, you may have noticed that you can navigate through commands you’ve executed previously with the up and down arrow key. This history is stored in a file called cqlsh_history, which is located in a hidden directory called .cassandra within your home directory. This acts like your bash shell history, listing the commands in a plain-text file in the order Cassandra executed them. Nice!

Running Cassandra in Docker

Over the past few years, containers have become a very popular alternative to full machine virtualization for deployment of applications and supporting infrastructure such as databases.

Given the high popularity of Docker and it’s image format, the Apache project has begun supporting official Docker images of Cassandra.

If you have a Docker environment installed on your machine, it’s extremely simple to start a Cassandra node. After making sure you’ve stopped any Cassandra node started above, start a new node in Docker using the following two commands:

docker pull cassandra
docker run --name my-cassandra cassandra

The first command pulls the docker image marked with the tag latest from the Docker Hub https://hub.docker.com/_/cassandra/

Using default tag: latest
latest: Pulling from library/cassandra
9fc222b64b0a: Pull complete
33b9abeacd73: Pull complete
d28230b01bc3: Pull complete
6e755ec31928: Pull complete
b881e4d8c78e: Pull complete
d8b058ab9240: Pull complete
3ddfff7126ed: Pull complete
94de8e3674c4: Pull complete
61d4f90c97c4: Pull complete
a3d009e31ea4: Pull complete
Digest: sha256:0f188d784235e1bedf191361096e6eeab330f9579eac7d2e68e14a5c29f75ad6
Status: Downloaded newer image for cassandra:latest
docker.io/library/cassandra:latest

The second command starts an instance of Cassandra with default options. Note that we could have used -d option to start the container in the background without printing out the logs.

We used the --name option to specify a name for the container, which allows us to reference the container by name when using other Docker commands. For example, we can stop the container by using the command:

docker stop my-cassandra

If we don’t provide a name for the container, the Docker runtime will assign a randomly selected name such as “breezy_ensign”. Docker also creates a unique identifier for each container which is returned from the initial run command. Either the name or ID may be used to reference a specific container in Docker commands.

If you’d like to start an instance of cqlsh, the simplest way is to use the copy inside the instance by executing a command on the instance:

docker exec -it my-cassandra cqlsh

This will give us a cqlsh prompt, with which you could execute the same commands we’ve practiced in this chapter, or any other commands you’d like.

Up to this point, we’ve only created a single Cassandra node in Docker, which is not accessible from outside Docker’s internal network. In order to access this node from outside Docker for CQL queries, you’ll want to make sure the standard CQL port is exposed when the node is created:

docker start cassandra -p 9042:9042

There are several other configuration options available for running Cassandra in Docker, which are documented on the Docker Hub page referenced above. One exercise you may find interesting is to launch multiple nodes in Docker to create a small cluster.

Summary

Now you should have a Cassandra installation up and running. You’ve worked with the cqlsh client to insert and retrieve some data, and you’re ready to take a step back and get the big picture on Cassandra before really diving into the details.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
44.200.27.215