With JVM ready, installing Cassandra is as easy as downloading the appropriate tarball from the Apache Cassandra download page, http://cassandra.apache.org/download, and untarring it. On Debian or Ubuntu, you may choose either to install from a .tar
file or from an Apache Software Foundation repository.
This guide assumes that Cassandra is installed in the /opt
directory, the datafiles in the /cassandra-data
directory, and the system logs in /var/log/cassandra
. These are just some conventions that were chosen by me. You may choose a location that suits you best:
# Download. Please select appropriate version and # URL from http://cassandra.apache.org/download page $ wget http://mirror.sdunix.com/apache/cassandra/1.1.11/apache-cassandra-1.1.11-bin.tar.gz [-- snip --] Saving to: 'apache-cassandra-1.1.11-bin.tar.gz' # extract $ tar xzf apache-cassandra-1.1.11-bin.tar.gz # (optional) Symbolic link to easily switch versions in # future without having to change dependent scripts $ ln -s apache-cassandra-1.1.11 cassandra
Apache Software Foundation provides Debian packages for different versions of Cassandra to directly install it from the repository. To list the packages, run the following command:
# Edit sources $ sudo vi /etc/apt/sources.list
Also, append the following three lines:
# Cassandra repo deb http://www.apache.org/dist/cassandra/debian 11x main deb-src http://www.apache.org/dist/cassandra/debian 11x main
Next, execute sudo apt-get update
, as shown in the following code:
$ sudo apt-get update Ign http://security.ubuntu.com natty-security InRelease [-- snip --] GPG error: http://www.apache.org 11x InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 4BD736A82B5C1B00
If you get this error, add the public keys as shown:
$ gpg --keyserver pgp.mit.edu --recv-keys F758CE318D77295D $ gpg --export --armor F758CE318D77295D | sudo apt-key add – $ gpg --keyserver pgp.mit.edu --recv-keys 2B5C1B00 $ gpg --export --armor 2B5C1B00 | sudo apt-key add -
Now, you can install Cassandra using the following commands:
$ sudo apt-get update $ sudo apt-get install cassandra
This installation does most of the system-wide configurations for you. It makes all the executables available to the $PATH
system path, copies the configuration file to /etc/cassandra
, and adds the .init
script to set up proper JVM and ulimits. It also sets run-level
, so Cassandra starts at boot as "cassandra" user.
There are a couple of programs and files that one must know about to work effectively with Cassandra. These things come to use during investigation, maintenance, configuration, and optimization.
Depending on how the installation is done, the file may be available at different locations. For a tarball installation, everything is neatly packaged under the directory where Cassandra is installed: binaries under the bin
directory and the configuration file under the conf
directory. For repository-based installations, binaries are available in /usr/bin
and /usr/sbin
directories; and configuration files under /etc/cassandra
and /etc/default/cassandra
.
These contain executables for various tasks. Let's take a quick glance at them:
cassandra
: It starts the Cassandra daemon using default configuration. To start Cassandra in the foreground, use the -f
option. You can use Ctrl + C to kill Cassandra and view logs on the console. One may also use -p <pid_file>
to have a handle and to kill Cassandra running in the background by using kill 'cat <pid_file>'
.If Cassandra is installed from the repository, it must have created a service for it. So, one should use sudo service cassandra start
, sudo service cassandra stop
, and sudo service cassandra status
to start, stop, and query the status of Cassandra.
cassandra-cli
: Cassandra's command-line interface (CLI) gives a very basic access to execute simple commands to modify and access keyspaces and column families. More discussion on Cassandra's CLI can be found at http://wiki.apache.org/cassandra/CassandraCli. The typical use of Cassandra looks like this:cassandra-cli -h <hostname> -p <port> -k <keyspace>
A file of statements can be passed to the CLI using the -f
option.
cqlsh
: This is a command-line interface to execute CQL queries. The default version is CQL 3 as of Cassandra Version 1.1.*. It may change in Version 1.2.0+. One may switch to CQL 3 using the -cql3
switch. Typically, the cqlsh
connect command looks like this:cqlsh <hostname> <port> -k <keyspace>
json2sstable
and sstable2json
: As the name suggests, they represent the yin and yang of serializing and deserializing the data in SSTable. It can be vaguely assumed to be similar to the mysqldump --xml <database>
command, except that it works in the JSON format.sstable2json
provides SSTable as JSON, and json2sstable
takes JSON to materialize a functional SSTable.sstable2json
may have the following three options:-k
: the keys to be dumped-x
: the keys to be excluded-e
: it makes sstable2json
to dump just keys, no column family datasstable2json -k <key1> -k <key2> <sstable_path>
One can use the -k
or -x
switches up to 500 times. A general sstable2json
executable looks like this:
sstable_path
must be the full path to SSTable such as /cassandra-data/data/mykeyspace/mykeyspace-hc-1.data
. Also, the key
variable must be a hex string.sstablekeys
: This is essentially sstable2json
with a -e
switch.sstableloader
: This is used to bulk load to Cassandra. One can simply copy SSTable datafiles and load to another Cassandra setup without much hassle. Essentially, sstableloader
reads the datafiles and streams to the current Cassandra setup as specified by Cassandra's YAML file. We will see this tool in more detail in section Using Cassandra bulk loader to restore the data, Chapter 6, Managing a Cluster – Scaling, Node Repair, and Backup.Cassandra has a central configuration file named cassandra.yaml
. It contains cluster settings, node-to-node communication specifications, performance-related settings, authentication, security, and backup settings.
Apart from this, there are the log4j-server.properties
and cassandra-topology.properties
files. The log4j-server.properties
file is used to tweak Cassandra logging settings. The only thing that one may want to change in this file is the following line so that we can change the location where logs are located:
log4j.appender.R.File=/var/log/cassandra/system.log
The cassandra-topology.properties
file is to be filled with cluster-specific values if you use PropertyFileSnich
. We'll discuss more on this in this chapter.
cassandra.yaml
and other files can be accessed from the conf
directory under the installation directory for a tarball installation. For a repository installation, the cassandra.yaml
file and others can be found under /etc/cassandra
.
As discussed earlier, one should configure the data directory and the commit log directory to separate disk drives to improve performance. The cassandra.yaml
file holds all these configurations and more.
AWS EC2 users: Although it is emphasized to have data and commit logs on two drives, for EC2 instance store instances, it is suggested to set up the RAID0 configuration and use it for both the data directories and the commit log. It performs better than having one of those on the root device and the other on ephemeral.
EBS-backed instances are a bad choice for a Cassandra installation due to slow I/O performance, and the same goes for any NAS setup.
To update data directories, edit the following lines in cassandra.yaml
:
# directories where Cassandra should store data on disk. data_file_directories: - /var/lib/cassandra/data
Change /var/lib/cassandra/data
to a directory that is suitable for your setup. You may as well add more directories spanning different hard disks. Then change the commit log directory as shown in the following code:
# commit log commitlog_directory: /var/lib/cassandra/commitlog
Edit this to set a desired location.
These directories (data or commit log) must be available for write. If it is not a fresh install, one may want to migrate data from the old data directories and the commit log directory to new ones.
18.216.27.251