Cassandra provides a simple backup tool called nodetool snapshot
to take incremental snapshots and back up of data. The snapshot
command flushes MemTables to the disk and creates a backup by creating a hard link to SSTables (SSTables are immutable).
These hard links stay under the data
directory, which is placed under <keyspace>/<column_family>/snapshots
.
The general plan to back up a cluster roughly follows these steps:
snapshot
command provides an option to specify whether to back up the entire keyspace or just the selected column families.nodetool clearsnapshot
command cleans all the snapshots on a node.It is important to understand that creating snapshots creates hard links to the data files. These data files do not get deleted when they become obsolete because they are saved for backup. This unnecessary disk space usage can be avoided by clearsnapshot
after the snapshots are copied to a different location.
For really large datasets, it may be hard to back up the entire keyspace on a daily basis. Plus, it is expensive to transfer large data over a network to move the snapshots to a safe location. You can take a snapshot at first and copy it to a safe location. Once this is done, all we need to do is move the incremental data. This is called
incremental backup. To enable incremental backup to a node, you need to edit cassandra.yaml
and set incremental_backups: true
. This will result in the creation of hard links in the backup's directory under the data
directory.
Therefore, you have snapshots with incremental backup. You also have a backup of all the SSTables created after the snapshot is taken. Incremental backups have the same problem as snapshots. They are hard links; they delete obsolete data files that are not to be deleted. It is recommended to run clearsnapshot
before a new snapshot is created, and make sure that the backup's directory has no incremental backup.
Backing up the data is just half the story. Backups are meaningful when they are restored in the case of node replacement or to launch a whole new cluster from the backup data of another cluster. There is more than one way to restore a node. We will see two approaches here. The first step is to just paste the appropriate files to the appropriate location. Perform the following steps:
.db
files for the column family from the data
directory. It is located under <data_directoy>/<keyspace>/<column_family>
. Do not delete anything other than the .db
files. Also, delete the commit logs from the commitlog
directory.snapshot
directory that you wanted to replace. Paste the content of everything in that snapshot directory to the data
directory mentioned in the previous step.If you have enabled incremental backup, you may want to take them into account too. You need to paste all the incremental backup taken from your backup (it is situated under <data_directory>/<keyspace>/<column_family>/backups
) to the data
directory, the same as we did with snapshots in the previous step.
Here are a couple of things to note:
nodetool repair
.desc schema
in the cqlsh
console. You may need to create a schema to be able to get the restoration working.An alternative technique to load the data to Cassandra is using the sstableloader
utility. It can be found under the bin
directory of the Cassandra installation. This tool is especially useful when the number of nodes and the replication strategy is changed, because unlike the copy method, it streams appropriate data to appropriate nodes, based on the configuration.
Assuming that you have the -Index.db
and -Data.db
files with you, here are the steps to use sstableloader
:
.db
files) that is being restored should be kept in a directory with the same name as the column family name. For example, if you are restoring a myCF
column family in keyspace, mykeyspace
, all mykeyspace-myCF-hf-x-Data.db
and mykeyspace-myCF-hf-x-Index.db
(where x
is an integer) files should be placed within the directory structure: mykeyspace/myCF/
.bin/sstableloadermykeyspace
.Cassandra's bulk loader simplified the task to an extent that one can just store the backup in the exact same directory structure as required by sstableloader
, and whenever a restoration is required just download the backup directory and execute sstableloader
.
It can be observed that the backup step is very mechanical and can easily be automated to perform a daily backup using the cron job and the shell script. It may be a good idea to clear the snapshot once in a while, and take a snapshot then on.
Backup
Coming from the traditional database, one thinks that data backup is an essential part of data management. Data must be backed up daily, stored in a hard disk, and stored in a safe place. This is a good idea. It gets harder and inefficient to achieve this as the data size grows to terabytes. With Cassandra, you may set up a configuration that makes it really hard to lose data. For example, a setup with three data centers in Virginia (US East), California (US West), and Tokyo (Japan), where data is replicated across all the three data centers, you will seldom need to worry about data. If you are nervous, you may have a cron job backing up the data from one of the data centers, at every time interval up to which you may take a risk. With this setup, in the rare event of the two US data centers going down, you can serve the users without any repercussions. Things will catch up as soon as the data centers come back up.
3.141.2.157