Backup and restoration

Cassandra provides a simple backup tool called nodetool snapshot to take incremental snapshots and back up of data. The snapshot command flushes MemTables to the disk and creates a backup by creating a hard link to SSTables (SSTables are immutable).

Note

Hard link is a directory entry associated with file data on a filesystem. It can roughly be assumed as an alias to a file that refers to the location where data is stored. It is unlike a soft link that just aliases filenames, not the actual underlying data.

These hard links stay under the data directory, which is placed under <keyspace>/<column_family>/snapshots.

The general plan to back up a cluster roughly follows these steps:

  1. Take a snapshot of each node one by one. The snapshot command provides an option to specify whether to back up the entire keyspace or just the selected column families.
  2. Taking a snapshot is just half of the story. To be able to restore the database at a later point, you need to move these snapshots to a location that cannot be affected by the node's hardware failure or the node's unavailability. One of the easiest things to do is to move the data to a network-attached storage. To AWS users, it is fairly common to back up the snapshots in the S3 bucket.
  3. Once you are done with backing up the snapshots, you need to clean them. The nodetool clearsnapshot command cleans all the snapshots on a node.

It is important to understand that creating snapshots creates hard links to the data files. These data files do not get deleted when they become obsolete because they are saved for backup. This unnecessary disk space usage can be avoided by clearsnapshot after the snapshots are copied to a different location.

For really large datasets, it may be hard to back up the entire keyspace on a daily basis. Plus, it is expensive to transfer large data over a network to move the snapshots to a safe location. You can take a snapshot at first and copy it to a safe location. Once this is done, all we need to do is move the incremental data. This is called incremental backup. To enable incremental backup to a node, you need to edit cassandra.yaml and set incremental_backups: true. This will result in the creation of hard links in the backup's directory under the data directory.

Therefore, you have snapshots with incremental backup. You also have a backup of all the SSTables created after the snapshot is taken. Incremental backups have the same problem as snapshots. They are hard links; they delete obsolete data files that are not to be deleted. It is recommended to run clearsnapshot before a new snapshot is created, and make sure that the backup's directory has no incremental backup.

Backing up the data is just half the story. Backups are meaningful when they are restored in the case of node replacement or to launch a whole new cluster from the backup data of another cluster. There is more than one way to restore a node. We will see two approaches here. The first step is to just paste the appropriate files to the appropriate location. Perform the following steps:

  1. Shut down the node that is going to be restored. Clean the .db files for the column family from the data directory. It is located under <data_directoy>/<keyspace>/<column_family>. Do not delete anything other than the .db files. Also, delete the commit logs from the commitlog directory.
  2. From the backup, take the snapshot directory that you wanted to replace. Paste the content of everything in that snapshot directory to the data directory mentioned in the previous step.

    If you have enabled incremental backup, you may want to take them into account too. You need to paste all the incremental backup taken from your backup (it is situated under <data_directory>/<keyspace>/<column_family>/backups) to the data directory, the same as we did with snapshots in the previous step.

  3. Restart the node.

Here are a couple of things to note:

  • If you are restoring the complete cluster, shut down the cluster while restoring and restore the nodes one by one.
  • Once the restoration process is done, execute nodetool repair.
  • If you are trying to restore on a completely new machine that has no idea about the keyspace that is being restored, it may be worth checking the schema for the keyspace and the column family that you wanted to restore. The schema can be queried by executing desc schema in the cqlsh console. You may need to create a schema to be able to get the restoration working.

Using the Cassandra bulk loader to restore the data

An alternative technique to load the data to Cassandra is using the sstableloader utility. It can be found under the bin directory of the Cassandra installation. This tool is especially useful when the number of nodes and the replication strategy is changed, because unlike the copy method, it streams appropriate data to appropriate nodes, based on the configuration.

Assuming that you have the -Index.db and -Data.db files with you, here are the steps to use sstableloader:

  1. Check the node's schema. If it does not have the keyspaces and the column families that are being restored, create the appropriate keyspaces and the column families.
  2. Create a directory with the same name as the keyspace that is being loaded. Inside this directory, all the column families' data (the .db files) that is being restored should be kept in a directory with the same name as the column family name. For example, if you are restoring a myCF column family in keyspace, mykeyspace, all mykeyspace-myCF-hf-x-Data.db and mykeyspace-myCF-hf-x-Index.db (where x is an integer) files should be placed within the directory structure: mykeyspace/myCF/.
  3. Finally, execute bin/sstableloadermykeyspace.

Cassandra's bulk loader simplified the task to an extent that one can just store the backup in the exact same directory structure as required by sstableloader, and whenever a restoration is required just download the backup directory and execute sstableloader.

It can be observed that the backup step is very mechanical and can easily be automated to perform a daily backup using the cron job and the shell script. It may be a good idea to clear the snapshot once in a while, and take a snapshot then on.

Note

Backup

Coming from the traditional database, one thinks that data backup is an essential part of data management. Data must be backed up daily, stored in a hard disk, and stored in a safe place. This is a good idea. It gets harder and inefficient to achieve this as the data size grows to terabytes. With Cassandra, you may set up a configuration that makes it really hard to lose data. For example, a setup with three data centers in Virginia (US East), California (US West), and Tokyo (Japan), where data is replicated across all the three data centers, you will seldom need to worry about data. If you are nervous, you may have a cron job backing up the data from one of the data centers, at every time interval up to which you may take a risk. With this setup, in the rare event of the two US data centers going down, you can serve the users without any repercussions. Things will catch up as soon as the data centers come back up.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.2.157