This chapter will talk about various scripts and other options that we can consider and write to master HBase operations and functions. We will also look at the remaining value additions, techniques, and some more exceptions and errors that you might face in HBase.
In this chapter, we will discuss the following topics:
Now, we will see the HBase backup and restore methods as it is very important for any technology to be able to restore and create a backup of the data to avoid data loss.
We will now discuss these methods in detail. There are two kinds of HBase methodologies in general. The following are HBase backup methods that we can choose according to our requirement and setup:
hadoop distcp
commandLet's get started with offline backup.
This method includes full-shutdown backup of HBase on a file system, using the distcp
command that runs the MapReduce task. It copies the parallel data from one location to another, which can be a backup location on the same cluster or another backup cluster. This method is not recommended on a live cluster or a cluster that needs zero downtime. If users can invest in a downtime, we can go with this method.
Suppose we have an HDFS location as hdfs://namenode:9000/hbase
, where the complete HBase data is located. Here, we can copy the whole location to either the same cluster or another cluster using distcp
.
The following is the syntax of distcp
:
hadoop [ Generic Options ] distcp [-p [rbugp] ] [-i ] [-log ] [-m ] [-overwrite ] [-update ] [-f <URI list> ] [-filelimit <n> ] [-sizelimit <n> ] [-delete ] <source> <destination>
More on this command can be found at http://hadoop.apache.org/docs/r1.2.1/distcp2.html.
If you prefer the former way, use the following command to create the backup in the same cluster.
hadoop distcp hdfs://Infinity1:9000/hbase hdfs://Infinity1:9000/hbaseBackup/backup1
This preceding command will copy the hbase
directory as it is in /hbaseBackup/backup1
on the HDFS location of the same cluster.
If you prefer the latter way, you can create a backup in different clusters using the following command:
hadoop distcp hdfs://Infinity1:9000/hbase hdfs://Infinity2:9000/hbaseBackup/backup1
This preceding command will copy the hbase
directory as it is in /hbaseBackup/backup1
on the HDFS location on another cluster.
If something goes wrong, we can copy the target directory to the HBase directory. This method copies the full file system directory, so will take the same amount of space on HDFS. It is better to have a separate backup cluster or copy this data offline to the tapes.
The online backup process is preferred as there is no need to shut down the cluster, and it does not hamper the operation of the cluster, and thus, we don't need a downtime. This class of backup category has the following methods to take and restore the backup. Let's discuss these.
The HBase snapshot enables us to take a snapshot of a table without fiddling with RegionServers. Snapshot in HBase is a set of metadata information that allows the administrator to get back to the previous working state of the table. A snapshot is not a copy of the table data, but just a layout of the data present on the HBase cluster. We can think of it as a set of operations to keep track of metadata (table info and regions) and data (HFiles, MemStore, and WALs). No data is copied during the process of taking a snapshot of a table.
There are two types of snapshots, online and offline.
Online snapshots can be taken when a table is enabled and active for I/O operations. In this case, the master receives the snapshot request and asks each RegionServer to take a snapshot of the regions for which it is responsible.
Offline snapshots can be taken when the table is disabled and not ready for I/O operations. The master performs this operation, and the time required to perform this operation is determined mainly on the time taken by HDFS NameNode to provide the list of files.
Now, we will see how to take a snapshot and back up data using the offline method:
hbase-site.xml
to enable this feature:<property> <name>hbase.snapshot.enabled</name> <value>true</value> </property>
Restart the cluster once this change is made.
hbase > snapshot 'emptable', 'baksnapshot01082014'
hbase > list_snapshots
hbase > delete_snapshot 'baksnapshot01082014'
hbase > clone_snapshot 'baksnapshot01082014', 'newSnapTable'
hbase > disable 'table' hbase > restore_snapshot ' baksnapshot01082014'
The org.apache.hadoop.hbase.snapshot.ExportSnapshot
tool copies all the data related to a snapshot (HFiles, logs, and snapshot metadata) to another cluster:
hbase class org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot snapshot01082014 -copy-to hdfs://infinity:9000/hbase -mappers 8 -bandwidth 100
So, here eight map jobs will run to export all snapshots to another cluster with a limiting bandwidth of 100 MB/s.
The HBase replication method provides a mechanism to copy or replicate data from one HBase cluster setup to another. It can serve as a disaster recovery solution and contribute to provide higher availability at the HBase layer. This method works when the data is being pushed from one cluster to another, where pusher is the master and taker is the slave setup. This replication, or pushing, happens asynchronously. This provides high availability of an HBase cluster too. The basis of this replication is based on HLog from each RegionServer. There are three types of replication as follows:
The following are the prerequisites to set up cluster replication:
org.apache.hadoop.hbase.replication
The following are the deployment steps to set up cluster replication:
hbase-site.xml
file and put the following lines of code in it:<property> <name>hbase.replication</name> <value>true</value> </property>
These changes will need a full cluster restart for configuration to be loaded.
add_peer 'ID' 'CLUSTER_KEY'
Here, ID
should be a short integer and for CLUSTER_KEY
, follow this template:
hbase.zookeeper.quorum:hbase.zookeeper.property.clientPort:zookeeper.znode.parent
disable 'table' alter 'table', {NAME => 'colFam', REPLICATION_SCOPE => '1'} enable 'table'
Here, putting 0
means it will not replicate, and putting 1
means it has to replicate.
hbase > list_peers
For replication-related commands, refer to the previous chapter.
Considering 1 rs, with ratio 0.1 Getting 1 rs from peer cluster # 0 Choosing peer <ipaddress_regionserver>:<regionServerPort>
stop_replication
command at the source-HBase shell.In this section, we will have a look at the Export
and Import
commands in detail.
The Export utility is provided by the HBase JAR file, which writes the HBase table content to a sequence file on HDFS (on the same cluster or another cluster). This runs as a MapReduce job, and using HBase API, reads each row one by one and writes it to the HDFS location. Using this, we can take a full and incremental backup of a live cluster as it takes the start and end timestamps as parameters.
The syntax for this is as follows:
hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
Here, tablename
is the table that is to be exported; outputdir
is the target where the output sequence file will be written, which can be an HDFS location on the same cluster or on different clusters; versions
implies the number of version to be exported; starttime
and endtime
are the timestamp between which the data's timestamp should lie to be able to be exported.
The import utility is used to read the exported sequence file—an output of the Export
command—to be restored in an HBase table, a new table, or an existing table. If data already exists, it will be overwritten.
The syntax for this is as follows:
hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
Here, tablename
is the name of the table where data is to be imported/restored and inputdir
is the path where the data exists or where it will be exported using the Export
command.
The Export
and Import
commands work well with a live cluster and can also be run as Hadoop MapReduce using the following commands:
hadoop jar <full path of> hbase-*.jar export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]] hadoop jar <full path of> hbase-*.jar import <tablename> <inputdir>
We can pass the runtime parameter in these commands as -D mapred.output.compress=true
, and other required parameters can be given after export operation.
If the export operation is done using HBase v0.94, and the same data has to be imported from a newer version, we can specify the runtime parameter as follows:
hbase -Dhbase.import.version=0.94 org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
ImportTsv is a utility that loads data from a Tab-separated Value (TSV) format into an HBase table. The syntax for the same is as follows:
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <hdfs-inputdir>
The CopyTable tool can copy a part of the table, or the whole table, to the same cluster, or to another cluster. The target table should already be present with the same schema.
The following is the syntax to use this tool:
hbase org.apache.hadoop.hbase.mapreduce.CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>
The following are the options of the preceding syntax:
rs.class
: This is the hbase.regionserver.class
of the peer cluster. It's specified if different from the current cluster.rs.impl
: This is the hbase.regionserver.impl
of the peer cluster.startrow
: This is the start row.stoprow
: This is the stop row.starttime
: This is the beginning of the time range. If no endtime
is specified, it means from starttime
to forever.endtime
: This is an end of the time range. This will be ignored if no starttime
is specified.versions
: This is the number of cell versions to copy.new.name
: This is the name of a new table.peer.adr
: This is the address of the peer cluster given in the hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
format.families
: This is a comma-separated list of families to copy. To copy from cf1
to cf2
, give sourceCfName:destCfName
. To keep the same name, just give cfName
.all.cells
: This copies deleted markers and deleted cells.The only argument is tablename
, which is the name of the table to copy.
Have a look at the following example:
hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=Copynew table
Here, new.name
is the name of the copied table, and table
is the one to be copied.
We can always write our own custom application that utilizes the public API (http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html) and queries the table directly. We can do this through MapReduce jobs in order to utilize the framework's distributed-batch-processing advantages, or through any other means of our design. However, this approach requires a deep understanding of Hadoop development and all the APIs and performance implications of using them in your production cluster.
The in-depth explanation of backup using a Mozilla tool is out of the scope of this book, so visit the following links for more information on this:
The following table shows the comparison between different backup processes:
Backup process |
Effect on performance |
Data size |
Downtime requirement |
If incremental backup is possible or not |
Time for recovery |
---|---|---|---|---|---|
Snapshots |
Minimal |
Very small |
On restore |
No |
Seconds |
Replication |
Minimal |
Huge |
None |
Yes |
Seconds |
Export |
High |
Huge |
None |
Yes |
High |
CopyTable |
High |
Huge |
None |
Yes |
High |
API |
Medium |
Huge |
None |
Yes |
High |
Distcp |
Downtime |
Huge |
Yes |
No |
Long |
We have another option of backup, and it is to increase the replication factor of an HBase cluster, which will be maintained on the Hadoop level and provide more availability and robustness. However, this will need more space, and if space is not a constraint, we can use this hassle-free backup method.
18.226.98.208