Chapter 7. Scripting in HBase

This chapter will talk about various scripts and other options that we can consider and write to master HBase operations and functions. We will also look at the remaining value additions, techniques, and some more exceptions and errors that you might face in HBase.

In this chapter, we will discuss the following topics:

  • HBase backup and restore methodologies
  • HBase on Windows
  • Scripting in HBase
  • Value addition
  • More on exceptions and errors in HBase

HBase backup and restore techniques

Now, we will see the HBase backup and restore methods as it is very important for any technology to be able to restore and create a backup of the data to avoid data loss.

We will now discuss these methods in detail. There are two kinds of HBase methodologies in general. The following are HBase backup methods that we can choose according to our requirement and setup:

  • Offline backup / full-shutdown backup
    • Use the hadoop distcp command
  • Online backup
    • Snapshots
    • Replication
    • Export
    • CopyTable
    • HTable API
    • Offline backup of HDFS data
    • Backup using a Mozilla tool
    • HDFS replication

Let's get started with offline backup.

Offline backup / full-shutdown backup

This method includes full-shutdown backup of HBase on a file system, using the distcp command that runs the MapReduce task. It copies the parallel data from one location to another, which can be a backup location on the same cluster or another backup cluster. This method is not recommended on a live cluster or a cluster that needs zero downtime. If users can invest in a downtime, we can go with this method.

Backup

Suppose we have an HDFS location as hdfs://namenode:9000/hbase, where the complete HBase data is located. Here, we can copy the whole location to either the same cluster or another cluster using distcp.

The following is the syntax of distcp:

hadoop [ Generic Options ] distcp
    [-p [rbugp] ] [-i ] [-log ] [-m ] [-overwrite ]
    [-update  ] [-f <URI list> ] [-filelimit <n> ] [-sizelimit <n> ]
    [-delete ] <source>  <destination>

Note

More on this command can be found at http://hadoop.apache.org/docs/r1.2.1/distcp2.html.

If you prefer the former way, use the following command to create the backup in the same cluster.

hadoop distcp hdfs://Infinity1:9000/hbase hdfs://Infinity1:9000/hbaseBackup/backup1

This preceding command will copy the hbase directory as it is in /hbaseBackup/backup1 on the HDFS location of the same cluster.

If you prefer the latter way, you can create a backup in different clusters using the following command:

hadoop distcp hdfs://Infinity1:9000/hbase hdfs://Infinity2:9000/hbaseBackup/backup1

This preceding command will copy the hbase directory as it is in /hbaseBackup/backup1 on the HDFS location on another cluster.

Note

Note that for the distcp command to work, we need to have JobTracker and TaskTracker running as they are needed for MapReduce tasks.

While copying, we have parameters such as –overwrite and –update; we can use these combinations too.

If something goes wrong, we can copy the target directory to the HBase directory. This method copies the full file system directory, so will take the same amount of space on HDFS. It is better to have a separate backup cluster or copy this data offline to the tapes.

Restore

In the restore method, restoring can be done in the same way by copying data using distcp from the target to source cluster from where data was copied.

Tip

This method is not always preferable as it requires a full shutdown of clusters, which needs a downtime that will impact the business requirement and SLAs. So, this method is the least preferable way of backup.

Online backup

The online backup process is preferred as there is no need to shut down the cluster, and it does not hamper the operation of the cluster, and thus, we don't need a downtime. This class of backup category has the following methods to take and restore the backup. Let's discuss these.

The HBase snapshot

The HBase snapshot enables us to take a snapshot of a table without fiddling with RegionServers. Snapshot in HBase is a set of metadata information that allows the administrator to get back to the previous working state of the table. A snapshot is not a copy of the table data, but just a layout of the data present on the HBase cluster. We can think of it as a set of operations to keep track of metadata (table info and regions) and data (HFiles, MemStore, and WALs). No data is copied during the process of taking a snapshot of a table.

There are two types of snapshots, online and offline.

Online

Online snapshots can be taken when a table is enabled and active for I/O operations. In this case, the master receives the snapshot request and asks each RegionServer to take a snapshot of the regions for which it is responsible.

Offline

Offline snapshots can be taken when the table is disabled and not ready for I/O operations. The master performs this operation, and the time required to perform this operation is determined mainly on the time taken by HDFS NameNode to provide the list of files.

Now, we will see how to take a snapshot and back up data using the offline method:

  1. Set the required configuration. We need to add the following lines in hbase-site.xml to enable this feature:
    <property>
        <name>hbase.snapshot.enabled</name>
        <value>true</value>
     </property>

    Restart the cluster once this change is made.

  2. Go to HBase shell and execute the following command to take a snapshot:
    hbase > snapshot 'emptable', 'baksnapshot01082014'
    
  3. Use the following command to list snapshots:
    hbase > list_snapshots
    
  4. Delete a snapshot using the following command:
    hbase > delete_snapshot 'baksnapshot01082014'
    
  5. Clone a table from a snapshot, as follows:
    hbase >  clone_snapshot 'baksnapshot01082014', 'newSnapTable'
    
  6. Restore a snapshot as follows:
    hbase >  disable 'table'
    hbase >  restore_snapshot ' baksnapshot01082014'
    
  7. We can use MapReduce to take snapshot to another HDFS cluster.

    The org.apache.hadoop.hbase.snapshot.ExportSnapshot tool copies all the data related to a snapshot (HFiles, logs, and snapshot metadata) to another cluster:

    hbase class org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot snapshot01082014 -copy-to hdfs://infinity:9000/hbase -mappers 8 -bandwidth 100
    

    So, here eight map jobs will run to export all snapshots to another cluster with a limiting bandwidth of 100 MB/s.

The HBase replication method

The HBase replication method provides a mechanism to copy or replicate data from one HBase cluster setup to another. It can serve as a disaster recovery solution and contribute to provide higher availability at the HBase layer. This method works when the data is being pushed from one cluster to another, where pusher is the master and taker is the slave setup. This replication, or pushing, happens asynchronously. This provides high availability of an HBase cluster too. The basis of this replication is based on HLog from each RegionServer. There are three types of replication as follows:

  • Master-slave: In this type of replication, the data is pushed from one cluster to the target cluster. This happens in a single direction. So, the source cluster can have its own tables, and the target cluster might have its own too. We can use the target cluster for the source data replication.
  • Master-master: In this method, data (that might be in the same or a different table) is sent in both the directions between two clusters. So, both the clusters might act as master as well as slave at the same time, pushing and getting the data.
  • Cyclic: In this setup, more than two HBase clusters take part in the replication setup. One cluster can have various possible combinations of master-slave and master-master setups between any two clusters.

Setting up cluster replication

The following are the prerequisites to set up cluster replication:

  • All machines from both the clusters should be able to communicate with each machine in the two clusters
  • Both the clusters must have the same Hadoop/HBase version
  • Each table to be replicated should contain the same column families; in other words, the same schemes and tables must exist on every cluster with the exact same name
  • For multiple slaves, master-master, or cyclic replications, HBase Version 0.92 or higher is needed
  • The package that is responsible for cluster replication is org.apache.hadoop.hbase.replication

The following are the deployment steps to set up cluster replication:

  1. Open the hbase-site.xml file and put the following lines of code in it:
    <property>
      <name>hbase.replication</name>
      <value>true</value>
    </property>

    These changes will need a full cluster restart for configuration to be loaded.

  2. Run the following command from the master-cluster-HBase shell:
    add_peer 'ID' 'CLUSTER_KEY'
    

    Here, ID should be a short integer and for CLUSTER_KEY, follow this template:

    hbase.zookeeper.quorum:hbase.zookeeper.property.clientPort:zookeeper.znode.parent
    
  3. Once we have the peer added, we need to enable replication on column families. The first method is to alter the table and change the replication scope for the column family using the following command:
    disable 'table'
    alter 'table', {NAME => 'colFam', REPLICATION_SCOPE => '1'}
    enable 'table'
    

    Here, putting 0 means it will not replicate, and putting 1 means it has to replicate.

  4. To list the peers, execute the following command:
    hbase > list_peers
    

    For replication-related commands, refer to the previous chapter.

  5. We can verify the setup by looking into any RegionServer log where we can find:
    Considering 1 rs, with ratio 0.1
    Getting 1 rs from peer cluster # 0
    Choosing peer <ipaddress_regionserver>:<regionServerPort>
  6. If the preceding lines are present in the RegionServer logs, setup is replicating, and we can also verify this from the target cluster.
  7. To stop replication at any point in time, we can use the stop_replication command at the source-HBase shell.

Backup and restore using Export and Import commands

In this section, we will have a look at the Export and Import commands in detail.

Export

The Export utility is provided by the HBase JAR file, which writes the HBase table content to a sequence file on HDFS (on the same cluster or another cluster). This runs as a MapReduce job, and using HBase API, reads each row one by one and writes it to the HDFS location. Using this, we can take a full and incremental backup of a live cluster as it takes the start and end timestamps as parameters.

The syntax for this is as follows:

hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]

Here, tablename is the table that is to be exported; outputdir is the target where the output sequence file will be written, which can be an HDFS location on the same cluster or on different clusters; versions implies the number of version to be exported; starttime and endtime are the timestamp between which the data's timestamp should lie to be able to be exported.

Import

The import utility is used to read the exported sequence file—an output of the Export command—to be restored in an HBase table, a new table, or an existing table. If data already exists, it will be overwritten.

The syntax for this is as follows:

hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>

Here, tablename is the name of the table where data is to be imported/restored and inputdir is the path where the data exists or where it will be exported using the Export command.

The Export and Import commands work well with a live cluster and can also be run as Hadoop MapReduce using the following commands:

hadoop jar <full path of> hbase-*.jar export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
hadoop jar <full path of> hbase-*.jar import <tablename> <inputdir>

We can pass the runtime parameter in these commands as -D mapred.output.compress=true, and other required parameters can be given after export operation.

If the export operation is done using HBase v0.94, and the same data has to be imported from a newer version, we can specify the runtime parameter as follows:

hbase -Dhbase.import.version=0.94 org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>

Miscellaneous utilities

ImportTsv is a utility that loads data from a Tab-separated Value (TSV) format into an HBase table. The syntax for the same is as follows:

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <hdfs-inputdir>

CopyTable

The CopyTable tool can copy a part of the table, or the whole table, to the same cluster, or to another cluster. The target table should already be present with the same schema.

The following is the syntax to use this tool:

hbase org.apache.hadoop.hbase.mapreduce.CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>

The following are the options of the preceding syntax:

  • rs.class: This is the hbase.regionserver.class of the peer cluster. It's specified if different from the current cluster.
  • rs.impl: This is the hbase.regionserver.impl of the peer cluster.
  • startrow: This is the start row.
  • stoprow: This is the stop row.
  • starttime: This is the beginning of the time range. If no endtime is specified, it means from starttime to forever.
  • endtime: This is an end of the time range. This will be ignored if no starttime is specified.
  • versions: This is the number of cell versions to copy.
  • new.name: This is the name of a new table.
  • peer.adr: This is the address of the peer cluster given in the hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent format.
  • families: This is a comma-separated list of families to copy. To copy from cf1 to cf2, give sourceCfName:destCfName. To keep the same name, just give cfName.
  • all.cells: This copies deleted markers and deleted cells.

The only argument is tablename, which is the name of the table to copy.

Have a look at the following example:

hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=Copynew table

Here, new.name is the name of the copied table, and table is the one to be copied.

HTable API

We can always write our own custom application that utilizes the public API (http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html) and queries the table directly. We can do this through MapReduce jobs in order to utilize the framework's distributed-batch-processing advantages, or through any other means of our design. However, this approach requires a deep understanding of Hadoop development and all the APIs and performance implications of using them in your production cluster.

Backup using a Mozilla tool

The in-depth explanation of backup using a Mozilla tool is out of the scope of this book, so visit the following links for more information on this:

The following table shows the comparison between different backup processes:

Backup process 

Effect on performance

Data size

Downtime requirement

If incremental backup is possible or not

Time for recovery

Snapshots

Minimal

Very small

On restore

No

Seconds

Replication

Minimal

Huge

None

Yes

Seconds

Export

High

Huge

None

Yes

High

CopyTable

High

Huge

None

Yes

High

API

Medium

Huge

None

Yes

High

Distcp

Downtime

Huge

Yes

No

Long

Note

We have another option of backup, and it is to increase the replication factor of an HBase cluster, which will be maintained on the Hadoop level and provide more availability and robustness. However, this will need more space, and if space is not a constraint, we can use this hassle-free backup method.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.98.208