MapReduce and HBase

HBase supports writing MapReduce jobs for processing data from the HBase table using the org.apache.hadoop.hbase.mapreduce package, which has lots of methods for the same. This also provides HBase MapReduce input and output formats that can be utilized in MapReduce jobs, a table indexing MapReduce job, and many other MapReduce utilities. It utilizes Hadoop MapReduce framework to do so.

The following is a list of MapReduce classes provided by HBase:

  • Import: This utility is used to import sequence file from HDFS, which is exported by the export command.
  • ImportTsv: This utility is used to import the Tab-separated Value (TSV) file using the MapReduce task.
  • CellCounter: This counts the number of cells in the HBase table using the MapReduce job.
  • CopyTable: This is used to copy table from one HBase cluster to another HBase cluster. The destination can be the same cluster or another cluster.
  • Driver: This is the Driver class for MapReduce jobs in HBase.
  • Export: This exports or writes data from an HBase table to a sequential file for backup on a HDFS location using the MapReduce job.
  • GroupingTableMapper: This is used to extract grouping columns from the input record.
  • HFileOutputFormat2: This is used to write HFiles.
  • HLogInputFormat: This provides an input format for HLog files.
  • HRegionPartitioner<key, value>: This partitions the output key to a group of keys.
  • IdentityTableMapper: This passes the specified key and record to the Reduce phase.
  • IdentityTableReducer: This is a convenience class that simply writes all values passed to the configured HBase table.
  • KeyValueSortReducer: This emits sorted KeyValues.
  • LoadIncrementalHFiles: This loads the output of HFileOutputFormat into an existing HBase table.
  • MultiTableInputFormat: This converts HBase tabular data into a format that can be consumed by MapReduce.
  • MultiTableInputFormatBase: This is a base class for MultiTableInputFormats.
  • MultiTableOutputFormat: This is the Hadoop output format that writes into one or more HBase tables.
  • PutCombiner<K>: This groups Puts.
  • PutSortReducer: This emits a sorted list of Puts.
  • RowCounter: This runs a MapReduce job to count rows in a specified HBase table.
  • SimpleTotalOrderPartitioner<value>: This takes the start and end keys and uses this feature to figure out reduce key belongs to which partition.
  • TableInputFormat: This converts the HBase tabular data into a format that can be consumed by MapReduce.
  • TableInputFormatBase: This is the base class for TableInputFormats.
  • TableMapper<keyout, valueout>: This extends the base Mapper class to add the required input key and value classes.
  • TableOutputFormat<KEY>: This converts MapReduce output and writes it to an HBase table.
  • TableRecordReader: This iterates over an HBase table data and key-value pairs of data.
  • TableReducer<keyin, valuein, keyout>: This extends the basic Reducer class to add the required key and value I/O classes.
  • TableSnapshotInputFormat: This is used to run a MapReduce over a table snapshot.
  • TableSplit: This splits a table using the MapReduce job.
  • TextSortReducer: This emits a sorted key-value pair.
  • TsvImporterMapper: This writes content of an HBase table to files on HDFS.
  • TsvImporterTextMapper: This writes table content to map output files.
  • WALPlayer: This is used to replay WAL files using the MapReduce job.

Note

The updated and most recent MapReduce utilities can be found at http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html.

We can run a MapReduce task using the following line of code:

hadoop jar ${HBASE_HOME}/hbase-0.90.0.jar <utility name from jar file > <list of parameters>

The utility names can be as follows:

  • completebulkload: This is used for bulk loading data
  • copytable: This is used to export a table from a cluster to peer cluster
  • export: This exports a table on HDFS
  • import: This imports exported data
  • importtsv: This imports data that is in TSV format
  • rowcounter: This counts the number of rows in an HBase table
  • verifyrep: This is used to compare the data from tables in two different clusters

Note

More information about the Hadoop MapReduce framework can be found at http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html.

Let's now look into a MapReduce code example for HBase. The following is the word count example that counts words in the HBase table:

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;


public class hbaseMapRedExampleClaseeWorkCount {
  public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable count = new IntWritable(1);
    private Text textToEmit = new Text();
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      StringTokenizer strTokenizerObj = new StringTokenizer(value.toString());
      while (strTokenizerObj.hasMoreTokens()) {
        textToEmit.set(strTokenizerObj.nextToken());
        context.write(textToEmit, count);
      }
    }
  }
  public static class Reduce extends TableReducer<Text, IntWritable, NullWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
      int total = 0;
      Iterator<IntWritable> iterator = values.iterator();
      while (iterator.hasNext()) {
        total += iterator.next().get();
      }
      Put put = new Put(Bytes.toBytes(key.toString()));
      put.add(Bytes.toBytes("colFam"), Bytes.toBytes("count"), Bytes.toBytes(String.valueOf(total)));
      context.write(NullWritable.get(), put);
      }
  }
  public static void createHBaseTable(String hbaseMapRedTestTableObj) throws IOException {
    HTableDescriptor tableDescriptorObj = new HTableDescriptor(hbaseMapRedTestTableObj);
    HColumnDescriptor column = new HColumnDescriptor("colFam");
    tableDescriptorObj.addFamily(column);
    Configuration configObj = HBaseConfiguration.create();
    configObj.set("hbase.zookeeper.quorum", "infinity");
    configObj.set("hbase.zookeeper.property.clientPort", "2222");
    HBaseAdmin hAdmin = new HBaseAdmin(configObj);
    if (hAdmin.tableExists(hbaseMapRedTestTableObj)) {
      System.out.println("Table exist !");
      hAdmin.disableTable(hbaseMapRedTestTableObj);
      hAdmin.deleteTable(hbaseMapRedTestTableObj);
    }
    System.out.println("Create Table" + hbaseMapRedTestTableObj);
    hAdmin.createTable(tableDescriptorObj);
  }
  public static void main(String[] args) throws Exception {
    String hbaseMapRedTestTableObj = "hbaseMapReduceTest";
    hbaseMapRedExampleClaseeWorkCount.createHBaseTable(hbaseMapRedTestTableObj);
    Configuration configObj = new Configuration();
    configObj.set("mapred.job.tracker", "infinity:9001");
    configObj.set("hbase.zookeeper.quorum", "infinity");
    configObj.set("hbase.zookeeper.property.clientPort", "2222");
    configObj.set(TableOutputFormat.OUTPUT_TABLE, hbaseMapRedTestTableObj);
    Job job = new Job(configObj, "HBase WordCount Map reduce");
    job.setJarByClass(hbaseMapRedExampleClaseeWorkCount.class);
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(IntWritable.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TableOutputFormat.class);
    FileInputFormat.addInputPath(job, new Path("<hbasefilepath>"));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

We can write MapReduce code for HBase data for different scenarios, which will completely depend on the requirements. HBase stores data as a key-value pair, which is best for MapReduce.

Note

More on MapReduce in HBase and API uses can be found at the following links:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.128.105