Impala and HBase

HBase is a very popular nonrelational database on Hadoop that stores data in a column-oriented store model. HBase also uses HDFS as its data storage layer and MapReduce to process data. The key difference between Hive and HBase is that HBase is a complete nonrelational database running on Hadoop, while Hive is a SQL-like database that supports SQL statements to process data. As it is another kind of database, HBase supports the concepts of databases, tables, and columns and uses SQL statements to submit queries while processing the data in tables on HDFS.

Impala does not disappoint us and provides great flexibility to query data in HBase tables. Impala tables process datafiles stored on HDFS—great for bulk loads and full-table-scan queries; however, HBase can perform efficient data processing by performing individual row or range lookups. Impala considers HBase a key-value store in which the key is mapped to one column in the Impala table and value fields are mapped to other columns.

Tip

While discussing HBase, internals are out of the scope of this book. If you are working on the HBase table with Impala, I would suggest reading the appropriate HBase documentation or visiting the Apache HBase website for the latest documentation, http://hbase.apache.org/.

Here are the steps to work with HBase and Impala together:

  1. Use the Hive shell to create a Hive table using CREATE EXTERNAL TABLE and specific keywords and map Hive tables with HBase tables. We are using the Hive shell only because certain keywords used in SQL statements are not supported in Impala.
  2. Define the column corresponding to the HBase row key as a string with the #string keyword or map it to the STRING column.
  3. Once the preceding steps are done, the Hive metastore will be updated with the required information and Impala can perform queries on these tables.
  4. Make sure Impala users have read/write access for HBase tables. Using the GRANT command in HBase shell can do this.

Using Impala to query HBase tables

While querying HBase tables, Impala uses the HBase client API to query data stored in HBase. You can create external tables in Hive with or without the string key. Here is an example of creating a table first in HBase and then in Hive for mapping, and finally, querying it in Impala:

  1. Create the HBase table in the HBase shell as follows:
    Create 'hbasetable', 'ints', 'strings'
    Enable 'hbasetable'
  2. Create an external table in the Hive shell with a string row key as follows:
    CREATE EXTERNAL TABLE hivetableforhbase_userid (
          UserId string,    /* Row Key is set as String */
          UserName string,  UserAge int,
          UserDob timestamp)
      STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
      WITH SERDEPROPERTIES (
          "hbase.columns.mapping" = ":key,strings:UserID,strings:UserName,ints:UserAge,strings:UserBob )
    TBLPROPERTIES("hbase.table.name" = "hivetableforhbaseuseragg");
  3. You can also create another table without a string row key for learning purposes as follows:
    CREATE EXTERNAL TABLE hivetableforhbase (
          UserId int, /* Row Key is not set as String */
          UserName string, UserAge int,
          UserDob timestamp)
      STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
      WITH SERDEPROPERTIES (
          "hbase.columns.mapping" = ":key,strings:UserID, strings:UserName, ints:UserAge, strings:UserBob )
    TBLPROPERTIES("hbase.table.name" = "hivetableforhbase");
  4. Now we can issue the following query in the Impala shell:
    -- When row key is mapped as string column, range predicates are applied in the scan
    SELECT * FROM hivetableforhbase_useragg  WHERE UserId = '10';
    -- When row key is not transformed into scan parameter (not mapped as string) 
    SELECT * FROM hivetableforhbase WHERE id = 10;
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.81.33