Using Hive to map an external table over weblog data in HDFS

You will often want to create tables over existing data that does not live within the managed Hive warehouse in HDFS. Creating a Hive external table is one of the easiest ways to handle this scenario. Queries from the Hive client will execute as they normally do over internally managed tables.

Getting ready

Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster, with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account. This recipe depends on having the weblog_entries dataset loaded into an HDFS directory at the absolute path /input/weblog/weblog_records.txt.

How to do it...

Carry out the following steps to map an external table in HDFS:

  1. Open a text editor of your choice, ideally one with SQL syntax highlighting. I have used the Textmate text editor for this recipe.
  2. Add the CREATE TABLE syntax, as follows:
    DROP TABLE IF EXISTS weblog_entries;
    CREATE EXTERNAL TABLE weblog_entries (
          md5 STRING,
          url STRING,
          request_date STRING,
          request_time STRING,
          ip STRING
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '	' LINES TERMINATED BY '
    '
    LOCATION '/input/weblog/';
  3. Save the script as weblog_create_external_table.hsql in the working directory.
  4. Run the script from the operating system shell by supplying the –f option to the Hive client, as follows:
    hive –f weblog_create_external_table.hql
  5. You should see two successful commands issued to the Hive client.
    OK
    Time taken: 2.654 seconds
    OK
    Time taken: 0.473 seconds

How it works...

The existing definition of the table weblog_entries is deleted if it already exists. Following this, the script issues a CREATE command with the EXTERNAL keyword, which tells the Hive Metastore that the data is not managed by the Hive warehouse in HDFS.

The table is defined as having five fields per entry. The MD5 of the URL, the URL itself, the date of the request, the exact time of the request, and the IP address that the request was associated with.

ROW FORMAT DELIMITED uses the native Hive SerDe, which is Hive's extensible and internal serialization/deserialization mechanism for reading and writing raw data. We explicitly tell the SerDe that a tab character separates each field and a newline character separates each record. The LOCATION keyword is required by Hive when creating an external table. It points to the HDFS directory that contains the table data using an absolute path.

There's more...

There are a few handy tips that you need to know when working with external tables.

LOCATION must point to a directory, not a file

As of Hive release 0.7.1, the LOCATION keyword requires an absolute path to a directory in HDFS.

Dropping an external table does not delete the data stored in the table

Unlike a managed table in Hive, the DROP command only deletes the table entry from the Metastore and not the physical data in HDFS. Other applications that depend on data stored in the supplied HDFS directory will continue to operate normally.

You can add data to the path specified by LOCATION

If new data is inserted into a directory specified in an external table's LOCATION attribute, the data will be visible to any subsequent queries performed on the table.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.124.49