Using Hive to dynamically create tables from the results of a weblog query

This recipe will outline a shorthand technique for inline table creation when the query is executed. Having to create every table definition up front is impractical and does not scale for large ETL. Being able to dynamically define intermediate tables is tremendously useful for complex analytics with multiple staging points.

In this recipe, we will create a new table that contains three fields from the weblog entry dataset, namely request_date, request_time, and url. In addition to this, we will define a new field called url_length.

Getting ready

Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster, with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account.

This recipe depends on having the weblog_entries dataset loaded into Hive table named weblog_entries with the following fields mapped to the respective datatypes.

Issue the following command to the Hive client:

describe weblog_entries

You should see the following response:

OK
md5 string
url string
request_date string
request_time string
ip string

How to do it...

Carry out the following steps to create an inline table definition using an alias:

  1. Open a text editor of your choice, ideally one with SQL syntax highlighting.
  2. Add the following inline creation syntax:
    CREATE TABLE weblog_entries_with_url_length AS
    SELECT url, request_date, request_time, length(url) as url_length FROM weblog_entries;
  3. Save the script as weblog_entries_create_table_as.hql in the active directory.
  4. Run the script from the operating system shell by supplying the –f option to the Hive client, as follows:
    hive –f weblog_create_table_as.hql
  5. To verify that the table was created successfully, issue the following command to the Hive client directly, using the –e option:
    hive –e "describe weblog_entries_with_url_length"
  6. You should see a table with three string fields and a fourth int field holding the URL length:
    OK
    url string
    request_date string
    request_time string
    url_length int

How it works...

The following statement initially defines a new table by the name weblog_entries_with_url_length:

CREATE TABLE weblog_entries_with_url_length AS

We then define the body of this table as an alias to the result set of a nested SELECT statement. In this case, our SELECT statement simply grabs the url, request_date, and request_time fields from each entry in the weblog_entries table. The field names are copied as field names to our new table weblog_entires_with_url_length. We also defined an additional field aliased as url_length to be calculated for each selected record. It stores an int value that represents the number of characters in the record's url field.

SELECT url, request_date, request_time, length(url) as url_length FROM weblog_entries;

In one simple statement, we created a table with a subset of fields from our starting table, as well as a new derived field.

There's more...

The following are a few reminders for when using external tables:

CREATE TABLE AS cannot be used to create external tables

As of Apache Hive 0.7.1, you cannot create external tables using aliases with SELECT statements.

DROP temporary tables

The ease of the CREATE TABLE AS syntax lets Hive users create new tables very quickly, but don't forget to DROP any temporary tables. If you are scripting the CREATE ALIAS for repeated use, the next execution, especially, will fail if there are table name conflicts. Moreover, such intermediate tables will create a warehouse namespace that will quickly become unmanageable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.164.34