Time for action – creating a table from an existing file

So far we have loaded data into Hive directly from files over which Hive effectively takes control. It is also possible, however, to create tables that model data held in files external to Hive. This can be useful when we want the ability to perform Hive processing over data written and managed by external applications or otherwise required to be held in directories outside the Hive warehouse directory. Such files are not moved into the Hive warehouse directory or deleted when the table is dropped.

  1. Save the following to a file called states.hql:
    CREATE EXTERNAL TABLE states(abbreviation string, full_name string)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '	'
    LOCATION '/tmp/states' ;
  2. Copy the data file onto HDFS and confirm its presence afterwards:
    $ hadoop fs -put states.txt /tmp/states/states.txt
    $ hadoop fs -ls /tmp/states
    

    You will receive the following response:

    Found 1 items
    -rw-r--r--   3 hadoop supergroup        654 2012-03-03 16:54 /tmp/states/states.txt
    
  3. Execute the HiveQL script:
    $ hive -f states.hql
    

    You will receive the following response:

    Logging initialized using configuration in jar:file:/opt/hive-0.8.1/lib/hive-common-0.8.1.jar!/hive-log4j.properties
    Hive history file=/tmp/hadoop/hive_job_log_hadoop_201203031655_1132553792.txt
    OK
    Time taken: 3.954 seconds
    OK
    Time taken: 0.594 seconds
    
  4. Check the source data file:
    $ hadoop fs -ls /tmp/states
    

    You will receive the following response:

    Found 1 items
    -rw-r--r--   3 hadoop supergroup        654 … /tmp/states/states.txt
    
  5. Execute a sample query against the table:
    $ hive -e "select full_name from states where abbreviation like 'CA'"
    

    You will receive the following response:

    Logging initialized using configuration in jar:file:/opt/hive-0.8.1/lib/hive-common-0.8.1.jar!/hive-log4j.properties
    Hive history file=/tmp/hadoop/hive_job_log_hadoop_201203031655_410945775.txt
    Total MapReduce jobs = 1
    ...
    OK
    California
    Time taken: 15.75 seconds
    

What just happened?

The HiveQL statement to create an external table only differs slightly from the forms of CREATE TABLE we used previously. The EXTERNAL keyword specifies that the table exists in resources that Hive does not control and the LOCATION clause specifies where the source file or directory are to be found.

After creating the HiveQL script, we copied the source file onto HDFS. For this table, we used the data file from Chapter 4, Developing MapReduce Programs, which maps U.S. states to their common two-letter abbreviation.

After confirming the file was in the expected location on HDFS, we executed the query to create the table and checked the source file again. Unlike previous table creations that moved the source file into the /user/hive/warehouse directory, the states.txt file is still in the HDFS location into which it was copied.

Finally, we executed a query against the table to confirm it was populated with the source data and the expected result confirms this. This highlights an additional difference with this form of CREATE TABLE; for our previous non-external tables, the table creation statement does not ingest any data into the table, a subsequent LOAD DATA or (as we'll see later) INSERT statement performs the actual table population. With table definitions that include the LOCATION specification, we can create the table and ingest data in a single statement.

We now have two tables in Hive; the larger table with UFO sighting data and a smaller one mapping U.S. state abbreviations to their full names. Wouldn't it be a useful combination to use data from the second table to enrich the location column in the former?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.69.199