Time for action – redefining the table with the correct column separator

Let's fix our table specification as follows:

  1. Create the following file as commands.hql:
    DROP TABLE ufodata ;
    CREATE TABLE ufodata(sighted string, reported string, sighting_location string,
    shape string, duration string, description string)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '	' ;
    LOAD DATA INPATH '/tmp/ufo.tsv' OVERWRITE INTO TABLE ufodata ;
  2. Copy the data file onto HDFS:
    $ hadoop fs -put ufo.tsv /tmp/ufo.tsv
    
  3. Execute the HiveQL script:
    $ hive -f commands.hql 
    

    You will receive the following response:

    OK
    Time taken: 5.821 seconds
    OK
    Time taken: 0.248 seconds
    Loading data to table default.ufodata
    Deleted hdfs://head:9000/user/hive/warehouse/ufodata
    OK
    Time taken: 0.285 seconds
    
  4. Validate the number of rows in the table:
    $ hive -e "select count(*) from ufodata;"
    

    You will receive the following response:

    OK
    61393
    Time taken: 28.077 seconds
    
  5. Validate the contents of the reported column:
    $ hive -e "select reported from ufodata limit 5"
    

    You will receive the following response:

    OK
    19951009
    19951011
    19950103
    19950510
    19950614
    Time taken: 14.852 seconds
    

What just happened?

We introduced a third way to invoke HiveQL commands in this example. In addition to using the interactive shell or passing query strings to the Hive tool, we can have Hive read and execute the contents of a file containing a series of HiveQL statements.

We first created such a file that deletes the old table, creates a new one, and loads the data file into it.

The main differences with the table specification are the ROW FORMAT and FIELDS TERMINATED BY statements. We need both these commands as the first tells Hive that the row contains multiple delimited fields, while the second specifies the actual separator. As can be seen here, we can use both explicit ASCII codes as well as common tokens such as for tab.

Note

Be careful with the separator specification as it must be precise and is case sensitive. Do not waste a few hours by accidentally writing T instead of as I did recently.

Before running the script, we copy the data file onto HDFS again—the previous copy was removed by the DELETE statement—and then use hive -f to execute the HiveQL file.

As before, we then execute two simple SELECT statements to first count the rows in the table and then extract the specific values from a named column for a small number of rows.

The overall row count is, as should be expected, the same as before, but the second statement now produces what looks like correct data, showing that the rows are now correctly being split into their constituent fields.

Hive tables – real or not?

If you look closely at the time taken by the various commands in the preceding example, you'll see a pattern which may at first seem strange. Loading data into a table takes about as long as creating the table specification, but even the simple count of all row statements takes significantly longer. The output also shows that table creation and the loading of data do not actually cause MapReduce jobs to be executed, which explains the very short execution times.

When loading data into a Hive table, the process is different from what may be expected with a traditional relational database. Although Hive copies the data file into its working directory, it does not actually process the input data into rows at that point. What it does instead is create metadata around the data which is then used by subsequent HiveQL queries.

Both the CREATE TABLE and LOAD DATA statements, therefore, do not truly create concrete table data as such, instead they produce the metadata that will be used when Hive is generating MapReduce jobs to access the data conceptually stored in the table.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.4.191