Time for action – inserting the UFO data

Now that we have created a table, let us load the UFO data into it.

  1. Copy the UFO data file onto HDFS:
    $ hadoop fs -put ufo.tsv /tmp/ufo.tsv
    
  2. Confirm that the file was copied:
    $ hadoop fs -ls /tmp
    

    You will receive the following response:

    Found 2 items
    drwxrwxr-x   - hadoop supergroup          0 … 14:52 /tmp/hive-hadoop
    -rw-r--r--   3 hadoop supergroup   75342464 2012-03-03 16:01 /tmp/ufo.tsv
    
  3. Enter the Hive interactive shell:
    $ hive
    
  4. Load the data from the previously copied file into the ufodata table:
    hive> LOAD DATA INPATH '/tmp/ufo.tsv' OVERWRITE INTO TABLE ufodata;
    

    You will receive the following response:

    Loading data to table default.ufodata
    Deleted hdfs://head:9000/user/hive/warehouse/ufodata
    OK
    Time taken: 5.494 seconds
    
  5. Exit the Hive shell:
    hive> quit;
    
  6. Check the location from which we copied the data file:
    $ hadoop fs -ls /tmp
    

    You will receive the following response:

    Found 1 items
    drwxrwxr-x   - hadoop supergroup          0 … 16:10 /tmp/hive-hadoop
    

What just happened?

We first copied onto HDFS the tab-separated file of UFO sightings used previously in Chapter 4, Developing MapReduce Programs. After validating the file's presence on HDFS, we started the Hive interactive shell and used the LOAD DATA command to load the file into the ufodata table.

Because we are using a file already on HDFS, the path was specified by INPATH alone. We could have loaded directly from a file on the local filesystem (obviating the need for the prior explicit HDFS copy) by using LOCAL INPATH.

We specified the OVERWRITE statement which will delete any existing data in the table before loading the new data. This obviously should be used with care, as can be seen from the output of the command, the directory holding the table data is removed by use of OVERWRITE.

Note the command took only a little over five seconds to execute, significantly longer than it would have taken to copy the UFO data file onto HDFS.

Note

Though we specified an explicit file in this example, it is possible to load multiple files with a single statement by specifying a directory as the INPATH location; in such a case, all files within the directory will be loaded into the table.

After exiting the Hive shell, we look again at the directory into which we copied the data file and find it is no longer there. If a LOAD statement is given a path to data on HDFS, it will not simply copy this into /user/hive/datawarehouse, but will move it there instead. If you want to analyze data on HDFS that is used by other applications, then either create a copy or use the EXTERNAL mechanism that will be described later.

Validating the data

Now that we have loaded the data into our table, it is good practice to do some quick validating queries to confirm all is as expected. Sometimes our initial table definition turns out to be incorrect.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.134.154