Let's fix our table specification as follows:
commands.hql
:DROP TABLE ufodata ; CREATE TABLE ufodata(sighted string, reported string, sighting_location string, shape string, duration string, description string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' ; LOAD DATA INPATH '/tmp/ufo.tsv' OVERWRITE INTO TABLE ufodata ;
$ hadoop fs -put ufo.tsv /tmp/ufo.tsv
$ hive -f commands.hql
You will receive the following response:
OK Time taken: 5.821 seconds OK Time taken: 0.248 seconds Loading data to table default.ufodata Deleted hdfs://head:9000/user/hive/warehouse/ufodata OK Time taken: 0.285 seconds
$ hive -e "select count(*) from ufodata;"
You will receive the following response:
OK 61393 Time taken: 28.077 seconds
$ hive -e "select reported from ufodata limit 5"
You will receive the following response:
OK 19951009 19951011 19950103 19950510 19950614 Time taken: 14.852 seconds
We introduced a third way to invoke HiveQL commands in this example. In addition to using the interactive shell or passing query strings to the Hive tool, we can have Hive read and execute the contents of a file containing a series of HiveQL statements.
We first created such a file that deletes the old table, creates a new one, and loads the data file into it.
The main differences with the table specification are the ROW FORMAT
and FIELDS TERMINATED BY
statements. We need both these commands as the first tells Hive that the row contains multiple delimited fields, while the second specifies the actual separator. As can be seen here, we can use both explicit ASCII codes as well as common tokens such as
for tab.
Before running the script, we copy the data file onto HDFS again—the previous copy was removed by the DELETE
statement—and then use hive -f
to execute the HiveQL file.
As before, we then execute two simple SELECT
statements to first count the rows in the table and then extract the specific values from a named column for a small number of rows.
The overall row count is, as should be expected, the same as before, but the second statement now produces what looks like correct data, showing that the rows are now correctly being split into their constituent fields.
If you look closely at the time taken by the various commands in the preceding example, you'll see a pattern which may at first seem strange. Loading data into a table takes about as long as creating the table specification, but even the simple count of all row statements takes significantly longer. The output also shows that table creation and the loading of data do not actually cause MapReduce jobs to be executed, which explains the very short execution times.
When loading data into a Hive table, the process is different from what may be expected with a traditional relational database. Although Hive copies the data file into its working directory, it does not actually process the input data into rows at that point. What it does instead is create metadata around the data which is then used by subsequent HiveQL queries.
Both the CREATE TABLE
and LOAD DATA
statements, therefore, do not truly create concrete table data as such, instead they produce the metadata that will be used when Hive is generating MapReduce jobs to access the data conceptually stored in the table.
18.226.4.191