Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Time for action – redefining the table with the correct column separator

Let's fix our table specification as follows:

Create the following file as commands.hql:

DROP TABLE ufodata ;
CREATE TABLE ufodata(sighted string, reported string, sighting_location string,
shape string, duration string, description string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '	' ;
LOAD DATA INPATH '/tmp/ufo.tsv' OVERWRITE INTO TABLE ufodata ;

Copy the data file onto HDFS:
```
$ hadoop fs -put ufo.tsv /tmp/ufo.tsv
```

Execute the HiveQL script:

$ hive -f commands.hql

You will receive the following response:

OK
Time taken: 5.821 seconds
OK
Time taken: 0.248 seconds
Loading data to table default.ufodata
Deleted hdfs://head:9000/user/hive/warehouse/ufodata
OK
Time taken: 0.285 seconds

Validate the number of rows in the table:

$ hive -e "select count(*) from ufodata;"

You will receive the following response:

OK
61393
Time taken: 28.077 seconds

Validate the contents of the reported column:

$ hive -e "select reported from ufodata limit 5"

You will receive the following response:

OK
19951009
19951011
19950103
19950510
19950614
Time taken: 14.852 seconds

What just happened?

We introduced a third way to invoke HiveQL commands in this example. In addition to using the interactive shell or passing query strings to the Hive tool, we can have Hive read and execute the contents of a file containing a series of HiveQL statements.

We first created such a file that deletes the old table, creates a new one, and loads the data file into it.

The main differences with the table specification are the ROW FORMAT and FIELDS TERMINATED BY statements. We need both these commands as the first tells Hive that the row contains multiple delimited fields, while the second specifies the actual separator. As can be seen here, we can use both explicit ASCII codes as well as common tokens such as for tab.

Note

Be careful with the separator specification as it must be precise and is case sensitive. Do not waste a few hours by accidentally writing T instead of as I did recently.

Before running the script, we copy the data file onto HDFS again—the previous copy was removed by the DELETE statement—and then use hive -f to execute the HiveQL file.

As before, we then execute two simple SELECT statements to first count the rows in the table and then extract the specific values from a named column for a small number of rows.

The overall row count is, as should be expected, the same as before, but the second statement now produces what looks like correct data, showing that the rows are now correctly being split into their constituent fields.

Hive tables – real or not?

If you look closely at the time taken by the various commands in the preceding example, you'll see a pattern which may at first seem strange. Loading data into a table takes about as long as creating the table specification, but even the simple count of all row statements takes significantly longer. The output also shows that table creation and the loading of data do not actually cause MapReduce jobs to be executed, which explains the very short execution times.

When loading data into a Hive table, the process is different from what may be expected with a traditional relational database. Although Hive copies the data file into its working directory, it does not actually process the input data into rows at that point. What it does instead is create metadata around the data which is then used by subsequent HiveQL queries.

Both the CREATE TABLE and LOAD DATA statements, therefore, do not truly create concrete table data as such, instead they produce the metadata that will be used when Hive is generating MapReduce jobs to access the data conceptually stored in the table.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Time for action – redefining the table with the correct column separator

Create new playlist

Sign In

Sign Up

Time for action – redefining the table with the correct column separator

What just happened?

Note

Hive tables – real or not?

Table of Contents for
Time for action – redefining the table with the correct column separator