Using the Hive string UDFs to concatenate fields in weblog data

String concatenation is a very common operation in any development task. It frequently comes up when using Hive for report generation and even simple ETL tasks. This recipe will show a very basic and useful example using one of the Hive string concatenation UDFs.

In this recipe, we will take the separate request_date and request_time fields from the weblog_entries and print a single concatenated column to the console for every record, containing both the request_date and request_time fields separated by an underscore (_).

Getting ready

Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster, with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account.

This recipe depends on having the weblog_entries dataset loaded into a Hive table named weblog_entries with the following fields mapped to the respective datatypes.

Issue the following command to the Hive client:

describe weblog_entries

You should see the following response:

OK
md5 string
url string
request_date string
request_time string
ip string

How to do it...

Carry out the following steps to perform string concatenation in HiveQL:

  1. Open a text editor of your choice, ideally one with SQL syntax highlighting.
  2. Add the following inline creation syntax:
    SELECT concat_ws('_', request_date, request_time) FROM weblog_entries;
  3. Save the script as weblog_concat_date_time.hql in the active directory.
  4. Run the script from the operating system shell by supplying the –f option to the Hive client. You should see the results of the SELECT statement printed out to the console. The following snippet is an example that contains only two sample rows. The full printout will contain all 3000 rows.
    2012-05-10_21:33:26
    2012-05-10_21:13:10

How it works...

The script relies on the Hive built-in UDF to concatenate two strings together with a supplied separator token. For each row, it supplies the function with the respective request_date and request_time values that correspond to that row. The output of the function is a single string containing both the fields separated by an underscore (_). Since the SELECT statement consists of only that function, and the function outputs just a single string, we see a single column for all 3000 rows, one printed per line.

There's more...

The following are a few additional notes to help with the concat_ws() function:

The UDF concat_ws() function will not automatically cast parameters to String

If you pass non-string datatypes as parameters to concat_ws(), you will be greeted with a very descriptive error message:

FAILED: Error in semantic analysis: Line 1:21 Argument type mismatch field1: Argument 2 of function CONCAT_WS must be "string", but "int" was found.

If you wish to encapsulate the auto-casting of your parameters to string, use the regular concat() function.

Alias your concatenated field

Like most Hive UDFs, you can alias the output of concat_ws(). This comes in handy if you are persisting the results of the concatenation and want a very descriptive column header.

The concat_ws() function supports variable length parameter arguments

When using concat_ws(), you must, at the very least, supply the separator character first and one input string parameter to be printed out. However, you are not limited in the number of input string parameters that you can supply to be concatenated and separated.

The following usage is valid:

concat_ws('_','test')

The following output will be printed to the console:

test

The following usage of the concat_ws() function is also valid:

concat_ws('_','hi','there','my','name','is')

The following output will be printed to the console:

hi_there_my_name_is

See also

  • The following recipes in Chapter 6, Big Data Analysis:
    • Using Hive date UDFs to transform and sort event dates from geographic event data
    • Using Hive to build a per-month report of fatalities over geographic event data
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.26.217