Using Hive date UDFs to transform and sort event dates from geographic event data

This recipe will illustrate the efficient use of the Hive date UDFs to list the 20 most recent events and the number of days between the event date and the current system date.

Getting ready

Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account.

This recipe depends on having the Nigera_ACLED_cleaned.tsv dataset loaded into a Hive table named acled_nigeria_cleaned with the fields mapped to the respective datatypes.

Issue the following command to the Hive client to see the mentioned fields:

describe acled_nigeria_cleaned

You should see the following response:

OK
Loc  string
event_date  string
event_type  string
actor  string
latitude  double
longitude  double
source  string
fatalities  int

How to do it...

Perform the following steps to utilize Hive UDFs for sorting and transformation:

  1. Open a text editor of your choice, ideally one with SQL syntax highlighting.
  2. Add the inline creation and transform syntax:
    SELECT event_type,event_date,days_since FROM (
        SELECT event_type,event_date,
                datediff(to_date(from_unixtime(unix_timestamp())),
                          to_date(from_unixtime( 
                                  unix_timestamp(event_date,
                                 'yyyy-MM-dd')))) AS days_since
         FROM acled_nigeria_cleaned) date_differences
         ORDER BY event_date DESC LIMIT 20;
  3. Save the file as top_20_recent_events.sql in the active folder.
  4. Run the script from the operating system shell by supplying the –f option to the Hive client. You should see the following five rows appear first in the output console:
    OK
    Battle-No change of territory  2011-12-31    190
    Violence against civilians  2011-12-27    194
    Violence against civilians  2011-12-25    196
    Violence against civilians  2011-12-25    196
    Violence against civilians  2011-12-25    196

How it works...

Let's start with the nested SELECT subqueries. We select three fields from our Hive table acled_nigeria_cleaned: event_type, event_date, and the result of calling the UDF datediff(), which takes as arguments an end date and a start date. Both are expected in the form yyyy-MM-dd. The first argument to datediff() is the end date, with which we want to represent the current system date. Calling unix_timestamp() with no arguments will return the current system time in milliseconds. We send that return value to from_unixtimestamp() to get a formatted timestamp representing the current system date in the default Java 1.6 format (yyyy-MM-dd HH:mm:ss). We only care about the date portion, so calling to_date() with the output of this function strips the HH:mm:ss. The result is the current date in the yyyy-MM-dd form.

The second argument to datediff() is the start date, which for our query is the event_date. The series of function calls operate in almost the exact same manner as our previous argument, except that when we call unix_timestamp(), we must tell the function that our argument is in the SimpleDateFormat format that is yyyy-MM-dd. Now we have both start_date and end_date arguments in the yyyy-MM-dd format and can perform the datediff() operation for the given row. We alias the output column of datediff() as days_since for each row.

The outer SELECT statement takes these three columns per row and sorts the entire output by event_date in descending order to get reverse chronological ordering. We arbitrarily limit the output to only the first 20.

The net result is the 20 most recent events with the number of days that have passed since that event occurred.

There's more...

The date UDFs can help tremendously in performing string date comparisons. Here are some additional pointers:

Date format strings follow Java SimpleDateFormat guidelines

Check out the Javadocs for SimpleDateFormat to learn how your custom date strings can be used with the date transform UDFs.

Default date and time formats

  • Many of the UDFs operate under a default format assumption.
  • For UDFs requiring only date, your column values must be in the form yyyy-MM-dd.
  • For UDFs that require date and time, your column values must be in the form yyyy-MM-dd HH:mm:ss.

See also

  • Using Hive to build a per-month report of fatalities over geographic event data
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.76.164