Using Hive to build a per-month report of fatalities over geographic event data

This recipe will show a very simple analytic that uses Hive to count fatalities for every month appearing in the dataset and print the results to the console.

Getting ready

Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account.

This recipe depends on having the Nigera_ACLED_cleaned.tsv dataset loaded into a Hive table named acled_nigeria_cleaned with the following fields mapped to the respective datatypes.

Issue the following command to the Hive client:

describe acled_nigeria_cleaned

You should see the following response:

OK
loc  string
event_date  string
event_type  string
actor  string
latitude  double
longitude  double
source  string
fatalities  int

How to do it...

Follow the steps to use Hive for report generation:

  1. Open a text editor of your choice, ideally one with SQL syntax highlighting.
  2. Add the inline creation and transformation syntax:
    SELECT from_unixtime(unix_timestamp(event_date,  'yyyy-MM-dd'), 'yyyy-MMM'),
           COALESCE(CAST(sum(fatalities) AS STRING), 'Unknown')
           FROM acled_nigeria_cleaned
           GROUP BY from_unixtime(unix_timestamp(event_date, 'yyyy-MM-dd'),'yyyy-MMM'),
  3. Save the file as monthly_violence_totals.sql in the active folder.
  4. Run the script from the operating system shell by supplying the –f option to the Hive client. You should see the following three rows appear first in the output console. Note that the output is sorted lexicographically, and not on the order of dates.
    OK
    1997-Apr  115
    1997-Aug  4
    1997-Dec  26

How it works...

The SELECT statement uses unix_timestamp() and from_unixtime() to reformat the event_date for each row as just a year-month concatenated field. This is also in the GROUP BY expression for totaling fatalities using sum().

The coalesce() method returns the first non-null argument passed to it. We pass as the first argument, the value of fatalities summed for that given year-month, cast as a string. If that value is NULL for any reason, return the constant Unknown. Otherwise return the string representing the total fatalities counted for that year-month combination. Print everything to the console over stdout.

There's more...

The following are some additional helpful tips related to the code in this recipe:

The coalesce() method can take variable length arguments.

As mentioned in the Hive documentation, coalesce() supports one or more arguments. The first non-null argument will be returned. This can be useful for evaluating several different expressions for a given column before deciding the right one to choose.

The coalesce() will return NULL if no argument is non-null. It's not uncommon to provide a type literal to return if all other arguments are NULL.

Date reformatting code template

Having to reformat dates stored in your raw data is very common. Proper use of from_unixtime() and unix_timestamp() can make your life much easier.

Remember this general code template for concise date format transformation in Hive:

from_unixtime(unix_timestamp(<col>,<in-format>),<out-format>);

See also

  • Using Hive date UDFs to transform and sort event dates from geographic event data
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.165.62