Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Using Hive to build a per-month report of fatalities over geographic event data

This recipe will show a very simple analytic that uses Hive to count fatalities for every month appearing in the dataset and print the results to the console.

Getting ready

Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account.

This recipe depends on having the Nigera_ACLED_cleaned.tsv dataset loaded into a Hive table named acled_nigeria_cleaned with the following fields mapped to the respective datatypes.

Issue the following command to the Hive client:

describe acled_nigeria_cleaned

You should see the following response:

OK
loc  string
event_date  string
event_type  string
actor  string
latitude  double
longitude  double
source  string
fatalities  int

How to do it...

Follow the steps to use Hive for report generation:

Open a text editor of your choice, ideally one with SQL syntax highlighting.

Add the inline creation and transformation syntax:

SELECT from_unixtime(unix_timestamp(event_date,  'yyyy-MM-dd'), 'yyyy-MMM'),
       COALESCE(CAST(sum(fatalities) AS STRING), 'Unknown')
       FROM acled_nigeria_cleaned
       GROUP BY from_unixtime(unix_timestamp(event_date, 'yyyy-MM-dd'),'yyyy-MMM'),

Save the file as monthly_violence_totals.sql in the active folder.
Run the script from the operating system shell by supplying the –f option to the Hive client. You should see the following three rows appear first in the output console. Note that the output is sorted lexicographically, and not on the order of dates.
```
OK
1997-Apr  115
1997-Aug  4
1997-Dec  26
```

How it works...

The SELECT statement uses unix_timestamp() and from_unixtime() to reformat the event_date for each row as just a year-month concatenated field. This is also in the GROUP BY expression for totaling fatalities using sum().

The coalesce() method returns the first non-null argument passed to it. We pass as the first argument, the value of fatalities summed for that given year-month, cast as a string. If that value is NULL for any reason, return the constant Unknown. Otherwise return the string representing the total fatalities counted for that year-month combination. Print everything to the console over stdout.

There's more...

The following are some additional helpful tips related to the code in this recipe:

The coalesce() method can take variable length arguments.

As mentioned in the Hive documentation, coalesce() supports one or more arguments. The first non-null argument will be returned. This can be useful for evaluating several different expressions for a given column before deciding the right one to choose.

The coalesce() will return NULL if no argument is non-null. It's not uncommon to provide a type literal to return if all other arguments are NULL.

Date reformatting code template

Having to reformat dates stored in your raw data is very common. Proper use of from_unixtime() and unix_timestamp() can make your life much easier.

Remember this general code template for concise date format transformation in Hive:

from_unixtime(unix_timestamp(<col>,<in-format>),<out-format>);

Table of Contents for
Using Hive to build a per-month report of fatalities over geographic event data

Using Hive to build a per-month report of fatalities over geographic event data

Getting ready

How to do it...

How it works...

There's more...

The coalesce() method can take variable length arguments.

Date reformatting code template

See also

Table of Contents for Using Hive to build a per-month report of fatalities over geographic event data

Create new playlist

Sign In

Sign Up

Using Hive to build a per-month report of fatalities over geographic event data

Getting ready

How to do it...

How it works...

There's more...

The coalesce() method can take variable length arguments.

Date reformatting code template

See also

Table of Contents for
Using Hive to build a per-month report of fatalities over geographic event data