Using a map-side join in Apache Hive to analyze geographical events

When joining two tables in Apache Hive, one table might be significantly smaller than the other. In such cases, Hive can push a hash table representing the smaller table over the distributed cache and join the tables entirely map-side, which can lead to better parallelism and job throughput. In this recipe, we will use a map-side join to attach any significant holiday information that may have occurred on a particular geographic event.

Getting ready

Ensure that Apache Hive 0.7.1 is installed on your client machine and on the environment path for the active user account.

This recipe depends on having the Nigera_ACLED_cleaned.tsv dataset loaded into a Hive table with the name acled_nigeria_cleaned and with the following fields mapped to the respective datatypes. The Nigera_ACLED_cleaned.tsv dataset can be downloaded from http://www.packtpub.com/support.

Issue the following command to the Hive client:

describe acled_nigeria_cleaned

You should see the following response:

OK
loc  string
event_date  string
event_type  string
actor  string
latitude  double
longitude  double
source  string
fatalities  int

This recipe also requires having nigeria-holidays.tsv loaded into a Hive table with the name nigeria_holidays and the following fields mapped to the respective datatypes.

Issue the following command to the Hive client:

describe nigeria_holidays

You should see the following response:

OK
yearly_date  string
description  string

How to do it...

Carry out the following steps to perform a map-side join in Apache Hive:

  1. Open a text editor and create a file named map-join-acled-holidays.sql.
  2. Add the inline creation and transformation syntax:
    SELECT /*+ MAPJOIN(nh)*/ acled.event_date, acled.event_type, nh.description
        FROM acled_nigeria_cleaned acled
        JOIN nigeria_holidays nh
            ON (substr(acled.event_date, 6) = nh.yearly_date);
  3. Run the map-join-acled-holidays.sql script from the operating system shell by supplying the –f option to the Hive client. You will know the map-side join is working if you see this message in the output trace:
    Mapred Local Task Succeeded. Convert the Join into MapJoin

    The generated MapReduce job should not have any reduce tasks.

  4. You should see the following five rows appear first in the output console:
    2002-01-01  Riots/Protests  New Years Day
    2001-06-12  Battle-No change of territory  Lagos State only; in memory of failed 1993 election
    2002-05-29  Violence against civilians   Democracy Day
    2010-10-01  Riots/Protests  Independence Day
    2010-10-01  Violence against civilians   Independence Day

How it works...

The script inner joins the month-day portion of each record in the event_date column in acled_nigeria_cleaned to the yearly_date column in nigeria_holidays. substr(event_date, 6) will omit the year portion from each record in the event_date column by starting from the position of the sixth character. The inline hint to /*+ MAPJOIN(nh) */ lets you manually define which table alias to load as the small table to each mapper. The nigeria_holidays table is very small and made the most sense to load as a hash table. Each map process in the join can operate over rows from acled_nigeria_cleaned with its own copy of the nigeria_holidays hash table. The MAPJOIN operation handles creating the hash table and distributing it to each map task.

We would like to see the values of the event_date and event_type columns, and a description of the holiday, if any, the event occurred on.

There's more...

Map-side joins can be tricky to configure and use properly. Here are a few pointers.

Auto-convert to map-side join whenever possible

Set the property hive.auto.convert.join to true in your Hive config and Hive will automatically try to convert the join to a map-side join, as long as the table fits below a certain size threshold. You can configure the maximum size with the property hive.smalltable.filesize. This will tell Hive what file size (or below) constitutes a small table. It's written in bytes expressed as a long (for example, 25000000L = 25M).

Also consider setting hive.hashtable.max.memory.usage, which tells the map task to terminate if it requires more than the configured memory percentage.

Map-join behavior

If you omit /*+ MAPJOIN() */ and rely on auto-convert, it can be difficult to follow what Hive is doing to optimize the join. Following are some tips:

  • TableFoo LEFT OUTER JOIN TableBar: Try to convert TableBar to a hash table
  • TableFoo RIGHT OUTER JOIN TABLE B: Try to convert TableFoo to a hash table
  • TableFoo FULL OUTER JOIN TableBar: Framework cannot map join full outer joins

See also

  • Using optimized full outer joins in Apache Hive to analyze geographical events
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.235.188