Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Using Pig to load a table and perform a SELECT operation with GROUP BY

This recipe will use Pig to group the IP addresses contained in the ip_to_country dataset and count the number of IP addresses listed for each country.

Getting ready

Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster with Apache Pig 0.9.2 installed on your client machine and on the environment path for the active user account. This recipe depends on having the ip-to-country named dataset included in the book loaded into HDFS at the absolute path /input/weblog_ip/ip_to_country.txt.

How to do it...

Carry out the following steps to perform a SELECT and GROUP BY operation in Pig:

Open a text editor of your choice, ideally one with SQL syntax highlighting.

Add the following inline creation syntax:

ip_countries = LOAD '/input/weblog_ip/ip_to_country.txt' AS (ip: chararray, country:chararray);
country_grpd = GROUP ip_countries BY country;
country_counts = FOREACH country_grpd GENERATE FLATTEN(group), COUNT(ip_countries) as counts;
STORE country_counts INTO '/output/geo_weblog_entries';

Save the file as group_by_country.pig.
In the directory containing the script, run the command line using the Pig client with the –f option.

How it works...

The first line creates a Pig relation named ip_countries from the tab-delimited records stored in HDFS. The relation specifies two attributes, namely ip and country, both character arrays. The second line creates the country_grpd relation containing a record for each distinct country in the ip_countries relation. The third line tells Pig to iterate over the country_grpd relation and count the number of records in the ip_countries relation that map to the current country. The results of this iteration are persisted to a new relation named country_counts, which consists of tuples containing exactly two attributes, namely group and counts. Store the tuples contained in this relation to the output directory specified by /output/geo_weblog_entries.

The output is not sorted in country in the ascending or descending order.

You should see in HDFS, under /output/geo_weblog_entries, one or more part files containing tab-delimited country listings and their IP address counts.

Table of Contents for
Using Pig to load a table and perform a SELECT operation with GROUP BY

Using Pig to load a table and perform a SELECT operation with GROUP BY

Getting ready

How to do it...

How it works...

See also

Table of Contents for Using Pig to load a table and perform a SELECT operation with GROUP BY

Create new playlist

Sign In

Sign Up

Using Pig to load a table and perform a SELECT operation with GROUP BY

Getting ready

How to do it...

How it works...

See also

Table of Contents for
Using Pig to load a table and perform a SELECT operation with GROUP BY