Using Pig to load a table and perform a SELECT operation with GROUP BY

This recipe will use Pig to group the IP addresses contained in the ip_to_country dataset and count the number of IP addresses listed for each country.

Getting ready

Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster with Apache Pig 0.9.2 installed on your client machine and on the environment path for the active user account. This recipe depends on having the ip-to-country named dataset included in the book loaded into HDFS at the absolute path /input/weblog_ip/ip_to_country.txt.

How to do it...

Carry out the following steps to perform a SELECT and GROUP BY operation in Pig:

  1. Open a text editor of your choice, ideally one with SQL syntax highlighting.
  2. Add the following inline creation syntax:
    ip_countries = LOAD '/input/weblog_ip/ip_to_country.txt' AS (ip: chararray, country:chararray);
    country_grpd = GROUP ip_countries BY country;
    country_counts = FOREACH country_grpd GENERATE FLATTEN(group), COUNT(ip_countries) as counts;
    STORE country_counts INTO '/output/geo_weblog_entries';
  3. Save the file as group_by_country.pig.
  4. In the directory containing the script, run the command line using the Pig client with the –f option.

How it works...

The first line creates a Pig relation named ip_countries from the tab-delimited records stored in HDFS. The relation specifies two attributes, namely ip and country, both character arrays. The second line creates the country_grpd relation containing a record for each distinct country in the ip_countries relation. The third line tells Pig to iterate over the country_grpd relation and count the number of records in the ip_countries relation that map to the current country. The results of this iteration are persisted to a new relation named country_counts, which consists of tuples containing exactly two attributes, namely group and counts. Store the tuples contained in this relation to the output directory specified by /output/geo_weblog_entries.

The output is not sorted in country in the ascending or descending order.

You should see in HDFS, under /output/geo_weblog_entries, one or more part files containing tab-delimited country listings and their IP address counts.

See also

  • The following recipes in Chapter 3, Extracting and Transforming Data
    • Using Apache Pig to filter bot traffic from web server logs
    • Using Apache Pig to sort web server logs data by timestamp
  • The Calculate cosine similarity of Artists in the Audioscrobbler dataset using Pig recipe in Chapter 6, Big Data Analysis
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.151.45