Using illustrate to debug Pig jobs

Generating good test data for a complex distributed job that joins, filters, and aggregates gigabytes or even terabytes of data can be one of the hardest parts of the development process, or at least one of the most tedious. Apache Pig provides an incredibly powerful tool, illustrate, that will seek out cases from the provided full input data that exercise different dataflow paths. The following recipe shows an example of the illustrate command in use.

Getting ready

Apache Pig 0.10 or a more recent version must be installed. You can download it from http://pig.apache.org/releases.html.

How to do it...

The following Pig code will show an example of a record with a malformed IP address:

weblogs = load '/data/weblogs/weblog_entries_bad_records.txt'
   as (md5:chararray, url:chararray, date:chararray, time:chararray, ip:chararray);
ip_addresses = foreach weblogs generate ip;
bad = filter ip_addresses by not
(ip matches '^([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])$'),
illustrate bad;

The output will look like the following:

How to do it...

How it works...

In the preceding example, data is filtered on invalid IP addresses. The number of records that have an invalid IP address make up a small percentage of the total. If a traditional sampling approach was taken to create test data, chances are that the sampled data would not contain any records with an invalid IP address.

The illustrate algorithm makes four complete passes over a Pig script to generate its data. The first pass takes a sample of data from each input and sends it through the script. The second pass finds and removes records that followed the same path through the script. The third pass determines if any possible paths were not taken by the sampled data from the first pass. If there are paths that are not represented by the sampled data, the illustrate algorithm will create fake data that exercises the remaining paths. The fourth pass is similar to the second pass; it removes any redundant data created by the third pass.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.232.239