Using Apache Pig to sort web server log data by timestamp

Sorting data is a common data transformation technique. In this recipe, we will demonstrate the method of writing a simple Pig script to sort a dataset using the distributed processing power of the Hadoop cluster.

Getting ready

You will need to download/compile/install the following:

How to do it...

Perform the following steps to sort data using Apache Pig:

  1. First load the web server log data into a Pig relation:
    nobots_weblogs = LOAD '/user/hadoop/apache_nobots_tsv.txt' AS (ip: chararray, timestamp:long, page:chararray, http_status:int, payload_size:int, useragent:chararray);
  2. Next, order the web server log records by the timestamp field in the ascending order:
    ordered_weblogs = ORDER nobots BY timestamp;
  3. Finally, store the sorted results in HDFS:
    STORE ordered_weblogs INTO '/user/hadoop/ordered_weblogs';
  4. Run the Pig job:
    $ pig –f ordered_weblogs.pig

How it works...

Sorting data in a distributed, share-nothing environment is non-trivial. The Pig relational operator ORDER BY has the capability to provide total ordering of a dataset. This means any record that appears in the output file part-00000, will have a timestamp less than the timestamp in the output file part-00001 (since our data was sorted by timestamp).

There's more...

The Pig ORDER BY relational operator sorts data by multiple fields, and also supports sorting data in the descending order. For example, to sort the nobots relationship by the ip and timestamp fields, we would use the following expression:

ordered_weblogs = ORDER nobots BY ip, timestamp;

To sort the nobots relationship by timestamp in the descending order, use the desc option:

ordered_weblogs = ORDER nobots timestamp desc;

See also

The following recipes will use Apache Pig:

  • Using Apache Pig to sessionize web server log data
  • Using Python to extend Apache Pig functionality
  • Using MapReduce and secondary sort to calculate page views
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.173.199