Trim Outliers from the Audioscrobbler dataset using Pig and datafu

Datafu is a Pig UDF library open sourced by the SNA team at LinkedIn. It contains many useful functions. This recipe will use play counts from the Audioscrobbler dataset and the Quantile UDF from datafu to identify and remove outliers.

Getting ready

How to do it...

  1. Register the datafu JAR file and construct the Quantile UDF:
    register /path/to/datafu-0.0.4.jar;
    define Quantile datafu.pig.stats.Quantile('.90'),
  2. Load the user_artist_data.txt file:
    plays = load '/data/audioscrobbler.txt'using PigStorage(' ') as (user_id:long, artist_id:long, playcount:long);
  3. Group all of the data:
    plays_grp = group plays ALL;
  4. Generate the ninetieth percentile value to be used as the outlier's max:
    out_max = foreach plays_grp{
             ord = order plays by playcount;
              generate Quantile(ord.playcount) as ninetieth ;
              }
  5. Trim outliers to the ninetieth percentile value:
    trim_outliers = foreach plays generate user_id, artist_id, (playcount>out_max.ninetieth ? out_max.ninetieth : playcount);
  6. Store the user_artist_data.txt file with outliers trimmed:
    store trim_outliers into '/data/audioscrobble/outliers_trimmed.bcp';

How it works...

This recipe takes advantage of the datafu library open sourced by LinkedIn. Once a JAR file is registered, all of its UDFs are available to the Pig script. The define command calls the constructor of the datafu.pig.stats.Quantile UDF passing it a value of .90. The constructor of the Quantile UDF will then create an instance that will produce the ninetieth percentile of the input vector it is passed. The define also aliases Quantile as shorthand for referencing this UDF.

The user artist data is loaded into a relation named plays. This data is then grouped by ALL. The ALL group is a special kind of group that creates a single bag containing all of the input.

The Quantile UDF requires that the data it has passed be sorted first. The data is sorted by play count, and the sorted play count's vector is passed to the Quantile UDF. The sorted play count simplifies the job of the Quantile UDF. It now picks the value at the ninetieth percentile position and returns it.

This value is then compared against each of the play counts in the user artist file. If the play count is greater, it is trimmed down to the value returned by the Quantile UDF, otherwise the value remains as it is.

The updated user artist file with outliers trimmed is then stored back in HDFS to be used for further processing.

There's more...

The datafu library also includes a StreamingQuantile UDF. This UDF is similar to the Quantile UDF except that it does not require the data to be sorted before it is used. This will greatly increase the performance of this operation. However, it does come at a cost. The StreamingQuantile UDF only provides an estimation of the values.

define Quantile datafu.pig.stats.StreamingQuantile('.90'),
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.126.56