Using Python to extend Apache Pig functionality

In this recipe, we will use Python to create a simple Apache Pig user-defined function (UDF) to count the number of records in a Pig DataBag.

Getting ready

You will need to download/compile/install the following:

This recipe requires the Jython standalone JAR file. To build the file, download the Jython java installer, run the installer, and select Standalone from the installation menu.

$ java –jar jython_installer-2.5.2.jar

Add the Jython standalone JAR file to Apache Pig's classpath:

$ export PIG_CLASSPATH=$PIG_CLASSPATH:/path/to/jython2.5.2/jython.jar

How to do it...

The following are the steps to create an Apache Pig UDF using Python:

  1. Start by creating a simple Python function to count the number of records in a Pig DataBag:
    #!/usr/bin/python
    
    @outputSchema("hits:long")
    def calculate(inputBag):
      hits = len(inputBag)
      return hits
  2. Next, create a Pig script to group all of the web server log records by IP and page. Then send the grouped web server log records to the Python function:
    register 'count.py' using jython as count;
    
    nobots_weblogs = LOAD '/user/hadoop/apache_nobots_tsv.txt' AS (ip: chararray, timestamp:long, page:chararray, http_status:int, payload_size:int, useragent:chararray);
    
    ip_page_groups = GROUP nobots_weblogs BY (ip, page);
    
    ip_page_hits = FOREACH ip_page_groups GENERATE FLATTEN(group), count.calculate(nobots_weblogs);
    
    STORE ip_page_hits INTO '/user/hadoop/ip_page_hits';

How it works...

First, we created a simple Python function to calculate the length of a Pig DataBag. In addition, the Python script contained the Python decorator, @outputSchema("hits:long"), that instructs Pig on how to interpret the data returned by the Python function. In this case, we want Pig to store the data returned by this function as a Java Long in a field named hits.

Next, we wrote a Pig script that registers the Python UDF using the statement:

register 'count.py' using jython as count;

Finally, we called the calculate() function using the alias count, in the Pig DataBag:

count.calculate(nobots_weblogs);
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.53.119