Streaming

Hive can also leverage the streaming feature in Hadoop to transform data in an alternative way. The streaming API opens an I/O pipe to an external process, such as a script. Then, the process reads data from the standard input and writes the results out through the standard output. In HQL, we can use TRANSFORM clauses directly to embed the mapper and the reducer scripts written in commands, shell scripts, Java, or other programming languages. Although streaming brings overhead by using serialization/deserialization between processes, it provides a simple coding mode for non-Java developers. The syntax of the TRANSFORM clause is as follows:

FROM (
    FROM src
    SELECT TRANSFORM '(' expression (',' expression)* ')'
    (inRowFormat)?
    USING 'map_user_script'
    (AS colName (',' colName)*)?
    (outRowFormat)? (outRecordReader)?
    (CLUSTER BY?|DISTRIBUTE BY? SORT BY?) src_alias
 )
 SELECT TRANSFORM '(' expression (',' expression)* ')'
 (inRowFormat)?
 USING 'reduce_user_script'
 (AS colName (',' colName)*)?
 (outRowFormat)? (outRecordReader)?

By default, the INPUT values for the user script are as follows:

Columns transformed to STRING values
Delimited by a tab
NULL values converted to the N literal string (differentiates NULL values from empty strings)

By default, the OUTPUT values of the user script are as follows:

Treated as tab-separated STRING columns
N will be reinterpreted as NULL
The resulting STRING column will be cast to the data type specified in the table declaration

These defaults can be overridden with ROW FORMAT. An example of streaming using the Python script upper.py is as follows:

$cat upper.py
#!/usr/bin/env python
'''
This is a script to upper all cases
'''
import sys

def main():
    try:
        for line in sys.stdin:
          n = line.strip()
          print n.upper()
    except:
        return None
if __name__ == "__main__":main()

Test the script by running it in the normal way, as follows:

$ echo "Will" | python upper.py
$ WILL

Call the script with HQL:

      > ADD FILE /tmp/upper.py;
      > SELECT 
      > TRANSFORM (name,work_place[0])
      > USING 'python upper.py' as (CAP_NAME,CAP_PLACE)
      > FROM employee;
      +-----------+------------+
      | cap_name  | cap_place  |
      +-----------+------------+
      | MICHAEL   | MONTREAL   |
      | WILL      | MONTREAL   |
      | SHELLEY   | NEW YORK   |
      | LUCY      | VANCOUVER  |
      | STEVEN    | NULL       |
      +-----------+------------+
      5 rows selected (30.101 seconds)

The TRANSFORM command is not allowed when SQL standard-based authorization is configured as of Hive v0.13.0.

Table of Contents for Streaming

Create new playlist

Sign In

Sign Up

Table of Contents for
Streaming