Using HDFS in a Greenplum external table

Greenplum is a parallel database that distributes data and queries to one or more PostgreSQL instances. It complements Hadoop by providing real-time or near real-time access to large amounts of data. It supports using HDFS files as external tables. External tables are a good solution for working with data that lives outside of the Greenplum cluster. Since data in external tables must first travel over the network, they should be infrequently used in queries with other data that lives inside of the Greenplum cluster. This recipe will cover creating read-only and read/write external tables.

Getting ready

This recipe assumes that you are using the CDH3 distribution of Hadoop.

Run an instance of Greenplum that must be able to reach the Hadoop cluster found at http://www.greenplum.com/products/greenplum-database.

Configure Greenplum with the following:

  • gp_hadoop_target_version set to cdh3u2
  • gp_hadoop_home set to the full path of $HADOOP_HOME

Java 1.6 or above must be installed on each node in the Greenplum cluster.

How to do it...

Create an external table from the weblogs file in HDFS:

CREATE EXTERNAL TABLE weblogs(
    md5             text,
    url             text,
    request_date    date,
    request_time    time,
    ip              inet
)
LOCATION ('gphdfs://<NAMENODE_HOST>:<NAMENODE_PORT>/data/weblogs/weblog_entries.txt')
FORMAT 'TEXT' (DELIMITER '	'),

How it works...

Greenplum has native support for loading data from HDFS in parallel. When a query is run against the weblog_entries.txt table, the weblog_entries.txt file is loaded into a temporary Greenplum table. The query then executes against this table. After the query finishes the table is discarded.

There's more...

Greenplum external tables also support writing of data. This requires the WRITABLE keyword while creating the table:

CREATE WRITABLE EXTERNAL TABLE weblogs(
    md5             text,
    url             text,
    request_date    date,
    request_time    time,
    ip              inet
)
LOCATION ('gphdfs://<NAMENODE_HOST>:<NAMENODE_PORT>/data/weblogs/weblog_entries.txt')
FORMAT 'TEXT' (DELIMITER '	'),

More information can be found in the Greenplum administrator's handbook at http://media.gpadmin.me/wp-content/uploads/2011/05/GP-4100-AdminGuide.pdf

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.228.99