Using Flume to load data into HDFS

Apache Flume is a project in the Hadoop community, consisting of related projects designed to efficiently and reliably load streaming data from many different sources into HDFS. A common use case for Flume is loading the weblog data from several sources into HDFS. This recipe will cover loading the weblog entries into HDFS using Flume.

Getting ready

This recipe assumes that you have Flume installed and configured.

Flume can be downloaded from its Apache page at http://incubator.apache.org/flume/.

If you are using CDH3, Flume Version 0.9.4+25.43 is installed by default.

How to do it...

Complete the following steps to load the weblogs data into HDFS:

  1. Use the dump command to test that Flume is configured properly:
    flume dump 'text("/path/to/weblog_entries.txt")'
  2. Use the Flume shell to execute a configuration:
    flume shell -c<MASTER_HOST>:<MASTER_PORT> -e 'exec config text("/path/to/weblog_entries.txt") | collectorSink("hdfs://<NAMENODE_HOST>:<NAMENODE_PORT>/data/weblogs/flume")'

How it works...

Flume uses Sources and Sinks abstractions and a pipe-like data flow to link them together. In the example, text is a source which takes a path to a file as an argument and sends the contents of that file to the configured sink. The dump command uses the console as a sink. With this configuration the weblog_entries.txt file is read by text and written to the console.

In step 2, the Flume shell is used to configure and execute a job. The -c argument tells Flume where to connect to the Flume Master node. Flume will execute the command after the -e argument. As mentioned previously, text is a source which reads all of the contents of the file it is passed. collectorSink is a sink which can be passed a local filesystem path or a path in HDFS. In the preceding example, a HDFS path is given. The result of this command will be to load the weblog_entries.txt into HDFS.

There's more...

Flume comes with several predefined Sources and Sinks. A few of the many basic Sources include:

  • null: This opens, closes, and returns null
  • stdin: This reads from stdin
  • rpcSource: This reads either Thrift or Avro RPC
  • text: This reads the contents of a file
  • tail: This reads a file and stays open, reading data that is appended to the file

A few of the many basic Sinks include:

  • null: This drops the events
  • collectorSink: This writes to the local filesystem or HDFS
  • console: This writes to the console
  • formatDfs: This writes to HDFS in a specified format Sequence File, Avro, Thrift, and so on
  • rpcSink: This writes either Thrift or Avro RPC
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.21.239