Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Using Flume to load data into HDFS

Apache Flume is a project in the Hadoop community, consisting of related projects designed to efficiently and reliably load streaming data from many different sources into HDFS. A common use case for Flume is loading the weblog data from several sources into HDFS. This recipe will cover loading the weblog entries into HDFS using Flume.

Getting ready

This recipe assumes that you have Flume installed and configured.

Flume can be downloaded from its Apache page at http://incubator.apache.org/flume/.

If you are using CDH3, Flume Version 0.9.4+25.43 is installed by default.

How to do it...

Complete the following steps to load the weblogs data into HDFS:

Use the dump command to test that Flume is configured properly:
```
flume dump 'text("/path/to/weblog_entries.txt")'
```

Use the Flume shell to execute a configuration:

flume shell -c<MASTER_HOST>:<MASTER_PORT> -e 'exec config text("/path/to/weblog_entries.txt") | collectorSink("hdfs://<NAMENODE_HOST>:<NAMENODE_PORT>/data/weblogs/flume")'

How it works...

Flume uses Sources and Sinks abstractions and a pipe-like data flow to link them together. In the example, text is a source which takes a path to a file as an argument and sends the contents of that file to the configured sink. The dump command uses the console as a sink. With this configuration the weblog_entries.txt file is read by text and written to the console.

In step 2, the Flume shell is used to configure and execute a job. The -c argument tells Flume where to connect to the Flume Master node. Flume will execute the command after the -e argument. As mentioned previously, text is a source which reads all of the contents of the file it is passed. collectorSink is a sink which can be passed a local filesystem path or a path in HDFS. In the preceding example, a HDFS path is given. The result of this command will be to load the weblog_entries.txt into HDFS.

There's more...

Flume comes with several predefined Sources and Sinks. A few of the many basic Sources include:

null: This opens, closes, and returns null
stdin: This reads from stdin
rpcSource: This reads either Thrift or Avro RPC
text: This reads the contents of a file
tail: This reads a file and stays open, reading data that is appended to the file

A few of the many basic Sinks include:

null: This drops the events
collectorSink: This writes to the local filesystem or HDFS
console: This writes to the console
formatDfs: This writes to HDFS in a specified format Sequence File, Avro, Thrift, and so on
rpcSink: This writes either Thrift or Avro RPC

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Using Flume to load data into HDFS

Create new playlist

Sign In

Sign Up

Using Flume to load data into HDFS

Getting ready

How to do it...

How it works...

There's more...

Table of Contents for
Using Flume to load data into HDFS