Introducing Apache Flume

Flume, found at http://flume.apache.org, is another Apache project with tight Hadoop integration and we will explore it for the remainder of this chapter.

Before we explain what Flume can do, let's make it clear what it is not. Flume is described as a system for the retrieval and distribution of logs, meaning line-oriented textual data. It is not a generic data-distribution platform; in particular, don't look to use it for the retrieval or movement of binary data.

However, since the vast majority of the data processed in Hadoop matches this description, it is likely that Flume will meet many of your data retrieval needs.

Note

Flume is also not a generic data serialization framework like Avro that we used in Chapter 5, Advanced MapReduce Techniques, or similar technologies such as Thrift and Protocol Buffers . As we'll see, Flume makes assumptions about the data format and provides no ways of serializing data outside of these.

Flume provides mechanisms for retrieving data from multiple sources, passing it to remote locations (potentially multiple locations in either a fan-out or pipeline model), and then delivering it to a variety of destinations. Though it does have a programmatic API that allows the development of custom sources and destinations, the base product has built-in support for many of the most common scenarios. Let's install it and take a look.

A note on versioning

Flume has gone through some major changes in recent times. The original Flume (now renamed Flume OG for Original Generation) is being superseded by Flume NG (Next Generation). Though the general principles and capabilities are very similar, the implementation is quite different.

Because Flume NG is the future, we will cover it in this book. For some time though, it will lack several of the features of the more mature Flume OG, so if you find a specific requirement that Flume NG doesn't meet then it may be worth looking at Flume OG.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.223.190