Chapter 10. Data Collection with Flume

In the previous two chapters, we've seen how Hive and Sqoop give a relational database interface to Hadoop and allow it to exchange data with "real" databases. Although this is a very common use case, there are, of course, many different types of data sources that we may want to get into Hadoop.

In this chapter, we will cover:

  • An overview of data commonly processed in Hadoop
  • Simple approaches to pull this data into Hadoop
  • How Apache Flume can make this task a lot easier
  • Common patterns for simple through sophisticated, Flume setups
  • Common issues, such as the data lifecycle, that need to be considered regardless of technology

A note about AWS

This chapter will discuss AWS less than any other in the book. In fact, we won't even mention it after this section. There are no Amazon services akin to Flume so there is no AWS-specific product that we could explore. On the other hand, when using Flume, it works exactly the same, be it on a local host or EC2 virtual instance. The rest of this chapter, therefore, assumes nothing about the environment on which the examples are executed; they will perform identically on each.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.106.233