Chapter 1. Hadoop Distributed File System – Importing and Exporting Data

In this chapter we will cover:

  • Importing and exporting data into HDFS using the Hadoop shell commands
  • Moving data efficiently between clusters using Distributed Copy
  • Importing data from MySQL into HDFS using Sqoop
  • Exporting data from HDFS into MySQL using Sqoop
  • Configuring Sqoop for Microsoft SQL Server
  • Exporting data from HDFS into MongoDB
  • Importing data from MongoDB into HDFS
  • Exporting data from HDFS into MongoDB using Pig
  • Using HDFS in a Greenplum external table
  • Using Flume to load data into HDFS

Introduction

In a typical installation, Hadoop is the heart of a complex flow of data. Data is often collected from many disparate systems. This data is then imported into the Hadoop Distributed File System (HDFS). Next, some form of processing takes place using MapReduce or one of the several languages built on top of MapReduce (Hive, Pig, Cascading, and so on). Finally, the filtered, transformed, and aggregated results are exported to one or more external systems.

For a more concrete example, a large website may want to produce basic analytical data about its hits. Weblog data from several servers is collected and pushed into HDFS. A MapReduce job is started, which runs using the weblogs as its input. The weblog data is parsed, summarized, and combined with the IP address geolocation data. The output produced shows the URL, page views, and location data by each cookie. This report is exported into a relational database. Ad hoc queries can now be run against this data. Analysts can quickly produce reports of total unique cookies present, pages with the most views, breakdowns of visitors by region, or any other rollup of this data.

The recipes in this chapter will focus on the process of importing and exporting data to and from HDFS. The sources and destinations include the local filesystem, relational databases, NoSQL databases, distributed databases, and other Hadoop clusters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.210.143