Chapter 4. Performing Common Tasks Using Hive, Pig, and MapReduce

In this chapter we will cover:

  • Using Hive to map an external table over weblog data in HDFS
  • Using Hive to dynamically create tables from the results of a weblog query
  • Using the Hive string UDFs to concatenate fields in weblog data
  • Using Hive to intersect weblog IPs and determine the country
  • Generating n-grams over news archives using MapReduce
  • Using the distributed cache in MapReduce to find lines that contain matching keywords over news archives
  • Using Pig to load a table and perform a SELECT operation with GROUP BY

Introduction

When working with Apache Hive, Pig, and MapReduce, you may find yourself having to perform certain tasks frequently. The recipes in this chapter provide solutions for executing several very common routines.

You will find that these tools let you solve the same problems in numerous different ways. Deciding on the right implementation can be a difficult task. The recipes presented here were designed for coding efficiency and clarity.

Hive and Pig provide a clean abstraction layer between your data flow and meaningful queries, and the complex MapReduce workflows they compile to. You can leverage the power of MapReduce for scalable queries without having to think about the underlying MapReduce semantics. Both tools handle the decomposition and building of your expressions into the proper MapReduce sequences. Hive lets you build analytics and manage data using a declarative, SQL-like dialect known as HiveQL. Pig operations are written in Pig Latin and take a more imperative form.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.216.7