Chapter 3. Extracting and Transforming Data

In this chapter, we will cover:

  • Transforming Apache logs into TSV format using MapReduce
  • Using Apache Pig to filter bot traffic from web server logs
  • Using Apache Pig to sort web server log data by timestamp
  • Using Apache Pig to sessionize web server log data
  • Using Python to extend Apache Pig functionality
  • Using MapReduce and secondary sort to calculate page views
  • Using Hive and Python to clean and transform geographical event data
  • Using Python and Hadoop Streaming to perform a time series analytic
  • Using MultipleOutputs in MapReduce to name output files
  • Creating custom Hadoop Writable and InputFormat to read geographical event data

Introduction

Parsing and formatting large amounts of data to meet business requirements is a challenging task. The software and the architecture must meet strict scalability, reliability, and run-time constraints. Hadoop is an ideal environment for extracting and transforming large-scale data. Hadoop provides a scalable, reliable, and distributed processing environment that is ideal for large-scale data processing. This chapter will demonstrate methods to extract and transform data using MapReduce, Apache Pig, Apache Hive, and Python.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.254.44