MapReduce

Consider the following problem: You have one billion objects in S3, and you want to list all the keys to make an inventory. Assuming you can list 1,000 keys per second and you perform this task sequentially, it would take 11 days to complete this task from a single program. Now consider the same problem, but with 100 programs running in parallel—the time drops considerably. This is the main concept about the MapReduce algorithm.

MapReduce is a programming framework that allows developers to use distributed computing to split jobs across hundreds or even thousands of machines to execute them in parallel, and it can be run on commodity hardware using Hadoop Distributed File Systems (HDFS), with a resilient design.

  • Map: The mappers are responsible for mapping the input data (in the preceding problem, the object keys); here, the filtering and sorting phases are made.
  • Reduce: The reducers are responsible for the data aggregation and summarization (in our case, they perform the key count for the common key prefixes):

B10781_12_10

In the previous diagram, we can observe the full life cycle of a MapReduce job that is programmed to count and summarize the number of different prefixes in an S3 bucket with huge amounts of data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.103.154