Tuning MapReduce job parameters

The Hadoop framework is very flexible and can be tuned using a number of configuration parameters. In this recipe, we will discuss the function and purpose of different configuration parameters you can set for a MapReduce job.

Getting ready

Ensure that you have a MapReduce job which has a job class that extends the Hadoop Configuration class and implements the Hadoop Tool interface, such as any MapReduce application we have written so far in this book.

How to do it...

Follow these steps to customize MapReduce job parameters:

  1. Ensure you have a MapReduce job class which extends the Hadoop Configuration class and the Tool interface.
  2. Use the ToolRunner.run() static method to run your MapReduce job, as shown in the following example:
    public static void main(String[] args) throws Exception {
            int exitCode = ToolRunner.run(new MyMapReduceJob(), args);
            System.exit(exitCode);
    }
  3. Examine the following table of Hadoop job properties and values:

    Property name

    Possible values

    Description

    mapred.reduce.tasks

    Integers (0 - N)

    Sets the number of reducers to launch.

    mapred.child.java.opts

    JVM key-value pairs

    These parameters are given as arguments to every task JVM. For example, to set the maximum heap size for all tasks to 1 GB, you would set this property to '-Xmx1GB'.

    mapred.map.child.java.opts

    JVM key-value pairs

    These parameters are given as arguments to every map task JVM.

    mapred.reduce.child.java.opts

    JVM key-value pairs

    These parameters are given as arguments to every reduce task JVM.

    mapred.map.tasks.speculative.execution

    Boolean (true/false)

    Tells the Hadoop framework to speculatively launch the exact same map task on different nodes in the cluster if a task is not performing well as compared to other tasks in the job. This property was discussed in Chapter 1, Hadoop Distributed File System – Importing and Exporting Data.

    mapred.reduce.tasks.speculative.execution

    Boolean (true/false)

    Tells the Hadoop framework to speculatively launch the exact same reduce task on different nodes in the cluster if a task is not performing well as compared to other tasks in the job.

    mapred.job.reuse.jvm.num.tasks

    Integer (-1, 1 – N)

    The number of task JVMs to be re-used. A value of 1 indicates one JVM will be started per task, a value of -1 indicates a single JVM can run an unlimited number of tasks. Setting this parameter might help increase the performance of small jobs because JVMs will be re-used for multiple tasks (as opposed to starting a JVM for each and every task).

    mapred.compress.map.output

    mapred.output.compression.type

    mapred.map.output.compression.codec

    Boolean (true/false)

    String (NONE, RECORD, or BLOCK)

    String (Name of compression codec class)

    These three parameters are used to compress the output of map tasks.

    mapred.output.compress

    mapred.output.compression.type

    mapred.output.compression.codec

    Boolean (true/false)

    String (NONE, RECORD, or BLOCK)

    String (Name of compression codec class)

    These three parameters are used to compress the output of a MapReduce job.

  4. Execute a MapReduce job with a custom Hadoop property. For example, we will launch a job using five reducers:
    $ cd /path/to/hadoop
    $ bin/hadoop –jar MyJar.jar com.packt.MyJobClass –Dmapred.reduce.tasks=5

How it works...

When a job class extends the Hadoop Configuration class and implements the Hadoop Tool interface, the ToolRunner class will automatically handle the following generic Hadoop arguments:

Argument/Flag

Purpose

-conf

Takes a path to a parameter configuration file.

-D

Used to specify Hadoop key/value properties which will be added to the job configuration

-fs

Used to specify the host port of the NameNode

-jt

Used to specify the host port of the JobTracker

In the case of this recipe, the ToolRunner class will automatically place all of the parameters specified with the -D flag into the job configuration XML file.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.206.225