Developing and testing MapReduce jobs running in local mode

Developing in MRUnit and local mode are complementary. MRUnit provides an elegant way to test the map and reduce phases of a MapReduce job. Initial development and testing of jobs should be done using this framework. However, there are several key components of a MapReduce job that are not exercised when running MRUnit tests. Two key class types are InputFormats and OutFormats. Running jobs in local mode will test a larger portion of a job. When testing in local mode, it is also much easier to use a significant amount of real-world data.

This recipe will show an example of configuring Hadoop to use local mode and then debugging that job using the Eclipse debugger.

Getting ready

You will need to download the weblog_entries_bad_records.txt dataset from the Packt website, http://www.packtpub.com/support. This example will use the CounterExample.java class provided with the Using Counters in a MapReduce job to track bad records recipe.

How to do it...

  1. Open the $HADOOP_HOME/conf/mapred-site.xml file in a text editor.
  2. Set the mapred.job.trackerproperty value to local:
    <property>
          <name>mapred.job.tracker</name>
          <value>local</value>
       </property>
  3. Open the $HADOOP_HOME/conf/core-site.xml file in a text editor.
  4. Set the fs.default.nameproperty value to file:///:
    <property>
       <name>fs.default.name</name>
       <value>file:///</value>
    </property>
  5. Open the $HADOOP_HOME/conf/hadoop-env.sh file and add the following line:
    export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=7272"
  6. Run the CountersExample.jar file by passing the local path to the weblog_entries_bad_records.txt file, and give a local path to an output file:
    $HADOOP_HOME/bin/hadoop jar ./CountersExample.jar com.packt.hadoop.solutions.CounterExample /local/path/to/weblog_entries_bad_records.txt /local/path/to/weblog_entries_clean.txt
    

    You'll get the following output:

    Listening for transport dt_socket at address: 7272
    
  7. Open the Counters project in Eclipse, and set up a new remote debug configuration.
    How to do it...
  8. Create a new breakpoint and debug.

How it works...

A MapReduce job that is configured to execute in local mode runs entirely in one JVM instance. Unlike the pseudo-distributed mode, this mode makes it possible to hook up a remote debugger to debug a job. The mapred.job.tracker property set to local informs the Hadoop framework that jobs will now run in the local mode. The LocalJobRunner class, which is used when running in local mode, is responsible for implementing the MapReduce framework locally in a single process. This has the benefit of keeping jobs that run in local mode as close as possible to the jobs that run distributed on a cluster. One downside to using LocalJobRunner is that it carries the baggage of setting up an instance of Hadoop. This means even the smallest jobs will require at least several seconds to run. Setting the fs.default.name property value to file:/// configures the job to look for input and output files on the local filesystem. Adding export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=7272" to the hadoop-env.sh file configures the JVM to suspend processing and listen for a remote debugger on port 7272 on start up.

There's more...

Apache Pig also provides a local mode for development and testing. It uses the same LocalJobRunner class as a local mode MapReduce job. It can be accessed by starting Pig with the following command:

pig –x local

See also

  • Developing and testing MapReduce jobs with MRUnit
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.137.67