When working with the amounts of data that Hadoop was designed to process, it is only a matter of time before even the most robust job runs into unexpected or malformed data. If not handled properly, bad data can easily cause a job to fail. By default, Hadoop will not skip bad data. For some applications, it may be acceptable to skip a small percentage of the input data. Hadoop provides a way to do just that. Even if skipping data is not acceptable for a given use case, Hadoop's skipping mechanism can be used to pinpoint the bad data and log it for review.
run()
method where the job configuration is set up:SkipBadRecords.setMapperMaxSkipRecords(conf, 100);
run()
method where the job configuration is set up:SkipBadRecords.setReducerMaxSkipGroups(conf, 100);
The process to skip bad records will trigger if skipping has been enabled. Skipping is enabled by calling the static methods on the
SkipBadRecords
class and once a task has failed twice. Hadoop will then perform a binary search through the input data to identify the bad records. Keep in mind that this is an expensive task that could require multiple attempts. A job that enables skipping will probably want to increase the number of map and reduce attempts. This can be done by using the
JobConf.setMaxMapAttempts()
and JobConf.setMaxReduceAttempts()
methods.
By default, the process to skip bad records will be triggered after two failed attempts. This default can be changed using the
setAttemptsToStartSkipping()
method on the SkipBadRecords
class. The output folder of the skipped records can be controlled using the
setSkipOutputPath()
method on the
SkipBadRecords
class. By default, skipped records will be logged to the _log/skip/
folder. These files are formatted as Hadoop sequence files. To get them into human-readable format, use the
following command:
hadoop fs –text _log/skip/<filename>
Record-skipping can also be controlled using MapReduce job properties. The following table is a relevant snippet from the table provided at http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html.
3.141.30.210