Moving data efficiently between clusters using Distributed Copy

Hadoop Distributed Copy (distcp) is a tool for efficiently copying large amounts of data within or in between clusters. It uses the MapReduce framework to do the copying. The benefits of using MapReduce include parallelism, error handling, recovery, logging, and reporting. The Hadoop Distributed Copy command (distcp) is useful when moving data between development, research, and production cluster environments.

Getting ready

The source and destination clusters must be able to reach each other.

The source cluster should have speculative execution turned off for map tasks. In the mapred-site.xml configuration file, set mapred.map.tasks.speculative.execution to false. This will prevent any undefined behavior from occurring in the case where a map task fails.

The source and destination cluster must use the same RPC protocol. Typically, this means that the source and destination cluster should have the same version of Hadoop installed.

How to do it...

Complete the following steps to copy a folder from one cluster to another:

  1. Copy the weblogs folder from cluster A to cluster B:
    hadoop distcp hdfs://namenodeA/data/weblogs hdfs://namenodeB/data/weblogs
  2. Copy the weblogs folder from cluster A to cluster B, overwriting any existing files:
    hadoop distcp –overwrite hdfs://namenodeA/data/weblogs hdfs://namenodeB/data/weblogs
  3. Synchronize the weblogs folder between cluster A and cluster B:
    hadoop distcp –update hdfs://namenodeA/data/weblogs hdfs://namenodeB/data/weblogs

How it works...

On the source cluster, the contents of the folder being copied are treated as a large temporary file. A map-only MapReduce job is created, which will do the copying between clusters. By default, each mapper will be given a 256-MB block of the temporary file. For example, if the weblogs folder was 10 GB in size, 40 mappers would each get roughly 256 MB to copy. distcp also has an option to specify the number of mappers.

hadoop distcp –m 10 hdfs://namenodeA/data/weblogs hdfs://namenodeB/data/weblogs

In the previous example, 10 mappers would be used. If the weblogs folder was 10 GB in size, then each mapper would be given roughly 1 GB to copy.

There's more...

While copying between two clusters that are running different versions of Hadoop, it is generally recommended to use HftpFileSystem as the source. HftpFileSystem is a read-only filesystem. The distcp command has to be run from the destination server:

hadoop distcp hftp://namenodeA:port/data/weblogs hdfs://namenodeB/data/weblogs

In the preceding command, port is defined by the dfs.http.address property in the hdfs-site.xml configuration file.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.97.85