Storage optimization

Data that is used or scanned frequently can be identified as hot data. Usually, query performance on hot data is critical for overall performance. Increasing the data replication factor in HDFS (see the following example) for hot data could increase the chance of data being hit locally by jobs and improve the overall performance. However, this is a trade-off against storage:

$ hdfs dfs -setrep -R -w 4 /user/hive/warehouse/employee
Replication 4 set: /user/hive/warehouse/employee/000000_0

On the other hand, too many files or redundancy could make namenode's memory exhausted, especially lots of small files whose sizes are less than the HDFS block sizes. Hadoop itself already has some solutions to deal with many small-file issues in the following ways:

  • Hadoop Archive/HAR: These are toolkits to pack small files introduced before.
  • SEQUENCEFILE Format: This is a format that can be used to compress small files into bigger files.
  • CombineFileInputFormat: A type of InputFormat to combine small files before map and reduce processing. It is the default InputFormat for Hive (see https://issues.apache.org/jira/browse/HIVE-2245).
  • HDFS Federation: It supports multiple namenodes to manage more files.

We can also leverage other tools in the Hadoop ecosystem if we have them installed, such as the following:

  • HBase has a smaller block size and better file format to deal with smaller file storage and access issues
  • Flume NG can be used as a pipe to merge small files into big ones
  • Developed and scheduled a file merge program to merge small files in HDFS or before loading the files to HDFS

For Hive, we can use the following configurations to merge files of query results and avoid recreating small files:

  • hive.merge.mapfiles: This merges small files at the end of a map-only job. By default, it is true.
  • hive.merge.mapredfiles: This merges small files at the end of a MapReduce job. Set it to true, as the default is false.
  • hive.merge.size.per.task: This defines the size of merged files at the end of the job. The default value is 256,000,000.
  • hive.merge.smallfiles.avgsize: This is the threshold for triggering file merge. The default value is 16,000,000.

When the average output file size of a job is less than the value specified by the hive.merge.smallfiles.avgsize property and both hive.merge.mapfiles (for map-only jobs) and hive.merge.mapredfiles (for MapReduce jobs) are set to true, Hive will start an additional MapReduce job to merge the output files into big files.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.52.203