Skew join

When working with data that has a highly uneven distribution, data skew could happen in such a way that a small number of compute nodes must handle the bulk of the computation. The following setting informs Hive to optimize properly if data skew happens:

> SET hive.optimize.skewjoin=true; --If there is data skew in join, set it to true. Default is false.

> SET hive.skewjoin.key=100000; 
 --This is the default value. If the number of key is bigger than 
 --this, the new keys will send to the other unused reducers.

Skewed data could occur with the GROUP BY data too. To optimize it, we need set hive.groupby.skewindata=true to use the preceding settings to enable skew data optimization in the GROUP BY result. Once configured, Hive will first trigger an additional MapReduce job whose map output will randomly distribute to the reducer to avoid data skew.

For more information about join optimization, please refer to the Hive Wiki at https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization and https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization.

Table of Contents for Skew join

Create new playlist

Sign In

Sign Up

Table of Contents for
Skew join