166 | Big Data Simplied
Sort-Merge-Bucket Map JOIN: Sort-Merge-Bucket Map join is also a technique of Hive join optimiza-
tion. These joins are very efcient as they require a simple merge of previously sorted tables. The
corresponding buckets are joined with each other at the mapper stage.
There are several scenarios where we can use Hive Sort Merge Bucket Join.
• When all tables are large.
• When all tables are bucketed using the join columns.
• When using the join columns, Sorted.
• When the number of buckets is same as the number of all tables.
Properties need to change in Hive session to perform SMB join is as follows.
hive>set hive.enforce.sortmergebucketmapjoin=false;
hive>set hive.auto.convert.sortmerge.join=true;
hive>set hive.optimize.bucketmapjoin = true;
hive>set hive.optimize.bucketmapjoin.sortedmerge = true;
hive>set hive.auto.convert.join=false;
Following are the limitations of Hive Sort Merge Bucket Join:
• SQL joins tables need to be bucketed. Hence, for other types of SQL, it cannot be used.
• It is possible that too much partition tables might slow down here.
For SMB join created two bucketed and sorted tables named buck_emp and buck_dept.
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
create table buck_emp(
emp_id int,
emp_name string,
dept_id int)
M07 Big Data Simplified XXXX 01.indd 166 5/17/2019 2:50:06 PM