Other BigData Tools andTechnologies | 167
CLUSTERED BY (dept_id)
SORTED BY (dept_id)
INTO 4 BUCKETS;
INSERT OVERWRITE TABLE buck_emp SELECT * FROM emp;
create table buck_dept(
id int,
dept_name string)
CLUSTERED BY (id)
SORTED BY (id)
INTO 4 BUCKETS;
INSERT OVERWRITE TABLE buck_dept SELECT * FROM dept;
set hive.enforce.sortmergebucketmapjoin=false;
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.auto.convert.join=false;
SELECT buck_emp.emp_id,buck_emp.emp_name,buck_dept.dept_name FROM
buck_dept INNER JOIN buck_emp ON (buck_dept.id = buck_emp.dept_id);
7.2.5 Partitioning and Bucketing in Hive
Partitioning: Partitioning is a way of dividing a Hive table into related internal parts based on the
values of particular main columns, like date, city, location, etc. Each table in the hive can have
one or more partition keys to identify a particular partition.
M07 Big Data Simplified XXXX 01.indd 167 5/17/2019 2:50:06 PM
168 | Big Data Simplied
It is used to slice the data horizontally over the entire range or on a smaller range of values
using one or more column. The partition concept is well known in RDBMS as well.
At first create ‘emp’ external table with ‘emp.txt’ data and then create a partition table.
hive>
>create external table emp_part(accno string,
> dt string,
>ctrycd int,
>groupid int,
> name string)
>PARTITIONED BY (ctry string)
> row format delimited
> fields terminated by ‘,’ ;
OK
Time taken: 0.223 seconds
hive>insert overwrite table emp_part partition(ctry) select * from emp
where ctry
=
”USA”;
hive>show partitions emp_part;
OK
ctry=USA
Time taken: 0.163 seconds, Fetched: 1 row(s)
hive>select * from emp_part;
OK
M07 Big Data Simplified XXXX 01.indd 168 5/17/2019 2:50:07 PM
Other BigData Tools andTechnologies | 169
4564546454 280813 640 890 Chiranjib USA
4564546489 280813 640 840 Astik USA
4564546454 280813 640 890 Arijit USA
4564546489 280813 640 840 Soumen USA
Time taken: 0.186 seconds, Fetched: 4 row(s)
Bucketing: Now, let us assume a condition that there is a huge dataset. At times, even after par-
titioning on a particular eld or elds, the partitioned le size doesn’t match with the actual
expectation and remains huge and we want to manage the partition results into different parts.
To overcome this problem of partitioning, Hive provides Bucketing concept, which allows user
to divide table data sets into more manageable parts.
At first create ‘emp’ external table with ‘emp.txt’ data and then create a bucketed table as
shown below.
hive>
>create table emp1_bucketed(accno string,
> dt string,
>ctrycd int,
>groupid int,
> name string,
>ctry string)
> CLUSTERED BY (ctry) into 3 buckets
> row format delimited
> fields terminated by ‘,’;
OK
Time taken: 0.278 seconds
Now, load the data in bucketed table from ‘emp’ table as shown below.
hive>from emp insert into table emp1_bucketed select *;
hduser@sayan:~/$ hadoop fs -ls /user/hive/warehouse
Found 1 items
drwxrwxr-x - hduser supergroup 0 2016-02-10 16:33 /user/hive/
warehouse/emp1_bucketed
hduser@sayan:~/$hadoop fs -ls /user/hive/warehouse/emp1_bucketed
Found 3 items
-rwxrwxr-x 1 hduser supergroup 71 2016-02-10 16:33 /user/hive/
warehouse/emp1_bucketed/000000_0
-rwxrwxr-x 1 hduser supergroup 70 2016-02-10 16:33 /user/hive/
warehouse/emp1_bucketed/000001_0
-rwxrwxr-x 1 hduser supergroup 230 2016-02-10 16:33 /user/hive/
warehouse/emp1_bucketed/000002_0
M07 Big Data Simplified XXXX 01.indd 169 5/17/2019 2:50:07 PM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.194.123