Data exchange with INSERT

To extract data from tables/partitions, we can use the INSERT keyword. Like other relational databases, Hive supports inserting data into a table by selecting data from another table. This is a very common ETL (a term in data warehousing for Extract, Transform, and Load) pattern used to populate an existing or new table from another table or dataset. The HQL INSERT statement has the same syntax as a relational database's INSERT. However, HQL has improved its INSERT statement by supporting data overwrittening, multi-insert, dynamic partition insert, as well as inserting data into files. The following are a few examples of INSERT statements in HQL:

  1. The following is a regular INSERT from the SELECT statement:
      -- Check the target table, which is empty.
> SELECT name, work_place FROM employee;
+-------------+-------------------+
|employee.name|employee.work_place|
+-------------+-------------------+
+-------------+-------------------+
No rows selected (0.115 seconds)


-- Populate data from query while "INTO" will append data
> INSERT INTO TABLE employee SELECT * FROM ctas_employee;
No rows affected (31.701 seconds)


-- Verify the data loaded
> SELECT name, work_place FROM employee;
+-------------+----------------------+
|employee.name| employee.work_place |
+-------------+----------------------+
| Michael |["Montreal","Toronto"]|
| Will |["Montreal"] |
| Shelley |["New York"] |
| Lucy |["Vancouver"] |
+-------------+----------------------+
4 rows selected (0.12 seconds)
  1. Insert a table with specified columns. For columns not specified, NULL is populated. However, there are two limitations for now. First, it only works for INSERT INTO rather than INSERT OVERWRITE. Second, unspecified columns must be the primary data type (such as array is not supported). The same limitations also apply to INSERT INTO ... VALUES statements:
      > CREATE TABLE emp_simple( -- Create a test table only has 
primary types
> name string,
> work_place string
> );
No rows affected (1.479 seconds)


> INSERT INTO TABLE emp_simple (name) -- Specify which columns to
insert
> SELECT name FROM employee WHERE name = 'Will';
No rows affected (30.701 seconds)


> INSERT INTO TABLE emp_simple VALUES -- Insert constant values
> ('Michael', 'Toronto'),('Lucy', 'Montreal');
No rows affected (18.045 seconds)


> SELECT * FROM emp_simple; -- Verify the data loaded
+---------+------------+
| name | work_place |
+---------+------------+
| Will | NULL | -- NULL when column is not specified
| Michael | Toronto |
| Lucy | Montreal |
+---------+------------+
3 rows selected (0.263 seconds)
  1. Insert data from the CTE statement:
      > WITH a as (
> SELECT * FROM ctas_employee
> )
> FROM a
> INSERT OVERWRITE TABLE employee
> SELECT *;
No rows affected (30.1 seconds)
  1. Run multi-insert by only scanning the source table once for better performance:
      > FROM ctas_employee
> INSERT OVERWRITE TABLE employee
> SELECT *
> INSERT OVERWRITE TABLE employee_internal
> SELECT *
> INSERT OVERWRITE TABLE employee_partitioned
> PARTITION (year=2018, month=9) -- Insert to static partition
> SELECT *
> ;
No rows affected (27.919 seconds)
The INSERT OVERWRITE statement will replace the data in the target table/partition, while INSERT INTO will append data.

When inserting data into the partitions, we need to specify the partition columns. Instead of specifying static partition values, Hive also supports dynamically giving partition values. Dynamic partitions are useful when it is necessary to populate partitions dynamically from data values. Dynamic partitions are disabled by default because a careless dynamic partition insert could create many partitions unexpectedly. We have to set the following properties to enable dynamic partitions:

> SET hive.exec.dynamic.partition=true;
No rows affected (0.002 seconds)

By default, the user must specify at least one static partition column. This is to avoid accidentally overwriting partitions. To disable this restriction, we can set the partition mode to nonstrict from the default strict mode before inserting into dynamic partitions as follows:

> SET hive.exec.dynamic.partition.mode=nonstrict;
No rows affected (0.002 seconds)

-- Partition year, month are determined from data
> INSERT INTO TABLE employee_partitioned
> PARTITION(year, month)
> SELECT name, array('Toronto') as work_place,
> named_struct("gender","Male","age",30) as gender_age,
> map("Python",90) as skills_score,
> map("R&D",array('Developer')) as depart_title,
> year(start_date) as year, month(start_date) as month
> FROM employee_hr eh
> WHERE eh.employee_id = 102;
No rows affected (29.024 seconds)
Complex type constructors are used in the preceding example to create a constant value of a complex data type.

INSERT also supports writing data to files, which is the opposite operation compared to LOAD. It is usually used to extract data from SELECT statements to files in the local/HDFS directory. However, it only supports the OVERWRITE keyword, which means we can only overwrite rather than append data to the data files. By default, the columns are separated by Ctrl+A and rows are separated by newlines in the exported file. Column, row, and collection separators can also be overwritten like in the table creation statement. The following are a few examples of exporting data to files using the INSERT OVERWRITE ... directory statement:

  1. We can insert to local files with default row separators:
      > INSERT OVERWRITE LOCAL DIRECTORY '/tmp/output1'
> SELECT * FROM employee;
No rows affected (30.859 seconds)
Many partial files could be created by reducers when doing an insert into a directory. To merge them into one file, we can use the HDFS merge command: hdfs dfs –getmerge <exported_hdfs_folder> <local_folder>.
  1. Insert into local files with specified row separators:
      > INSERT OVERWRITE LOCAL DIRECTORY '/tmp/output2'
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> SELECT * FROM employee;
No rows affected (31.937 seconds)


-- Verify the separator
$vi /tmp/output2/000000_0
Michael,Montreal^BToronto,Male^B30,DB^C80,
Product^CDeveloper^DLead

Will,Montreal,Male^B35,Perl^C85,Product^CLead^BTest^CLead
Shelley,New York,Female^B27,Python^C80,Test^CLead^BCOE^CArchitect
Lucy,Vancouver,Female^B57,Sales^C89^BHR^C94,Sales^CLead
  1. Use multi-insert statements to export data from the same table:
      > FROM employee
> INSERT OVERWRITE DIRECTORY '/user/dayongd/output3'
> SELECT *
> INSERT OVERWRITE DIRECTORY '/user/dayongd/output4'
> SELECT name ;
No rows affected (25.4 seconds)
Combined HQL and HDFS shell commands, we can extract data to local or remote files with both append and overwrite supported. The hive -e quoted_hql_string or hive -f <hql_filename> commands can execute a HQL query or query file. Linux's redirect operators and piping can be used with these commands to redirect result sets. The following are a few examples:
  • Append to local files: $hive -e 'select * from employee' >> test
  • Overwrite local files: $hive -e 'select * from employee' > test
  • Append to HDFS files: $hive -e 'select * from employee'|hdfs dfs -appendToFile - /tmp/test1 
  • Overwrite HDFS files: $hive -e 'select * from employee'|hdfs dfs -put -f - /tmp/test2
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.131.255