Table creation

The following are major DDLs for the internal and external table creation:

  1. Show the data file content of employee.txt:
      $ vi /home/hadoop/employee.txt
Michael|Montreal,Toronto|Male,30|DB:80|Product:Developer^DLead
Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Shelley|New York|Female,27|Python:80|Test:Lead,COE:Architect
Lucy|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead
  1. Create an internal table and load the data:
      > CREATE TABLE IF NOT EXISTS employee_internal (
> name STRING COMMENT 'this is optinal column comments',
> work_place ARRAY<STRING>,-- table column names are NOT case
sensitive
> gender_age STRUCT<gender:STRING,age:INT>,
> skills_score MAP<STRING,INT>, -- columns names are lower case
> depart_title MAP<STRING,ARRAY<STRING>>-- No "," for the last
column
> )

> COMMENT 'This is an internal table'-- This is optional table
comments
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '|' -- Symbol to seperate columns
> COLLECTION ITEMS TERMINATED BY ','-- Seperate collection elements
> MAP KEYS TERMINATED BY ':' -- Symbol to seperate keys and values
> STORED as TEXTFILE; -- Table file format
No rows affected (0.149 seconds)


> LOAD DATA INPATH '/tmp/hivedemo/data/employee.txt'
> OVERWRITE INTO TABLE employee_internal;
If the folder path does not exist in the LOCATION property, Hive will create that folder. If there is another folder inside it, Hive will NOT report errors when creating the table but querying the table.
  1. Create an external table and load the data:
      > CREATE EXTERNAL TABLE employee_external ( -- Use EXTERNAL keywords
> name string,
> work_place ARRAY<string>,
> gender_age STRUCT<gender:string,age:int>,
> skills_score MAP<string,int>,
> depart_title MAP<STRING,ARRAY<STRING>>
> )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '|'
> COLLECTION ITEMS TERMINATED BY ','
> MAP KEYS TERMINATED BY ':'
> STORED as TEXTFILE
> LOCATION '/user/dayongd/employee'; -- Specify data folder location
No rows affected (1.332 seconds)


> LOAD DATA INPATH '/tmp/hivedemo/data/employee.txt'
> OVERWRITE INTO TABLE employee_external;
Since v2.1.0, Hive supports primary and foreign key constraints. However, these constraints are not validated, so the upstream system needs to ensure data integrity before it's loaded into Hive. The Hive constraints may benefit some SQL tools to generate more efficient queries with them, but they are not used very often. 

Hive also supports creating temporary tables. A temporary table is only visible to the current user session. It's automatically deleted at the end of the session. The data of the temporary table is stored in the user's scratch directory, such as /tmp/hive-<username>. Therefore, make sure the folder is properly configured or secured when you have sensitive data in temporary tables. Whenever a temporary table has the same name as a permanent table, the temporary table will be chosen rather than the permanent table. A temporary table does not support partitions and indexes. The following are three ways to create temporary tables:

> CREATE TEMPORARY TABLE IF NOT EXISTS tmp_emp1 (
> name string,
> work_place ARRAY<string>,
> gender_age STRUCT<gender:string,age:int>,
> skills_score MAP<string,int>,
> depart_title MAP<STRING,ARRAY<STRING>>
> );
No rows affected (0.122 seconds)


> CREATE TEMPORARY TABLE tmp_emp2 as SELECT * FROM tmp_emp1;

> CREATE TEMPORARY TABLE tmp_emp3 like tmp_emp1;

Tables can also be created and populated by the results of a query in one statement, called Create-Table-As-Select (CTAS). The table created by CTAS is not visible by other users until all the query results are populated. CTAS has the following restrictions:

  • The table created cannot be a partitioned table
  • The table created cannot be an external table
  • The table created cannot be a list-bucketing table

CTAS statement always triggers a yarn job to populate the data, although the SELECT * statement itself does not trigger any yarn job.

CTAS can also be used with CTE, which stands for Common Table Expression. CTE is a temporary result set derived from a simple select query specified in a WITH clause, followed by the SELECT or INSERT statement to build the result set. The CTE is defined only within the execution scope of a single statement. One or more CTEs can be used in a nested or chained way with keywords, such as the SELECTINSERTCREATE TABLE AS SELECT, or CREATE VIEW AS SELECT statements. Using CTE of HQL makes the query more concise and clear than writing complex nested queries.

The following are examples using CTAS and CTE for table creation:

  1. Create a table with CTAS:
      > CREATE TABLE ctas_employee as SELECT * FROM employee_external;
No rows affected (1.562 seconds)
  1. Create a table with both CTAS and CTE:
      > CREATE TABLE cte_employee as
> WITH r1 as (
> SELECT name FROM r2 WHERE name = 'Michael'
> ),
> r2 as (
> SELECT name FROM employee WHERE gender_age.gender= 'Male'
> ),
> r3 as (
> SELECT name FROM employee WHERE gender_age.gender= 'Female'
> )
> SELECT * FROM r1
> UNION ALL
> SELECT * FROM r3;
No rows affected (61.852 seconds)


> SELECT * FROM cte_employee;
+----------------------------+
| cte_employee.name |
+----------------------------+
| Michael |
| Shelley |
| Lucy |
+----------------------------+
3 rows selected (0.091 seconds)
  1. Use CTAS to create an empty table by copying the schema from another table. It is empty because the where condition is false:
      > CREATE TABLE empty_ctas_employee as
> SELECT * FROM employee_internal WHERE 1=2;
No rows affected (213.356 seconds)
  1. In another way, we can also use CREATE TABLE LIKE to create an empty table. This is a faster way to copy the table schema since it does not trigger any jobs but only copies metadata:
      > CREATE TABLE empty_like_employee LIKE employee_internal;
No rows affected (0.115 seconds)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.230.81