Hive data types are categorized into two types: primitive and complex. String and Int are the most useful primitive types, which are supported by most HQL functions. The details of primitive types are as follows:
ay contain a set of any type of fields. Complex types allow the nesting of types. The details of complex types a
Primitive type | Description | Example |
TINYINT | It has 1 byte, from -128 to 127. The postfix is Y. It is used as a small range of numbers. | 10Y |
SMALLINT | It has 2 bytes, from -32,768 to 32,767. The postfix is S. It is used as a regular descriptive number. | 10S |
INT | It has 4 bytes, from -2,147,483,648 to 2,147,483,647. | 10 |
BIGINT | It has 8 bytes, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. The postfix is L. | 100L |
FLOAT | This is a 4 byte single-precision floating-point number, from 1.40129846432481707e-45 to 3.40282346638528860e+38 (positive or negative). Scientific notation is not yet supported. It stores very close approximations of numeric values. | 1.2345679 |
DOUBLE | This is an 8 byte double-precision floating-point number, from 4.94065645841246544e-324d to 1.79769313486231570e+308d (positive or negative). Scientific notation is not yet supported. It stores very close approximations of numeric values. | 1.2345678901234567 |
BINARY | This was introduced in Hive 0.8.0 and only supports CAST to STRING and vice versa. | 1011 |
BOOLEAN | This is a TRUE or FALSE value. | TRUE |
STRING | This includes characters expressed with either single quotes (') or double quotes ("). Hive uses C-style escaping within the strings. The max size is around 2 G. | 'Books' or "Books" |
CHAR | This is available starting with Hive 0.13.0. Most UDF will work for this type after Hive 0.14.0. The maximum length is fixed at 255. | 'US' or "US" |
VARCHAR | This is available starting with Hive 0.12.0. Most UDF will work for this type after Hive 0.14.0. The maximum length is fixed at 65,355. If a string value being converted/assigned to a varchar value exceeds the length specified, the string is silently truncated. | 'Books' or "Books" |
DATE | This describes a specific year, month, and day in the format of YYYY-MM-DD. It is available starting with Hive 0.12.0. The range of dates is from 0000-01-01 to 9999-12-31. | 2013-01-01 |
TIMESTAMP | This describes a specific year, month, day, hour, minute, second, and millisecond in the format of YYYY-MM-DD HH:MM:SS[.fff...]. It is available starting with Hive 0.8.0. | 2013-01-01 12:00:01.345 |
Hive has three main complex types: ARRAY, MAP, and STRUCT. These data types are built on top of the primitive data types. ARRAY and MAP are similar to that in Java. STRUCT is a record type, which may contain a set of any type of fields. Complex types allow the nesting of types. The details of complex types are as follows:
Complex type | Description | Example |
ARRAY | This is a list of items of the same type, such as [val1, val2, and so on]. You can access the value using array_name[index], for example, fruit[0]="apple". Index starts from 0. | ["apple","orange","mango"] |
MAP | This is a set of key-value pairs, such as {key1, val1, key2, val2, and so on}. You can access the value using map_name[key] for example, fruit[1]="apple". | {1: "apple",2: "orange"} |
STRUCT | This is a user-defined structure of any type of field, such as {val1, val2, val3, and so on}. By default, STRUCT field names will be col1, col2, and so on. You can access the value using structs_name.column_name, for example, fruit.col1=1. | {1, "apple"} |
NAMED STRUCT | This is a user-defined structure of any number of typed fields, such as {name1, val1, name2, val2, and so on}. You can access the value using structs_name.column_name, for example, fruit.apple="gala". | {"apple":"gala","weight kg":1} |
UNION | This is a structure that has exactly any one of the specified data types. It is available starting with Hive 0.7.0. It is not commonly used. | {2:["apple","orange"]} |
For MAP, the type of keys and values are unified. However, STRUCT is more flexible. STRUCT is more like a table, whereas MAP is more like an ARRAY with a customized index.
The following is a short exercise for all the commonly-used data types. The details of the CREATE, LOAD, and SELECT statements will be introduced in later chapters. Let's take a look at the exercise:
- Prepare the data as follows:
$vi employee.txt
Michael|Montreal,Toronto|Male,30|DB:80|Product:Developer^DLead
Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Shelley|New York|Female,27|Python:80|Test:Lead,COE:Architect
Lucy|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead
- Log in to beeline with the JDBC URL:
$beeline -u "jdbc:hive2://localhost:10000/default"
- Create a table using various data types (> indicates the beeline interactive mode):
> CREATE TABLE employee (
> name STRING,
> work_place ARRAY<STRING>,
> gender_age STRUCT<gender:STRING,age:INT>,
> skills_score MAP<STRING,INT>,
> depart_title MAP<STRING,ARRAY<STRING>>
> )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '|'
> COLLECTION ITEMS TERMINATED BY ','
> MAP KEYS TERMINATED BY ':'
> STORED AS TEXTFILE;
No rows affected (0.149 seconds)
- Verify that the table has been created:
> !table employee
+---------+------------+------------+--------------+---------+
|TABLE_CAT|TABLE_SCHEMA| TABLE_NAME | TABLE_TYPE | REMARKS |
+---------+------------+------------+--------------+---------+
| |default | employee | MANAGED_TABLE| |
+---------+------------+------------+--------------+---------+
> !column employee
-------------+-------------+-------------+-------------------+
| TABLE_SCHEM | TABLE_NAME | COLUMN_NAME | TYPE_NAME |
+-------------+-------------+-------------+------------------+
| default | employee | name | STRING |
| default | employee | work_place | array<string> |
| default | employee | gender_age |
struct<gender:string,age:int>|
| default | employee | skills_score| map<string,int> |
| default | employee | depart_title|
map<string,array<string>> |
+-------------+-------------+-------------+------------------+
- Load data into the table:
> LOAD DATA INPATH '/tmp/hivedemo/data/employee.txt'
> OVERWRITE INTO TABLE employee;
No rows affected (1.023 seconds)
- Query the whole array and each array element in the table:
> SELECT work_place FROM employee;
+----------------------+
| work_place |
+----------------------+
| [Montreal, Toronto] |
| [Montreal] |
| [New York] |
| [Vancouver] |
+----------------------+
4 rows selected (27.231 seconds)
> SELECT
> work_place[0] as col_1, work_place[1] as col_2,
> work_place[2] as col_3
> FROM employee;
+------------+----------+--------+
| col_1 | col_2 | col_3 |
+------------+----------+--------+
| Montreal | Toronto | |
| Montreal | | |
| New York | | |
| Vancouver | | |
------------+----------+---------+
4 rows selected (24.689 seconds)
- Query the whole struct and each struct attribute in the table:
> SELECT gender_age FROM employee;
+------------------+
| gender_age |
+------------------+
| [Male, 30] |
| [Male, 35] |
| [Female, 27] |
| [Female, 57] |
+------------------+
4 rows selected (28.91 seconds)
> SELECT gender_age.gender, gender_age.age FROM employee;
+------------+------+
| gender | age |
+------------+------+
| Male | 30 |
| Male | 35 |
| Female | 27 |
| Female | 57 |
+------------+------+
4 rows selected (26.663 seconds)
- Query the whole map and each map element in the table:
> SELECT skills_score FROM employee;
+--------------------+
| skills_score |
+--------------------+
| {DB=80} |
| {Perl=85} |
| {Python=80} |
| {Sales=89, HR=94} |
+--------------------+
4 rows selected (32.659 seconds)
> SELECT
> name, skills_score['DB'] as DB, skills_score['Perl'] as Perl,
> skills_score['Python'] as Python,
> skills_score['Sales'] as Sales,
> skills_score['HR'] as HR
> FROM employee;
+----------+-----+-------+---------+--------+-----+
| name | db | perl | python | sales | hr |
+----------+-----+-------+---------+--------+-----+
| Michael | 80 | | | | |
| Will | | 85 | | | |
| Shelley | | | 80 | | |
| Lucy | | | | 89 | 94 |
+----------+-----+-------+---------+--------+-----+
4 rows selected (24.669 seconds)
- Query the composite type in the table:
> SELECT depart_title FROM employee;
+---------------------------------+
| depart_title |
+---------------------------------+
| {Product=[Developer, Lead]} |
| {Test=[Lead], Product=[Lead]} |
| {Test=[Lead], COE=[Architect]} |
| {Sales=[Lead]} |
+---------------------------------+
4 rows selected (30.583 seconds)
> SELECT
> name, depart_title['Product'] as Product, depart_title['Test']
as Test,
> depart_title['COE'] as COE, depart_title['Sales'] as Sales
> FROM employee;
+--------+--------------------+---------+-------------+------+
| name | product | test | coe |sales |
+--------+--------------------+---------+-------------+------+
| Michael| [Developer, Lead] | | | |
| Will | [Lead] | [Lead] | | |
| Shelley| | [Lead] | [Architect] | |
| Lucy | | | |[Lead]|
+--------+--------------------+---------+-------------+------+
4 rows selected (26.641 seconds)
> SELECT
> name, depart_title['Product'][0] as product_col0,
> depart_title['Test'][0] as test_col0
> FROM employee;
+----------+---------------+------------+
| name | product_col0 | test_col0 |
+----------+---------------+------------+
| Michael | Developer | |
| Will | Lead | Lead |
| Shelley | | Lead |
| Lucy | | |
+----------+---------------+------------+
4 rows selected (26.659 seconds)
- Row Delimiter: This can be used with Ctrl + A or ^A (use