Project data with SELECT

The most common use case for Hive is to query data in Hadoop. To achieve this, we need to write and execute a SELECT statement. The typical work done by the SELECT statement is to project the whole row (with SELECT *) or specified columns (with SELECT column1, column2, ...) from a table, with or without conditions.Most simple SELECT statements will not trigger a Yarn job. Instead, a dump task is created just for dumping the data, such as the hdfs dfs -cat command. The SELECT statement is quite often used with the FROM and DISTINCT keywords. A FROM keyword followed by a table is where SELECT projects data. The DISTINCT keyword used after SELECT ensures only unique rows or combination of columns are returned from the table. In addition, SELECT also supports columns combined with user-defined functions, IF(), or a CASE WHEN THEN ELSE END statement, and regular expressions. The following are examples of projecting data with a SELECT statement:

  1. Query the whole row or specific columns in the table:
      > SELECT * FROM employee; -- Project the whole row
> SELECT name FROM employee; -- Project specified columns
+----------+
| name |
+----------+
| Michael |
| Will |
| Shelley |
| Lucy |
+----------+
4 rows selected (0.452 seconds)


-- List all columns match java regular expression
> SET hive.support.quoted.identifiers = none;
-- Enable this
> SELECT `^work.*` FROM employee; -- All columns start with work
+------------------------+
| employee.work_place |
+------------------------+
| ["Montreal","Toronto"] |
| ["Montreal"] |
| ["New York"] |
| ["Vancouver"] |
+------------------------+
4 rows selected (0.141 seconds)
  1. Select distinct columns listed from a table:
      > SELECT DISTINCT name, work_place FROM employee;
+---------+------------------------+
| name | work_place |
+---------+------------------------+
| Lucy | ["Vancouver"] |
| Michael | ["Montreal","Toronto"] |
| Shelley | ["New York"] |
| Will | ["Montreal"] |
+---------+------------------------+
4 rows selected (35.962 seconds)
  1. Select columns with IF or CASE WHEN functions:
      > SELECT 
> CASE WHEN gender_age.gender = 'Female' THEN 'Ms.'
> ELSE 'Mr.' END as title,
> name,
> IF(array_contains(work_place, 'New York'), 'US', 'CA') as
country

> FROM employee;
+-------+---------+---------+
| title | name | country |
+-------+---------+---------+
| Mr. | Michael | CA |
| Mr. | Will | CA |
| Ms. | Shelley | US |
| Ms. | Lucy | CA |
+-------+---------+---------+
4 rows selected (0.585 seconds)

Multiple SELECT statements can work together to build a complex query using nested queries or CTE. A nested query, which is also called a subquery, is a query projecting data from the result of another query. Nested queries can be rewritten using CTE (mentioned in Chapter 3Data Definition and Description) with the WITH and AS keywords. When using nested queries, an alias should be given for the inner query (see t1 in the following example), or else Hive will report exceptions. The following are a few examples of using nested queries in HQL:

  1. A nested query example with the mandatory alias:
      > SELECT 
> name, gender_age.gender as gender
> FROM (
> SELECT * FROM employee WHERE gender_age.gender = 'Male'
> ) t1; -- t1 here is mandatory
+----------+----------+
| name | gender |
+----------+----------+
| Michael | Male |
| Will | Male |
+----------+----------+
2 rows selected (48.198 seconds)
  1. A nested query can be rewritten with CTE as follows. This is the recommended way of writing a complex single HQL query:
      > WITH t1 as (
> SELECT * FROM employee WHERE gender_age.gender = 'Male'
> )
> SELECT name, gender_age.gender as gender
> FROM t1;
+----------+----------+
| name | gender |
+----------+----------+
| Michael | Male |
| Will | Male |
+----------+----------+
2 rows selected (38.706 seconds)

In addition, a special SELECT followed by a constant expression can work without the FROM table clause. It returns the result of the expression. This is equivalent to querying a dummy table with one dummy record:

> SELECT concat('1','+','3','=',cast((1 + 3) as string)) as res;
+-------+
| res |
+-------+
| 1+3=4 |
+-------+
1 row selected (0.109 seconds)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.163.13