INNER JOIN

INNER JOIN or JOIN returns rows meeting the join conditions from both sides of joined tables. The JOIN keyword can also be omitted by comma-separated table names; this is called an implicit join. Here are examples of the HQL JOIN operation:

  1. First, prepare a table to join with and load data to it:
      > CREATE TABLE IF NOT EXISTS employee_hr (
> name string,
> employee_id int,
> sin_number string,
> start_date date
> )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '|';
No rows affected (1.732 seconds)


> LOAD DATA INPATH '/tmp/hivedemo/data/employee_hr.txt'
> OVERWRITE INTO TABLE employee_hr;
No rows affected (0.635 seconds)
  1. Perform an INNER JOIN between two tables with equal and unequal join conditions, along with complex expressions as well as a post join WHERE condition. Usually, we need to add a table name or table alias before columns in the join condition, although Hive always tries to resolve them:
      > SELECT 
> emp.name, emph.sin_number
> FROM employee emp
> JOIN employee_hr emph ON emp.name = emph.name; -- Equal Join
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091 |
| Will | 527-948-090 |
| Lucy | 577-928-094 |
+-----------+------------------+
3 rows selected (71.083 seconds)


> SELECT
> emp.name, emph.sin_number
> FROM employee emp -- Unequal join supported since v2.2.0
returns more rows

> JOIN employee_hr emph ON emp.name != emph.name;
+----------+-----------------+
| emp.name | emph.sin_number |
+----------+-----------------+
| Michael | 527-948-090 |
| Michael | 647-968-598 |
| Michael | 577-928-094 |
| Will | 547-968-091 |
| Will | 647-968-598 |
| Will | 577-928-094 |
| Shelley | 547-968-091 |
| Shelley | 527-948-090 |
| Shelley | 647-968-598 |
| Shelley | 577-928-094 |
| Lucy | 547-968-091 |
| Lucy | 527-948-090 |
| Lucy | 647-968-598 |
+----------+-----------------+
13 rows selected (24.341 seconds)

-- Join with complex expression in join condition
-- This is also the way to implement conditional join
-- Below, conditional ignore row with name = 'Will'
> SELECT

> emp.name, emph.sin_number
> FROM employee emp
> JOIN employee_hr emph ON
> IF(emp.name = 'Will', '1', emp.name) =
> CASE WHEN emph.name = 'Will' THEN '0' ELSE emph.name END;
+----------+-----------------+

| emp.name | emph.sin_number |
+----------+-----------------+
| Michael | 547-968-091 |
| Lucy | 577-928-094 |
+----------+-----------------+
2 rows selected (27.191 seconds)


-- Use where/limit to limit the output of join
> SELECT
> emp.name, emph.sin_number
> FROM employee emp
> JOIN employee_hr emph ON emp.name = emph.name
> WHERE
> emp.name = 'Will';
+----------+-----------------+
| emp.name | emph.sin_number |
+----------+-----------------+
| Will | 527-948-090 |
+----------+-----------------+
1 row selected (26.811 seconds)
  1. The JOIN operation can be performed on more tables (such as table A, B, and C) with sequence joins. The tables can either join from A to B and B to C, or join from A to B and A to C:
      > SELECT 
> emp.name, empi.employee_id, emph.sin_number
> FROM employee emp
> JOIN employee_hr emph ON emp.name = emph.name
> JOIN employee_id empi ON emp.name = empi.name;
+-----------+-------------------+------------------+
| emp.name | empi.employee_id | emph.sin_number |
+-----------+-------------------+------------------+
| Michael | 100 | 547-968-091 |
| Will | 101 | 527-948-090 |
| Lucy | 103 | 577-928-094 |
+-----------+-------------------+------------------+
3 rows selected (67.933 seconds)
  1. Self-join is where one table joins itself. When doing such joins, a different alias should be given to distinguish the same table:
      > SELECT 
> emp.name -- Use alias before column name
> FROM employee emp
> JOIN employee emp_b -- Here, use a different alias
> ON emp.name = emp_b.name;
+-----------+
| emp.name |
+-----------+
| Michael |
| Will |
| Shelley |
| Lucy |
+-----------+
4 rows selected (59.891 seconds)
  1. Perform an implicit join without using the JOIN keyword. This is only applicable to the INNER JOIN:
      > SELECT 
> emp.name, emph.sin_number
> FROM
> employee emp, employee_hr emph -- Only applies for inner join
> WHERE
> emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091 |
| Will | 527-948-090 |
| Lucy | 577-928-094 |
+-----------+------------------+
3 rows selected (47.241 seconds)
  1. The join condition uses different columns, which will create an additional job:
      > SELECT 
> emp.name, empi.employee_id, emph.sin_number
> FROM employee emp
> JOIN employee_hr emph ON emp.name = emph.name
> JOIN employee_id empi ON emph.employee_id = empi.employee_id;
+-----------+-------------------+------------------+
| emp.name | empi.employee_id | emph.sin_number |
+-----------+-------------------+------------------+
| Michael | 100 | 547-968-091 |
| Will | 101 | 527-948-090 |
| Lucy | 103 | 577-928-094 |
+-----------+-------------------+------------------+
3 rows selected (49.785 seconds)
If JOIN uses different columns in its conditions, it will request an additional job to complete the join. If the JOIN operation uses the same column in the join conditions, it will join on this condition using one job.

When JOIN is performed between multiple tables, Yarn/MapReduce jobs are created to process the data in the HDFS. Each of the jobs is called a stage. Usually, it is suggested to put the big table right at the end of the JOIN statement for better performance and to avoid Out Of Memory (OOM) exceptions. This is because the last table in the JOIN sequence is usually streamed through reducers where as the others are buffered in the reducer by default. Also, a hint, /*+STREAMTABLE (table_name)*/, can be specified to advise which table should be streamed over the default decision, as in the following example:

> SELECT /*+ STREAMTABLE(employee_hr) */
> emp.name, empi.employee_id, emph.sin_number
> FROM employee emp
> JOIN employee_hr emph ON emp.name = emph.name
> JOIN employee_id empi ON emph.employee_id = empi.employee_id;
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.34.226