Data sorting

Another aspect of manipulating data is properly sorting it in order to clearly identify important facts, such as top the N values, maximum, minimum, and so on. HQL supports the following keywords for data sorting:

ORDER BY [ASC|DESC]: It is similar to the SQL ORDER BY statement. When using ORDER BY, a sorted order is maintained across all of the output from every reducer. It performs a global sort using only one reducer, so it takes longer to return the result. The direction specifier after ORDER BY can be either ASC for ascending (low to high) or DESC for descending (high to low). If you do not provide a direction specifier, the default of ascending is used. Since v2.1.0, the ORDER BY statement supports specifying the sorting direction for the NULL value, such as NULL FIRST or NULL LAST. By default, NULL stays at the first place in the ASC direction and the last place in the DESC direction:

      > SELECT name FROM employee ORDER BY name DESC; -- By columns
      +----------+
      |   name   |
      +----------+
      | Will     |
      | Shelley  |
      | Michael  |
      | Lucy     |
      +----------+
      4 rows selected (24.057 seconds)
      

      > SELECT name 
      > FROM employee  -- Order by expression
      > ORDER BY CASE WHEN name = 'Will' THEN 0 ELSE 1 END DESC;
      +----------+
      |   name   |
      +----------+
      | Lucy     |
      | Shelley  |
      | Michael  |
      | Will     |
      +----------+
      4 rows selected (25.057 seconds)
      

      > SELECT * FROM emp_simple ORDER BY work_place NULL LAST;
      +---------+------------+
      | name    | work_place |
      +---------+------------+
      | Lucy    | Montreal   |
      | Michael | Toronto    |
      | Will    | NULL       | -- NULL stays at the last
      +---------+------------+
      3 rows selected (0.263 seconds)

Using LIMIT with ORDER BY is strongly recommended. When the hive.mapred.mode = strict property is set (the default value for hive.mapred.mode is nonstrict in Hive v1.* and strict in Hive v2.* ), it throws exceptions when using ORDER BY without LIMIT.

SORT BY [ASC|DESC]: It specifies which columns to use to sort reducer input records. This means the sorting is completed before sending data to the reducer. The SORT BY statement does not perform a global sort (but ORDER BY does) and only ensures data is locally sorted in each reducer. If SORT BY sorts with only one reducer (set mapred.reduce.tasks=1), it is equal to ORDER BY, as the following example shows. Most of the time, SORT BY itself is useless but is used with DISTRIBUTE BY, which is introduced next:

      > SET mapred.reduce.tasks = 2; -- Sort by with more than 1 reducer
      No rows affected (0.001 seconds)
      

      > SELECT name FROM employee SORT BY name DESC;
      +---------+
      |   name  |
      +---------+
      | Shelley | -- Once result is collected to client, it is 
      | Michael |   order-less
      | Lucy    |
      | Will    |
      +---------+
      4 rows selected (54.386 seconds)
      
      > SET mapred.reduce.tasks = 1; -- Sort by one reducer
      No rows affected (0.002 seconds)
      

      > SELECT name FROM employee SORT BY name DESC;
      +----------+
      |   name   |
      +----------+
      | Will     | -- Same result to ORDER BY
      | Shelley  |
      | Michael  |
      | Lucy     |
      +----------+
      4 rows selected (46.03 seconds)

DISTRIBUTE BY: It is very similar to GROUP BY (introduced in Chapter 6, Data Aggregation and Sampling) when the mapper decides to which reducer it can deliver the output. Compared to GROUP BY, DISTRIBUTE BY will not work on data aggregations, such as count(*), but only directs where data goes. In this case, DISTRIBUTE BY is quite often used to reorganize data in files by specified columns. For example, we may need to use DISTRIBUTE BY after a UNION result set to reorganize data in higher granularity. When used with SORT BY to sort data within specified groups, DISTRIBUTE BY can be used before SORT BY in one query. In addition, the columns after DISTRIBUTE BY must appear in the select column list as follows:

      -- Error when not specify distributed column employee_id in 
      select
      > SELECT name FROM employee_hr DISTRIBUTE BY employee_id; 
      Error: Error while compiling statement: FAILED: SemanticException 
      [Error 10004]: Line 1:44 Invalid table alias or column reference 
      'employee_id': (possible column names are: name) 
      

      > SELECT name, employee_id FROM employee_hr DISTRIBUTE BY 
      employee_id; 
      +----------+--------------+
      |   name   | employee_id  |
      +----------+--------------+
      | Lucy     | 103          |
      | Steven   | 102          |
      | Will     | 101          |
      | Michael  | 100          |
      +----------+--------------+
      4 rows selected (38.92 seconds)
      

      -- Used with SORT BY to order name started on the same day
      > SELECT name, start_date
      > FROM employee_hr
      > DISTRIBUTE BY start_date SORT BY name;
      +----------+--------------+
      |   name   |  start_date  |
      +----------+--------------+
      | Lucy     | 2010-01-03   |
      | Michael  | 2014-01-29   |
      | Steven   | 2012-11-03   |
      | Will     | 2013-10-02   |
      +----------+--------------+
      4 rows selected (38.01 seconds)

CLUSTER BY: It is a shortcut operator you can use to perform DISTRIBUTE BY and SORT BY operations on the same group of columns. The CLUSTER BY statement does not allow you to specify ASC or DESC yet. Compared to ORDER BY, which is globally sorted, the CLUSTER BY statement sorts data in each distributed group:

      > SELECT name, employee_id FROM employee_hr CLUSTER BY name;
      +----------+--------------+
      |   name   | employee_id  |
      +----------+--------------+
      | Lucy     | 103          |
      | Michael  | 100          |
      | Steven   | 102          |
      | Will     | 101          |
      +----------+--------------+
      4 rows selected (39.791 seconds)

When we have to do a global sort, we can do CLUSTER BY first and then ORDER BY. In this way, we can fully utilize all the available reducers ahead of ORDER BY and have better performance, for example: SELECT * FROM (SELECT * FROM employee CLUSTER BY name) base ORDER BY name;.

In summary, the difference between these sorting keywords is shown in the following diagram:

HQL sorting keywords difference

Table of Contents for Data sorting

Create new playlist

Sign In

Sign Up

Table of Contents for
Data sorting