ANALYZE statement

Hive statistics are a collection of data that describes more details, such as the number of rows, number of files, and raw data size of the objects in the database. Statistics are the metadata of data, collected and stored in the metastore database. Hive supports statistics at the table, partition, and column level. These statistics serve as an input to the Hive Cost-Based Optimizer (CBO), which is an optimizer used to pick the query plan with the lowest cost in terms of system resources required to complete the query. The statistics are partially gathered automatically in Hive v3.2.0 through to JIRA HIVE-11160 (https://issues.apache.org/jira/browse/HIVE-11160) or manually through the ANALYZE statement on tables, partitions, and columns, as in the following examples:

  1. Collect statistics on the existing table. When the NOSCAN option is specified, the command runs faster by ignoring file scanning but only collecting the number of files and their size:
      > ANALYZE TABLE employee COMPUTE STATISTICS;
      No rows affected (27.979 seconds)

> ANALYZE TABLE employee COMPUTE STATISTICS NOSCAN; No rows affected (25.979 seconds)
  1. Collect statistics on specific or all existing partitions:
      -- Applies for specific partition
> ANALYZE TABLE employee_partitioned > PARTITION(year=2018, month=12) COMPUTE STATISTICS; No rows affected (45.054 seconds)


-- Applies for all partitions
> ANALYZE TABLE employee_partitioned
> PARTITION(year, month) COMPUTE STATISTICS; No rows affected (45.054 seconds)
  1. Collect statistics on columns for existing tables:
      > ANALYZE TABLE employee_id COMPUTE STATISTICS FOR COLUMNS 
employee_id;
No rows affected (41.074 seconds)
We can enable automatic gathering of statistics by specifying SET hive.stats.autogather=true. For new tables or partitions that are populated through the INSERT OVERWRITE/INTO statement (rather than the LOAD statement), statistics are automatically collected in the metastore

Once the statistics are built and collected, we can check the statistics with the DESCRIBE EXTENDED/FORMATTED statement. From the table/partition output, we can find the statistical information inside the parameters, such as parameters:{numFiles=1, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1417726247, numRows=4, totalSize=227, rawDataSize=223}). The following is an example of checking statistics in a table:

-- Check statistics in a table
> DESCRIBE EXTENDED employee_partitioned PARTITION(year=2018, month=12);

-- Check statistics in a partition
> DESCRIBE EXTENDED employee; ... parameters:{numFiles=1, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1417726247, numRows=4, totalSize=227, rawDataSize=223}).
-- Check statistics in a column
> DESCRIBE FORMATTED employee.name; +--------+---------+---+---+---------+--------------+ |col_name|data_type|min|max|num_nulls|distinct_count| ... +--------+---------+---+---+---------+--------------+ | name | string | | | 0 | 5 | ... +--------+---------+---+---+---------+--------------+
+-----------+-----------+ |avg_col_len|max_col_len| ... +-----------+-----------+ | 5.6 | 7 | ... +-----------+-----------+
3 rows selected (0.116 seconds)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.188.121