Hive

We now move on from file storage formats to data processing and retrieval components. Apache Hive is a project in the Hadoop ecosystem that fits within the analysis interaction area. It allows you to write queries in a SQL-like language, called Hive Query Language (HiveQL), which it interprets into processing commands for execution over HDFS. HiveQL is very similar to SQL and will be instantly usable to any SQL developer.

Hive architecture consists of a User Interface (UI), a metastore database, a driver, a compiler, and an execution engine. These components work together to translate a user query into a Directed Acyclic Graph (DAG) of actions that orchestrate execution and return results utilizing Hadoop MapReduce (usually) and HDFS.

Hive architecture. Source: Apache Hive project documentation

The UI allows you to submit queries. It also supports other actions such as commands to create Hive tables. You will most probably not use it directly. An an IoT analyst, you are more likely to interact with it through a web-based interface, such as Hue, or through another application that uses a Hive ODBC driver, such as a programming IDE or visualization software such as Tableau.

The metastore database holds information about Hive tables. Hive tables are very similar to relational database tables where the structure is defined ahead of time. Hive tables can be managed by Hive in HDFS or they can be defined to point to data files not managed by Hive (called external tables). External tables are not necessary in HDFS. Amazon S3 is a common example where structured files can be stored and still be queryable through Hive as long as a Hive table definition has been created for it.

The metastore database can be housed inside the cluster or remotely. If you are creating transient Hadoop clusters for temporary use, such as through Microsoft HDInsights, the remote database can be useful as the metadata can persist beyond the life of the cluster. This allows it to be used by future incarnations of clusters. This is more common when Hadoop is being used for batch processing jobs or for a temporary use case, such as experimental analytics jobs. Hive metastores can also be used by other Hadoop ecosystem projects, such as Spark and Impala.

The driver serves as a coordinator. It creates a unique session ID for the query, sends it to the compiler for a query plan, and then sends it to the execution engine for processing. Finally, it coordinates the fetching of result records.

The compiler communicates with the metastore to get information on the tables in the query. It checks type and develops an execution plan. The plan takes the form of a DAG of steps, which can either be a map or reduce job, a metadata operation, or an HDFS operation. The execution plan is passed back to the driver.

The execution engine submits actions to the appropriate Hadoop components based on the execution plan. During the execution steps, data is stored in temporary HDFS files to be used by later steps in the plan. The final results are retrieved by the execution engine when the driver sends a fetch request.

Here is an example table creation statement:

CREATE TABLE iotsensor (
                   sensorid BIGINT,
                   temperature DOUBLE,
                   timeutc STRING
                   )
STORED AS PARQUET;

Hive tables can support partitions and other performance enhancing options. Query statements are very similar to SQL, typically not distinguishable for most simple queries. This is an example query statement:

SELECT stdev_samp (temperature) FROM iotsensor WHERE TO_DATE(timeread) > ‘2016-01-01'

Here are some tips for using Hive for IoT analytics:

Avoid too many table joins: Keep less than five tables joined in a single query. If this is not possible, break the query up into multiple successive queries that build intermediate pieces into temp tables.
Experiment with Hive using execution engines other than Hadoop MapReduce: Hive on Tez and Hive on Spark are two newer implementation projects that promise faster, more efficient query processing by minimizing disk I/O and leveraging more in-memory processing.
Store data in Parquet files: The columnar storage better supports the types of query commonly run for analytics. File compression can reduce disk space requirements for IoT data.

Table of Contents for Hive

Create new playlist

Sign In

Sign Up

Table of Contents for
Hive