One of the design considerations for a new framework is always the compatibility with the old frameworks. For better or worse, most data analysts still work with SQL. The roots of the SQL go to an influential relational modeling paper (Codd, Edgar F (June 1970). A Relational Model of Data for Large Shared Data Banks. Communications of the ACM (Association for Computing Machinery) 13 (6): 377–87). All modern databases implement one or another version of SQL.
While the relational model was influential and important for bringing the database performance, particularly for Online Transaction Processing (OLTP) to the competitive levels, the significance of normalization for analytic workloads, where one needs to perform aggregations, and for situations where relations themselves change and are subject to analysis, is less critical. This section will cover the extensions of standard SQL language for analysis engines traditionally used for big data analytics: Hive and Impala. Both of them are currently Apache licensed projects. The following table summarizes the complex types:
While Hive/Impala tables can be created on top of many underlying file formats (Text, Sequence, ORC, Avro, Parquet, and even custom format) and multiple serializations, in most practical instances, Hive is used to read lines of text in ASCII files. The underlying serialization/deserialization format is LazySimpleSerDe
(Serialization/Deserialization (SerDe)). The format defines several levels of separators, as follows:
row_format : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] [NULL DEFINED AS char]
The default for separators are ' 01'
or ^A
, ' 02'
or ^B
, and ' 03'
or ^B
. In other words, it's using the new separator at each level of the hierarchy as opposed to the definition/repetition indicator in the Dremel encoding. For example, to encode the LabeledPoint
table that we used before, we need to create a file, as follows:
$ cat data 0^A1^B1^D1.0$ 2^A1^B1^D3.0$ 1^A0^B0.0^C2.0^C0.0$ 3^A0^B0.0^C4.0^C0.0$
Download Hive from http://archive.cloudera.com/cdh5/cdh/5/hive-1.1.0-cdh5.5.0.tar.gz and perform the follow:
$ tar xf hive-1.1.0-cdh5.5.0.tar.gz $ cd hive-1.1.0-cdh5.5.0 $ bin/hive … hive> CREATE TABLE LABELED_POINT ( LABEL INT, VECTOR UNIONTYPE<ARRAY<DOUBLE>, MAP<INT,DOUBLE>> ) STORED AS TEXTFILE; OK Time taken: 0.453 seconds hive> LOAD DATA LOCAL INPATH './data' OVERWRITE INTO TABLE LABELED_POINT; Loading data to table alexdb.labeled_point Table labeled_point stats: [numFiles=1, numRows=0, totalSize=52, rawDataSize=0] OK Time taken: 0.808 seconds hive> select * from labeled_point; OK 0 {1:{1:1.0}} 2 {1:{1:3.0}} 1 {0:[0.0,2.0,0.0]} 3 {0:[0.0,4.0,0.0]} Time taken: 0.569 seconds, Fetched: 4 row(s) hive>
In Spark, select from a relational table is supported via the sqlContext.sql
method, but unfortunately the Hive union types are not directly supported as of Spark 1.6.1; it does support maps and arrays though. The supportability of complex objects in other BI and data analysis tools still remains the biggest obstacle to their adoption. Supporting everything as a rich data structure in Scala is one of the options to converge on nested data representation.
3.133.134.58