Spark SQL

Spark SQL is a component on top of Spark core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data. Spark SQL provides functions for manipulating large sets of distributed, structured data using an SQL subset supported by Spark and Hive QL. Spark SQL simplifies the handling of structured data through DataFrames and datasets at a much more performant level as part of the Tungsten initative. Spark SQL also supports reading and writing data to and from various structured formats and data sources, files, parquet, orc, relational databases, Hive, HDFS, S3, and so on. Spark SQL provides a query optimization framework called Catalyst to optimize all operations to boost the speed (compared to RDDs Spark SQL is several times faster). Spark SQL also includes a Thrift server, which can be used by external systems to query data through Spark SQL using classic JDBC and ODBC protocols.

We cover Spark SQL in detail in Chapter 8, Introduce a Little Structure - Spark SQL.

Table of Contents for Spark SQL

Create new playlist

Sign In

Sign Up

Table of Contents for
Spark SQL