Summary

Users of Spark have three different APIs to interact with distributed collections of data: the RDD API, the DataFrames API, and the new Dataset API. Traditional RDD APIs provide type safety and powerful lambda functions but not optimized performance. The Dataset API and the DataFrame API provide easier ways to work with domain-specific language and provide superior performance over RDDs. The Dataset API combines both RDDs and DataFrames. Users have a choice to work with RDDs, DataFrames, or Datasets depending on their needs. But, in general, DataFrame or Dataset are preferred over conventional RDDs for better performance. Spark SQL uses a catalyst optimizer under the hood to provide optimization.

Dataset/DataFrame APIs provide optimization, speed, automatic schema discovery, multiple sources support, multiple language support, and predicate pushdown; moreover, they are interoperable with RDDs and Datasets. The Dataset API was introduced in version 1.6, which is available in Scala and Java languages. Spark 2.0 unified the Dataset and DataFrame APIs to provide a single abstraction for users. So, in Spark 2.0, DataFrame is equivalent to Dataset[Row].

The Data Sources API provides an easy way to read and save various sources including built-in sources such as Parquet, ORC, JSON, CSV, and JDBC, and external sources such as AVRO, XML, HBASE, Cassandra, and so on.

Spark SQL can be used as a distributed SQL query engine using Thrift server, which provides JDBC access to Spark SQL. Once the Thrift server is started, it can be accessed from any SQL client such as beeline, or any business intelligence tool such as Tableau or Qlikview.

Hive-on-Spark project was created for existing users to run Hive queries. So, existing HiveQL scripts can be run on the Spark execution engine instead of the MapReduce engine.

The next chapter introduces real-time analytics with Spark Streaming.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.156.251