Introducing SQL, Datasources, DataFrame, and Dataset APIs

Let's understand four components in Spark SQL—SQL, the Data Sources API, the DataFrame API, and the Dataset API.

Spark SQL can write and read data to and from Hive tables using the SQL language. SQL can be used within Java, Scala, Python, R languages, over JDBC/ODBC, or using the command-line option. When SQL is used in programming languages, the results will be converted as DataFrames.

Advantages of SQL are:

  • Can work with Hive tables easily
  • Can connect BI tools to a distributed SQL engine using Thrift Server and submit SQL or Hive QL queries using JDBC or ODBC interfaces

The Data Sources API provides a single interface for reading and writing data using Spark SQL. In addition to the in-built sources that come prepackaged with the Apache Spark distribution, the Data Sources API provides integration for external developers to add custom data sources. All external data sources and other packages can be viewed at http://spark-packages.org/.

Advantages of the Data Source API are:

  • Easy loading/saving of DataFrames.
  • Efficient data access by predicate pushdown to the data source. This enables reading less data from data sources.
  • Building libraries for any new data source.
  • No need to include in Spark code.
  • Easy to share a new Data Source with Spark packages.

The DataFrame API is designed to make big data analytics easier for a variety of users. This API is inspired by DataFrames in R and Python (Pandas), but designed for distributed processing of massive datasets to support modern big data analytics. DataFrame can be seen as an extension to the existing RDD API and are an abstraction over RDDs.

Advantages of the DataFrame API are:

  • Easy to develop applications with Domain Specific Language (DSL)
  • High performance over traditional RDDs and similar performance for Scala, Java, Python, or R
  • Automatic schema discovery and partition discovery of sources
  • Supports a wide array of data sources
  • Optimization and code generation through the Catalyst optimizer
  • Interoperability with RDDs, Datasets, Pandas and external sources like RDBMS databases, HBase, Cassandra, and so on

The Dataset API introduced in version 1.6 combined the best of RDDs and DataFrames. Datasets use encoders for converting JVM objects to a dataset table representation, which is stored using Spark's Tungsten binary format.

Advantages of the Dataset API are:

  • Just like RDDs, Datasets are typesafe
  • Just like DataFrame, Datasets are faster than RDDs
  • Interoperability between DataFrames and RDDs
  • Cached Datasets take less space than RDDs
  • Serialization with encoders is faster than Java or Kryo serializers

The following table shows the differences between SQL, DataFrames, and Datasets in terms of compile time and runtime safety.

 

SQL

DataFrames

Datasets

Syntax Errors

Runtime

Compile time

Compile time

Analysis Errors

Runtime

Runtime

Compile time

Tip

The Dataset API and its subset DataFrame APIs will become the mainstream API instead of the RDD API. If possible, always use Datasets or DataFrames only, instead of RDDs, because it provides much higher performance due to optimizations done in Catalyst.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.7.240