Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Introducing SQL, Datasources, DataFrame, and Dataset APIs

Let's understand four components in Spark SQL—SQL, the Data Sources API, the DataFrame API, and the Dataset API.

Spark SQL can write and read data to and from Hive tables using the SQL language. SQL can be used within Java, Scala, Python, R languages, over JDBC/ODBC, or using the command-line option. When SQL is used in programming languages, the results will be converted as DataFrames.

Advantages of SQL are:

Can work with Hive tables easily
Can connect BI tools to a distributed SQL engine using Thrift Server and submit SQL or Hive QL queries using JDBC or ODBC interfaces

The Data Sources API provides a single interface for reading and writing data using Spark SQL. In addition to the in-built sources that come prepackaged with the Apache Spark distribution, the Data Sources API provides integration for external developers to add custom data sources. All external data sources and other packages can be viewed at http://spark-packages.org/.

Advantages of the Data Source API are:

Easy loading/saving of DataFrames.
Efficient data access by predicate pushdown to the data source. This enables reading less data from data sources.
Building libraries for any new data source.
No need to include in Spark code.
Easy to share a new Data Source with Spark packages.

The DataFrame API is designed to make big data analytics easier for a variety of users. This API is inspired by DataFrames in R and Python (Pandas), but designed for distributed processing of massive datasets to support modern big data analytics. DataFrame can be seen as an extension to the existing RDD API and are an abstraction over RDDs.

Advantages of the DataFrame API are:

Easy to develop applications with Domain Specific Language (DSL)
High performance over traditional RDDs and similar performance for Scala, Java, Python, or R
Automatic schema discovery and partition discovery of sources
Supports a wide array of data sources
Optimization and code generation through the Catalyst optimizer
Interoperability with RDDs, Datasets, Pandas and external sources like RDBMS databases, HBase, Cassandra, and so on

The Dataset API introduced in version 1.6 combined the best of RDDs and DataFrames. Datasets use encoders for converting JVM objects to a dataset table representation, which is stored using Spark's Tungsten binary format.

Advantages of the Dataset API are:

Just like RDDs, Datasets are typesafe
Just like DataFrame, Datasets are faster than RDDs
Interoperability between DataFrames and RDDs
Cached Datasets take less space than RDDs
Serialization with encoders is faster than Java or Kryo serializers

The following table shows the differences between SQL, DataFrames, and Datasets in terms of compile time and runtime safety.

	SQL	DataFrames	Datasets
Syntax Errors	Runtime	Compile time	Compile time
Analysis Errors	Runtime	Runtime	Compile time

Tip

The Dataset API and its subset DataFrame APIs will become the mainstream API instead of the RDD API. If possible, always use Datasets or DataFrames only, instead of RDDs, because it provides much higher performance due to optimizations done in Catalyst.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Introducing SQL, Datasources, DataFrame, and Dataset APIs

Create new playlist

Sign In

Sign Up

Introducing SQL, Datasources, DataFrame, and Dataset APIs

Tip

Table of Contents for
Introducing SQL, Datasources, DataFrame, and Dataset APIs