SQLContext and HiveContext

Prior to Spark 2.0, SparkContext used to be the entry point for Spark applications, an SQLContext and HiveContext used to be the entry points to run Spark SQL. HiveContext is the superset of SQLContext. The SQLContext needs to be created to run Spark SQL on the RDD.

The SQLContext provides connectivity to various data sources. Data can be read from those data sources and Spark SQL can be executed to transform the data as per the requirement. It can be created using SparkContext as follows:

JavaSparkContext javaSparkContext = new JavaSparkContext(conf); 
SQLContext sqlContext = new SQLContext(javaSparkContext); 

The SQLContext creates a wrapper over SparkContext and provides SQL functionality and functions to work with structured data. It comes with the basic level of SQL functions.

The HiveContext, being a superset of SQLContext, provides a lot more functions. The HiveContext lets you write queries using Hive QL Parser ,which means all of the Hive functions can be accessed using HiveContext. For example, windowing functions can be executed using javaSparkContext HiveContext. Along with that it provides access to Hive UDFs. Similarly to SQLContext, HiveContext can be created using SparkContext as follows:

JavaSparkContext javaSparkContext = new JavaSparkContext(conf); 
HiveContext hiveContext = new HiveContext(javaSparkContext); 

The HiveContext does not require Hive to run. However, if needed, Hive tables can be queried using HiveContext. hive-site.xml is needed to access Hive tables. Also, all the data sources, which are accessible using SQLContext, can also be accessed using HiveContext.

HiveContext requires all hive dependencies to be included. So, if only basic SQL functions are required, then SQLContext will suffice. However, it's recommended to use HiveContext as it provides a wide range of functions.

In Spark 2.x both of these APIs are available. However, as per the documentation, they are deprecated. Spark 2.0 introduced a new entry point for all Spark applications called SparkSession.

javaSparkContextSparkSession

In Chapter 5, Working with Data and Storage, we used SparkSession to connect to Cassandra and to read data in various formats. In this section, we will discuss SparkSession in detail.

The SparkSession was introduced in Spark 2.0 as the single entry point for all Spark applications. Prior to Spark 2.0, SparkContext used to be the entry point for Spark applications: SQLContext and HiveContext were used for Spark SQL, StreamingContext was used for Spark streaming applications, and so on.

We will explain Spark streaming in the next chapter.

In Spark 2.x, SparkSession can be used as the entry point for every Spark application. It combines all the functions of SQLContext, HiveContext, and StreamingContext. The SparkSession internally requires a SparkContext for computation.

Another key difference in SparkSession is that there can be multiple SparkSession instances in single a Spark application, which was not possible with SparkContext.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.17.139