Chapter 4. Apache Spark SQL

In this chapter, I would like to examine Apache Spark SQL, the use of Apache Hive with Spark, and DataFrames. DataFrames have been introduced in Spark 1.3, and are columnar data storage structures, roughly equivalent to relational database tables. The chapters in this book have not been developed in sequence, so the earlier chapters might use older versions of Spark than the later ones. I also want to examine user-defined functions for Spark SQL. A good place to find information about the Spark class API is: spark.apache.org/docs/<version>/api/scala/index.html.

I prefer to use Scala, but the API information is also available in Java and Python formats. The <version> value refers to the release of Spark that you will be using—1.3.1. This chapter will cover the following topics:

  • SQL context
  • Importing and saving data
  • DataFrames
  • Using SQL
  • User-defined functions
  • Using Hive

Before moving straight into SQL and DataFrames, I will give an overview of the SQL context.

The SQL context

The SQL context is the starting point for working with columnar data in Apache Spark. It is created from the Spark context, and provides the means for loading and saving data files of different types, using DataFrames, and manipulating columnar data with SQL, among other things. It can be used for the following:

  • Executing SQL via the SQL method
  • Registering user-defined functions via the UDF method
  • Caching
  • Configuration
  • DataFrames
  • Data source access
  • DDL operations

I am sure that there are other areas, but you get the idea. The examples in this chapter are written in Scala, just because I prefer the language, but you can develop in Python and Java as well. As shown previously, the SQL context is created from the Spark context. Importing the SQL context implicitly allows you to implicitly convert RDDs into DataFrames:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

For instance, using the previous implicits call, allows you to import a CSV file and split it by separator characters. It can then convert the RDD that contains the data into a data frame using the toDF method.

It is also possible to define a Hive context for the access and manipulation of Apache Hive database table data (Hive is the Apache data warehouse that is part of the Hadoop eco-system, and it uses HDFS for storage). The Hive context allows a superset of SQL functionality when compared to the Spark context. The use of Hive with Spark will be covered in a later section in this chapter.

Next, I will examine some of the supported file formats available for importing and saving data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.179.85