In this chapter, I would like to examine Apache Spark SQL, the use of Apache Hive with Spark, and DataFrames. DataFrames have been introduced in Spark 1.3, and are columnar data storage structures, roughly equivalent to relational database tables. The chapters in this book have not been developed in sequence, so the earlier chapters might use older versions of Spark than the later ones. I also want to examine user-defined functions for Spark SQL. A good place to find information about the Spark class API is: spark.apache.org/docs/<version>/api/scala/index.html
.
I prefer to use Scala, but the API information is also available in Java and Python formats. The <version>
value refers to the release of Spark that you will be using—1.3.1. This chapter will cover the following topics:
Before moving straight into SQL and DataFrames, I will give an overview of the SQL context.
The SQL context is the starting point for working with columnar data in Apache Spark. It is created from the Spark context, and provides the means for loading and saving data files of different types, using DataFrames, and manipulating columnar data with SQL, among other things. It can be used for the following:
I am sure that there are other areas, but you get the idea. The examples in this chapter are written in Scala, just because I prefer the language, but you can develop in Python and Java as well. As shown previously, the SQL context is created from the Spark context. Importing the SQL context implicitly allows you to implicitly convert RDDs into DataFrames:
val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._
For instance, using the previous implicits
call, allows you to import a CSV file and split it by separator characters. It can then convert the RDD that contains the data into a data frame using the toDF
method.
It is also possible to define a Hive context for the access and manipulation of Apache Hive database table data (Hive is the Apache data warehouse that is part of the Hadoop eco-system, and it uses HDFS for storage). The Hive context allows a superset of SQL functionality when compared to the Spark context. The use of Hive with Spark will be covered in a later section in this chapter.
Next, I will examine some of the supported file formats available for importing and saving data.
3.145.179.85