Summary

This chapter started by explaining the Spark SQL context, and file I/O methods. It then showed that Spark and HDFS-based data could be manipulated, as both DataFrames with SQL-like methods and with Spark SQL by registering temporary tables. Next, user-defined functions were introduced to show that the functionality of Spark SQL could be extended by creating new functions to suit your needs, registering them as UDF's, and then calling them in SQL to process data.

Finally, the Hive context was introduced for use in Apache Spark. Remember that the Hive context in Spark offers a super set of the functionality of the SQL context. I understand that over time, the SQL context is going to be extended to match the Hive Context functionality. Hive QL data processing in Spark using a Hive context was shown using both, a local Hive, and a Hive-based Metastore server. I believe that the latter configuration is better, as the tables are created, and data changes persist in your Hive instance.

In my case, I used Cloudera CDH 5.3, which used Hive 0.13, PostgreSQL, ZooKeeper, and Hue. I also used Apache Spark version 1.3.1. The configuration setup that I have shown you is purely for this configuration. If you wanted to use MySQL, for instance, you would need to research the necessary changes. A good place to start would be the mailing list.

Finally, I would say that Apache Spark Hive context configuration, with Hive-based storage, is very useful. It allows you to use Hive as a big data scale data warehouse, with Apache Spark for fast in-memory processing. It offers you the ability to manipulate your data with not only the Spark-based modules (MLlib, SQL, GraphX, and Stream), but also other Hadoop-based tools, making it easier to create ETL chains.

The next chapter will examine the Spark graph processing module, GraphX, it will also investigate the Neo4J graph database, and the MazeRunner application.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.96.94