Chapter 8. Spark Databricks

Creating a big data analytics cluster, importing data, and creating ETL streams to cleanse and process the data are hard to do, and also expensive. The aim of Databricks is to decrease the complexity and make the process of cluster creation, and data processing easier. They have created a cloud-based platform, based on Apache Spark that automates cluster creation, and simplifies data import, processing, and visualization. Currently, the storage is based upon AWS but, in the future, they plan to expand to other cloud providers.

The same people who designed Apache Spark are involved in the Databricks system. At the time of writing this book, the service was only accessible via registration. I have been offered a 30-day trial period. Over the next two chapters, I will examine the service, and its components, and offer some sample code to show how it works. This chapter will cover the following topics:

  • Installing Databricks
  • AWS configuration
  • Account management
  • The menu system
  • Notebooks and folders
  • Importing jobs via libraries
  • Development environments
  • Databricks tables
  • The Databricks DbUtils package

Given that this book is provided in a static format, it will be difficult to fully examine functionality such as streaming.

Overview

The Databricks service, available at the https://databricks.com/ website, is based upon the idea of a cluster. This is similar to a Spark cluster, which has already been examined and used in previous chapters. It contains a master, workers, and executors. However, the configuration and the size of the cluster are automated, depending upon the amount of memory that you specify. Features such as security, isolation, process monitoring, and resource management are all automatically managed for you. If you have an immediate requirement for a Spark-based cluster using 200 GB of memory, for a short period of time, this service can be used to dynamically create it, and process your data. You can terminate the cluster to reduce your costs when the processing is finished.

Within a cluster, the idea of a Notebook is introduced, along with a location for you to create scripts and run programs. Folders can be created within Notebooks, which can be based upon Scala, Python, or SQL. Jobs can be created to execute the functionality, and can be called from the Notebook code or the imported libraries. Notebooks can call Notebook functionality. Also, the functionality is provided to schedule jobs, based on time or event.

This provides you with a feel of what the Databricks service provides. The following sections will explain each major item that has been introduced. Please keep in mind that what is presented here is new and evolving. Also, I used the AWS US East (North Virginia) region for this demonstration, as the Asia Sydney region currently has limitations that caused the Databricks install to fail.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.76.164