Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

2. Working with Apache Spark

Hien Luu¹

(1)

SAN JOSE, CA, USA

When it comes to working with Spark or building Spark applications, there are many options. This chapter describes the three common options, including using Spark shell, submitting a Spark application from the command line, and using a hosted cloud platform called Databricks. The last part of this chapter is geared toward software engineers who want to set up Apache Spark source code on a local machine to study Spark source code and learn how certain features were implemented.

Downloading and Installation

To learn or experiment with Spark, it is convenient to have it installed locally on your computer. This way, you can easily try out certain features or test your data processing logic with small datasets. Having Spark locally installed on your laptop lets you learn it from anywhere, including your comfortable living room, the beach, or at a bar in Mexico.

Spark is written in Scala. It is packaged so that it can run on both Windows and UNIX-like systems (e.g., Linux, macOS). To run Spark locally, all that is needed is Java installed on your computer.

To set up a multitenant Spark production cluster requires a lot more information and resources, which are beyond the scope of this book.

Downloading Spark

The Download section of the Apache Spark website (http://spark.apache.org/downloads.html) has detailed instructions for downloading the pre-packaged Spark binary file. At the time of writing this book, the latest version is 3.1.1. In terms of package type, choose the one with the latest version of Hadoop. Figure 2-1 shows the various options for downloading Spark. The easiest way is to download the pre-packaged binary file because it contains the necessary JAR files to run Spark on your computer. Clicking the link on line item 3 triggers the binary file download. There is a way to manually build the Spark binary from source code. The instructions on how to do that are covered later in the chapter.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig1_HTML.jpg — Figure 2-1
Apache Spark download options

Installing Spark

Once the file is successfully downloaded onto your computer, the next step is to uncompress it. The spark-3.1.1-bin-hadoop2.7.tgz file is in a GZIP compressed tar archive file, so you need to use the right tool to uncompress it.

For Linux or macOs computers, the tar command should already exist. So run the following command to uncompress the downloaded file.

tar xvf spark-3.1.1-bin-hadoop2.7.tgz

For Windows computers, you can use either the WinZip or 7-zip tool to unzip the downloaded file.

Once the uncompression is successfully finished, there should be a directory called spark-3.1.1-bin-hadoop2.7. From here on, this directory is referred to as the Spark directory.

Note

If a different version of Spark is downloaded, the directory name is slightly different.

There are about a dozen directories under the spark-3.1.1-bin-hadoop2.7 directory. Table 2-1 describes the ones that are good to know.

Table 2-1

The Subdirectories in spark-3.1.1-bin-hadoop2.7

Name	Description
bin	Contains the various executable files to bring up Spark shell in Scala or Python, submit Spark applications, run Spark examples
data	Contains small sample data files for various Spark examples
examples	Contains both the source code and binary file for all Spark examples
jars	Contains the necessary binaries that are needed to run Spark
sbin	Contains the executable files to manage Spark cluster

The next step is to test out the installation by bringing up the Spark shell.

Spark shell is like a Unix shell. It provides an interactive environment to easily learn Spark and analyze data. Most Spark applications are developed using either Python or Scala programming language. Spark shell is available for both of those languages. If you are a data scientist and Python is your cup of tea, you will not feel left out. The following section shows how to bring up Spark Scala and Spark Python shell.

Note

Scala is a Java JVM-based language, and thus, it is easy to leverage existing Java libraries in Scala applications.

Spark Scala Shell

To start up the Spark Scala shell, enter the ./bin/spark-shell command in the Spark directory. After a few seconds, you should see something similar to Figure 2-2.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig2_HTML.jpg — Figure 2-2
Scala Spark shell output

To exit the Scala Spark shell , type :quit or :q.

Note

Java version 11 or higher is preferred to run the Spark Scala shell.

Spark Python Shell

To start up the Spark Python shell, enter the ./bin/pyspark command in the Spark directory. After a few seconds, you should see something similar to Figure 2-3.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig3_HTML.jpg — Figure 2-3
Output of Python Spark shell

To exit the Python Spark shell, enter ctrl-d.

Note

Spark Python shell requires Python 3.7.x or higher.

The Spark Scala shell and the Spark Python shell are extensions of Scala REPL and Python REPL, respectively. REPL is an acronym for read-eval-print loop. It is an interactive computer programming environment that takes user inputs, evaluates them, and returns the result to the user. Once a line of code is entered, the REPL immediately provides feedback on whether there is a syntactic error. If there aren’t any syntax errors, it evaluates them. If there is any output, it is displayed in the shell. The interactive and immediate feedback environment allows developers to be very productive by bypassing the code compilation step in the normal software development process.

To learn Spark, Spark shell is a very convenient tool to use on your local computer anytime and anywhere. It doesn’t have any external dependencies other than the data files you process need to reside on your computer. However, if you have an Internet connection, it is possible to access those remote data files, but it will be slow.

The remaining chapters of this book use the Spark Scala shell.

Having Fun with the Spark Scala Shell

This section provides information about Scala Spark shell and a set of useful commands to know to be effective and productive at using it for exploratory data analysis or building Spark applications interactively.

The ./bin/spark-shell command effectively starts a Spark application and provides an environment where you can interactively call Spark Scala APIs to easily perform exploratory data processing. Since Spark Scala shell is an extension of Scala REPL, it is a great way to use it to learn Scala and Spark at the same time.

Useful Spark Scala Shell Command and Tips

Once a Spark Scala shell is started, it puts you in an interactive environment to enter shell commands and Scala code. This section covers various useful commands and a few tips on working in the shell.

Once inside the Spark Shell, type the following to bring a complete list of available commands.

scala> :help

The output of this command is shown in Figure 2-4.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig4_HTML.jpg — Figure 2-4
List of available shell commands

Some commands are used more often than others because of their usefulness. Table 2-2 describes the commonly used commands.

Table 2-2

Useful Spark Shell Commands

Name	Description
:history	This command displays what was entered during the previous Spark shell session as well as the current session. It is useful for copying purposes.
:load	Load and execute the code in the provided file. This is particularly useful when the data processing logic is long. It is a bit easier to keep track of the logic in a file.
:reset	After experimenting with the various Scala or Spark APIs for a while, you may lose track of the value of various variables. This command resets the shell to a clean state to make it easy to reason.
:silent	This is for an advanced user who is a bit tired at looking at the output of each Scala or Spark APIs that were entered in the shell. To re-enable the output, simply type :silent again.
:quit	This is a self-explanatory command but useful to know. Often, people try to quit the shell by entering :exit, which doesn’t work.
:type	Display the type of a variable. :type <variable name>

In addition to these commands, a helpful feature for improving developer productivity is the code completion feature. Like popular integrated development environments (IDEs) like Eclipse or IntelliJ, the code completion feature helps developers explore the possible options and reduce typing errors.

Inside the shell, type spa and then hit the Tab key. The environment adds characters to transform “spa” to “spark”. In addition, it shows possible matches for Spark (see Figure 2-5).

scala> spa <tab>

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig5_HTML.jpg — Figure 2-5
Tab completion output of spa

In addition to completing the name of a partially entered word, the tab completion can show an object’s available member variables and functions.

In the shell, type spark, and then hit the Tab key. This displays a list of available member variables and functions of the Scala object represented by the spark variable (see Figure 2-6).

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig6_HTML.jpg — Figure 2-6
List of available member variables and functions of object called “spark”

The :history command displays the previously entered commands or lines of code. This suggests that the Spark shell maintains a record of what was entered. One way to quickly display or recall what was entered recently is by pressing the up arrow key. Once you scroll up to the line you want to execute, simply hit Enter to execute it.

Basic Interactions with Scala and Spark

The preceding section covered the basics of navigating the Spark shell; this section introduces a few fundamental ways of working with Scala and Spark in Spark shell. This fundamental knowledge will be really helpful in future chapters as you dive much deeper into topics like Spark DataFrame and Spark SQL.

Basic Interactions with Scala

Let’s start with Scala in the Spark Scala shell, which provides a full-blown environment for learning Scala. Think of Spark Scala shell as a Scala application with an empty body, and this is where you come in. You fill this empty body with Scala functions and logic for your application. This section intends to demonstrate a few simple Scala examples in Spark shell. Scala is a fascinating programming language that is powerful, concise, and elegant. Please refer to Scala-related books to learn more about this programming language.

The canonical example for learning any programming language is the “Hello World” example, which entails printing out a message. So let’s do that. Enter the following line in the Spark Scala shell; the output should look like Figure 2-7.

scala> println("Hello from Spark Scala shell")

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig7_HTML.jpg — Figure 2-7
Output of the Hello World example command

The next example defines an array of ages and prints those element values out in the Spark shell. In addition, this example illustrates the code completion feature mentioned in the previous section.

To define an array of ages and assign it to an immutable variable, enter the following into the Spark shell. Figure 2-8 shows the evaluation output.

scala> val ages = Array(20, 50, 35, 41)

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig8_HTML.jpg — Figure 2-8
Output of defining an array of ages

Now you can refer to the ages variable in the following line of code. Let’s pretend that you can’t exactly remember a function name in the Array class to iterate through the elements in the array, but you know it starts with “fo”. You can enter the following and hit the tab to see how Spark shell can help.

scala> ages.fo

After you press the Tab key, Spark shell displays what’s shown in Figure 2-9.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig9_HTML.jpg — Figure 2-9
Output of code completion

Aha! You need the foreach function to iterate through the elements in the array. Let’s use it to print the ages.

scala> ages.foreach(println)

Figure 2-10 shows the expected output.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig10_HTML.jpg — Figure 2-10
Output from printing out the ages

The previous code statement may look a bit cryptic for those new to Scala; however, you can intuitively guess what it does. As the foreach function iterates through each element in the “ages” array, it passes that element to the println function to print the value out to the console. This style is used quite a bit in the coming chapters.

The last example in this section defines a Scala function to determine whether the age is an odd number or even number; it is then used to find the odd number ages in the array.

scala> def isOddAge(age:Int) : Boolean = {

(age % 2) == 1

}

If you come from a Java programming background, this function signature may look strange, but it is not too difficult to decipher what it does. Notice the function doesn’t use the return keyword to return the value of the expression in its body. In Scala, it is not necessary to add the return keyword. The output of the last statement in a function body is returned to the caller (if that function was defined to return a value). Figure 2-11 shows the output from the Spark shell.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig11_HTML.jpg — Figure 2-11
If there are syntax errors, Spark shell returns the function signature

To figure out the odd number ages in the ages array, let’s leverage the filter function in the Array class.

scala> ages.filter(age => isOddAge(age)).foreach(println)

This line of code does the filtering and then iterates through the result to print out the odd ages. It is a common practice in Scala to use function chaining to make the code concise. Figure 2-12 shows the output from Spark shell.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig12_HTML.jpg — Figure 2-12
The output from filtering and printing out only ages that are odd numbers

Now let’s try out the :type shell command on a Scala variable and function defined earlier. This command comes in handy once you have used Spark shell for a while and lost track of the data type of a certain variable or the return type of a function. Figure 2-13 shows examples of the :type command.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig13_HTML.jpg — Figure 2-13
Output of :type command

To learn Spark, it is not necessary to master the Scala programming language. However, one must be comfortable with knowing and working with the basics of Scala. A good resource for learning just enough Scala to learn Spark is at https://github.com/deanwampler/JustEnoughScalaForSpark. This resource was presented at various Spark-related conferences.

Spark UI and Basic Interactions with Spark

In the previous section, I mentioned Spark shell is a Scala application. That is only partially true. The Spark shell is a Spark application written in Scala. When the Spark shell is started, a few things are initialized and set up for you to use, including Spark UI and a few important variables.

Spark UI

If you go back and carefully examine the Spark shell output in either Figure 2-2 or Figure 2-3, you see a line that looks something like the following. (The URL may be a bit different for your Spark shell.)

The SparkContext Web UI is available at http://<ip>:4040.

If you point your browser to that URL in your Spark shell, it displays something like what’s shown in Figure 2-14.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig14_HTML.jpg — Figure 2-14
Spark UI

The Spark UI is a web application designed to help with monitoring and debugging Spark applications. It contains detailed runtime information and various resource consumptions of a Spark application. The runtime includes various metrics that are tremendously helpful in diagnosing performance issues in your Spark applications. One thing to note is that the Spark UI is only available while a Spark application is running.

The navigation bar at the top of the Spark UI contains links to the various tabs, including Jobs, Stages, Storage, Environment, Executors, and SQL. I briefly cover the Environment and Executors tabs and describe the remaining tabs in later chapters.

The Environment tab contains static information about the environment that a Spark application is running in. This includes runtime information, spark properties, system properties, and classpath entries. Table 2-3 describes each of those areas.

Table 2-3

Sections in the Environment Tab

Name	Description
Runtime Information	Contains the locations and versions of the various components that Spark depends on, including Java and Scala.
Spark Properties	This area contains the basic and advanced properties that are configured in a Spark application. The basic properties include the basic information about an application like application id, name, and so on. The advanced properties are meant to turn on or off certain Spark features or tweak them in certain ways that are best for a particular application. See the resource at https://spark.apache.org/docs/latest/configuration.html for a comprehensive list of configurable properties.
Resource Profiles	Information about the number of CPUs and the amount of memory in the Spark cluster.
Hadoop Properties	The various Hadoop and Hadoop File System properties.
System Properties	These properties are mainly at the OS and Java level, not Spark specific.
Classpath Entries	Contains a list of classpaths and jar files that are used in a Spark application.

The Executors tab contains the summary and breakdown information for each of the executors supporting a Spark application. This information includes the capacity of certain resources and how much is being used in each executor. The resources include memory, disk, and CPU. The Summary section provides a bird’s-eye view of the resource consumption across all the executors in a Spark application. Figure 2-15 shows more of the details.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig15_HTML.png — Figure 2-15
Executor tab of a Spark application that uses only a single executor

You revisit Spark UI in a later chapter.

Basic Interactions with Spark

Once a Spark shell is successfully started, an important variable called spark is initialized and ready to be used. The spark variable represents an instance of a SparkSession class. Let’s use the :type command to verify this.

scala>:type spark

And the Spark shell displays its type in Figure 2-16.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig16_HTML.jpg — Figure 2-16
Showing the type of “spark” variable

The SparkSession class was introduced in Spark 2.0 to provide a single entry point to interact with underlying Spark functionalities. This class has APIs for reading unstructured and structured data in text and binary formats, such as JSON, CSV, Parquet, ORC, and so on. In addition, the SparkSession component provides a facility to retrieve and set Spark configurations.

Let’s start interacting with the spark variable in Spark shell to print out a few useful pieces of information, such as the version and existing configurations. From the Spark shell, type the following code to print the Spark version. Figure 2-17 shows the output.

scala> spark.version

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig17_HTML.jpg — Figure 2-17
Spark version output

To be a little more formal, you can use the println function covered in the previous section to print out the Spark version and output shown in Figure 2-18.

scala> println("Spark version: " + spark.version)

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig18_HTML.jpg — Figure 2-18
Display Spark version using println function

To see the default configurations in the Spark shell, you access the conf variable of spark. Here is the code to display the default configurations, and the output is shown in Figure 2-19.

scala> spark.conf.getAll.foreach(println)

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig19_HTML.jpg — Figure 2-19
Default configurations in Spark shell application

To see the complete set of available objects you can access from spark variable, you can leverage the Spark shell code completion features.

scala> spark.<tab>

Figure 2-20 shows the result this command.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig20_HTML.jpg — Figure 2-20
A complete list of variables that can be accessed from the spark variable

Upcoming chapters have more examples of using spark to interact with underlying Spark functionalities.

Introduction to Collaborative Notebooks

Collaborative Notebooks is a commercial product offered by Databricks, the original creator of the open source project called Apache Spark. According to the product documentation, Collaborative Notebooks is designed for data engineers, data scientists, and data analysts to perform data analysis and build machine learning models that support multiple languages, built-in data visualization, and automatic data versioning. It also provides Spark on demand compute infrastructure and can execute jobs for production data pipelines on a specific schedule. It is built around Apache Spark and provides four main value propositions to customers around the world.

Fully managed Spark clusters
An interactive workspace for exploration and visualization
A production pipeline scheduler
A platform for powering your favorite Spark-based applications

The Collaborative Notebooks product has two versions, the full platform and the community edition. The commercial edition is a paid product that provides advanced features such as creating multiple clusters, user management, and job scheduling. The community edition is free and ideal for developers, data scientists, data engineers and anyone who wants to learn Apache Spark or try Databricks.

The following section cover the basic features of Collaborative Notebooks community edition. It provides an easy and intuitive environment to learn Spark, perform data analysis or build Spark applications. This section is not intended to be a comprehensive guide. For that, you can refer to the Databricks user guide (https://docs.databricks.com/user-guide/index.html).

To use Collaborative Notebooks, you need to sign up for a free account on the community edition at https://databricks.com/try-databricks. This signup process is simple and quick; an account can be created in a matter of minutes. Once the necessary information is provided and submitted in the sign-up form, you shortly receive an email from Databricks to confirm your email, which looks something like Figure 2-21.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig21_HTML.jpg — Figure 2-21
Databricks email to confirm your email address

Clicking the URL link shown in Figure 2-21 takes you to the Databricks sign-in form, as shown in Figure 2-22.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig22_HTML.jpg — Figure 2-22
Databricks sign-in page

After a successful sign in using the email and password, you see the Databricks welcome page like in Figure 2-23.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig23_HTML.jpg — Figure 2-23
Databricks welcome page

Over time, the welcome page may evolve, so it does not look exactly like Figure 2-23. Feel free to explore the tutorial or the documentation.

This section aims to create a notebook in Databricks so that you can learn the commands covered in the previous section. The following are the main steps.

1.
Create a cluster.
2.
Create a folder.
3.
Create a notebook.

Create a Cluster

One of the coolest features of the community edition (CE) is that it provides a single node Spark cluster with 15 GB of memory for free. At the time of writing this book, this single node cluster is hosted on the AWS cloud. Since the CE account is free, it provides the capability to create multiple clusters simultaneously. A cluster continues to stay up as long as it is being used. If it is idle for two hours, it automatically shuts down. This means you don’t have to proactively shut down the cluster.

To create a cluster, click the Clusters icon in the vertical navigation bar on the left side of the page. The Cluster page looks like Figure 2-24.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig24_HTML.jpg — Figure 2-24
DataBricks Cluster page with no active clusters

Now click the Create Cluster button to bring up the New Cluster form that looks like Figure 2-25.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig25_HTML.jpg — Figure 2-25
Create Cluster form

The only required field on this form is the cluster name. Table 2-4 describes each field.

Table 2-4

Databricks New Cluster Form Fields

Name	Description
Cluster Name	A unique name to identify your cluster. The name can have space between each word; for example, “my spark cluster”.
Databricks Runtime Version	Databricks supports many versions of Spark. For learning purposes, select the latest version, which is automatically filled for you. Each version is tied to a specific AWS image.
Instance	For the CE edition, there isn’t any other choice.
AWS – Availability Zone	This allows you to decide which AWS Availability Zone your single node cluster runs in. The options may look different based on your location.
Spark – Spark Config	This allows you to specify any application-specific configurations that should be used to launch the Spark cluster. Examples include JVM configurations to turn on certain Spark features.

Once you enter a cluster name, click the Create Cluster button. It can take up to 10 minutes to create a single node Spark cluster. If needed, try switching to a different availability zone if the default one takes a long time. Once a Spark cluster is successfully created, a green dot appears next to the cluster name, as shown in Figure 2-26.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig26_HTML.jpg — Figure 2-26
After a cluster is created successfully

Feel free to explore by clicking the name of your cluster or the various links on this page. If you try to create another Spark cluster by following the same steps, it won’t allow you to do so.

To terminate an active Spark cluster, click the square block under the Actions column.

For more information on creating and managing a Spark cluster in Databricks, go to https://docs.databricks.com/user-guide/clusters/index.html.

Let’s move on to the next step, creating a folder.

Create a Folder

Before going into how to create a folder, it is worth it to take a moment to describe the workspace concept in Databricks. The easiest way to think about workspace is to treat it as the file system on your computer, which means one can leverage its hierarchical property to organize the various notebooks.

To create a folder, click the Workspace icon in the vertical navigation bar on the left side of the page. The Workspace column slides out, as shown in Figure 2-27.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig27_HTML.jpg — Figure 2-27
Workspace column

Now click the downward arrow in the upper right of the Workspace column, and the popup menu shows up (see Figure 2-28).

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig28_HTML.jpg — Figure 2-28
Menu item for creating a folder

Selecting the Create ➤ Folder menu item brings up the New Folder Name dialog box (see Figure 2-29).

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig29_HTML.jpg — Figure 2-29
New Folder Name dialog box

Now you can enter a folder name (i.e., Chapter 2), and click the Create Folder button to complete the process. The Chapter 2 folder should now appear in the Workspace column, as shown in Figure 2-30.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig30_HTML.jpg — Figure 2-30
Chapter 2 folder appears in the Workspace column

Before creating a notebook, it is worth mentioning that there is an alternative way to create a folder. Place your mouse pointer anywhere in the Workspace column and right-click; the same menu options appear.

For more information on workspaces and creating folders, please go to https://docs.databricks.com/user-guide/workspace.html.

Create a Notebook

To create a Scala notebook in the Chapter 2 folder. First, select the Chapter 2 folder in the Workspace column. The Chapter 2 column slides out after the Workspace column, as shown in Figure 2-31.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig31_HTML.jpg — Figure 2-31
Chapter 2 column appears to the right of Workspace column

Now you can either click the downward arrow in the upper right of the Chapter 2 column or right-mouse click anywhere in the Chapter 2 column to bring the menu, as shown in Figure 2-32.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig32_HTML.jpg — Figure 2-32
Create notebook menu item

Selecting the Notebook menu item brings up the Create Notebook dialog box. Give your notebook a name, and make sure to select the Scala option for the Language field. The value for the cluster should be automatically filled in because the CE edition can only have one cluster at a time. The dialog box should look something like Figure 2-33.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig33_HTML.jpg — Figure 2-33
Create Notebook dialog box with Scala language option selected

Once the Create button is clicked, a brand-new notebook is created, as shown in Figure 2-34.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig34_HTML.jpg — Figure 2-34
New Scala notebook

If you have never worked with IPython notebook, the notebook concept may seem a bit strange at first. However, once you get used to it, you find it intuitive and fun.

A notebook is essentially an interactive computational environment (similar to Spark shell, but way better). You can execute Spark code, document your code with rich text using Markdown or HTML markup language and visualize the result of your data analysis with various types of chart and graph.

The following section covers only a few essential parts to help you be productive in using the Spark Notebook. For a comprehensive list of instructions on using and interacting with a Databricks notebook, please go to https://docs.databricks.com/user-guide/notebooks/index.html.

The Spark Notebook contains a collection of cells, and each one contains either a block of code to execute or markups for documentation purposes.

Note

A good practice of using the Spark Notebook is to break your data processing logic into multiple logical groups so each one resides in one or more cells. This is similar to the practice of developing maintainable software applications.

Let’s divide the notebook into two sections. The first section contains the code snippets you entered in the “Basic Interactions with Scala” section. The second section contains the code snippets you entered in the “Basic Interactions with Spark” section.

Let’s start with adding a Markdown statement to document the first section of your notebook by entering the following into the first cell (see Figure 2-35).

%md #### Basic Interactions with Scala

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig35_HTML.jpg — Figure 2-35
Cell contains section header markup statement

To execute that markup statement, make sure the mouse cursor is in cell 1, hold down the Shift key, and hit the Enter key. That is the shortcut for running code or markup statements in a cell. The result should look like Figure 2-36.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig36_HTML.jpg — Figure 2-36
The output of executing the markup statement

Notice the Shift+Enter key combination executes the statements in that cell and creates a new cell below it. Now let’s enter the “Hello World” example into the second cell and execute that cell. The output should look like Figure 2-37.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig37_HTML.jpg — Figure 2-37
The output of executing the println statement

The remaining three code statements in the “Interactions with Scala” section are copied into the notebook (see Figure 2-38).

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig38_HTML.jpg — Figure 2-38
The remaining code statements from the “Interactions with Scala” section

Like Spark Scala shell, Scala Notebook is a full-blown Scala interactive environment where you can execute Scala code.

Now let’s enter the second markup statement to denote the beginning of the second part of your notebook and the remaining code snippets in the “Interactions with Spark” section. Figure 2-39 shows the output.

%md #### Basic Interactions with Spark

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig39_HTML.jpg — Figure 2-39
Output of the code snippets from Interactions with Spark section

There are a few important notes to know when working with a Spark Notebook. It provides a very convenient auto-saving feature. The content of a notebook is automatically saved as you enter market statements or code snippets. In fact, the menu items under the File menu item don’t have an option for saving a notebook.

Sometimes there is a need to create a new cell between two existing cells. One way to do this is to move the mouse cursor to the space between them, then click the plus icon that appears to create a new cell. Figure 2-40 shows what the plus icon looks like.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig40_HTML.jpg — Figure 2-40
Using plus icon to create a new cell between two existing cells

Sometimes, you need to share your notebook with a co-worker who works in a remote office or other collaborators to either show off your awesome Spark knowledge or get their feedback on your data analysis. Simply click the File menu item at the top of your Spark notebook and select the Publish submenu item. Figure 2-41 shows what it looks like.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig41_HTML.jpg — Figure 2-41
Notebook publishing menu item

Clicking the Publish submenu item brings up a confirmation dialog box (see Figure 2-42). If you follow through with it, the Notebook Published dialog box (see Figure 2-43) provides a URL that you can send to anyone in the world. With that URL, your co-worker or collaborators can view your notebook, or they can import it into their Databricks workspace.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig42_HTML.jpg — Figure 2-42
Publishing confirmation dialog box

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig43_HTML.jpg — Figure 2-43
Notebook published URL

This section covers only the essential parts of using Databricks. Many other advanced features make it enticing to use Databricks as the platform for performing interactive data analysis or building advanced data solutions like machine learning models.

The CE provides a free account with a single node Spark cluster. Learning Spark through the Databricks product becomes so much easier than before. I highly recommend giving Databricks a try in your journey of learning Spark.

Setting up Spark Source Code

This section is geared toward software developers or anyone interested in learning how Spark works at the code level. Since Apache Spark is an open source project, its source code is publicly available to download from GitHub, examine and study how certain features were implemented. The Spark code is written in Scala by some of the smartest Scala programmers on the planet, so studying the Spark code is a great way to improve one’s Scala programming skills and knowledge.

There are two ways to download Apache Spark source code to your computer. You can download it from the Spark download page at http://spark.apache.org/downloads.html, the same page used earlier to download the Spark binary file. This time, let’s choose the Source Code package type, like in Figure 2-44.

../images/419951_2_En_2_Chapter/419951_2_En_2_Fig44_HTML.jpg — Figure 2-44
Apache Spark source download option

To complete the source code download process, click the link on line #3 to download the compressed source code file. The final step is to uncompress the file into the directory your choice.

You can also use the git clone command to download Apache Spark source code from its GitHub repository. This requires an installation of git on your computer. Git is available for download at https://git-scm.com/downloads. The installation instructions are available at https://git-scm.com/book/en/v2/Getting-Started-Installing-Git. Once Git is properly installed on your computer, issue the following command to clone the Apache Spark git repository on GitHub (https://github.com/apache/spark).

git clone git://github.com/apache/spark.git

Once the Apache Spark source code is downloaded on your computer, go to http://spark.apache.org/developer-tools.html for information on how to import them into your favorite IDE.

Summary

When it comes to learning Spark, there are a few options. You can either use the locally installed Spark or use the Collaborative Notebook Community Edition. These tools make it easy and convenient for anyone to learn Spark.
The Spark shell is a powerful and interactive environment to learn Spark APIs and to analyze data interactively. There are two types of Spark shell, Spark Scala shell, and Spark Python shell.
The Spark shell provides a set of commands to help its users to become productive.
Collaborative Notebooks is a fully managed platform designed to simplify building and deploying data exploration, data pipelines, and machine learning solutions. The interactive workspace provides an intuitive way to organize and manage notebooks. Each notebook contains a combination of markup statements and code snippets. Sharing a notebook with others only requires a few mouse clicks.
For software developers interested in learning about the internals of Spark, downloading and examining the Apache Spark source code is a great way to satisfy that curiosity.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. Working with Apache Spark

Create new playlist

Sign In

Sign Up

2. Working with Apache Spark

Downloading and Installation

Downloading Spark

Installing Spark

Spark Scala Shell

Spark Python Shell

Having Fun with the Spark Scala Shell

Useful Spark Scala Shell Command and Tips

Basic Interactions with Scala and Spark

Basic Interactions with Scala

Spark UI and Basic Interactions with Spark

Spark UI

Basic Interactions with Spark

Introduction to Collaborative Notebooks

Create a Cluster

Create a Folder

Create a Notebook

Setting up Spark Source Code

Summary

Table of Contents for
2. Working with Apache Spark