Installing Spark

For learning purposes, let's now install Spark in the local computer (even though it is more frequently used in a cluster of servers). Full instructions can be found at: https://spark.apache.org/downloads.html.

There are many stable versions, and we take version 2.3.2 (Sep 24 2018) as an example. As illustrated in the following screenshot, after selecting 2.3.2 in step 1, we choose Pre-built for Apache Hadoop 2.7 and later for step 2. Then, click the link in step 3 to download the spark-2.3.2-bin-hadoop2.7.tgz file. Unzip the file and the resulting folder contains a complete Spark package. The steps are in the following screenshot:

Before running any Spark program, we need to make sure the following dependencies are installed:

  • Java 8+, and that it is included in the system environment variables
  • Scala version 2.11

To check whether Spark is installed properly, we run the following tests:

  1. First, we approximate the value of π using Spark by typing in the following command in Terminal (note bin is a folder in spark-2.3.2-bin-hadoop2.7):
./bin/run-example SparkPi 10
  1. It should print out something similar to the following (the values may differ):
Pi is roughly 3.141851141851142

This test is actually similar to the following:

./bin/spark-submit examples/src/main/python/pi.py 10
  1. Next, we test the interactive shell with the following command:
./bin/pyspark --master local[2]

This should open a Python interpreter, as shown in the following screenshot:

By now, the Spark program should be installed properly. We will talk about those commands (pyspark, and spark-submit) in the following sections.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.139.169