Big data is the topic on everyone's mind. I thought it would be good to see what can be done with big data in Jupyter. An up-and-coming language for dealing with large datasets is Spark. Spark is an open source big data processing framework. Spark can run over Hadoop, in the cloud, or standalone. We can use Spark coding in Jupyter much like the other languages we have seen.
In this chapter, we will cover the following topics:
One of the tools we will be using is Apache Spark. Spark is an open source toolset for cluster computing. While we will not be using a cluster, the typical usage for Spark is a larger set of machines or cluster that operate in parallel to analyze a big data set. An installation guide is available at https://www.dataquest.io/blog/pyspark-installation-guide. In particular, you will need to add two settings to your bash profile: SPARK_HOME
and PYSPARK_SUBMIT_ARGS
. SPARK_HOME
is the directory where the software is installed. PYSPARK_SUBMIT_ARGS
sets the number of cores to use in the local cluster.
To install, we download the latest TGZ file from the Spark download page at https://spark.apache.org/downloads.html, unpack the TGZ file, and move the unpacked directory to our Applications folder.
Spark relies on Scala's availability. We installed Scala in Chapter 7, Sharing and Converting Jupyter Notebooks.
Open a command-line window to the Spark directory and run this command:
brew install sbt
This may take a while.
Now set the configuration for Spark (for Mac) in your .bash_profile
file:
# location of spark code export SPARK_HOME="/Applications/spark-2.0.0-bin-hadoop2.7" # machine to run on export SPARK_MASTER_IP=127.0.0.1 export SPARK_LOCAL_IP=127.0.0.1 # python location export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.1-src.zip:$PYTHONPATH
Note that, the paths used will correspond to your installation
You should now be able to run this command (from inside your Spark directory), successfully opening a command-line window in Spark:
bin/pyspark
It looks something like this (depending on the version):
Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_ version 2.0.0 /_/ Using Python version 2.7.12 (default, Jul 2 2016 17:43:17) SparkSession available as 'spark'. >>>
You can enter quit()
to exit.
Now, when we run our notebook, when using a Python kernel, we can access Spark.
We have already installed Python as part of the Jupyter installation much earlier in this book. We need to download and install the latest Spark version from http://spark.apache.org/downloads.html. Unpack the TGZ file and move the resulting directory to the C:spark directory
.
You will need to have winutils.exe
available as well (this seems to be a problem with the Hadoop installation, but it may get fixed at some time). Download the file from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and install at C:winutilsin
.
Now need to set up environment variables for all of these:
HADOOP_HOME=C:winutils SPARK_HOME=C:spark PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=notebook
You can start Jupyter using the pyspark
command. You should not notice anything different about your notebook.
3.17.128.129