Chapter 10. Jupyter and Big Data

Big data is the topic on everyone's mind. I thought it would be good to see what can be done with big data in Jupyter. An up-and-coming language for dealing with large datasets is Spark. Spark is an open source big data processing framework. Spark can run over Hadoop, in the cloud, or standalone. We can use Spark coding in Jupyter much like the other languages we have seen.

In this chapter, we will cover the following topics:

  • Installing Spark for use in Jupyter
  • Using Spark's features

Apache Spark

One of the tools we will be using is Apache Spark. Spark is an open source toolset for cluster computing. While we will not be using a cluster, the typical usage for Spark is a larger set of machines or cluster that operate in parallel to analyze a big data set. An installation guide is available at https://www.dataquest.io/blog/pyspark-installation-guide. In particular, you will need to add two settings to your bash profile: SPARK_HOME and PYSPARK_SUBMIT_ARGS. SPARK_HOME is the directory where the software is installed. PYSPARK_SUBMIT_ARGS sets the number of cores to use in the local cluster.

Mac installation

To install, we download the latest TGZ file from the Spark download page at https://spark.apache.org/downloads.html, unpack the TGZ file, and move the unpacked directory to our Applications folder.

Spark relies on Scala's availability. We installed Scala in Chapter 7Sharing and Converting Jupyter Notebooks.

Open a command-line window to the Spark directory and run this command:

brew install sbt

This may take a while.

Now set the configuration for Spark (for Mac) in your .bash_profile file:

# location of spark code
export SPARK_HOME="/Applications/spark-2.0.0-bin-hadoop2.7"
# machine to run on
export SPARK_MASTER_IP=127.0.0.1
export SPARK_LOCAL_IP=127.0.0.1
# python location
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.1-src.zip:$PYTHONPATH

Note that, the paths used will correspond to your installation

You should now be able to run this command (from inside your Spark directory), successfully opening a command-line window in Spark:

bin/pyspark

It looks something like this (depending on the version):

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_   version 2.0.0
      /_/
Using Python version 2.7.12 (default, Jul  2 2016 17:43:17)
SparkSession available as 'spark'.
>>>

You can enter quit() to exit.

Now, when we run our notebook, when using a Python kernel, we can access Spark.

Windows installation

We have already installed Python as part of the Jupyter installation much earlier in this book. We need to download and install the latest Spark version from http://spark.apache.org/downloads.html. Unpack the TGZ file and move the resulting directory to the C:spark directory.

You will need to have winutils.exe available as well (this seems to be a problem with the Hadoop installation, but it may get fixed at some time). Download the file from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and install at C:winutilsin.

Now need to set up environment variables for all of these:

HADOOP_HOME=C:winutils
SPARK_HOME=C:spark
PYSPARK_DRIVER_PYTHON=ipython
PYSPARK_DRIVER_PYTHON_OPTS=notebook

You can start Jupyter using the pyspark command. You should not notice anything different about your notebook.

Note

We are using the Python script to invoke Spark functionality, so the language format needs to conform to Python.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.128.129