PySpark exposes the Spark programming model to Python. Spark is a fast, general engine for large-scale data processing. We can use Python under Jupyter. So, we can use Spark in Jupyter.
Installing Spark requires the following components to be installed on your machine:
- Java JDK.
- Scala from http://www.scala-lang.org/download/.
- Python recommend downloading Anaconda with Python (from http://continuum.io).
- Spark from https://spark.apache.org/downloads.html.
- winutils: This is a command-line utility that exposes Linux commands to Windows. There are 32-bit and 64-bit versions available at:
- 32-bit winutils.exe at https://code.google.com/p/rrd-hadoop-win32/source/checkout
- 64-bit winutils.exe at https://github.com/steveloughran/winutils/tree/master/hadoop-2.6.0/bin
Then set environment variables that show the position of the preceding components:
- JAVA_HOME: The bin directory where you installed JDK
- PYTHONPATH: Directory where Python was installed
- HADOOP_HOME: Directory where winutils resides
- SPARK_HOME: Where Spark is installed
These components are readily available over the internet for a variety of operating systems. I have successfully installed these previous components in a Windows environment and a Mac environment.
Once you have these installed you should be able to run the command, pyspark, from a command line window and a Jupyter Notebook with Python (with access to Spark) can be used. In my installation I used the command:
pyspark
As I had installed Spark in the root with the spark directory. Yes, pyspark is a built-in tool for use by Spark.