Data Mining and SQL Queries

PySpark exposes the Spark programming model to Python. Spark is a fast, general engine for large-scale data processing. We can use Python under Jupyter. So, we can use Spark in Jupyter.

Installing Spark requires the following components to be installed on your machine:

Then set environment variables that show the position of the preceding components:

  • JAVA_HOME: The bin directory where you installed JDK
  • PYTHONPATH: Directory where Python was installed
  • HADOOP_HOME: Directory where winutils resides
  • SPARK_HOME: Where Spark is installed

These components are readily available over the internet for a variety of operating systems. I have successfully installed these previous components in a Windows environment and a Mac environment.

Once you have these installed you should be able to run the command, pyspark, from a command line window and a Jupyter Notebook with Python (with access to Spark) can be used. In my installation I used the command:

pyspark  

As I had installed Spark in the root with the spark directory. Yes, pyspark is a built-in tool for use by Spark.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.144.32