By setting PySpark on Python IDEs

We can also configure and run PySpark from Python IDEs such as PyCharm. In this section, we will show how to do it. If you're a student, you can get the free licensed copy of PyCharm once you register using your university/college/institute email address at https://www.jetbrains.com/student/. Moreover, there's also a community (that is, free) edition of PyCharm, so you don't need to be a student in order to use it.

Recently PySpark has been published with Spark 2.2.0 PyPI (see https://pypi.python.org/pypi/pyspark/. This has been a long time coming (previous releases included pip installable artifacts that for a variety of reasons couldn't be published to PyPI). So if you (or your friends) want to be able to work with PySpark locally on your laptop you've got an easier path getting started, just execute the following command:

$ sudo pip install pyspark # for python 2.7 
$ sudo pip3 install pyspark # for python 3.3+

However, if you are using Windos 7, 8 or 10, you should install pyspark manually. For exmple using PyCharm, you can do it as follows:

Figure 2: Installing PySpark on Pycharm IDE on Windows 10

At first, you should create a Python script with Project interpreter as Python 2.7+. Then you can import pyspark along with other required models as follows:

import os
import sys
import pyspark

Now that if you're a Windows user, Python also needs to have the Hadoop runtime; you should put the winutils.exe file in the SPARK_HOME/bin folder. Then create a environmental variable as follows:

Select your python file | Run | Edit configuration | Create an environmental variable whose key is HADOOP_HOME and the value is the PYTHON_PATH for example for my case it's C:Usersadmin-karimDownloadsspark-2.1.0-bin-hadoop2.7. Finally, press OK then you're done:

Figure 3: Setting Hadoop runtime env on Pycharm IDE on Windows 10

That's all you need. Now if you start writing Spark code, you should at first place the imports in the try block as follows (just for example):

try: 
from pyspark.ml.featureimport PCA
from pyspark.ml.linalgimport Vectors
from pyspark.sqlimport SparkSession
print ("Successfully imported Spark Modules")

And the catch block can be placed as follows:

ExceptImportErroras e: 
print("Can not import Spark Modules", e)
sys.exit(1)

Refer to the following figure that shows importing and placing Spark packages in the PySpark shell:

Figure 4: Importing and placing Spark packages in PySpark shell

If these blocks execute successfully, you should observe the following message on the console:

Figure 5: PySpark package has been imported successfully
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.87.250