PySpark uses Python-based SparkContext and Python scripts as tasks and then uses sockets and pipes to executed processes to communicate between Java-based Spark clusters and Python scripts. PySpark also uses Py4J, which is a popular library integrated within PySpark that lets Python interface dynamically with Java-based RDDs.
Python must be installed on all worker nodes running the Spark executors.
The following is how PySpark works by communicating between Java processed and Python scripts:
![](http://imgdetail.ebookreading.net/other/5/9781785280849/9781785280849__scala-and-spark__9781785280849__assets__c94481c3-f888-46dd-8f8c-783975875465.png)