Introduction to PySpark

Python is one of the most popular and general purpose programming languages with a number of exciting features for data processing and machine learning tasks. To use Spark from Python, PySpark was initially developed as a lightweight frontend of Python to Apache Spark and using Spark's distributed computation engine. In this chapter, we will discuss a few technical aspects of using Spark from Python IDE such as PyCharm.

Many data scientists use Python because it has a rich variety of numerical libraries with a statistical, machine learning, or optimization focus. However, processing large-scale datasets in Python is usually tedious as the runtime is single-threaded. As a result, data that fits in the main memory can only be processed. Considering this limitation and for getting the full flavor of Spark in Python, PySpark was initially developed as a lightweight frontend of Python to Apache Spark and using Spark's distributed computation engine. This way, Spark provides APIs in non-JVM languages like Python.

The purpose of this PySpark section is to provide basic distributed algorithms using PySpark. Note that PySpark is an interactive shell for basic testing and debugging and is not supposed to be used for a production environment.

Table of Contents for Introduction to PySpark

Create new playlist

Sign In

Sign Up

Table of Contents for
Introduction to PySpark