Spark architecture

A Spark cluster is a set of processes distributed over different machines. The Driver Program is a process, such as a Scala or Python interpreter, used by the user to submit the tasks to be executed.

The user can build task graphs, similar to Dask, using a special API and submit those tasks to the Cluster Manager that is responsible for assigning these tasks to Executors, processes responsible for executing the tasks. In a multi-user system, the Cluster Manager is also responsible for allocating resources on a per-user basis.

The user interacts with the Cluster Manager through the Driver Program. The class responsible for communication between the user and the Spark cluster is called SparkContext. This class is able to connect and configure the Executors on the cluster based on the resources available to the user.

For its most common use-cases, Spark manages its data through a data structure called Resilient Distributed Datasets (RDD), which represents a collection of items. RDDs are capable of handling massive datasets by separating their elements into partitions and operating on the partitions in parallel (note that this mechanism is mainly hidden from the user). RDDs can also be stored in memory (optionally, and when appropriate) for fast access and to cache expensive intermediate results.

Using RDDs, it is possible to define tasks and transformations (similarly to how we were automatically generating computation graphs in Dask) and, when requested, the Cluster Manager will automatically dispatch and execute tasks on the available Executors.

The Executors will receive the tasks from the Cluster Manager, execute them, and keep the results around if needed. Note that an Executor can have multiple cores and each node in the cluster may have multiple Executors. Generally speaking, Spark is fault tolerant on Executor's failures.

In the following diagram, we show how the aforementioned components interact in a Spark cluster. The Driver Program interacts with the Cluster Manager that manages the Executor instances on different nodes (each Executor instance can also have multiple threads). Note that, even if the Driver Program doesn't directly control the Executors, the results, which are stored in the Executor instances, are transferred directly between the Executors and the Driver Program. For this reason, it's important that the Driver Program is network-reachable from the Executor processes:

A natural question to ask is: How is Spark, a software written in Scala, able to execute Python code? The integration is done through the Py4J library, which maintains a Python process under-the-hood and communicates with it through sockets (a form of interprocess communication). In order to run the tasks, Executors maintain a series of Python processes so that they are able to process Python code in parallel.

RDDs and variables defined in a Python process in the Driver Program are serialized, and the communication between the Cluster Manager and the Executors (including shuffling) is dealt with by Spark's Scala code. The extra serialization steps necessary for the Python and Scala interchange, all contribute to the overhead of communication; therefore, when using PySpark, extra care must be taken to ensure that the data structures used are serialized efficiently and that the data partitions are big enough so that the cost of communication is negligible compared to the cost of execution.

The following diagram illustrates the additional Python processes needed for PySpark execution. These additional Python processes come with associated memory costs and an extra layer of indirection that complicate error reporting:

Despite these drawbacks, PySpark is still a widely used tool because it bridges the vivid Python ecosystem with the industrial strength of the Hadoop infrastructure.

Table of Contents for Spark architecture

Create new playlist

Sign In

Sign Up

Table of Contents for
Spark architecture