The architecture of Spark

Spark consists of 3 primary architectural components:

The SparkSession / SparkContext
The Cluster Manager
The Worker Nodes (that hosts executor processes)

The SparkSession/SparkContext, or more generally the Spark Driver, is the entry point for all Spark applications as discussed earlier. The SparkContext will be used to create RDDs and perform operations against RDDs. The SparkDriver sends instructions to the worker nodes to schedule tasks.

The Cluster manager is conceptually similar to Resource Managers in Hadoop and indeed, one of the supported solutions is YARN. Other Cluster Managers include Mesos. Spark can also operate in a Standalone mode in which case YARN/Mesos are not required. Cluster Managers co-ordinate communications between the Worker Nodes, manage the nodes (such as starting, stopping, and so on), and perform other administration tasks.

Worker nodes are servers where Spark applications are hosted. Each application gets its own unique executor process, namely, processes that perform the actual action and transformation tasks. By assigning dedicated executor processes, Spark ensures that an issue in any particular application does not impact other applications. Worker Nodes consist of the Executor, the JVM, and the Python/R/other application process required by the Spark application. Note that in the case of Hadoop, the Worker Node and Data Nodes are one and the same:

Table of Contents for The architecture of Spark

Create new playlist

Sign In

Sign Up

Table of Contents for
The architecture of Spark