Spark components

Before moving any further let's first understand the common terminologies associated with Spark:

  • Driver: This is the main program that oversees the end-to-end execution of a Spark job or program. It negotiates the resources with the resource manager of the cluster for delegate and orchestrate the program into smallest possible data local parallel programming unit.
  • Executors: In any Spark job, there can be one or more executors, that is, processes that execute smaller tasks delegated by the driver. The executors process the data, preferably local to the node and store the result in memory, disk, or both.
  • Master: Apache Spark has been implemented in master-slave architecture and hence master refers to the cluster node executing the driver program.
  • Slave: In a distributed cluster mode, slave refers to the nodes on which executors are being run and hence there can be (and mostly is) more than one slave in the cluster.
  • Job: This is a collection of operations performed on any set of data. A typical word count job deals with reading a text file from an arbitrary source and splitting and then aggregating the words.
  • DAG: Any Spark job in a Spark engine is represented by a DAG of operations. The DAG represents the logical execution of Spark operations in a sequential order. Re-computation of RDD in case of a failure is possible lineage can be derived from the DAG.
  • Tasks: A job can be split into smaller units to be operated upon in silos which are called Tasks. Each task is executed upon by an executor on a partition of data.
  • Stages: Spark jobs can be divided logically into stages, where each stage represents a set of tasks having the same shuffle dependencies, that is, where data shuffling occurs. In shuffle map stage the tasks results are input for the next stage where as in result stage the tasks compute the action that started the evaluation of Spark job such as take(), foreach(), and collect().

Following diagram shows logical representation of how different components of Spark application interacts:

How a Spark job gets executed:

  • A Spark Job can comprise of a series of operations that are performed upon a set of data. However big or small a Spark job may be, it requires a SparkContext to execute any such job. In the previous examples of working with REPL, one would notice the use of an environment variable called sc, which is how a SparkContext is accessible in an REPL environment.
  • SparkContext creates an operator graph of different transformations of the job, but once an action gets called on such a transformation, the graph gets submitted to the DAGScheduler. Depending on the nature of the RDD or resultant being produced with narrow transformation or the wide transformation (those that require the shuffle operation), the DAGScheduler produces stages.
  • The DAGScheduler splits the DAG in such a way that each stage comprises of the same shuffle dependency with common shuffle boundaries. Also, stages can either be a shuffle map stage in which case its tasks' results are input for another stage or a result stage in which case its tasks directly compute the action that initiated a job, for example, count().
  • Stages are then submitted to TaskScheduler as TaskSets by the DAGScheduler. The TaskScheduler schedules the TaskSets via cluster manager (YARN, Mesos, and Spark standalone) and monitors its execution. In case of the failure of the any task, it is rerun and finally the results are sent to the DAGScheduler. In case the result output files are lost, then DAGScheduler resubmits such stages to the TaskScheduler to be rerun again.
  • Tasks are then scheduled on the designated executors (JVMs running on a slave node) meeting the resource and data locality constraints. Each executor can also have more than one task assigned.

Following diagram provides logical representation of different phases of Spark jJob execution:

In this section, we became familiar with different components of a Spark job. In the next section, we will learn capabilities of Spark driver's UI.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.186.83