Dig deeper into Apache Spark

Apache Spark is a fast in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming machine learning or SQL workloads that require fast interactive access to datasets. Apache Spark consists of Spark core and a set of libraries. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed application development.

Additional libraries built on top of the core allow the workloads for streaming, SQL, Graph processing, and machine learning. SparkML, for instance, is designed for Data science and its abstraction makes Data science easier.

In order to plan and carry out the distributed computations, Spark uses the concept of a job, which is executed across the worker nodes using Stages and Tasks. Spark consists of a driver, which orchestrates the execution across a cluster of worker nodes. The driver is also responsible for tracking all the worker nodes as well as the work currently being performed by each of the worker nodes.

Let's look into the various components a little more. The key components are the Driver and the Executors which are all JVM processes (Java processes):

  • Driver: The Driver program contains the applications, main program. If you are using the Spark shell, that becomes the Driver program and the Driver launches the executors across the cluster and also controls the task executions.
  • Executor: Next are the executors which are processes running on the worker nodes in your cluster. Inside the executor, the individual tasks or computations are run. There could be one or more executors in each worker node and, similarly, there could be multiple tasks inside each executor. When Driver connects to the cluster manager, the cluster manager assigns resources to run executors.
The cluster manager could be a standalone cluster manager, YARN, or Mesos.

The Cluster Manager is responsible for the scheduling and allocation of resources across the compute nodes forming the cluster. Typically, this is done by having a manager process which knows and manages a cluster of resources and allocates the resources to a requesting process such as Spark. We will look at the three different cluster managers: standalone, YARN, and Mesos further down in the next sections.

The following is how Spark works at a high level:

The main entry point to a Spark program is called the SparkContext. The SparkContext is inside the Driver component and represents the connection to the cluster along with the code to run the scheduler and task distribution and orchestration.

In Spark 2.x, a new variable called SparkSession has been introduced. SparkContext, SQLContext, and HiveContext are now member variables of the SparkSession.

When you start the Driver program, the commands are issued to the cluster using the SparkContext, and then the executors will execute the instructions. Once the execution is completed, the Driver program completes the job. You can, at this point, issue more commands and execute more Jobs.

The ability to maintain and reuse the SparkContext is a key advantage of the Apache Spark architecture, unlike the Hadoop framework where every MapReduce job or Hive query or Pig Script starts entire processing from scratch for each task we want to execute that too using expensive disk instead of memory.

The SparkContext can be used to create RDDs, accumulators, and broadcast variables on the cluster. Only one SparkContext may be active per JVM/Java process. You must stop() the active SparkContext before creating a new one.

The Driver parses the code, and serializes the byte level code across to the executors to be executed. When we perform any computations, the computations will actually be done at the local level by each node, using in-memory processing.

The process of parsing the code and planning the execution is the key aspect implemented by the Driver process.

The following is how Spark Driver coordinates the computations across the cluster:

The Directed Acyclic Graph (DAG) is the secret sauce of Spark framework. The Driver process creates a DAG of tasks for a piece of code you try to run using the distributed processing framework. Then, the DAG is actually executed in stages and tasks by the task scheduler by communicating with the Cluster Manager for resources to run the executors. A DAG represents a job, and a job is split into subsets, also called stages, and each stage is executed as tasks using one core per task.

An illustration of a simple job and how the DAG is split into stages and tasks is shown in the following two illustrations; the first one shows the job itself, and the second diagram shows the stages in the job and the tasks:

The following diagram now breaks down the job/DAG into stages and tasks:

The number of stages and what the stages consist of is determined by the kind of operations. Usually, any transformation comes into the same stage as the one before, but every operation such as reduce or shuffle always creates a new stage of execution. Tasks are part of a stage and are directly related to the cores executing the operations on the executors.

If you use YARN or Mesos as the cluster manager, you can use dynamic YARN scheduler to increase the number of executors when more work needs to be done, as well as killing idle executors.

The driver, hence, manages the fault tolerance of the entire execution process. Once the job is completed by the Driver, the output can be written to a file, database, or simply to the console.

Remember that the code in the Driver program itself has to be completely serializable including all the variables and objects.
The often seen exception is a not a serializable exception, which is a result of including global variables from outside the block.

Hence, the Driver process takes care of the entire execution process while monitoring and managing the resources used, such as executors, stages, and tasks, making sure everything is working as planned and recovering from failures such as task failures on executor nodes or entire executor nodes as a whole.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.186.244