Spark architecture

Apache Spark is designed to simplify the laborious, and sometimes error prone task of highly-parallelized, distributed computing. To understand how it does this, let's explore its history and identify what Spark brings to the table.

History of Spark

Apache Spark implements a type of data parallelism that seeks to improve upon the MapReduce paradigm popularized by Apache Hadoop. It extended MapReduce in four key areas:

  • Improved programming model: Spark provides a higher level of abstraction through its APIs than Hadoop; creating a programming model that significantly reduces the amount of code that must be written. By introducing a fluent, side-effect-free, function-oriented API, Spark makes it possible to reason about an analytic in terms of its transformations and actions, rather than just sequences of mappers and reducers. This makes it easier to understand and debug.
  • Introduces workflow: Rather than chaining jobs together (by persisting results to disk and using a third-party workflow scheduler, as with traditional MapReduce), Spark allows analytics to be decomposed into tasks and expressed as Directed Acyclic Graphs (DAGs). This has the immediate effect of removing the need to materialize data, but also means it has much more control over how analytics are run, including enabling efficiencies such as cost-based query optimization (seen in the catalyst query planner).
  • Better Memory Utilization: Spark exploits the memory on each node for in-memory caching of datasets. It permits access to caches between operations to improve performance over basic MapReduce. This is particularly effective for iterative workloads, such as stochastic gradient descent (SGD), where a significant improvement in performance can usually be observed.
  • Integrated Approach: With support for streaming, SQL execution, graph processing, machine learning, database integration, and much more, it offers one tool to rule them all! Before Spark, specialist tools were needed, for example, Storm, Pig, Giraph, Mahout, and so on. Although there are situations where the specialist tools can provide better results, Spark's on-going commitment to integration is impressive.

In addition to these general improvements, Spark offers many other features. Let's take a look inside the box.

Moving parts

At a conceptual level, there are a number of key components inside Apache Spark, many of which you may know already, but let's review them within the context of the scalability principles we've outlined:

Moving parts

Driver

The Driver is the main entry point for Spark. It's the program that you start, it runs in a single JVM, and it initiates and controls all of the operations in your job.

In terms of performance, it's likely that you'll want to avoid bringing large datasets back to the driver, as running such operations (such as rdd.collect) can often cause an OutOfMemoryError. This happens when the size of data being returned exceeds the JVM heap size of the driver, as specified by --driver-memory.

SparkSession

As the driver is starting, the SparkSession class is initialized. The SparkSession class provides access to all of Spark's services, via the relevant context, such as SQLContext, SparkContext, and StreamingContext classes.

It's also the place to tune Spark's runtime performance-related properties.

Resilient distributed datasets (RDDs)

An Resilient Distributed Dataset (RDD) is the underlying abstraction representing a distributed set of homogenous records.

Although data may be physically stored over many machines in the cluster, analytics are intentionally unaware of their actual location: they deal only with RDDs. Under the covers, RDDs consist of partitions, or contiguous blocks of data, like slices of cake. Each partition has one or more replicas, or copies, and Spark is able to determine the physical location of these replicas in order to decide where to run transformation tasks to ensure data locality.

Note

For an example of how the physical location of replicas is determined, see getPreferredLocations in: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala.

RDDs are also responsible for ensuring that data is cached appropriately from the underlying block storage, for example, HDFS.

Executor

Executors are processes that run on the worker nodes of your cluster. When launched, each executor connects back to the driver and waits for instructions to run operations over data.

You decide on how many executors your analytic needs and this becomes your maximum level of parallelism.

Note

Unless using dynamic allocation. In which case, the maximum level of parallelism is infinity until configured using spark.dynamicAllocation.maxExecutors. See Spark configuration for details.

Shuffle operation

The shuffle is the name given to the transfer of data between executors that occurs as part of an operation whenever data must be physically moved, in order to compute a calculation. It typically occurs when data is grouped so that all records with the same key are together on a single machine, but it can also be used strategically to repartition data for greater levels of parallelism.

However, as it involves both (i) the movement of data over the network and (ii) its persistence to disk, it is generally considered a slow operation. And hence, the shuffle is an area of great significance to scalability more on this later.

Cluster Manager

The Cluster Manager sits outside of Spark, acting as a resource negotiator for the cluster. It controls the initial allocation of physical resources, so that Spark is able to start its executors on machines with the requisite number of cores and memory.

Although each cluster manager works in a different way, your choice is unlikely to have any measurable impact on algorithmic performance.

Task

A Task represents an instruction to run a set of operations over a single partition of data. Each task is serialized over to an executor by the driver and, is in effect, what is referred to by the expression moving the processing to the data.

DAG

A DAG represents the logical execution plan of all transformations involved in the execution of an action. Its optimization is fundamental to the performance of the analytic. In the case of SparkSQL and Datasets optimization is performed on your behalf by the catalyst optimizer.

DAG scheduler

The DAG scheduler creates a physical plan, by dividing the DAG into stages and, for each stage, creating a corresponding set of tasks (one for each partition).

DAG scheduler

Transformations

Transformations are a type of operation. They typically apply a user-defined function to each record in an RDD. There are two kinds of transformation, narrow and wide.

Narrow transformations are operations that are applied locally to partitions and as such do not require data to be moved in order to compute correctly. They include: filter, map, mapValues, flatMap, flatMapValues, glom, pipe, zipWithIndex, cartesian, union, mapPartitionsWithInputSplit, mapPartitions, mapPartitionsWithIndex, mapPartitionsWithContext, sample, randomSplit.

In contrast, wide transformations are operations that require data to be moved in order to compute correctly. In other words, they require a shuffle. They include: sortByKey, reduceByKey, groupByKey, join, cartesian, combineByKey, partitionBy, repartition, repartitionAndSortWithinPartitions, coalesce, subtractByKey, cogroup.

Note

The coalesce, subtractByKey and cogroup transformations could be narrow depending on where data is physically situated.

In order to write scalable analytics, it's important to be aware of which type of transformation you are using.

Stages

A stage represents a group of operations that can be physically mapped to a task (one per partition). There are a couple of things to note about stages:

  • Any sequence of narrow transformations appearing consecutively in a DAG are pipelined together into a single stage. In other words, they execute in order, on the same executor and hence against the same partition and do not need a shuffle.
  • Whenever a wide transformation is encountered in a DAG, a stage boundary is introduced. Two stages (or more in the case of join, and so on) now exist and the second cannot begin until the first has finished (see ShuffledRDD class for more details).

Actions

Actions are another type of operation within Spark. They're typically used to perform a parallel write or transfer of data back to the driver. While other transformations are lazily evaluated, it is the action that triggers the execution of a DAG.

Upon invoking an action, its parent RDD gets submitted to the SparkSession or SparkContext classes within the driver and the DAG scheduler generates a DAG for execution.

Task scheduler

The task scheduler receives a set of tasks determined by the DAG scheduler (one task per partition) and schedules each to run on an appropriate executor in conjunction with data locality.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.198.159