Apache Spark is designed to simplify the laborious, and sometimes error prone task of highly-parallelized, distributed computing. To understand how it does this, let's explore its history and identify what Spark brings to the table.
Apache Spark implements a type of data parallelism that seeks to improve upon the MapReduce paradigm popularized by Apache Hadoop. It extended MapReduce in four key areas:
In addition to these general improvements, Spark offers many other features. Let's take a look inside the box.
At a conceptual level, there are a number of key components inside Apache Spark, many of which you may know already, but let's review them within the context of the scalability principles we've outlined:
The Driver is the main entry point for Spark. It's the program that you start, it runs in a single JVM, and it initiates and controls all of the operations in your job.
In terms of performance, it's likely that you'll want to avoid bringing large datasets back to the driver, as running such operations (such as rdd.collect
) can often cause an OutOfMemoryError
. This happens when the size of data being returned exceeds the JVM heap size of the driver, as specified by --driver-memory
.
As the driver is starting, the SparkSession
class is initialized. The SparkSession
class provides access to all of Spark's services, via the relevant context, such as SQLContext
, SparkContext
, and StreamingContext
classes.
It's also the place to tune Spark's runtime performance-related properties.
An Resilient Distributed Dataset (RDD) is the underlying abstraction representing a distributed set of homogenous records.
Although data may be physically stored over many machines in the cluster, analytics are intentionally unaware of their actual location: they deal only with RDDs. Under the covers, RDDs consist of partitions, or contiguous blocks of data, like slices of cake. Each partition has one or more replicas, or copies, and Spark is able to determine the physical location of these replicas in order to decide where to run transformation tasks to ensure data locality.
For an example of how the physical location of replicas is determined, see getPreferredLocations
in: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala.
RDDs are also responsible for ensuring that data is cached appropriately from the underlying block storage, for example, HDFS.
Executors are processes that run on the worker nodes of your cluster. When launched, each executor connects back to the driver and waits for instructions to run operations over data.
You decide on how many executors your analytic needs and this becomes your maximum level of parallelism.
The shuffle is the name given to the transfer of data between executors that occurs as part of an operation whenever data must be physically moved, in order to compute a calculation. It typically occurs when data is grouped so that all records with the same key are together on a single machine, but it can also be used strategically to repartition data for greater levels of parallelism.
However, as it involves both (i) the movement of data over the network and (ii) its persistence to disk, it is generally considered a slow operation. And hence, the shuffle is an area of great significance to scalability more on this later.
The Cluster Manager sits outside of Spark, acting as a resource negotiator for the cluster. It controls the initial allocation of physical resources, so that Spark is able to start its executors on machines with the requisite number of cores and memory.
Although each cluster manager works in a different way, your choice is unlikely to have any measurable impact on algorithmic performance.
A Task represents an instruction to run a set of operations over a single partition of data. Each task is serialized over to an executor by the driver and, is in effect, what is referred to by the expression moving the processing to the data.
A DAG represents the logical execution plan of all transformations involved in the execution of an action. Its optimization is fundamental to the performance of the analytic. In the case of SparkSQL and Datasets optimization is performed on your behalf by the catalyst optimizer.
The DAG scheduler creates a physical plan, by dividing the DAG into stages and, for each stage, creating a corresponding set of tasks (one for each partition).
Transformations are a type of operation. They typically apply a user-defined function to each record in an RDD. There are two kinds of transformation, narrow and wide.
Narrow transformations are operations that are applied locally to partitions and as such do not require data to be moved in order to compute correctly. They include: filter
, map
, mapValues
, flatMap
, flatMapValues
, glom
, pipe
, zipWithIndex
, cartesian
, union
, mapPartitionsWithInputSplit
, mapPartitions
, mapPartitionsWithIndex
, mapPartitionsWithContext
, sample
, randomSplit
.
In contrast, wide transformations are operations that require data to be moved in order to compute correctly. In other words, they require a shuffle. They include: sortByKey
, reduceByKey
, groupByKey
, join
, cartesian
, combineByKey
, partitionBy
, repartition
, repartitionAndSortWithinPartitions
, coalesce
, subtractByKey
, cogroup
.
In order to write scalable analytics, it's important to be aware of which type of transformation you are using.
A stage represents a group of operations that can be physically mapped to a task (one per partition). There are a couple of things to note about stages:
ShuffledRDD
class for more details).Actions are another type of operation within Spark. They're typically used to perform a parallel write or transfer of data back to the driver. While other transformations are lazily evaluated, it is the action that triggers the execution of a DAG.
Upon invoking an action, its parent RDD gets submitted to the SparkSession
or SparkContext
classes within the driver and the DAG scheduler generates a DAG for execution.
18.117.99.152