Shuffling

Whatever the partitioner used, many operations will cause a repartitioning of data across the partitions of an RDD. New partitions can be created or several partitions can be collapsed/coalesced. All the data movement necessary for the repartitioning is called shuffling, and this is an important concept to understand when writing a Spark Job. The shuffling can cause a lot of performance lag as the computations are no longer in memory on the same executor but rather the executors are exchanging data over the wire.

A good example is the example of groupByKey(), we saw earlier in the Aggregations section. Obviously, lot of data was flowing between executors to make sure all values for a key are collected onto the same executor to perform the groupBy operation.

Shuffling also determines the Spark Job execution process and influences how the Job is split into Stages. As we have seen in this chapter and the previous chapter, Spark holds a DAG of RDDs, which represent the lineage of the RDDs such that not only does Spark use the lineage to plan the execution of the job but also any loss of executors can be recovered from. When an RDD is undergoing a transformation, an attempt is made to make sure the operations are performed on the same node as the data. However, often we use join operations, reduce, group, or aggregate operations among others, which cause repartitioning intentionally or unintentionally. This shuffling in turn determines where a particular stage in the processing has ended and a new stage has begun.

The following diagram is an illustration of how a Spark Job is split into stages. This example shows a pairRDD being filtered, transformed using map before invoking groupByKey followed by one last transformation using map():

The more shuffling we have, the more stages occur in the job execution affecting the performance. There are two key aspects which are used by Spark Driver to determine the stages. This is done by defining two types of dependencies of the RDDs, the narrow dependencies and the wide dependencies.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.8.127