coalesce

A coalesce transformation in Spark is used to reduce the number of partitions. If the partition count is too high, a user may need to reduce the number the partitions before running join operations. Joining two RDDs of n and m sizes respectively requires n*m tasks. So, decreasing the partition count before joining RDDs can be a useful decision if the size of the partition (number of elements per partition) is small.

Repartition transformation can also be used to reduce the number of partitions of an RDD. However, repartition requires a full shuffle of data. coalesce, on the other hand, avoids full shuffle.

For example, an RDD has x number of partitions and the user performs coalesce to decrease the number of partitions to (x-n), then only data of n partitions (x-(x-n)) will be shuffled. This minimizes the data movement amongst nodes.

The coalesce transformation can be performed as follows:

pairRDD.coalesce(2)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.233.153