groupByKey

groupbyKey transformation work with PairRDD (that is, RDD consists of (key, value) pairs). It is used to group all the values that are related to keys. It helps to transform a PairRDD consists of <key,value> pairs to PairRDD of <key,Iterable<value>>) pairs.

In the following example, we will execute the groupByKey operation on pairRDD generated in the mapToPair() transformation section.

The groupByKey operation can be executed as follows:

pairRDD.groupByKey()

There is an overload available for this transformation function that lets the user provide Partitioner:

groupByKey(Partitioner partitioner)

As RDD is partitioned across multiple nodes. A key can be present in multiple partitions. In an operation such as groupByKey data will shuffled across multiple partitions. So key after shuffling will land on which of the executor running is decided by Partitioner. By default, Spark uses Hash Partitoner that uses hashCode of key to decide which executor it should be shuffled to. User can also provide custom partitioners based on the requirement.

RDD Partitioning concepts will be explained in detail in Chapter 7, Spark Programming Model-Advance.

groupByKey() is a very costly operation as it requires a lot of shuffling. Data corresponding to one key is shuffled across the network, which can cause serious issues in case of large datasets.

Table of Contents for groupByKey

Create new playlist

Sign In

Sign Up

Table of Contents for
groupByKey