aggregateByKey

aggregateByKey is quite similar to reduceByKey, except that aggregateByKey allows more flexibility and customization of how to aggregate within partitions and between partitions to allow much more sophisticated use cases such as generating a list of all <Year, Population> pairs as well as total population for each State in one function call.

aggregateByKey works by aggregating the values of each key, using given combine functions and a neutral initial/zero value.
This function can return a different result type, U, than the type of the values in this RDD V, which is the biggest difference. Thus, we need one operation for merging a V into a U and one operation for merging two U's. The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U:

def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)]

def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)]

def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)]

aggregateByKey works by performing an aggregation within the partition operating on all elements of each partition and then applies another aggregation logic when combining the partitions themselves. Ultimately, all pairs of (key - value) for the same Key are collected in the same partition; however, the aggregation as to how it is done and the output generated is not fixed as in groupByKey and reduceByKey, but is more flexible and customizable when using aggregateByKey.

The following diagram is an illustration of what happens when aggregateByKey is called. Instead of adding up the counts as in groupByKey and reduceByKey, here we are generating lists of values for each Key:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.54.13