sortByKey

sortByKey() belongs to OrderedRDDFunctions. It is used sort to a pair RDD by its key in ascending or descending order.

To execute sortByKey() on a PairRDD, Key should be of ordered type. In the Java world, key type should have implemented comparable, or else it throws an exception. So for a custom key type, the user should implement comparable interface.

The following is an example of the sortByKey() transformation:

Let's first create a pair RDD of <String,Integer> type using JAVaSparkContext:

JavaPairRDD<String, Integer> unsortedPairRDD = javaSparkContext.parallelizePairs (
Arrays.asList(new Tuple2<String, Integer>("B", 2), new Tuple2<String, Integer>("C", 5), new Tuple2<String, Integer>("D", 7), new Tuple2<String, Integer>("A", 8)));

Now, let's execute the sortByKey transformation on it to sort it in descending order:

unsortedPairRDD.sortByKey()

There is an overload available for the sortByKey() function that takes a Boolean argument:

sortByKey(boolean ascending)

This function can be used to sort an RDD in descending order by keys as follows:

unsortedPairRDD.sortByKey(false);

There are other overloads available for this function that allows users to provide a Comparator for the key type. So if the key has not implemented Comparable this function can be used:

sortByKey(Comparator<T> comp, boolean ascending)

This transformation also required shuffling of data between partitions. With the introduction of dataframes and datasets, Spark SQL operations can also be utilized to sort the data (these operations will discussed in detail in Chapter 8, Working with Spark SQL).

Table of Contents for sortByKey

Create new playlist

Sign In

Sign Up

Table of Contents for
sortByKey