How to do it...

Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.

package spark.ml.cookbook.chapter3

import breeze.numerics.pow 
import org.apache.spark.sql.SparkSession 
import Array._

Import the packages for setting up logging level for log4j. This step is optional, but we highly recommend it (change the level appropriately as you move through the development cycle):

import org.apache.log4j.Logger 
import org.apache.log4j.Level

Set up the logging level to warning and error to cut down on output. See the previous step for package requirements.

Logger.getLogger("org").setLevel(Level.ERROR) 
Logger.getLogger("akka").setLevel(Level.ERROR)

val spark = SparkSession 
  .builder 
  .master("local[*]") 
  .appName("myRDD") 
  .config("Spark.sql.warehouse.dir", ".") 
  .getOrCreate()

Set up the data structures and RDD for the example. In this example we create an RDD using range facilities and divide them into three partitions (that is, explicit parameter set). It simply creates numbers 1 through 12 and puts them into 3 partitions.

    val rangeRDD=sc.parallelize(1 to 12,3)

We apply the groupBy() function to the RDDs to demonstrate the transformation. In the example, we take the partitioned RDD of ranges and label them as odd/even using the mod function.

val groupByRDD= rangeRDD.groupBy( i => {if (i % 2 == 1) "Odd" 
  else "Even"}).collect 
groupByRDD.foreach(println)

On running the previous code, you will get the following output:

Now that we have seen how to code groupBy(), we switch gears and demonstrate reduceByKey().

To see the difference in coding, while producing the same output more efficiently, we set up an array with two letters (that is, a and b) so we can show aggregation by summing them up.

val alphabets = Array("a", "b", "a", "a", "a", "b") // two type only to make it simple

val alphabetsPairsRDD = spark.sparkContext.parallelize(alphabets).map(alphabets => (alphabets, 1))

We apply the groupBy() function first using the usual Scala syntax (_+_) to traverse the RDD and sum up, while aggregating by the type of alphabet (that is, considered key):

val countsUsingGroup = alphabetsPairsRDD.groupByKey() 
  .map(c => (c._1, c._2.sum)) 
  .collect()

We apply the reduceByKey() function first using the usual Scala syntax (_+_) to traverse the RDD and sum up while aggregating by type of alphabet (that is, considered key)

val countsUsingReduce = alphabetsPairsRDD 
  .reduceByKey(_ + _) 
  .collect()

println("Output for  groupBy") 
countsUsingGroup.foreach(println(_)) 
println("Output for  reduceByKey") 
countsUsingReduce.foreach(println(_))

On running the previous code, you will get the following output:

Output for groupBy
(b,2)
(a,4)
Output for reduceByKey
(b,2)
(a,4)

Table of Contents for How to do it...