How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
  1. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter3 
  1. Import the necessary packages:
import breeze.numerics.pow 
import org.apache.spark.sql.SparkSession 
import Array._
  1. Import the packages for setting up logging level for log4j. This step is optional, but we highly recommend it (change the level appropriately as you move through the development cycle):
import org.apache.log4j.Logger 
import org.apache.log4j.Level 
  1. Set up the logging level to warning and error to cut down on output. See the previous step for package requirements.
Logger.getLogger("org").setLevel(Level.ERROR) 
Logger.getLogger("akka").setLevel(Level.ERROR) 
  1. Set up the Spark context and application parameter so Spark can run:
val spark = SparkSession 
  .builder 
  .master("local[*]") 
  .appName("myRDD") 
  .config("Spark.sql.warehouse.dir", ".") 
  .getOrCreate() 
  1. Set up the data structures and RDD for the example. In this example we create an RDD using range facilities and divide them into three partitions (that is, explicit parameter set). It simply creates numbers 1 through 12 and puts them into 3 partitions.
    val rangeRDD=sc.parallelize(1 to 12,3)
  1. We apply the groupBy() function to the RDDs to demonstrate the transformation. In the example, we take the partitioned RDD of ranges and label them as odd/even using the mod function.
val groupByRDD= rangeRDD.groupBy( i => {if (i % 2 == 1) "Odd" 
else "Even"}).collect groupByRDD.foreach(println)

On running the previous code, you will get the following output:

  1. Now that we have seen how to code groupBy(), we switch gears and demonstrate reduceByKey().
  1. To see the difference in coding, while producing the same output more efficiently, we set up an array with two letters (that is, a and b) so we can show aggregation by summing them up.
val alphabets = Array("a", "b", "a", "a", "a", "b") // two type only to make it simple 
  1. In this step, we use a Spark context to produce a parallelized RDD:
val alphabetsPairsRDD = spark.sparkContext.parallelize(alphabets).map(alphabets => (alphabets, 1)) 
  1. We apply the groupBy() function first using the usual Scala syntax (_+_) to traverse the RDD and sum up, while aggregating by the type of alphabet (that is, considered key):
val countsUsingGroup = alphabetsPairsRDD.groupByKey() 
  .map(c => (c._1, c._2.sum)) 
  .collect() 
  1. We apply the reduceByKey() function first using the usual Scala syntax (_+_) to traverse the RDD and sum up while aggregating by type of alphabet (that is, considered key)
val countsUsingReduce = alphabetsPairsRDD 
  .reduceByKey(_ + _) 
  .collect()
  1. We output the results:
println("Output for  groupBy") 
countsUsingGroup.foreach(println(_)) 
println("Output for  reduceByKey") 
countsUsingReduce.foreach(println(_)) 

On running the previous code, you will get the following output:

Output for groupBy
(b,2)
(a,4)
Output for reduceByKey
(b,2)
(a,4)  
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.111.242