Creating broadcast variables

Creating a broadcast variable can be done using the Spark Context's broadcast() function on any data of any data type provided that the data/variable is serializable.

Let's look at how we can broadcast an Integer variable and then use the broadcast variable inside a transformation operation executed on the executors:

scala> val rdd_one = sc.parallelize(Seq(1,2,3))
rdd_one: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[101] at parallelize at <console>:25

scala> val i = 5
i: Int = 5

scala> val bi = sc.broadcast(i)
bi: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(147)

scala> bi.value
res166: Int = 5

scala> rdd_one.take(5)
res164: Array[Int] = Array(1, 2, 3)

scala> rdd_one.map(j => j + bi.value).take(5)
res165: Array[Int] = Array(6, 7, 8)

Broadcast variables can also be created on more than just primitive data types as shown in the next example where we will broadcast a HashMap from the Driver.

The following is a simple transformation of an integer RDD by multiplying each element with another integer by looking up the HashMap. The RDD of 1,2,3 is transformed to 1 X 2 , 2 X 3, 3 X 4 = 2,6,12 :

scala> val rdd_one = sc.parallelize(Seq(1,2,3))
rdd_one: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[109] at parallelize at <console>:25

scala> val m = scala.collection.mutable.HashMap(1 -> 2, 2 -> 3, 3 -> 4)
m: scala.collection.mutable.HashMap[Int,Int] = Map(2 -> 3, 1 -> 2, 3 -> 4)

scala> val bm = sc.broadcast(m)
bm: org.apache.spark.broadcast.Broadcast[scala.collection.mutable.HashMap[Int,Int]] = Broadcast(178)

scala> rdd_one.map(j => j * bm.value(j)).take(5)
res191: Array[Int] = Array(2, 6, 12)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.120.159