Introducing Spark andKafka | 129
S. No. RDD Transformations and Meaning
9. distinct([numTasks])
Returns a new dataset that contains the distinct elements of the source dataset.
Example:
val rdd1 = park.sparkContext.
parallelize(Seq((1,”jan”,2016),(3,”nov”,2014),(16,”feb”,2014),(3,”nov”,2014)))
val result = rdd1.distinct()
println(result.collect().mkString(“, “))
Transformations on Pair RDD:
S. No Pair RDD Transformations and Meaning
1. groupByKey([numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.
Note: If you are grouping in order to perform an aggregation (such as a sum or
average) over each key, using reduceByKey or aggregateByKey will yield much better
performance.
Example:
val data = spark.sparkContext.
parallelize(Array((‘k’,5),(‘s’,3),(‘s’,4),(‘p’,7),(‘p’,5),(‘t’,8),(‘k’,6)),3)
val group = data.groupByKey().collect()
group.foreach(println)
2. reduceByKey(func, [numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the
values for each key are aggregated using the given reduce function func, which must
be of type (V, V) V. Like in groupByKey, the number of reduce tasks is congurable
through an optional second argument.
Example:
val words = Array(“one”,”two”,”two”,”four”,”ve”,”six”,”six”,”eight”,”nine”,”ten”)
val data = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey(_+_)
data.foreach(println)
3. aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
When called on a dataset of (K, V) pairs, it returns a dataset of (K, U) pairs where the
values for each key are aggregated using the given combine functions and a neutral
‘zero’ value. Allows an aggregated value type that is different from the input value type,
while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks
is congurable through an optional second argument.
(Continued)
M06 Big Data Simplified XXXX 01.indd 129 5/17/2019 2:49:11 PM
130 | Big Data Simplied
S. No Pair RDD Transformations and Meaning
Example:
valkeysWithValuesList = Array(“foo=A”, “foo=A”, “foo=A”, “foo=A”, “foo=B”, “bar=C”,
“bar=D”, “bar=D”)
val data = sc.parallelize(keysWithValuesList)
//Create key value pairs
valkv = data.map(_.split(“=”)).map(v => (v(0), v(1))).cache()
valinitialCount = 0;
valaddToCounts = (n: Int, v: String) => n + 1
valsumPartitionCounts = (p1: Int, p2: Int) => p1 + p2
valcountByKey = kv.aggregateByKey(initialCount)(addToCounts, sumPartitionCounts)
4. sortByKey([ascending], [numTasks])
When called on a dataset of (K, V) pairs where K implements ordered, returns a dataset
of (K, V) pairs sorted by keys in ascending or descending order as specied in the
Boolean ascending argument.
Example:
val data = spark.sparkContext.parallelize(Seq((“maths”,52), (“english”,75), (“science”,82),
(“computer”,65), (“maths”,85)))
val sorted = data.sortByKey()
sorted.foreach(println)
5. join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), it returns a dataset of (K, (V,W))
pairs with all pairs of elements for each key. Outer joins are supported through
leftOuterJoin, rightOuterJoin and fullOuterJoin.
Example:
val data = spark.sparkContext.parallelize(Array((‘A’,1),(‘b’,2),(‘c’,3)))
val data2 =spark.sparkContext.parallelize(Array((‘A’,4),(‘A’,6),(‘b’,7),(‘c’,3),(‘c’,8)))
val result = data.join(data2)
println(result.collect().mkString(“,”))
Difference between groupByKey() and reduceByKey()groupByKey():
On applying groupByKey() on a
dataset of (K, V) pairs, the data shufing occurs according to the key value K in another RDD
(Ref.Figure 6.8). In this transformation, huge amount of unnecessary data transfers over the
network.
Spark provides the provision to save data to disk when there is more data shuffling onto a
single executor machine than can fit in memory.
val words = Array(“one”, “two”, “two”, “three”, “three”, “three”)
valwordPairsRDD = sc.parallelize(words).map(word => (word, 1))
valwordCountsWithGroup = wordPairsRDD.groupByKey().map(t =>
(t._1,t._2.sum)).collect()
M06 Big Data Simplified XXXX 01.indd 130 5/17/2019 2:49:11 PM
Introducing Spark andKafka | 131
FIGURE 6.9 reduceByKey () operation
(a, 1)
(b, 1)
(a, 1)
(b, 1)
(a, 1)
(a, 1)
(b, 1)
(b, 1)
(a, 2)
(b, 2)
(a, 1)
(a, 1)
(a, 1)
(b, 1)
(b, 1)
(a, 3)
(b, 2)
(a, 1)
(a, 2)
(a, 3)
(a, 6)
(b, 1)
(b, 2)
(b, 2)
(b, 5)
FIGURE 6.8 groupByKey () operation
(a, 1)
(b, 1)
(a, 1)
(a, 1)
(b, 1)
(b, 1)
(a, 1)
(a, 1)
(a, 1)
(b, 1)
(b, 1)
(a, 1)
(a, 1)
(a, 1)
(a, 1)
(a, 1)
(a, 1)
(a, 6)
(b, 1)
(b, 1)
(b, 1)
(b, 1)
(b, 1)
(b, 5)
reduceByKey(): While both reducebykey() and groupbykey() will produce the same answer, the
reduceByKey example works much better on a large dataset (Ref. Figure 6.9). It is because Spark
knows it can combine the output with a common key on each partition before shufing the data.
M06 Big Data Simplified XXXX 01.indd 131 5/17/2019 2:49:11 PM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.233.205