Page 160

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Introducing Spark andKafka | 129

S. No. RDD Transformations and Meaning

9. distinct([numTasks])

Returns a new dataset that contains the distinct elements of the source dataset.

Example:

val rdd1 = park.sparkContext.

parallelize(Seq((1,”jan”,2016),(3,”nov”,2014),(16,”feb”,2014),(3,”nov”,2014)))

val result = rdd1.distinct()

println(result.collect().mkString(“, “))

Transformations on Pair RDD:

S. No Pair RDD Transformations and Meaning

1. groupByKey([numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.

Note: If you are grouping in order to perform an aggregation (such as a sum or

average) over each key, using reduceByKey or aggregateByKey will yield much better

performance.

Example:

val data = spark.sparkContext.

parallelize(Array((‘k’,5),(‘s’,3),(‘s’,4),(‘p’,7),(‘p’,5),(‘t’,8),(‘k’,6)),3)

val group = data.groupByKey().collect()

group.foreach(println)

2. reduceByKey(func, [numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the

values for each key are aggregated using the given reduce function func, which must

be of type (V, V) ⇒ V. Like in groupByKey, the number of reduce tasks is congurable

through an optional second argument.

Example:

val words = Array(“one”,”two”,”two”,”four”,”ve”,”six”,”six”,”eight”,”nine”,”ten”)

val data = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey(_+_)

data.foreach(println)

3. aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])

When called on a dataset of (K, V) pairs, it returns a dataset of (K, U) pairs where the

values for each key are aggregated using the given combine functions and a neutral

‘zero’ value. Allows an aggregated value type that is different from the input value type,

while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks

is congurable through an optional second argument.

(Continued)

M06 Big Data Simplified XXXX 01.indd 129 5/17/2019 2:49:11 PM

130 | Big Data Simplied

S. No Pair RDD Transformations and Meaning

Example:

valkeysWithValuesList = Array(“foo=A”, “foo=A”, “foo=A”, “foo=A”, “foo=B”, “bar=C”,

“bar=D”, “bar=D”)

val data = sc.parallelize(keysWithValuesList)

//Create key value pairs

valkv = data.map(_.split(“=”)).map(v => (v(0), v(1))).cache()

valinitialCount = 0;

valaddToCounts = (n: Int, v: String) => n + 1

valsumPartitionCounts = (p1: Int, p2: Int) => p1 + p2

valcountByKey = kv.aggregateByKey(initialCount)(addToCounts, sumPartitionCounts)

4. sortByKey([ascending], [numTasks])

When called on a dataset of (K, V) pairs where K implements ordered, returns a dataset

of (K, V) pairs sorted by keys in ascending or descending order as specied in the

Boolean ascending argument.

Example:

val data = spark.sparkContext.parallelize(Seq((“maths”,52), (“english”,75), (“science”,82),

(“computer”,65), (“maths”,85)))

val sorted = data.sortByKey()

sorted.foreach(println)

5. join(otherDataset, [numTasks])

When called on datasets of type (K, V) and (K, W), it returns a dataset of (K, (V,W))

pairs with all pairs of elements for each key. Outer joins are supported through

leftOuterJoin, rightOuterJoin and fullOuterJoin.

Example:

val data = spark.sparkContext.parallelize(Array((‘A’,1),(‘b’,2),(‘c’,3)))

val data2 =spark.sparkContext.parallelize(Array((‘A’,4),(‘A’,6),(‘b’,7),(‘c’,3),(‘c’,8)))

val result = data.join(data2)

println(result.collect().mkString(“,”))

Difference between groupByKey() and reduceByKey()—groupByKey():

On applying groupByKey() on a

dataset of (K, V) pairs, the data shufing occurs according to the key value K in another RDD

(Ref.Figure 6.8). In this transformation, huge amount of unnecessary data transfers over the

network.

Spark provides the provision to save data to disk when there is more data shuffling onto a

single executor machine than can fit in memory.

val words = Array(“one”, “two”, “two”, “three”, “three”, “three”)

valwordPairsRDD = sc.parallelize(words).map(word => (word, 1))

valwordCountsWithGroup = wordPairsRDD.groupByKey().map(t =>

(t._1,t._2.sum)).collect()

M06 Big Data Simplified XXXX 01.indd 130 5/17/2019 2:49:11 PM

Introducing Spark andKafka | 131

FIGURE 6.9 reduceByKey () operation

(a, 1)

(b, 1)

(a, 1)

(b, 1)

(a, 1)

(b, 1)

(a, 2)

(b, 2)

(a, 1)

(b, 1)

(a, 3)

(b, 2)

(a, 1)

(a, 2)

(a, 3)

(a, 6)

(b, 1)

(b, 2)

(b, 5)

FIGURE 6.8 groupByKey () operation

(a, 1)

(b, 1)

(a, 1)

(b, 1)

(a, 1)

(b, 1)

(a, 1)

(a, 6)

(b, 1)

(b, 5)

reduceByKey(): While both reducebykey() and groupbykey() will produce the same answer, the

reduceByKey example works much better on a large dataset (Ref. Figure 6.9). It is because Spark

knows it can combine the output with a common key on each partition before shufing the data.

M06 Big Data Simplified XXXX 01.indd 131 5/17/2019 2:49:11 PM

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Page 160

Create new playlist

Sign In

Sign Up

Table of Contents for
Page 160