Page 155

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

124 | Big Data Simplied

In the above code, RDD ‘wordPair’ is created from existing RDD ‘word’ using map() transforma-

tion which contains word and its starting character together.

Features of Spark RDD

• In-memory Computation: The data inside RDD are stored in memory for as long as you want

to store. Keeping the data in-memory improves the performance by an order of magnitudes.

• Lazy Evaluation: The data inside RDDs are not evaluated on the go. The changes or the

computation is performed only after an action is triggered. Thus, it limits how much work it

has to do.

• Fault Tolerance: Upon the failure of worker node, using lineage of operations, we can

recompute the lost partition of RDD from the original one. Thus, we can easily recover the

lost data.

• Immutability: RDDS are immutable in nature meaning once we create an RDD we cannot

manipulate it. And if we perform any transformation, it creates new RDD. We achieve con-

sistency through immutability.

• Partitioning: RDD partition the records logically and distributes the data across various

nodes in the cluster. The logical divisions are only for processing and internally it has no

division. Thus, it provides parallelism.

• Parallel: RDD processes the data in parallel over the cluster.

• Location-Stickiness: RDDs are capable of defining placement preference to compute

partitions. Placement preference refers to information about the location of RDD. The

DAG Scheduler places the partitions in such a way that the task is close to data as much as

possible. Thus, it speeds up computation.

• Coarse-grained Operation: We apply coarse-grained transformations to RDD. Here, coarse-

grained meaning the operation applies to the whole dataset and not on individual element in

the data set of RDD.

• Typed: We can have RDD of various types like: RDD [int], RDD [long], RDD [string].

• No Limitation: We can have any number of RDDs. There is no limit to its number, as such.

The limit depends on the size of disk and memory.

Hence, using RDD we can recover the shortcoming of Hadoop MapReduce and can handle

large volume of data. As a result, it decreases the time complexity of the system. Thus, the

above-mentioned features of Spark RDD makes them useful for fast computations and increases

the performance of the system.

If you like this post or feel that I have missed some features of Spark RDD, please do leave a

comment.

M06 Big Data Simplified XXXX 01.indd 124 5/17/2019 2:49:09 PM

Introducing Spark andKafka | 125

RDD Persistence and Caching Mechanism: Spark RDD persistence is an optimization technique in which

it saves the result of RDD evaluation. Using this, we save the intermediate result so that we can

use it further if required. It reduces the computation overhead.

We can make persisted RDD through cache() and persist() methods. When we use the cache()

method, we can store all the RDD in-memory. We can persist the RDD in memory and use it

efficiently across parallel operations as shown in Figure 6.3.

The difference between cache() and persist() is that using cache() the default storage level is

MEMORY_ONLY while using persist() we can use various storage levels (described below). It is

a key tool for an interactive algorithm. Because, when we persist RDD each node stores any par-

tition of it that it computes in memory and makes it reusable for future use. This process speeds

up further computation ten times.

FIGURE 6.3 Persistence and caching in apache Spark

DISK

RAM

OPERATION 1

OPERATION 2

………………

When the RDD is computed for the first time, it is kept in memory on the node. The cache

memory of the Spark is fault tolerant so whenever any partition of RDD is lost, it can be recovered

by transformation operation that originally created it.

The following are some advantages of RDD caching and persistence mechanism in spark.

• Time efficient

• Cost efficient

• Lesser the execution time

Apache Spark Paired RDD: Spark Paired RDDs are nothing but RDDs containing a key-value pair.

Basically, key-value pair (KVP) consists of a two linked data item in it. Here, the key is the iden-

tier, whereas value is the data corresponding to the key value.

Moreover, Spark operations work on RDDs containing any type of objects. However, key-

value pair RDDs attains few special operations in it, such as distributed ‘shuffle’ operations,

grouping or aggregating the elements by a key.

In addition, on Spark Paired RDDs containing Tuple2 objects in Scala, these operations are

automatically available. Basically, operations for the key-value pair are available in the Pair RDD

functions class. However, that wraps around a Spark RDD of tuples (Ref. Figure 6.4).

Spark Operations on RDD: As depicted in Figure 6.5, the operations over RDD are categorized into

two parts, such as Transformations and Actions.

Transformations on RDD: Any function that returns an RDD is a transformation, elaborating it further

we can say that Transformation is functions which create a new data set from an existing one by

passing each data set element through a function and returns a new RDD representing the results.

M06 Big Data Simplified XXXX 01.indd 125 5/17/2019 2:49:09 PM

126 | Big Data Simplied

All transformations in Spark are lazy. They do not compute their results right away. Instead, they

just remember the transformations applied to some base data set (For example, a le). Thetrans-

formations are only computed when an action requires a result that needs to be returned to the

driver program.

There are two types of transformations and they are explained in detail below.

• Narrow Transformation: In narrow transformation, as shown in Figure 6.6, all the elements

that are required to compute the records in single partition live in the single partition of par-

ent RDD. A limited subset of partition is used to calculate the result. Narrow transformations

are the result of map(), filter().

FIGURE 6.5 Parts of RDD operations

Spark operationsTransformations Actions

FIGURE 6.6 Narrow transformation

Parent RDD 1

Parent RDD 2

Parent RDD 3

Child RDD 1

FIGURE 6.4 Paired RDD

Pair RDD = An RDD with combination of KEY/VALUE pairs

Employee 1 Employee 2 Employee 3 Employee 4

ID/Employee 1

ID/Employee 2 ID/Employee 3

ID/Employee 4

• Wide Transformation: In wide transformation, as shown in Figure 6.7, all the elements that

are required to compute the records in the single partition may live in many partitions of

parent RDD. The partition may live in many partitions of parent RDD. Wide transformations

are the result of groupbyKey() and reducebyKey().

M06 Big Data Simplified XXXX 01.indd 126 5/17/2019 2:49:10 PM

Introducing Spark andKafka | 127

Transformation on RDD:

S. No. RDD Transformations and Meaning

1. map(func)

Returns a new distributed dataset, formed by passing each element of the source

through a function func. The map function iterates over every line in RDD and split

into new RDD. Using map() transformation, we take in any function, and that function

is applied to every element of RDD.

Example:

valdataRDD = sc.parallelize(List(3,4,5,6))

valmapRDD = dataRDD.map(x=>x+2).collect

2. lter(func)

Returns a new dataset formed by selecting those elements of the source on which func

returns true. Create a le ‘sparkRDDTest.txt’ inside ‘/usr/local’. Put some lines about Spark.

Example:

valtestdataRDD = sc.textFile(“/usr/local/sparkRDDTest.txt”)

valsparkRDD = testdataRDD.ﬁlter(lines =>lines.

contains(“Spark”)).collect

3. atMap(func)

Similar to map, but each input item can be mapped to 0 or more output items

(So,func should return a Seq rather than a single item). With the help of atMap()

function, to each input element, we have many elements in an output RDD. The most

simple use of atMap() is to split each input string into words. See the difference in

between map() and atMap() after executing the below code.

Example 1:

valstringRDD = sc.parallelize(List(“apache hadoop”,”apache

spark”))

valﬂatMapRDD = stringRDD.ﬂatMap(x =>x.split(“ “)).collect

valﬂatMapRDD = stringRDD.map(x =>x.split(“ “)).collect

FIGURE 6.7 Wide transformation

Parent RDD 1

Parent RDD 2

Parent RDD 3

Child RDD 1

(Continued)

M06 Big Data Simplified XXXX 01.indd 127 5/17/2019 2:49:10 PM

128 | Big Data Simplied

S. No. RDD Transformations and Meaning

Example 2:

val x = sc.parallelize(Array(1,2,3))

val y = x.ﬂatMap(n => Array(n, n*100, 42))

println(x.collect().mkString(“, “))

println(y.collect().mkString(“, “))

Output:

x:[1,2,3]

y: [1,100, 42, 2, 200, 42, 3, 300, 42]

4. mapPartitions(func)

Similar to map, but runs separately on each partition (block) of the RDD, so func must

be of type Iterator<T>⇒ Iterator<U> when running on an RDD of type T.

5. groupBy(func)

Group the data in the original RDD. Create pairs where the key is the output of a user

function, and the value is all items for which the function yields this key.

Example:

val x = sc.parallelize( Array(“Amit”, “Sourabh”, “Sayan”,

“Jhon”))

val y = x.groupBy(w =>w.charAt(0))

println(y.collect().mkString(“, “))

6. sample(withReplacement, fraction, seed)

Sample a fraction of the data, with or without replacement, using a given random

number generator seed.

7. union(otherDataset)

Returns a new dataset that contains the union of the elements in the source dataset and

the argument.

8. intersection(otherDataset)

Returns a new RDD that contains the intersection of elements in the source dataset and

the argument.

Example:

val rdd1 = spark.sparkContext.

parallelize(Seq((1,”jan”,2016),(3,”nov”,2014,

(16,”feb”,2014)))

val rdd2 = spark.sparkContext.

parallelize(Seq((5,”dec”,2014),(1,”jan”,2016)))

valcomman = rdd1.intersection(rdd2)

comman.foreach(Println)

(Continued)

M06 Big Data Simplified XXXX 01.indd 128 5/17/2019 2:49:11 PM

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Page 155

Create new playlist

Sign In

Sign Up

Table of Contents for
Page 155