124 | Big Data Simplied
In the above code, RDD ‘wordPair’ is created from existing RDD ‘word’ using map() transforma-
tion which contains word and its starting character together.
Features of Spark RDD
In-memory Computation: The data inside RDD are stored in memory for as long as you want
to store. Keeping the data in-memory improves the performance by an order of magnitudes.
Lazy Evaluation: The data inside RDDs are not evaluated on the go. The changes or the
computation is performed only after an action is triggered. Thus, it limits how much work it
has to do.
Fault Tolerance: Upon the failure of worker node, using lineage of operations, we can
recompute the lost partition of RDD from the original one. Thus, we can easily recover the
lost data.
Immutability: RDDS are immutable in nature meaning once we create an RDD we cannot
manipulate it. And if we perform any transformation, it creates new RDD. We achieve con-
sistency through immutability.
Partitioning: RDD partition the records logically and distributes the data across various
nodes in the cluster. The logical divisions are only for processing and internally it has no
division. Thus, it provides parallelism.
Parallel: RDD processes the data in parallel over the cluster.
Location-Stickiness: RDDs are capable of defining placement preference to compute
partitions. Placement preference refers to information about the location of RDD. The
DAG Scheduler places the partitions in such a way that the task is close to data as much as
possible. Thus, it speeds up computation.
Coarse-grained Operation: We apply coarse-grained transformations to RDD. Here, coarse-
grained meaning the operation applies to the whole dataset and not on individual element in
the data set of RDD.
Typed: We can have RDD of various types like: RDD [int], RDD [long], RDD [string].
No Limitation: We can have any number of RDDs. There is no limit to its number, as such.
The limit depends on the size of disk and memory.
Hence, using RDD we can recover the shortcoming of Hadoop MapReduce and can handle
large volume of data. As a result, it decreases the time complexity of the system. Thus, the
above-mentioned features of Spark RDD makes them useful for fast computations and increases
the performance of the system.
If you like this post or feel that I have missed some features of Spark RDD, please do leave a
comment.
M06 Big Data Simplified XXXX 01.indd 124 5/17/2019 2:49:09 PM
Introducing Spark andKafka | 125
RDD Persistence and Caching Mechanism: Spark RDD persistence is an optimization technique in which
it saves the result of RDD evaluation. Using this, we save the intermediate result so that we can
use it further if required. It reduces the computation overhead.
We can make persisted RDD through cache() and persist() methods. When we use the cache()
method, we can store all the RDD in-memory. We can persist the RDD in memory and use it
efficiently across parallel operations as shown in Figure 6.3.
The difference between cache() and persist() is that using cache() the default storage level is
MEMORY_ONLY while using persist() we can use various storage levels (described below). It is
a key tool for an interactive algorithm. Because, when we persist RDD each node stores any par-
tition of it that it computes in memory and makes it reusable for future use. This process speeds
up further computation ten times.
FIGURE 6.3 Persistence and caching in apache Spark
DISK
RAM
OPERATION 1
OPERATION 2
………………
When the RDD is computed for the first time, it is kept in memory on the node. The cache
memory of the Spark is fault tolerant so whenever any partition of RDD is lost, it can be recovered
by transformation operation that originally created it.
The following are some advantages of RDD caching and persistence mechanism in spark.
Time efficient
Cost efficient
Lesser the execution time
Apache Spark Paired RDD: Spark Paired RDDs are nothing but RDDs containing a key-value pair.
Basically, key-value pair (KVP) consists of a two linked data item in it. Here, the key is the iden-
tier, whereas value is the data corresponding to the key value.
Moreover, Spark operations work on RDDs containing any type of objects. However, key-
value pair RDDs attains few special operations in it, such as distributed ‘shuffle’ operations,
grouping or aggregating the elements by a key.
In addition, on Spark Paired RDDs containing Tuple2 objects in Scala, these operations are
automatically available. Basically, operations for the key-value pair are available in the Pair RDD
functions class. However, that wraps around a Spark RDD of tuples (Ref. Figure 6.4).
Spark Operations on RDD: As depicted in Figure 6.5, the operations over RDD are categorized into
two parts, such as Transformations and Actions.
Transformations on RDD: Any function that returns an RDD is a transformation, elaborating it further
we can say that Transformation is functions which create a new data set from an existing one by
passing each data set element through a function and returns a new RDD representing the results.
M06 Big Data Simplified XXXX 01.indd 125 5/17/2019 2:49:09 PM
126 | Big Data Simplied
All transformations in Spark are lazy. They do not compute their results right away. Instead, they
just remember the transformations applied to some base data set (For example, a le). Thetrans-
formations are only computed when an action requires a result that needs to be returned to the
driver program.
There are two types of transformations and they are explained in detail below.
Narrow Transformation: In narrow transformation, as shown in Figure 6.6, all the elements
that are required to compute the records in single partition live in the single partition of par-
ent RDD. A limited subset of partition is used to calculate the result. Narrow transformations
are the result of map(), filter().
FIGURE 6.5 Parts of RDD operations
Spark operationsTransformations Actions
FIGURE 6.6 Narrow transformation
Parent RDD 1
Parent RDD 2
Parent RDD 3
Child RDD 1
Child RDD 1
Child RDD 1
FIGURE 6.4 Paired RDD
Pair RDD = An RDD with combination of KEY/VALUE pairs
Employee 1 Employee 2 Employee 3 Employee 4
ID/Employee 1
ID/Employee 2 ID/Employee 3
ID/Employee 4
Wide Transformation: In wide transformation, as shown in Figure 6.7, all the elements that
are required to compute the records in the single partition may live in many partitions of
parent RDD. The partition may live in many partitions of parent RDD. Wide transformations
are the result of groupbyKey() and reducebyKey().
M06 Big Data Simplified XXXX 01.indd 126 5/17/2019 2:49:10 PM
Introducing Spark andKafka | 127
Transformation on RDD:
S. No. RDD Transformations and Meaning
1. map(func)
Returns a new distributed dataset, formed by passing each element of the source
through a function func. The map function iterates over every line in RDD and split
into new RDD. Using map() transformation, we take in any function, and that function
is applied to every element of RDD.
Example:
valdataRDD = sc.parallelize(List(3,4,5,6))
valmapRDD = dataRDD.map(x=>x+2).collect
2. lter(func)
Returns a new dataset formed by selecting those elements of the source on which func
returns true. Create a le ‘sparkRDDTest.txt’ inside ‘/usr/local’. Put some lines about Spark.
Example:
valtestdataRDD = sc.textFile(“/usr/local/sparkRDDTest.txt”)
valsparkRDD = testdataRDD.filter(lines =>lines.
contains(“Spark”)).collect
3. atMap(func)
Similar to map, but each input item can be mapped to 0 or more output items
(So,func should return a Seq rather than a single item). With the help of atMap()
function, to each input element, we have many elements in an output RDD. The most
simple use of atMap() is to split each input string into words. See the difference in
between map() and atMap() after executing the below code.
Example 1:
valstringRDD = sc.parallelize(List(“apache hadoop”,”apache
spark”))
valflatMapRDD = stringRDD.flatMap(x =>x.split(“ “)).collect
valflatMapRDD = stringRDD.map(x =>x.split(“ “)).collect
FIGURE 6.7 Wide transformation
Parent RDD 1
Parent RDD 2
Parent RDD 3
Child RDD 1
Child RDD 1
Child RDD 1
(Continued)
M06 Big Data Simplified XXXX 01.indd 127 5/17/2019 2:49:10 PM
128 | Big Data Simplied
S. No. RDD Transformations and Meaning
Example 2:
val x = sc.parallelize(Array(1,2,3))
val y = x.flatMap(n => Array(n, n*100, 42))
println(x.collect().mkString(“, “))
println(y.collect().mkString(“, “))
Output:
x:[1,2,3]
y: [1,100, 42, 2, 200, 42, 3, 300, 42]
4. mapPartitions(func)
Similar to map, but runs separately on each partition (block) of the RDD, so func must
be of type Iterator<T>⇒ Iterator<U> when running on an RDD of type T.
5. groupBy(func)
Group the data in the original RDD. Create pairs where the key is the output of a user
function, and the value is all items for which the function yields this key.
Example:
val x = sc.parallelize( Array(“Amit”, “Sourabh”, “Sayan”,
“Jhon”))
val y = x.groupBy(w =>w.charAt(0))
println(y.collect().mkString(“, “))
6. sample(withReplacement, fraction, seed)
Sample a fraction of the data, with or without replacement, using a given random
number generator seed.
7. union(otherDataset)
Returns a new dataset that contains the union of the elements in the source dataset and
the argument.
8. intersection(otherDataset)
Returns a new RDD that contains the intersection of elements in the source dataset and
the argument.
Example:
val rdd1 = spark.sparkContext.
parallelize(Seq((1,”jan”,2016),(3,”nov”,2014,
(16,”feb”,2014)))
val rdd2 = spark.sparkContext.
parallelize(Seq((5,”dec”,2014),(1,”jan”,2016)))
valcomman = rdd1.intersection(rdd2)
comman.foreach(Println)
(Continued)
M06 Big Data Simplified XXXX 01.indd 128 5/17/2019 2:49:11 PM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.73.207