Introducing Spark andKafka | 125
RDD Persistence and Caching Mechanism: Spark RDD persistence is an optimization technique in which
it saves the result of RDD evaluation. Using this, we save the intermediate result so that we can
use it further if required. It reduces the computation overhead.
We can make persisted RDD through cache() and persist() methods. When we use the cache()
method, we can store all the RDD in-memory. We can persist the RDD in memory and use it
efficiently across parallel operations as shown in Figure 6.3.
The difference between cache() and persist() is that using cache() the default storage level is
MEMORY_ONLY while using persist() we can use various storage levels (described below). It is
a key tool for an interactive algorithm. Because, when we persist RDD each node stores any par-
tition of it that it computes in memory and makes it reusable for future use. This process speeds
up further computation ten times.
FIGURE 6.3 Persistence and caching in apache Spark
DISK
RAM
OPERATION 1
OPERATION 2
When the RDD is computed for the first time, it is kept in memory on the node. The cache
memory of the Spark is fault tolerant so whenever any partition of RDD is lost, it can be recovered
by transformation operation that originally created it.
The following are some advantages of RDD caching and persistence mechanism in spark.
• Time efficient
• Cost efficient
• Lesser the execution time
Apache Spark Paired RDD: Spark Paired RDDs are nothing but RDDs containing a key-value pair.
Basically, key-value pair (KVP) consists of a two linked data item in it. Here, the key is the iden-
tier, whereas value is the data corresponding to the key value.
Moreover, Spark operations work on RDDs containing any type of objects. However, key-
value pair RDDs attains few special operations in it, such as distributed ‘shuffle’ operations,
grouping or aggregating the elements by a key.
In addition, on Spark Paired RDDs containing Tuple2 objects in Scala, these operations are
automatically available. Basically, operations for the key-value pair are available in the Pair RDD
functions class. However, that wraps around a Spark RDD of tuples (Ref. Figure 6.4).
Spark Operations on RDD: As depicted in Figure 6.5, the operations over RDD are categorized into
two parts, such as Transformations and Actions.
Transformations on RDD: Any function that returns an RDD is a transformation, elaborating it further
we can say that Transformation is functions which create a new data set from an existing one by
passing each data set element through a function and returns a new RDD representing the results.
M06 Big Data Simplified XXXX 01.indd 125 5/17/2019 2:49:09 PM