Using RDD in an immutable way

Now that we know how to create a chain of execution using RDD inheritance, let's learn how to use RDD in an immutable way.

In this section, we will go through the following topics:

  • Understating DAG immutability
  • Creating two leaves from the one root RDD
  • Examining results from both leaves

Let's first understand directed acyclic graph immutability and what it gives us. We will then be creating two leaves from one node RDD, and checking if both leaves are behaving totally independently if we create a transformation on one of the leaf RDD's. We will then examine results from both leaves of our current RDD and check if any transformation on any leaf does not change or impact the root RDD. It is imperative to work like this because we have found that we will not be able to create yet another leaf from the root RDD, because the root RDD will be changed, which means it will be mutable. To overcome this, the Spark designers created an immutable RDD for us. 

There is a simple test to show that the RDD should be immutable. First, we will create an RDD from 0 to 5, which is added to a sequence from the Scala branch. to is taking the Int, and the first parameter is an implicit one, which is from the Scala package, as shown in the following example:

class ImmutableRDD extends FunSuite {
val spark: SparkContext = SparkSession
.builder().master("local[2]").getOrCreate().sparkContext

test("RDD should be immutable") {
//given
val data = spark.makeRDD(0 to 5)

Once we have our RDD data, we can create the first leaf. The first leaf is a result (res) and we are just mapping every element multiplied by 2. Let's create a second leaf, but this time it will be marked by 4, as shown in the following example:

//when
val res = data.map(_ * 2)
val leaf2 = data.map(_ * 4)

So, we have our root RDD and two leaves. First, we will collect the first leaf and see that the elements in it are 0, 2, 4, 6, 8, 10, so everything here is multiplied by 2, as shown in the following example:

//then
res.collect().toList should contain theSameElementsAs List(
0, 2, 4, 6, 8, 10
)

However, even though we have that notification on the res, the data is still exactly the same as it was in the beginning, which is 0, 1, 2, 3, 4, 5, as shown in the following example:

data.collect().toList should contain theSameElementsAs List(
0, 1, 2, 3, 4, 5
)
}
}

So, everything is immutable, and executing the transformation of * 2 didn't change our data. If we create a test for leaf2, we will collect it and call toList. We will see that it should contain elements like 0, 4, 8, 12, 16, 20, as shown in the following example:

leaf2.collect().toList should contain theSameElementsAs List(
0, 4, 8, 12, 16, 20
)

When we run the test, we will see that every path in our execution, the root, that is, data, or the first leaf and second leaf, behave independently from each other, as shown in the following code output:

"C:Program FilesJavajdk-12injava.exe" "-javaagent:C:Program FilesJetBrainsIntelliJ IDEA 2018.3.5libidea_rt.jar=51704:C:Program FilesJetBrainsIntelliJ IDEA 2018.3.5in" -Dfile.encoding=UTF-8 -classpath C:UsersSnehaIdeaProjectsChapter07outproductionChapter07 com.company.Main

Process finished with exit code 0

Every mutation is different; we can see that the test passed, which shows us that our RDD is immutable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.77.153