Using DataFrame operations to transform

The data from the API has an RDD underneath it, and so there is no way that the DataFrame could be mutable. In DataFrame, the immutability is even better because we can add and subtract columns from it dynamically, without changing the source dataset.

In this section, we will cover the following topics:

  • Understanding DataFrame immutability
  • Creating two leaves from the one root DataFrame
  • Adding a new column by issuing transformation

We will start by using data from operations to transform our DataFrame. First, we need to understand DataFrame immutability and then we will create two leaves, but this time from the one root DataFrame. We will then issue a transformation that is a bit different than the RDD. This will add a new column to our resulting DataFrame because we are manipulating it this way in a DataFrame. If we want to map data, then we need to take data from the first column, transform it, and save it to another column, and then we'll have two columns. If we are no longer interested, we can drop the first column, but the result will be yet another DataFrame.

So, we'll have the first DataFrame with one column, the second one with result and source, and the third one with only one result. Let's look at the code for this section.

We will be creating a DataFrame, so we need to call the toDF() method. We are creating the UserData with "a" as "1", "b" as "2", and "d" as "200". The UserData has userID and data, two fields that are both String, as shown in the following example:

test("Should use immutable DF API") {
import spark.sqlContext.implicits._
//given
val userData =
spark.sparkContext.makeRDD(List(
UserData("a", "1"),
UserData("b", "2"),
UserData("d", "200")
)).toDF()

It is important to create an RDD using a case class in tests because, when we are called to the DataFrame, this part will infer the schema and name columns accordingly. The following code follows an example of this, where we are filtering only a userID column from the userData that is in "a":

//when
val res = userData.filter(userData("userId").isin("a"))

Our result should have only one record and so we are dropping two columns, but still, the userData source that we created will have 3 rows. So, modifying it by filtering created yet another DataFrame that we call the res without modifying the input userData, as shown in the following example:

    assert(res.count() == 1)
assert(userData.count() == 3)

}
}

So, let's start this test and see how immutable data from API behaves, as shown in the following screenshot:

"C:Program FilesJavajdk-12injava.exe" "-javaagent:C:Program FilesJetBrainsIntelliJ IDEA 2018.3.5libidea_rt.jar=51713:C:Program FilesJetBrainsIntelliJ IDEA 2018.3.5in" -Dfile.encoding=UTF-8 -classpath C:UsersSnehaIdeaProjectsChapter07outproductionChapter07 com.company.Main

Process finished with exit code 0

As we can see, our test passes, and, from the result (res), we know that our parent was not modified. So, for example, if we want to map something on res.map(), we can map the userData column, as shown in the following example:

res.map(a => a.getString("userId") + "can")

Another leaf will have an additional column without changing the userId source code, so that was the immutability of DataFrame.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.142.2