Testing in different versions of Spark

In this section, we will cover the following topics:

  • Changing the component to work with Spark pre-2.x
  • Mock testing pre-2.x
  • RDD mock testing

Let's start with the mocking data sources from the third section of this chapter—Mocking data sources using partial functions.

Since we were testing UserDataLogic.loadAndGetAmount, notice that everything operates on the DataFrame and thus we had a SparkSession and DataFrame.

Now, let's compare it to the Spark pre-2.x. We can see that this time, we are unable to use DataFrames. Let's assume that the following example shows our logic from the previous Sparks:

test("mock loading data from hive"){
//given
import spark.sqlContext.implicits._
val df = spark.sparkContext
.makeRDD(List(UserTransaction("a", 100), UserTransaction("b", 200)))
.toDF()
.rdd
//when
val res = UserDataLogicPre2.loadAndGetAmount(spark, _ => df)
//then
println(res.collect().toList)
}
}

We can see that we are not able to use DataFrames this time.

In the previous section, loadAndGetAmount was taking spark and DataFrame, but the DataFrame in the following example is an RDD, not a DataFrame anymore, and so we are passing an rdd:

 val res = UserDataLogicPre2.loadAndGetAmount(spark, _ => rdd)

However, we need to create a different UserDataLogicPre2 for Spark that takes SparkSession and returns an RDD after mapping an RDD of an integer, as shown in the following example:

object UserDataLogicPre2 {
def loadAndGetAmount(sparkSession: SparkSession, provider: SparkSession => RDD[Row]): RDD[Int] = {
provider(sparkSession).map(_.getAs[Int]("amount"))
}
}
object HiveDataLoaderPre2 {
def loadUserTransactions(sparkSession: SparkSession): RDD[Row] = {
sparkSession.sql("select * from transactions").rdd
}
}

In the preceding code, we can see that the provider is executing our provider logic, mapping every element, getting it as an int. Then, we get the amount. Row is a generic type that can have a variable number of arguments.

In Spark pre-2.x, we do not have SparkSession and therefore we need to use SparkContext and change our login accordingly.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.168.203