Mocking data sources using partial functions

In this section, we will cover the following topics:

  • Creating a Spark component that reads data from Hive
  • Mocking the component
  • Testing the mock component

Let's assume that the following code is our production line:

 ignore("loading data on prod from hive") {
UserDataLogic.loadAndGetAmount(spark, HiveDataLoader.loadUserTransactions)
}

Here, we are using the UserDataLogic.loadAndGetAmount function, which needs to load our user data transaction and get the amount of the transaction. This method takes two arguments. The first argument is a sparkSession and the second argument is the provider of sparkSession, which takes SparkSession and returns DataFrame, as shown in the following example:

object UserDataLogic {
def loadAndGetAmount(sparkSession: SparkSession, provider: SparkSession => DataFrame): DataFrame = {
val df = provider(sparkSession)
df.select(df("amount"))
}
}

For production, we will load user transactions and see that the HiveDataLoader component has only one method, sparkSession.sql, and ("select * from transactions"), as shown in the following code block:

object HiveDataLoader {
def loadUserTransactions(sparkSession: SparkSession): DataFrame = {
sparkSession.sql("select * from transactions")
}
}

This means that the function goes to Hive to retrieve our data and returns a DataFrame. According to our logic, it executes the provider that is returning a DataFrame and from a DataFrame, it is only selecting amount.

This logic is not simple we can test because our SparkSession provider is interacting with the external system in production. So, we can create a function such as the following:

UserDataLogic.loadAndGetAmount(spark, HiveDataLoader.loadUserTransactions)

Let's see how to test such a component. First, we will create a DataFrame of user transactions, which is our mock data, as shown in the following example:

 val df = spark.sparkContext
.makeRDD(List(UserTransaction("a", 100), UserTransaction("b", 200)))
.toDF()

However, we need to save the data to Hive, embed it, and then start Hive.

Since we are using the partial functions, we can pass a partial function as a second argument, as shown in the following example:

val res = UserDataLogic.loadAndGetAmount(spark, _ => df)

The first argument is spark, but it is not used in our method this time. The second argument is a method that is taking SparkSession and returning DataFrame.

However, our execution engine, architecture, and code do not consider whether this SparkSession is used or if the external call is made; it only wants to return DataFrame. We can _ our first argument because it's ignored and just return DataFrame as the return type.

And so our loadAndGetAmount will get a mock DataFrame, which is the DataFrame that we created.

But, for the logic shown, it is transparent and doesn't consider whether the DataFrame comes from Hive, SQL, Cassandra, or any other source, as shown in the following example:

 val df = provider(sparkSession)
df.select(df("amount"))

In our example, df comes from the memory that we created for the purposes of the test. Our logic continues and it selects the amount.

Then, we show our columns, res.show() , and that logic should end up with one column amount. Let's start this test, as shown in the following example:

We can see from the preceding example that our resulting DataFrame has one column amount in 100 and 200 values. This means it worked as expected, without the need to start an embedding Hive. The key here is to use a provider and not embed our select start within our logic. 

In the next section, we'll be using ScalaCheck for property-based tests.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.246.245