Integration testing using SparkSession

Let's now learn about integration testing using SparkSession.

In this section, we will cover the following topics:

Leveraging SparkSession for integration testing
Using a unit tested component

Here, we are creating the Spark engine. The following line is crucial for the integration test:

 val spark: SparkContext = SparkSession.builder().master("local[2]").getOrCreate().sparkContext

It is not a simple line just to create a lightweight object. SparkSession is a really heavy object and constructing it from scratch is an expensive operation from the perspective of resources and time. Tests such as creating SparkSession will take more time compared to the unit testing from the previous section.

For the same reason, we should use unit tests often to convert all edge cases and use integration testing only for the smaller part of the logic, such as the capital edge case.

The following example shows the array we are creating:

 val keysWithValuesList =
 Array(
 UserTransaction("A", 100),
 UserTransaction("B", 4),
 UserTransaction("A", 100001),
 UserTransaction("B", 10),
 UserTransaction("C", 10)
 )

The following example shows the RDD we are creating:

 val data = spark.parallelize(keysWithValuesList)

This is the first time that Spark has been involved in our integration testing. Creating an RDD is also a time-consuming operation. Compared to just creating an array, it is really slow to create an RDD because that is also a heavy object.

We will now use our data.filter to pass a qualifyForBonus function, as shown in the following example:

 val aggregatedTransactionsForUserId = data.filter(BonusVerifier.qualifyForBonus)

This function was already unit tested, so we don't need to consider all edge cases, different IDs, different amounts, and so on. We are just creating a couple of IDs with some amounts to test whether or not our whole chain of logic is working as expected.

After we have applied this logic, our output should be similar to the following:

 UserTransaction("A", 100001)

Let's start this test and check how long it takes to execute a single integration test, as shown in the following output:

It took around 646 ms to execute this simple test.

If we want to cover every edge case, the value will be multiplied by a factor of hundreds compared to the unit test from the previous section. Let's start this unit test with three edge cases, as shown in the following output:

We can see that our test took only 18 ms, which means that it was 20 times faster, even though we covered three edge cases, compared to integration tests with only one case.

Here, we have covered a lot of logic with hundreds of edge cases, and we can conclude that it is really wise to have unit tests at the lowest possible level.

In the next section, we will be mocking data sources using partial functions.

Table of Contents for Integration testing using SparkSession

Create new playlist

Sign In

Sign Up

Table of Contents for
Integration testing using SparkSession