Method 3: Making life easier with Spark testing base

Spark testing base helps you to test your most of the Spark codes with ease. So, what are the pros of this method then? There are many in fact. For example, using this the code is not verbose but we can get very succinct code. The API is itself richer than that of ScalaTest or JUnit. Multiple languages support, for example, Scala, Java, and Python. It has the support of built-in RDD comparators. You can also use it for testing streaming applications. And finally and most importantly, it supports both local and cluster mode testings. This is most important for the testing in a distributed environment.

The GitHub repo is located at https://github.com/holdenk/spark-testing-base.

Before starting the unit testing with Spark testing base, you should include the following dependency in the Maven friendly pom.xml file in your project tree for Spark 2.x as follows:

<dependency>
  <groupId>com.holdenkarau</groupId>
  <artifactId>spark-testing-base_2.10</artifactId>
  <version>2.0.0_0.6.0</version>
</dependency>

For SBT, you can add the following dependency:

"com.holdenkarau" %% "spark-testing-base" % "2.0.0_0.6.0"

Note that it is recommended to add the preceding dependency in the test scope by specifying <scope>test</scope> for both the Maven and SBT cases. In addition to these, there are other considerations such as memory requirements and OOMs and disabling the parallel execution. The default Java options in the SBT testing are too small to support for running multiple tests. Sometimes it's harder to test Spark codes if the job is submitted in local mode! Now you can naturally understand how difficult it would be in a real cluster mode -i.e. YARN or Mesos.

To get rid of this problem, you can increase the amount of memory in your build.sbt file in your project tree. Just add the following parameters as follows:

javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", "-XX:+CMSClassUnloadingEnabled")

However, if you are using Surefire, you can add the following:

<argLine>-Xmx2048m -XX:MaxPermSize=2048m</argLine>

In your Maven-based build, you can make it by setting the value in the environmental variable. For more on this issue, refer to https://maven.apache.org/configure.html.

This is just an example to run spark testing base's own tests. Therefore, you might need to set bigger value. Finally, make sure that you have disabled the parallel execution in your SBT by adding the following line of code:

parallelExecution in Test := false

On the other hand, if you're using surefire, make sure that forkCount and reuseForks are set as 1 and true, respectively. Let's see an example of using Spark testing base. The following source code has three test cases. The first test case is the dummy that compares if 1 is equal to 1 or not, which obviously will be passed. The second test case counts the number of words from the sentence, say Hello world! My name is Reza, and compares if this has six words or not. The final and the last test case tries to compare two RDDs:

package com.chapter16.SparkTesting
import org.scalatest.Assertions._
import org.apache.spark.rdd.RDD
import com.holdenkarau.spark.testing.SharedSparkContext
import org.scalatest.FunSuite
class TransformationTestWithSparkTestingBase extends FunSuite with SharedSparkContext {
  def tokenize(line: RDD[String]) = {
    line.map(x => x.split(' ')).collect()
  }
  test("works, obviously!") {
    assert(1 == 1)
  }
  test("Words counting") {
    assert(sc.parallelize("Hello world My name is Reza".split("\W")).map(_ + 1).count == 6)
  }
  test("Testing RDD transformations using a shared Spark Context") {
    val input = List("Testing", "RDD transformations", "using a shared", "Spark Context")
    val expected = Array(Array("Testing"), Array("RDD", "transformations"), Array("using", "a", "shared"), Array("Spark", "Context"))
    val transformed = tokenize(sc.parallelize(input))
    assert(transformed === expected)
  }
}

From the preceding source code, we can see that we can perform multiple test cases using Spark testing base. Upon successful execution, you should observe the following output (Figure 13):

Figure 13: A successful execution and passed test using Spark testing base

Table of Contents for Method 3: Making life easier with Spark testing base

Create new playlist

Sign In

Sign Up

Table of Contents for
Method 3: Making life easier with Spark testing base