Chapter 3. Building and Running a Spark Application

Using Spark in an interactive mode with the Spark shell has limited permanence and does not work in Java. Building Spark jobs is a bit trickier than building a normal application as all the dependencies have to be available on all the machines that are in your cluster. This chapter will cover building a Java and Scala Spark job with Maven or sbt and Spark jobs with a non-maven-aware build system.

Building your Spark project with sbt

The sbt tool is a popular build tool for Scala that supports building both Scala and Java code. Building Spark projects with sbt is one of the easiest options because Spark itself is built with sbt. It makes it easy to bring in dependencies (which is especially useful for Spark) as well as package everything into a single deployable/JAR file. The current normal method of building packages that use sbt is to use a shell script that bootstraps the specific version of sbt that your project uses, making installation simpler.

As a first step, take a Spark job that already works and go through the process of creating a build file for it. In the spark directory, begin by copying the GroupByTest example into a new directory as follows:

mkdir -p example-scala-build/src/main/scala/spark/examples/
cp -af sbt example-scala-build/
cp examples/src/main/scala/spark/examples/GroupByTest.scala example-scala-build/src/main/scala/spark/examples/

Since you are going to ship your JAR file to the other machines, you will want to ensure that all the dependencies are included in them. You can either add a bunch of JAR files or use a handy sbt plugin called sbt-assembly to group everything into a single JAR file. If you don't have a bunch of transitive dependencies, you may decide that using the assembly extension isn't useful for your project. Instead of using sbt-assembly, you probably want to run sbt/sbt assembly in the Spark project and add the resulting JAR file core/target/spark-core-assembly-0.7.0.jar to your classpath. The sbt-assembly package is a great tool to avoid having to manually manage a large number of JAR files. To add the assembly extension to your build, add the following code to project/plugins.sbt:

resolvers += Resolver.url("artifactory", url("http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases"))(Resolver.ivyStylePatterns)

resolvers += "Typesafe Repository" at "http://repo.typesafe.com/typesafe/releases/"

resolvers += "Spray Repository" at "http://repo.spray.cc/"
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.8.7")

Resolvers are used by sbt so that it can find where a package is; you can think of this as similar to specifying an additional APT PPA (Personal Package Archive) source, except it only applies to the one package that you are trying to build. If you load up the resolver URLs in your browser, most of them have directory listing turned on, so you can see what packages are provided by the resolver. These resolvers point to web URLs, but there are also resolvers for local paths, which can be useful during development. The addSbtPlugin directive is deceptively simple; it says to include the sbt-assembly package from com.eed3si9n at Version 0.8.7 and implicitly adds the Scala version and sbt version. Make sure to run sbt reload clean update to install new plugins.

The following is the build file for one of the GroupByTest.scala examples as if it were being built on its own; insert the following code in ./build.sbt:

//Next two lines only needed if you decide to use the assembly plugin
import AssemblyKeys._
assemblySettings

scalaVersion := "2.9.2"

name := "groupbytest"

libraryDependencies ++= Seq(
  "org.spark-project" % "spark-core_2.9.2" % "0.7.0"
)

resolvers ++= Seq(
  "JBoss Repository" at "http://repository.jboss.org/nexus/content/repositories/releases/","Spray Repository" at "http://repo.spray.cc/","Cloudera Repository" at"https://repository.cloudera.com/artifactory/cloudera-repos/","Akka Repository" at "http://repo.akka.io/releases/","Twitter4J Repository" at "http://twitter4j.org/maven2/")
//Only include if using assembly
mergeStrategy in assembly <<= (mergeStrategy in assembly) {
  (old) =>
  {
    case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first
    case PathList("org", "apache", xs @ _*) => MergeStrategy.first
    case "about.html"  => MergeStrategy.rename
    case x => old(x)
  }
}

As you can see, the build file is similar in format to plugins.sbt. There are a few unique things about this build file that are worth mentioning. Just as we did with the plugin file, you also need to add a number of resolvers so that sbt can find all the dependencies. Note that we are including it as "org.spark-project" % "spark-core_2.9.2" % "0.7.0" rather than using "org.spark-project" %% "spark-core" % "0.7.0". If possible, you should try to use the %% format, which automatically adds the Scala version. Another unique part of this build file is the use of mergeStrategy. Since multiple dependencies can define the same files, when you merge everything into a single JAR file, you need to tell the plugin how to handle it. It is a fairly simple build file other than the merge strategy and manually specifying the Scala version of Spark that you are using.

Note

If you have a different JDK on the master than JRE on the workers, you may want to switch the target JDK by adding the following code to your build file:

javacOptions ++= Seq("-target", "1.6")

Now that your build file is defined, build your GroupByTest Spark job:

sbt/sbt clean compile package

This will produce target/scala-2.9.2/groupbytest_2.9.2-0.1-SNAPSHOT.jar.

Run sbt/sbt assembly in the spark directory to make sure you have the Spark assembly available to your classpaths. The example requires a pointer to where Spark is using SPARK_HOME and where the jar example is using SPARK_EXAMPLES_JAR. We also need to specify the classpath that we built to Scala locally with -cp. We can then run the following example:

SPARK_HOME="../"  SPARK_EXAMPLES_JAR="./target/scala-2.9.2/groupbytest-assembly-0.1-SNAPSHOT.jar"  scala -cp/users/sparkuser/spark-0.7.0/example-scala-build/target/scala-2.9.2/groupbytest_2.9.2-0.1-SNAPSHOT.jar:/users/sparkuser/spark-0.7.0/core/target/spark-core-assembly-0.7.0.jar spark.examples.GroupByTest local[1]

If you have decided to build all of your dependencies into a single JAR file with the assembly plugin, we need to call it like this:

sbt/sbt assembly

This will produce an assembly snapshot at target/scala-2.9.2/groupbytest-assembly-0.1-SNAPSHOT.jar, which you can then run in a very similar manner, simply without spark-core-assembly:

SPARK_HOME="../"  SPARK_EXAMPLES_JAR="./target/scala-2.9.2/groupbytest-assembly-0.1-SNAPSHOT.jar" 
 scala -cp /users/sparkuser/spark-0.7.0/example-scala-build/target/scala-2.9.2/groupbytest-assembly-0.1-SNAPSHOT.jar spark.examples.GroupByTest local[1]

Note

You may run into merge issues with sbt assembly if things have changed; a quick search over the web will probably provide better current guidance than anything that could be written taking guesses about future merge problems. In general, MergeStategy.first should work.

Your success for the preceding code may have given you a false sense of security. Since sbt will resolve security from the local cache, the deps package that was brought in by another project could mean that the code builds on one machine and not others. Delete your local ivy cache and run sbt clean to make sure. If some files fail to download, try looking at Spark's list of resolvers and adding any missing ones to your build.sbt.

Some of the following links useful for referencing are as follows:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.37.154