Building your Spark job with Maven

Maven is an open source Apache project that builds Spark jobs in Java or Scala. As with sbt, you can include the Spark dependency through Maven, simplifying our build process. As with sbt, Maven has the ability to bundle Spark and all of our dependencies, in a single JAR file using a plugin or build Spark as a monolithic JAR using sbt/sbt assembly for inclusion.

To illustrate the build process for Spark jobs with Maven, this section will use Java as an example since Maven is more commonly used to build Java tasks. As a first step, let's take a Spark job that already works and go through the process of creating a build file for it. We can start by copying the GroupByTest example into a new directory and generating the Maven template as follows:

mkdir example-java-build/; cd example-java-build
mvn archetype:generate 
   -DarchetypeGroupId=org.apache.maven.archetypes 
   -DgroupId=spark.examples 
   -DartifactId=JavaWordCount 
   -Dfilter=org.apache.maven.archetypes:maven-archetype-quickstart
cp ../examples/src/main/java/spark/examples/JavaWordCount.java JavaWordCount/src/main/java/spark/examples/JavaWordCount.java

Next, update your Maven pom.xml to include information about the version of Spark we are using. Also, since the example file we are working with requires JDK 1.5, we will need to update the Java version that Maven is configured to use; at the time of writing, it defaults to 1.3. In between the <project> tags, we will need to add the following code:

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.spark-project</groupId>
      <artifactId>spark-core_2.9.2</artifactId>
      <version>0.7.0</version>
    </dependency>
  </dependencies>
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <configuration>
          <source>1.5</source>
          <target>1.5</target>
        </configuration>
      </plugin>
    </plugins>
  </build>

We can now build our jar with the maven package, which can be run using the following commands:

SPARK_HOME="../"  SPARK_EXAMPLES_JAR="./target/JavaWordCount-1.0-SNAPSHOT.jar"  java -cp ./target/JavaWordCount-1.0-SNAPSHOT.jar:../../core/target/spark-core-assembly-0.7.0.jar spark.examples.JavaWordCount local[1] ../../README

As with sbt, we can use a plugin to include all the dependencies in our JAR file. In between the <plugins> tags, add the following code:

<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-shade-plugin</artifactId>
  <version>1.7</version>
  <configuration>
    <!-- This transform is used so that merging of akka configuration files works -->
    <transformers>
      <transformer implementation="org.apache.maven.plugins.shade.resource.ApacheLicenseResourceTransformer">
      </transformer>
      <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
        <resource>reference.conf</resource>
      </transformer>
    </transformers>
  </configuration>
  <executions>
    <execution>
      <phase>package</phase>
      <goals>
        <goal>shade</goal>
      </goals>
    </execution>
  </executions>
</plugin>

Then run mvn assembly and the resulting jar file can be run as the preceding code, but leaving out the Spark assembly jar file from the classpath.

Some of the following links useful for referencing are as follows:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.68.28