From past examples, you know that we favor sbt as a build tool for developing Scala source examples.
We have created a development environment on the Linux server called hc2r1m2 using the Hadoop development account. The development directory is called h2o_spark_1_2:
[hadoop@hc2r1m2 h2o_spark_1_2]$ pwd
/home/hadoop/spark/h2o_spark_1_2
Our SBT build configuration file named h2o.sbt is located here; it contains the following:
[hadoop@hc2r1m2 h2o_spark_1_2]$ more h2o.sbt
name := "H 2 O"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0"
libraryDependencies += "org.apache.spark" % "spark-core" % "1.2.0" from "file:///usr/hdp/2.6.0.3-8/spark/lib/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar"
libraryDependencies += "org.apache.spark" % "mllib" % "1.2.0" from "file:///usr/hdp/2.6.0.3-8/spark/lib/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar"
libraryDependencies += "org.apache.spark" % "sql" % "1.2.0" from "file:///usr/hdp/2.6.0.3-8/spark/lib/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar"
libraryDependencies += "org.apache.spark" % "h2o" % "0.2.12-95" from "file:///usr/local/h2o/assembly/build/libs/sparkling-water-assembly-0.2.12-95-all.jar"
libraryDependencies += "hex.deeplearning" % "DeepLearningModel" % "0.2.12-95" from "file:///usr/local/h2o/assembly/build/libs/sparkling-water-assembly-0.2.12-95-all.jar"
libraryDependencies += "hex" % "ModelMetricsBinomial" % "0.2.12-95" from "file:///usr/local/h2o/assembly/build/libs/sparkling-water-assembly-0.2.12-95-all.jar"
libraryDependencies += "water" % "Key" % "0.2.12-95" from "file:///usr/local/h2o/assembly/build/libs/sparkling-water-assembly-0.2.12-95-all.jar"
libraryDependencies += "water" % "fvec" % "0.2.12-95" from "file:///usr/local/h2o/assembly/build/libs/sparkling-water-assembly-0.2.12-95-all.jar"
We provided sbt configuration examples in the previous chapters, so we won't go into line-by line-detail here. We have used the file-based URLs to define the library dependencies and sourced the Hadoop JAR files from Hadoop home.
The Sparkling Water JAR path is defined as /usr/local/h2o/ that was just created.
We use a Bash script called run_h2o.bash within this development directory to execute our H2O-based example code. It takes the application class name as a parameter and is shown as follows:
[hadoop@hc2r1m2 h2o_spark_1_2]$ more run_h2o.bash
#!/bin/bash
SPARK_HOME=/usr/hdp/current/spark-client
SPARK_LIB=$SPARK_HOME/lib
SPARK_BIN=$SPARK_HOME/bin
SPARK_SBIN=$SPARK_HOME/sbin
SPARK_JAR=$SPARK_LIB/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar
H2O_PATH=/usr/local/h2o/assembly/build/libs
H2O_JAR=$H2O_PATH/sparkling-water-assembly-0.2.12-95-all.jar
PATH=$SPARK_BIN:$PATH
PATH=$SPARK_SBIN:$PATH
export PATH
cd $SPARK_BIN
./spark-submit
--class $1
--master spark://hc2nn.semtech-solutions.co.nz:7077
--executor-memory 512m
--total-executor-cores 50
--jars $H2O_JAR
/home/hadoop/spark/h2o_spark_1_2/target/scala-2.10/h-2-o_2.10-1.0.jar
This example of Spark application submission has already been covered, so again, we won't go into detail. Setting the executor memory at a correct value was critical to avoiding out-of-memory issues and performance problems. This is examined in the Performance tuning section.
As in the previous examples, the application Scala code is located in the src/main/scala subdirectory under the development directory level. The next section will examine the Apache Spark and the H2O architectures.