From past examples, you will know that I favor SBT as a build tool for developing Scala source examples. I have created a development environment on the Linux CentOS 6.5 server called hc2r1m2
using the hadoop development account. The development directory is called h2o_spark_1_2
:
[hadoop@hc2r1m2 h2o_spark_1_2]$ pwd /home/hadoop/spark/h2o_spark_1_2
My SBT build configuration file named h2o.sbt
is located here; it contains the following:
[hadoop@hc2r1m2 h2o_spark_1_2]$ more h2o.sbt name := "H 2 O" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0" libraryDependencies += "org.apache.spark" % "spark-core" % "1.2.0" from "file:///opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/jars/spark-assembly-1.2.0-cdh5.3.3-hadoop2.5.0-cdh5.3.3.jar" libraryDependencies += "org.apache.spark" % "mllib" % "1.2.0" from "file:///opt/cloudera/parcels/CDH-5.3-1.cdh5.3.3.p0.5/jars/spark-assembly-1.2.0-cdh5.3.3-hadoop2.5.0-cdh5.3.3.jar" libraryDependencies += "org.apache.spark" % "sql" % "1.2.0" from "file:///opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/jars/spark-assembly-1.2.0-cdh5.3.3-hadoop2.5.0-cdh5.3.3.jar" libraryDependencies += "org.apache.spark" % "h2o" % "0.2.12-95" from "file:///usr/local/h2o/assembly/build/libs/sparkling-water-assembly-0.2.12-95-all.jar" libraryDependencies += "hex.deeplearning" % "DeepLearningModel" % "0.2.12-95" from "file:///usr/local/h2o/assembly/build/libs/sparkling-water-assembly-0.2.12-95-all.jar" libraryDependencies += "hex" % "ModelMetricsBinomial" % "0.2.12-95" from "file:///usr/local/h2o/assembly/build/libs/sparkling-water-assembly-0.2.12-95-all.jar" libraryDependencies += "water" % "Key" % "0.2.12-95" from "file:///usr/local/h2o/assembly/build/libs/sparkling-water-assembly-0.2.12-95-all.jar" libraryDependencies += "water" % "fvec" % "0.2.12-95" from "file:///usr/local/h2o/assembly/build/libs/sparkling-water-assembly-0.2.12-95-all.jar"
I have provided SBT configuration examples in the previous chapters, so I won't go into the line-by line-detail here. I have used the file-based URLs to define the library dependencies, and have sourced the Hadoop JAR files from the Cloudera parcel path for the CDH install. The Sparkling Water JAR path is defined as /usr/local/h2o/
that was just created.
I use a Bash script called run_h2o.bash
within this development directory to execute my H2O-based example code. It takes the application class name as a parameter, and is shown below:
[hadoop@hc2r1m2 h2o_spark_1_2]$ more run_h2o.bash #!/bin/bash SPARK_HOME=/opt/cloudera/parcels/CDH SPARK_LIB=$SPARK_HOME/lib SPARK_BIN=$SPARK_HOME/bin SPARK_SBIN=$SPARK_HOME/sbin SPARK_JAR=$SPARK_LIB/spark-assembly-1.2.0-cdh5.3.3-hadoop2.5.0-cdh5.3.3.jar H2O_PATH=/usr/local/h2o/assembly/build/libs H2O_JAR=$H2O_PATH/sparkling-water-assembly-0.2.12-95-all.jar PATH=$SPARK_BIN:$PATH PATH=$SPARK_SBIN:$PATH export PATH cd $SPARK_BIN ./spark-submit --class $1 --master spark://hc2nn.semtech-solutions.co.nz:7077 --executor-memory 85m --total-executor-cores 50 --jars $H2O_JAR /home/hadoop/spark/h2o_spark_1_2/target/scala-2.10/h-2-o_2.10-1.0.jar
This example of Spark application submission has already been covered, so again, I won't get into the detail. Setting the executor memory at a correct value was critical to avoiding out-of-memory issues and performance problems. This will be examined in the Performance Tuning section.
As in the previous examples, the application Scala code is located in the src/main/scala
subdirectory, under the development
directory level. The next section will examine the Apache Spark, and the H2O architecture.
18.226.52.203