The development environment

The Scala language will be used for the coding samples in this book. This is because, as a scripting language, it produces less code than Java. It can also be used from the Spark shell as well as compiled with Apache Spark applications. We will be using the sbt tool to compile the Scala code, which we have installed into Hortonworks HDP 2.6 Sandbox as follows:

[hadoop@hc2nn ~]# sudo su -
[root@hc2nn ~]# cd /tmp
[root@hc2nn ~]#wget http://repo.scala-sbt.org/scalasbt/sbt-native-packages/org/scala-sbt/sbt/0.13.1/sbt.rpm
[root@hc2nn ~]# rpm -ivh sbt.rpm

The following URL provides instructions to install sbt on other operating systems including Windows, Linux and macOS: http://www.scala-sbt.org/0.13/docs/Setup.html.

We used a generic Linux account called Hadoop. As the previous commands show, we need to install sbt as the root account, which we have accessed via sudo su -l (switch user). We then downloaded the sbt.rpm file to the /tmp directory from the web-based server called repo.scala-sbt.org using wget. Finally, we installed the rpm file using the rpm command with the options i for install, v for verify, and h to print the hash marks while the package is being installed.

We developed all of the Scala code for Apache Spark in this chapter on the Linux server, using the Linux Hadoop account. We placed each set of code within a subdirectory under /home/hadoop/spark. For instance, the following sbt structure diagram shows that the MLlib Naive Bayes code is stored in a subdirectory called nbayes under the Spark directory. What the diagram also shows is that the Scala code is developed within a subdirectory structure named src/main/scala under the nbayes directory. The files called bayes1.scala and convert.scala contain the Naive Bayes code that will be used in the next section:

The bayes.sbt file is a configuration file used by the sbt tool, which describes how to compile the Scala files within the Scala directory. (Note that if you were developing in Java, you would use a path of the nbayes/src/main/java form .) The contents of the bayes.sbt file are shown next. The pwd and cat Linux commands remind you of the file location and also remind you to dump the file contents.

The name, version, and scalaVersion options set the details of the project and the version of Scala to be used. The libraryDependencies options define where the Hadoop and Spark libraries can be located.

[hadoop@hc2nn nbayes]$ pwd
/home/hadoop/spark/nbayes
[hadoop@hc2nn nbayes]$ cat bayes.sbt
name := "Naive Bayes"
version := "1.0"
scalaVersion := "2.11.2"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.8.1"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.6.0"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.1.1"

The Scala nbayes project code can be compiled from the nbayes subdirectory using this command:

[hadoop@hc2nn nbayes]$ sbt compile

The sbt compile command is used to compile the code into classes. The classes are then placed in the nbayes/target/scala-2.10/classes directory. The compiled classes can be packaged in a JAR file with this command:

[hadoop@hc2nn nbayes]$ sbt package

The sbt package command will create a JAR file under the nbayes/target/scala-2.10 directory. As we can see in the example in the sbt structure diagram, the JAR file named naive-bayes_2.10-1.0.jar has been created after a successful compile and package. This JAR file, and the classes that it contains, can then be used in a spark-submit command. This will be described later as the functionality in the Apache Spark MLlib module is explored.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.12.232