A quick introduction to Scala

Scala stands for "scalable language." Scalable could be interpreted as scalability of the software application. Java developers have been programming with its nice and verbose syntax for many years now. However, compared with programming languages such as PHP, Ruby, and Python, it has often been considered to have a very restrictive syntax. While on the one hand these bunch of languages provide a very feature-rich syntax to make the code both more readable and concise, they are just not up to the power of JVM. On the other hand, Java language in itself didn't catch up with the speed of evolution in modern programming languages. One could debate over whether it is a good thing or a bad thing, but one thing is very clear "developers need a language that is concise, expressive, modern and runs fast!" Scala emerged as a language that fills this sweet spot between the power of JVM and expressiveness of modern programming languages.

Back again to the "scalable" part of the definition: scalable in this context means that Scala allows a developer to scale the "way code is written." A Scala program may begin as a quick script, then add more features, then be broken up into packages and classes. Because Scala supports both the object-oriented paradigm and the functional paradigm, it is possible for a developer to write really compact code. This allows a developer to focus more on the core logic, rather than on the boiler-plate code. Scala is a wonderful language to program with because it has:

  • Operator overloading
  • Generics
  • Functional operators: map, filter, fold, and reduce
  • Immutable collections and data structures
  • Intuitive pattern matching
  • Support for writing domain-specific Languages (DSL) via parser combinators

In a nutshell, Scala is a double-edged sword. On one hand you can write very elegant and concise code, and on the other hand you can write code that is too cryptic that no other developer can easily understand. That is the balance one has to maintain. Let's take a look at our first Scala application:

A quick introduction to Scala

Here, HelloWorld is defined as an object, which means that in a JVM, there will be only one copy of HelloWorld. We can also consider it as a singleton object. HelloWorld contains a single method named main with parameter args. Note that the parameter type is defined after the parameter name. Essentially, it translates to Java's public static void main method. Just like in a Java application, the main method is the entry point, which is also the case with a Scala object with a main method defined in the preceding screenshot. This program will output:

Hello World!

Next up is a Scala application named CoreConcepts.

A quick introduction to Scala

Things to notice in this code are:

  • The first line defines a package.
  • We have defined x, y, and s as val, which expands to values. We haven't specified the data type for y and s. Although we haven't specified the types, the Scala compiler inferred it using its type-inference system. So we save time by typing less.
  • Unlike Java, there are no semicolons. If you prefer, you can still use semicolons, but they are optional.
  • def defines a method, so apart from main, we have two other methods defined: sum and factorial
  • We haven't specified any return type for the sum method. But you can check in an IDE that it returns Double. Again, Scala's type inference saves us from specifying that.
  • There is no return statement. We don't need it. The value of the last expression is returned by default. Notice that some values are returned by both sum and factorial functions.
  • Notice @tailrec (annotation for tail recursion), which tells the Scala compiler to optimize the recursive factorial method into a for loop. You do not need to worry about recursion or tail recursion for now. This is just an illustration of some Scala features.
  • Also, look at strings in println statements. Take notice of f and s prefixes to those strings. Prefixing a string with s or f enables string interpolation. Prefixing with f also enables a format specifier for data types.

There is a lot going on in this small example. Take some time to go through each statement and refer to additional material if you need to.

Case classes

In the next example, we see how a case class can help drastically shorten our code size. If we were to store the name and age of an employee, we might need to create a Java bean with two members, one each for name and age. Then we would need to add two getters and two setters for each as well. However, all this and more could be achieved with just a single line of Scala code as in the next example. Check for a case class Employee statement:

package chapter01

import scala.util.Random

object ExampleCaseClasses {

  case class Employee(name: String, age: Int)

  def main(args: Array[String]) {
    val NUM_EMPLOYEES = 5
    val firstNames = List("Bruce", "Great", "The", "Jackie")
    val lastNames = List("Lee", "Khali", "Rock", "Chan")
    val employees = (0 until NUM_EMPLOYEES) map { i =>
      val first = Random.shuffle(firstNames).head
      val last = Random.shuffle(lastNames).head
      val fullName = s"$last, $first"
      val age = 20 + Random.nextInt(40)
      Employee(fullName, age)
    }

    employees foreach println

    val hasLee = """(Lee).*""".r
    for (employee <- employees) {
      employee match {
        case Employee(hasLee(x), age) => println("Found a Lee!")
        case _ => // Do nothing
      }
    }
  }
}

Other things to note in this example:

  • Use of a range (0 until NUM_EMPLOYEES) coupled with the map functional operator, instead of a loop to construct a collection of employees.
  • We use foreach loop to print all the employees. Technically it's not a loop, but, for simplicity, we can think of its behavior as a simple loop.
  • We also use regular expression pattern matching on the full name of all the employees. This is done using a for loop. Take a closer look at the loop structure. The left arrow is a generator syntax, quite similar to a foreach loop in Java. Here, we don't have to specify any types in a foreach loop.

Tuples

Other than classes and case classes, Scala also provides another basic type of data container called tuples. Tuples are typed, fixed-size, immutable list of values. By typed we mean that each of the members has a type, and this list is of fixed size. Also once we create a tuple, we can't change its member values, that is, tuples are immutable (think of Python tuples). Each of the member values are retrieved using an underscore followed by a number (the first member is numbered as ._1 ). There cannot be a zero member tuple.

Here is an example:

package chapter01

object ExampleTuples {
  def main(args: Array[String]) {
    val tuple1 = Tuple1(1)
    val tuple2 = ('a', 1) // can also be defined: ('a' -> 1)
    val tuple3 = ('a', 1, "name")

    // Access tuple members by underscore followed by 
    // member index starting with 1
    val a = tuple1._1 // res0: Int = 1
    val b = tuple2._2 // res1: Int = 1
    val c = tuple3._1 // res2: Char = a
    val d = tuple3._3 // res3: String = name
    
  }
}

Scala REPL

Scala also has a shell that is also called a REPL (read-eval-print loop) (think of Python's shell, or Ruby IRB, or Perl's PDB). By just typing in Scala, we can access the REPL:

$ scala
Welcome to Scala version 2.11.2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_76).
Type in expressions to have them evaluated.
Type :help for more information.

scala> println("Hello World!")
Hello World!

scala> 3 + 4
res1: Int = 7

scala>

Scala REPL is very powerful and convenient to use. We just set up proper classpaths and then we can import all the Java classes we need to play with, right within the Scala REPL. Notice how we use Java's HashMap in Scala REPL:

scala> val m = new java.util.HashMap[String, String]
m: java.util.HashMap[String,String] = {}

scala> m.put("Language", "Scala")
res0: String = null

scala> m.put("Creator", "Martin Odersky")
res1: String = null

scala> m
res2: java.util.HashMap[String,String] = {Creator=Martin Odersky, Language=Scala}

scala> import scala.collection.JavaConversions._
import scala.collection.JavaConversions._

scala> m.toMap
res3: scala.collection.immutable.Map[String,String] = Map(Creator -> Martin Odersky, Language -> Scala)

That is complete Java interoperability. To give you an idea of how Scala looks, we have shown you the very basic Scala code. Additional reading is required to become comfortable with Scala. Hold on to any decent book on Scala, that covers Scala in more detail, and you are good to go. Next we will see how to create a sample SBT project.

SBT – Scala Build Tool

SBT is arguably the most common build tool for building and managing a Scala project and its dependencies. An SBT project consists of a build.sbt file or a Build.scala file at the project root, and optionally a project/folder with additional configuration files. However, for most parts it would suffice to have only build.sbt.

Here is what a sample build.sbt file looks like:

$ cat build.sbt
name := "BuildingScalaRecommendationEngine"

scalaVersion :="2.10.2"

version :="1.0"

libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.3.0"

Well yes, that's pretty much it. One thing to be careful with is to keep a single line spacing between the statements. We have defined a project name, project version, Scala version, and we have a complete Scala project up and running for development. Also, you need to create source code folders inside the project folder. The default location for Scala and Java code is src/main/scala and src/main/java respectively.

SBT has its own shell. Just type sbt and you will see:

$ sbt
[info] Loading global plugins from /home/tuxdna/.sbt/0.13/plugins
[info] Set current project to BuildingScalaRecommendationEngine (in build file:/tmp/BuildingScalaRecommendationEngine/code/)
>

There you can type help to see more information on the different commands. For our purposes, we only need these commands:

  • clean: Cleans the project
  • compile: Compiles the Scala as well as Java code
  • package: Creates a project archive file
  • console: Opens a Scala console (REPL) with project dependencies already set up in the classpath

Apache Spark

Given that you have already set up Apache Spark, run Spark in local mode like so:

$ bin/spark-shell

This will give you Scala like REPL, which actually it is, but the interpreter has done Spark specific initialization for us. Let's see a simple Spark program:

scala> sc.textFile("README.md").flatMap( l => l.split(" ").map( _ -> 1)).reduceByKey(_ + _).sortBy(_._2, false).take(5)

res0: Array[(String, Int)] = Array(("",158), (the,28), (to,19), (Spark,18), (and,17))

The preceding code is a word count program that outputs the top five most occurring words in the file README.md.

Some things to note in the preceding example are:

  • sc is the SparkContext that is created by the Spark shell by default
  • flatMap, map, and so on are the functional programming constructs offered by Scala
  • Everything here is Scala code, which means you do all the processing and database-like querying using only one language, that is, Scala

Setting up a standalone Apache Spark cluster

In the previous Spark example, we ran Spark in local mode. However, for a full blown application we would run our code on multiple machines, that is, a Spark cluster. We can also set up a single machine cluster to achieve the same effect. To do so, ensure that you have Maven version 3 installed, and then set these environment variables:

$ export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

Go to the folder where we extracted the spark-1.3.0 archive earlier. Now build Apache Spark:

$ cd path/to/spark-1.3.0
$ mvn -DskipTests clean package
$ sbin/start-master.sh

In the output of the preceding command, you will be given a file with the extension .out. This file contains a string like this:

Starting Spark master at spark://matrix02:7077

Here spark://hostname:port part is the Spark instance you would connect to for your Spark programs. In this case, it is spark://matrix02:7077.

If you are using a Windows machine, you can check for instructions in this Stack Overflow question:

https://stackoverflow.com/questions/25721781/running-apache-spark-on-windows-7

Or, create a GNU/Linux virtual machine, and follow the steps mentioned earlier.

Apache Spark – MLlib

Starting with version 1.2.0, Spark comes packaged with a machine learning routines library called MLlib. Here is an example of KMeans clustering using MLlib on the Iris dataset:

package chapter01
import scala.collection.JavaConversions._
import scala.io.Source
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors

object IrisKMeans {
  def main(args: Array[String]) {

    val appName = "IrisKMeans"
    val master = "local"
    val conf = new SparkConf().setAppName(appName).setMaster(master)
    val sc = new SparkContext(conf)

    println("Loading iris data from URL...")
    val url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
    val src = Source.fromURL(url).getLines.filter(_.size > 0).toList
    val textData = sc.parallelize(src)
    val parsedData = textData
      .map(_.split(",").dropRight(1).map(_.toDouble))
      .map(Vectors.dense(_)).cache()

    val numClusters = 3
    val numIterations = 20
    val clusters = KMeans.train(parsedData, numClusters, numIterations)

    // Evaluate clustering by computing Within Set Sum of Squared Errors
    val WSSSE = clusters.computeCost(parsedData)
    println("Within Set Sum of Squared Errors = " + WSSSE)
  }
}

Tip

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Run the preceding program from your IDE or use the following command from the project root folder:

$ sbt 'run-main chapter01.IrisKMeans'  2>/dev/null

Output:

Loading iris data from URL ...
Within Set Sum of Squared Errors = 78.94084142614648

In this example, we basically downloaded the Iris dataset, loaded it into Spark, and performed KMeans clustering. Finally, we displayed a WSSE metric, which indicates how much spherical and distinct, the K clusters were formed (for details read: https://en.wikipedia.org/wiki/K-means_clustering#Description). Don't worry if you don't understand what that metric means because this example is only to get your Spark environment up and running. Next we discuss some machine learning and recommendation engine jargon.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.45.80