Chapter 1. Installing Spark and Setting Up Your Cluster

This chapter will detail some common methods for setting up Spark. Spark on a single machine is excellent for testing, but you will also learn to use Spark's built-in deployment scripts to a dedicated cluster via SSH (Secure Shell). This chapter will also cover using Mesos, Yarn, Puppet, or Chef to deploy Spark. For cloud deployments of Spark, this chapter will look at EC2 (both traditional and EC2MR). Feel free to skip this chapter if you already have your local Spark instance installed and want to get straight to programming.

Regardless of how you are going to deploy Spark, you will want to get the latest version of Spark from (Version 0.7 as of this writing). For coders who live dangerously, try cloning the code directly from the repository git:// Both the source code and pre-built binaries are available. To interact with Hadoop Distributed File System (HDFS), you need to use a Spark version that is built against the same version of Hadoop as your cluster. For Version 0.7 of Spark, the pre-built package is built against Hadoop 1.0.4. If you are up for the challenge, it's recommended that you build against the source since it gives you the flexibility of choosing which HDFS version you want to support as well as apply patches. You will need the appropriate version of Scala installed and the matching JDK. For Version 0.7.1 of Spark, you require Scala 2.9.2 or a later 2.9 series release (2.9.3 works well). At the time of this writing, Ubuntu's LTS release (Precise) has Scala Version 2.9.1. Additionally, the current stable version has 2.9.2 and Fedora 18 has 2.9.2. Up-to-date package information can be found at The latest version of Scala is available from It is important to choose the version of Scala that matches the version requested by Spark, as Scala is a fast-evolving language.

The tarball file contains a bin directory that needs to be added to your path, and SCALA_HOME should be set to the path where the tarball file is extracted. Scala can be installed from source by running:

wget && tar -xvf scala-2.9.3.tgz && cd scala-2.9.3 && export PATH=`pwd`/bin:$PATH && export SCALA_HOME=`pwd`

You will probably want to add these to your .bashrc file or equivalent:

export PATH=`pwd`/bin:$PATH
export SCALA_HOME=`pwd`

Spark is built with sbt (simple build tool, which is no longer very simple), and build times can be quite long when compiling Scala's source code. Don't worry if you don't have sbt installed; the build script will download the correct version for you.

On an admittedly under-powered core 2 laptop with an SSD, installing a fresh copy of Spark took about seven minutes. If you decide to build Version 0.7 from source, you would run:

wget && tar -xvf download-spark-0.7.0-sources-tgz && cd spark-0.7.0 && sbt/sbt package

If you are going to use a version of HDFS that doesn't match the default version for your Spark instance, you will need to edit project/SparkBuild.scala and set HADOOP_VERSION to the corresponding version and recompile it with:

sbt/sbt clean compile


The sbt tool has made great progress with dependency resolution, but it's still strongly recommended for developers to do a clean build rather than an incremental build. This still doesn't get it quite right all the time.

Once you have started the build it's probably a good time for a break, such as getting a cup of coffee. If you find it stuck on a single line that says "Resolving [XYZ]...." for a long time (say five minutes), stop it and restart the sbt/sbt package.

If you can live with the restrictions (such as the fixed HDFS version), using the pre-built binary will get you up and running far quicker. To run the pre-built version, use the following command:

wget && tar -xvf download-spark-0.7.0-prebuilt-tgz && cd spark-0.7.0


Spark has recently become a part of the Apache Incubator. As an application developer who uses Spark, the most visible changes will likely be the eventual renaming of the package to under the org.apache namespace.

Some of the useful links for references are as follows:

Running Spark on a single machine

A single machine is the simplest use case for Spark. It is also a great way to sanity check your build. In the Spark directory, there is a shell script called run that can be used to launch a Spark job. Run takes the name of a Spark class and some arguments. There is a collection of sample Spark jobs in ./examples/src/main/scala/spark/examples/.

All the sample programs take the parameter master, which can be the URL of a distributed cluster or local[N], where N is the number of threads. To run GroupByTest locally with four threads, try the following command:

./run spark.examples.GroupByTest local[4]

If you get an error, as SCALA_HOME is not set, make sure your SCALA_HOME is set correctly. In bash, you can do this using the export SCALA_HOME=[pathyouextractedscalato].

If you get the following error, it is likely you are using Scala 2.10, which is not supported by Spark 0.7:

[literal]"Exception in thread "main" java.lang.NoClassDefFoundError: scala/reflect/ClassManifest"[/literal]

The Scala developers decided to rearrange some classes between 2.9 and 2.10 versions. You can either downgrade your version of Scala or see if the development build of Spark is ready to be built along with Scala 2.10.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.