Chapter 10. Distributed Batch Processing with Spark

In Chapter 4, Parallel Collections and Futures, we discovered how to use parallel collections for "embarrassingly" parallel problems: problems that can be broken down into a series of tasks that require no (or very little) communication between the tasks.

Apache Spark provides behavior similar to Scala parallel collections (and much more), but, instead of distributing tasks across different CPUs on the same computer, it allows the tasks to be distributed across a computer cluster. This provides arbitrary horizontal scalability, since we can simply add more computers to the cluster.

In this chapter, we will learn the basics of Apache Spark and use it to explore a set of emails, extracting features with the view of building a spam filter. We will explore several ways of actually building a spam filter in Chapter 12, Distributed Machine Learning with MLlib.

Installing Spark

In previous chapters, we included dependencies by specifying them in a build.sbt file, and relying on SBT to fetch them from the Maven Central repositories. For Apache Spark, downloading the source code or pre-built binaries explicitly is more common, since Spark ships with many command line scripts that greatly facilitate launching jobs and interacting with a cluster.

Head over to http://spark.apache.org/downloads.html and download Spark 1.5.2, choosing the "pre-built for Hadoop 2.6 or later" package. You can also build Spark from source if you need customizations, but we will stick to the pre-built version since it requires no configuration.

Clicking Download will download a tarball, which you can unpack with the following command:

$ tar xzf spark-1.5.2-bin-hadoop2.6.tgz

This will create a spark-1.5.2-bin-hadoop2.6 directory. To verify that Spark works correctly, navigate to spark-1.5.2-bin-hadoop2.6/bin and launch the Spark shell using ./spark-shell. This is just a Scala shell with the Spark libraries loaded.

You may want to add the bin/ directory to your system path. This will let you call the scripts in that directory from anywhere on your system, without having to reference the full path. On Linux or Mac OS, you can add variables to the system path by entering the following line in your shell configuration file (.bash_profile on Mac OS, and .bashrc or .bash_profile on Linux):

export PATH=/path/to/spark/bin:$PATH

The changes will take effect in new shell sessions. On Windows (if you use PowerShell), you need to enter this line in the profile.ps1 file in the WindowsPowerShell folder in Documents:

$env:Path += ";C:Program FilesGnuWin32in"

If this worked correctly, you should be able to open a Spark shell in any directory on your system by just typing spark-shell in a terminal.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.154.64