Now that we are powered with the knowledge of where and how statistics and machine learning fits in the global data-driven enterprise architecture, let's stop at the specific implementations in Spark and MLlib, a machine learning library on top of Spark. Spark is a relatively new member of the big data ecosystem that is optimized for memory usage as opposed to disk. The data can be still spilled to disk as necessary, but Spark does the spill only if instructed to do so explicitly, or if the active dataset does not fit into the memory. Spark stores lineage information to recompute the active dataset if a node goes down or the information is erased from memory for some other reason. This is in contrast to the traditional MapReduce approach, where the data is persisted to the disk after each map or reduce task.
Spark is particularly suited for iterative or statistical machine learning algorithms over a distributed set of nodes and can scale out of core. The only limitation is the total memory and disk space available across all Spark nodes and the network speed. I will cover the basics of Spark architecture and implementation in this chapter.
One can direct Spark to execute data pipelines either on a single node or across a set of nodes with a simple change in the configuration parameters. Of course, this flexibility comes at a cost of slightly heavier framework and longer setup times, but the framework is very parallelizable and as most of modern laptops are already multithreaded and sufficiently powerful, this usually does not present a big issue.
In this chapter, we will cover the following topics:
If you haven't done so yet, you can download the pre-build Spark package from http://spark.apache.org/downloads.html. The latest release at the time of writing is 1.6.1:
Alternatively, you can build the Spark by downloading the full source distribution from https://github.com/apache/spark:
$ git clone https://github.com/apache/spark.git Cloning into 'spark'... remote: Counting objects: 301864, done. ... $ cd spark $sh ./ dev/change-scala-version.sh 2.11 ... $./make-distribution.sh --name alex-build-2.6-yarn --skip-java-test --tgz -Pyarn -Phive -Phive-thriftserver -Pscala-2.11 -Phadoop-2.6 ...
The command will download the necessary dependencies and create the spark-2.0.0-SNAPSHOT-bin-alex-spark-build-2.6-yarn.tgz
file in the Spark directory; the version is 2.0.0, as it is the next release version at the time of writing. In general, you do not want to build from trunk unless you are interested in the latest features. If you want a released version, you can checkout the corresponding tag. Full list of available versions is available via the git branch –r
command. The spark*.tgz
file is all you need to run Spark on any machine that has Java JRE.
The distribution comes with the docs/building-spark.md
document that describes other options for building Spark and their descriptions, including incremental Scala compiler, zinc. Full Scala 2.11 support is in the works for the next Spark 2.0.0 release.
3.142.173.89