The previous Chapter 16, Parallelism with Scala and Akka, provided the reader with different options to make computing-intensive applications scalable. These solutions are generic and do not address the specific needs of the data scientist. Optimization algorithms such as minimization of loss function or dynamic programming methods such as the Viterbi algorithm require support for caching data and broadcasting of model parameters. The Apache Spark framework addresses these shortcomings.
This chapter describes the key concepts behind the Apache Spark framework and its application to large scale machine learning problems. The reader is invited to dig into the wealth of books and papers on this topic. This last chapter describes four key characteristics of the Apache Spark framework:
Apache Spark is a fast and general-purpose cluster computing system, initially developed as AMPLab / UC Berkeley as part of the Berkeley Data Analytics Stack (BDAS), http://en.wikipedia.org/wiki/UC_Berkeley. It provides high-level APIs for the following programming languages that make large, concurrent parallel jobs easy to write and deploy [17:01]:
The core element of Spark is the resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of a cluster and/or CPU cores of servers. An RDD can be created from a local data structure such as a list, array, or hash table, from the local filesystem or the Hadoop distributed file system (HDFS) [17:02].
The operations on an RDD in Spark are very similar to the Scala higher-order methods. These operations are performed concurrently over each partition. Operations on RDD can be classified as follows:
An RDD can persist, be serialized, and be cached for future computation. Spark is written in Scala and built on top of Akka libraries. Spark relies on the following mechanisms to distribute and partition RDDs:
The Spark ecosystem can be represented as stacks of technology and frameworks, as seen in the following diagram:
The Spark ecosystem has grown to support some machine learning algorithms out of the box, MLlib; a SQL-like interface to manipulate datasets with relational operators, SparkSQL; a library for distributed graphs, GraphX; and a streaming library [17:12].
3.147.84.157