Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Spark – in-memory distributed computing

One of the issues with Hadoop is that after a MapReduce operation, the resulting files are written to the hard disk. Therefore, when there is a large data processing operation, there would be many read and write operations on the hard disk, which makes processing in Hadoop very slow. Moreover, the network latency, which is the time required to shuffle data between different nodes, also contributes to this problem. Another disadvantage is that one cannot make real-time queries from the files stored in HDFS. For machine learning problems, during training phase, the MapReduce will not persist over iterations. All this makes Hadoop not an ideal platform for machine learning.

A solution to this problem was invented at Berkeley University's AMP Lab in 2009. This came out of the PhD work of Matei Zaharia, a Romanian born computer scientist. His paper Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (reference 4 in the References section of this chapter) gave rise to the Spark project that eventually became a fully open source project under Apache. Spark is an in-memory distributed computing framework that solves many of the problems of Hadoop mentioned earlier. Moreover, it supports more type of operations that just MapReduce. Spark can be used for processing iterative algorithms, interactive data mining, and streaming applications. It is based on an abstraction called Resilient Distributed Datasets (RDD). Similar to HDFS, it is also fault-tolerant.

Spark is written in a language called Scala. It has interfaces to use from Java and Python and from the recent version 1.4.0; it also supports R. This is called SparkR, which we will describe in the next section. The four classes of libraries available in Spark are SQL and DataFrames, Spark Streaming, MLib (machine learning), and GraphX (graph algorithms). Currently, SparkR supports only SQL and DataFrames; others are definitely in the roadmap. Spark can be downloaded from the Apache project page at http://spark.apache.org/downloads.html. Starting from 1.4.0 version, SparkR is included in Spark and no separate download is required.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Spark – in-memory distributed computing

Create new playlist

Sign In

Sign Up

Spark – in-memory distributed computing

Table of Contents for
Spark – in-memory distributed computing