Here comes Apache Spark

Apache Spark is a unified distributed computing engine across different workloads and platforms. Spark can connect to different platforms and process different data workloads using a variety of paradigms such as Spark streaming, Spark ML, Spark SQL, and Spark GraphX.

Apache Spark is a fast in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming machine learning or SQL workloads that require fast interactive access to data sets. Apache Spark consists of Spark core and a set of libraries. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed application development. Additional libraries built on top of the core allow workloads for streaming, SQL, Graph processing, and machine learning. Spark ML, for instance, is designed for data science and its abstraction makes data science easier.

Spark provides real-time streaming, queries, machine learning, and graph processing. Before Apache Spark, we had to use different technologies for different types of workloads, one for batch analytics, one for interactive queries, one for real-time streaming processing and another for machine learning algorithms. However, Apache Spark can do all of these just using Apache Spark instead of using multiple technologies that are not always integrated.


Using Apache Spark, all types of workload can be processed and Spark also supports Scala, Java, R, and Python as a means of writing client programs.

Apache Spark is an open-source distributed computing engine which has key advantages over the MapReduce paradigm:

  • Uses in-memory processing as much as possible
  • General purpose engine to be used for batch, real-time workloads
  • Compatible with YARN and also Mesos
  • Integrates well with HBase, Cassandra, MongoDB, HDFS, Amazon S3, and other file systems and data sources

Spark was created in Berkeley back in 2009 and was a result of the project to build Mesos, a cluster management framework to support different kinds of cluster computing systems. Take a look at the following table:

Version Release date Milestones
0.5 2012-10-07 First available version for non-production usage
0.6 2013-02-07 Point release with various changes
0.7 2013-07-16 Point release with various changes
0.8 2013-12-19 Point release with various changes
0.9 2014-07-23 Point release with various changes
1.0 2014-08-05 First production ready, backward-compatible release. Spark Batch, Streaming, Shark, MLLib, GraphX
1.1 2014-11-26 Point release with various changes
1.2 2015-04-17 Structured Data, SchemaRDD (subsequently evolved into DataFrames)
1.3 2015-04-17 API to provide a unified API to read from structured and semi-structured sources
1.4 2015-07-15 SparkR, DataFrame API, Tungsten improvements
1.5 2015-11-09 Point release with various changes
1.6 2016-11-07 Dataset DSL introduced
2.0 2016-11-14

DataFrames and Datasets API as fundamental layer for ML, Structured Streaming,

SparkR improvements.

2.1 2017-05-02 Event time watermarks, ML, GraphX improvements
2.2 has been released 2017-07-11 which has several improvements especially Structured Streaming which is now GA.

Spark is a platform for distributed computing that has several features:

  • Transparently processes data on multiple nodes via a simple API
  • Resiliently handles failures
  • Spills data to disk as necessary though predominantly uses memory
  • Java, Scala, Python, R, and SQL APIs are supported
  • The same Spark code can run standalone, in Hadoop YARN, Mesos, and the cloud

Scala features such as implicits, higher-order functions, structured types, and so on allow us to easily build DSL's and integrate them with the language.

Apache Spark does not provide a Storage layer and relies on HDFS or Amazon S3 and so on. Hence, even if Apache Hadoop technologies are replaced with Apache Spark, HDFS is still needed to provide a reliable storage layer.

Apache Kudu provides an alternative to HDFS and there is already integration between Apache Spark and Kudu Storage layer, further decoupling Apache Spark and the Hadoop ecosystem.

Hadoop and Apache Spark are both popular big data frameworks, but they don't really serve the same purposes. While Hadoop provides distributed storage and a MapReduce distributed computing framework, Spark on the other hand is a data processing framework that operates on the distributed data storage provided by other technologies.

Spark is generally a lot faster than MapReduce because of the way it processes data. MapReduce operates on splits using Disk operations, Spark operates on the dataset much more efficiently than MapReduce, with the main reason behind the performance improvement in Apache Spark being the efficient off-heap in-memory processing rather than solely relying on disk-based computations.

MapReduce's processing style can be sufficient if you were data operations and reporting requirements are mostly static and it is okay to use batch processing for your purposes, but if you need to do analytics on streaming data or your processing requirements need multistage processing logic, you will probably want to want to go with Spark.

There are three layers in the Spark stack. The bottom layer is the cluster manager, which can be standalone, YARN, or Mesos.


Using local mode, you don't need a cluster manager to process.

In the middle, above the cluster manager, is the layer of Spark core, which provides all the underlying APIs to perform task scheduling and interacting with storage.

At the top are modules that run on top of Spark core such as Spark SQL to provide interactive queries, Spark streaming for real-time analytics, Spark ML for machine learning, and Spark GraphX for graph processing.

The three layers are as follows:

As seen in the preceding diagram, the various libraries such as Spark SQL, Spark streaming, Spark ML, and GraphX all sit on top of Spark core, which is the middle layer. The bottom layer shows the various cluster manager options.

Let's now look at each of the component briefly:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.96.5