Spark for Big Data Analytics

As the use of Hadoop and related technologies in the respective ecosystem gained prominence, a few obvious and salient deficiencies of the Hadoop operational model became apparent. In particular, the ingrained reliance on the MapReduce paradigm, and other facets related to MapReduce, made a truly functional use of the Hadoop ecosystem possible only for major firms that were invested deeply in the respective technologies.

At the UC Berkeley Electrical Engineering and Computer Sciences (EECS) Annual Research Symposium of 2011, a vision for a new research group at the university was announced during a presentation by Prof. Ian Stoica (https://amplab.cs.berkeley.edu/about/). It laid out the foundation of what was to become a pivotal unit that would profoundly change the landscape of Big Data. The AMPLab, launched in February 2011, aimed to deliver a scalable and unified solution by integrating Algorithms, Machines, and People that could cater to future needs without requiring any major re-engineering efforts.

The most well-known and most widely used project to evolve from the AMPLab initiative was Spark, arguably a superior alternative - or more precisely, extension - of the Hadoop ecosystem.

In this chapter, we will visit some of the salient characteristics of Spark and end with a real-world tutorial on how to use Spark. The topics we will cover are:

The advent of Spark
Theoretical concepts in Spark
Core components of Spark
The Spark architecture
Spark solutions
Spark tutorial

Table of Contents for Spark for Big Data Analytics

Create new playlist

Sign In

Sign Up

Table of Contents for
Spark for Big Data Analytics