118 | Big Data Simplied
6.1 INTRODUCING SPARK
Spark is a technology that is very popular nowadays because it makes working on big data
sets really simple. You can transform huge data sets and easily extract insights from them. It is
essentially a distributed computing engine. It has an interactive shell written either in Scala or in
Python, which allows you to quickly process large data sets. It is fast and intuitive and it has a
whole bunch of built-in libraries for machine learning, stream processing and graph processing.
We need a framework for processing the ever-growing data that has become big data, the data
that might not even logically fit onto one machine, or even if it does, the time to process slows
down as the size of the data increases. Now, you may say that this is the problem MapReduce
was built to conquer - parallelizing the processing across the distribution of machines, processing
the data, solving the big data problem by enhancing processing power. This sped up big data
processing by orders of magnitude in comparison to the previously mentioned single machine
algorithms. However, it has a multitude of difficulties, such as algorithm complexities, disk bot-
tlenecks and not being able to perform more than just batch processing to name a few.
Here comes Spark to the rescue, which uses a number of optimizations that allow us to per-
form those same Hadoop computations against a fraction of the resources, while still running yet
another magnitude faster. In fact, at the end of 2014, Spark officially beat the existing on-disk file
sorting benchmark, performing three times faster, using about ten times less machines. Taking it
one step further, Spark even ran the sort against a petabyte of data, and keeping a constant rate,
was able to decimate any previous sorts of that magnitude.
In addition to reducing processing time and resource demands, Spark’s generalized abstractions
result in tinier code. Testability comes just about as trivial as any other code you might write. This
is because you can write the code as if it is not distributed, only worrying about the distributed
nature of the implementation at deployment time. The separation of code from deployment also
means that you can now directly interact with your data from your local machine, figuring out
your algorithms on the fly in near real-time. This allows you to start with your prototype, build-
ing it into production code, scale up and out as your processing requires. This even means that
you can debug most problems on a local sample set, growing the code quickly and seamlessly,
and any big data processing framework should have fault tolerance built right into the model as
Spark has done from its inception. Last, but definitely not the least, is that Spark unifies different
types of big data needs regardless of processing batch data or streaming data, pushing it through
graphing or machine learning algorithms.
6.1.1 Hadoop and Spark
We have discussed how Spark shrinks processing time, machine resources, and code when
compared to Hadoop. But there is one other problem that it solves, which is the MapReduce
explosion. The explanation comes from Ion Stoica, CEO of Databricks, the company commer-
cializing Spark. He talked about how the rst cell phones were clunky and led to a wide array
of specialized complimentary gadgetry, but then the smart phone came along and united all of
that chaos into one easy-to-use device. So, in the same vein we know that the original big data
processing problems were alleviated by the advent of Hadoop. In the early years, it was a big leap
in technology. However, MapReduce can be very restrictive in its design, as it has a very narrow
M06 Big Data Simplified XXXX 01.indd 118 5/17/2019 2:49:07 PM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.79.206