Introducing Spark andKafka | 119
focus, specically on batch processing. This over-specication led to an explosion of specialized
libraries, each attempting to solve a different problem. So, if you want to process streaming
data at scale, then you would have to use another complimentary library called Storm. Apache
Storm is a free and open source, scalable, fault-tolerant, distributed real-time computation sys-
tem. Storm makes it easy to reliably process unbounded streams of data, doing for real-time
processing what Hadoop does for batch processing. Again, you may nd it easier to query your
data using something like Hive.
So, along came Spark’s generalized abstractions for big data computing, bringing the big data
pipeline into one cohesive unit just as the smartphone in our analogy. Spark’s aim is to be a
unified platform for big data, an SDK for the many different means of processing, and it all orig-
inates from the core library. This means that, any of its extension libraries automatically gains
from improvements to the core, such as performance boosts. Due to the core being so general-
ized, extending it is fairly simple and straightforward. If you want to query your data, just use
Spark’s SQL library. If you want to stream, there’s also a streaming library to help you out. Even
machine learning is made more straightforward with MLlib, and GraphX does the same for graph
computation.
On top of the ability to have shared knowledge across libraries, this common base also means
that the libraries do not need to be built from the bottom up, which results in a minimal code
footprint for each library. It lessens the possible bugs and any other liabilities that code typically
brings. Even the core library, which all the other libraries are built on top of, is fairly small.
6.1.2 Spark Programming Languages
There is more than one language you can possibly use for writing a Spark application. Spark itself
is written in Scala. The other most obvious choice is Java, as any Scala library, such as Spark,
isalso compatible with Java, albeit through a more verbose syntax. Even such verbosity is toned
down due to an effort by the Spark developers to keep clean APIs, resulting in an uncluttered Java
API, which even supports Java 8. Spark also supports Python.
6.1.3 Understanding Spark Architecture
The Spark documentation denes Resilient Distributed Dataset or RDD as the collection of
elements partitioned across the nodes of the cluster that can be operated on, in parallel. From a
user perspective, an RDD can be thought of as a collection, similar to a list or an array.
RDDs are collection of records which are immutable, partitioned, fault tolerant, may be
created by coarse grained operations, lazily evaluated and can be persisted. We will examine
these characteristics in more detail shortly.
Behind the scenes, the work is distributed across a cluster of machines so that the computa-
tions against the RDD can be run in parallel, reducing the processing time by orders of magni-
tude. This distribution of work and data also means that even if one point fails, the rest of the
system continues processing, while the failure can be restarted immediately elsewhere.
Failures across large clusters are inevitable, but the RDD’s design was built with resiliency in
mind. The design that makes fault-tolerance easy, is due to the fact that most functions in Spark
are lazy. Instead of immediately executing a function’s instructions, the instructions are stored for
later use in what is referred to as a DAG or Directed Acyclic Graph.
M06 Big Data Simplified XXXX 01.indd 119 5/17/2019 2:49:07 PM
120 | Big Data Simplied
This graph of instructions continues to grow through a series of calls to Transformations.
Transformation in an RDD creates a new RDD by performing computations on the source RDD.
Let us look at some typical examples of Transformations on RDDs with corresponding outputs.
Consider the RDDs rdd = {1, 2, 3, 3) and rdd1 = {3, 4, 5}
Now look at the following Transformations.
Transformation Syntax Output
Filter
valfilrdd = rdd.filter(x=> x!=1)
{2,3,3}
Distinct
valdisrdd = rdd.distinct()
{1,2,3}
Union
valunrdd = rdd.union(rdd1)
{1,2,3,3,3,4,5}
Map
valmaprdd = rdd.map(x=> x+1)
{2,3,4,4}
Intersection
valinrdd = rdd.intersection(rdd1)
{3}
Thus, the DAG ends up being a build-up of the functional lineage that will be sent to the workers,
which will actually use the instructions to compute the nal output for your Spark application.
So, this laziness is great, but work has to be done at some point. The set of methods that trig-
ger computations are called Actions. Action forces the actual evaluation of the Transformation.
Actions trigger the DAG execution and result in action against the data, whether returning to
the Driver program, or saving to some persistent storage system, as the case may be.
Let us look at some examples of Actions as stated below:
reduce(func): Aggregate the elements of the dataset using a function func (which takes two
arguments and returns one). Obviously, the function should be commutative and associative
so that it can be computed correctly in parallel.
collect(): Return all the elements of the dataset as an array to the Driver program. This is
usually useful after a Filter or other Transformation that returns a sufficiently small subset of
the data.
count(): Returns the number of elements in the dataset.
Refer to Figure 6.1. Here, RDDs are created from a log le and then Transformations are designed
to lter out all entries with the word ‘error’ creating a lineage of RDDs. Here, the count() action
returns the number of elements in the dataset with the word ‘error’.
Now, it should be noted that while an RDD is immutable, meaning that once you create
it you can no longer mutate it, the same does not apply to the data driving it. Each Action
will trigger a fresh execution of the DAG. If the underlying data changes, so will the final
results. By default, each transformed RDD will be recomputed each time you run an Action
on it. However, you may also persist an RDD in memory, in which case Spark will keep the
elements around on the cluster for much faster access. If different queries are run on the same
set of data repeatedly, then the particular dataset can be kept in memory for better execution
times. There is also support for persisting RDDs on disk or replicated across multiple nodes
in the cluster.
Refer Figure 6.2 and let us now examine how a Spark job gets executed.
M06 Big Data Simplified XXXX 01.indd 120 5/17/2019 2:49:07 PM
Introducing Spark andKafka | 121
When a client submits a Spark application code, the Driver converts the code containing
Transformations and Actions into the DAG. It then converts the logical DAG into a physical exe-
cution plan with set of stages. After creating the physical execution plan, it creates small physical
execution units referred to as Tasks under each stage. The tasks are then bundled to be sent to the
Spark cluster. The Driver program then talks to the Cluster Manager and negotiates for resources.
The Cluster Manager then launches Executors on the Worker nodes. At this point, the Driver
sends Tasks to the Cluster Manager based on data distribution. Before Executors begin execution,
they register themselves with the Driver program so that the Driver has a holistic view of all the
Executors. Now the Executors start executing various Tasks assigned by the Driver program.
FIGURE 6.1 Transformations and action
Transformations
RDD
Action
Value
filterError.cont()
filterError = textFile.filter(f => f.contains(“error”))
textFile = sc.textFile(“/path/to/input_log_file.txt”)
FIGURE 6.2 Execution of a spark job
Worker node
Executor
Cache
Task
Driver program
SparkContextCluster manager
Task
Worker node
Executor
Cache
Task Task
M06 Big Data Simplified XXXX 01.indd 121 5/17/2019 2:49:07 PM
122 | Big Data Simplied
Atany point of time when the Spark application is running, the Driver program monitors the set
of Executors that are running the Tasks. The Driver program also schedules future Tasks based on
data placement by tracking the location of cached data. When the main() method of the Driver
program exits or if it calls the stop () method of the Spark Context, all the Executors are termi-
nated and resources are released from the Cluster Manager.
Once all the nodes have completed their tasks, then the next stage of the DAG can be triggered,
repeating until the entire graph has been completed, and, if before the application can complete,
a chunk of data is lost, then the DAG scheduler can find a new node and restart the transforma-
tion from an appropriate point, returning to synchronization with the rest of the nodes.
RDD Creation in Spark: Resilient Distributed Datasets (RDD) is the fundamental data structure
of Spark. RDDs are immutable and fault tolerant in nature. These are distributed collections of
objects. The datasets are divided into a logical partition, which is further computed on different
nodes over the cluster. Thus, RDD is just the way of representing dataset distributed across mul-
tiple machines, which can be operated around in parallel. RDDs are called resilient because they
have the ability to always recompute an RDD. Spark RDDs in depth here.
There are three ways to create an RDD in Spark.
Parallelizing the already existing collection in driver program.
Referencing a dataset in an external storage system (For example, HDFS, Hbase, shared file
system).
Creating RDD from already existing RDDs.
Parallelized Collection (Parallelizing): In the initial stage when we learn Spark, RDDs are generally cre-
ated by parallelized collection, i.e., by taking an existing collection in the program and passing
it to SparkContext’s parallelize() method. This method is used in the initial stage of learning
Spark since it quickly creates our own RDDs in Spark shell and performs operations on them.
The elements of the collection are copied to form a distributed data set that can be operated on
in parallel.
Launch Spark in YARN Mode:
spark-shell –master yarn –deploy-mode client
M06 Big Data Simplified XXXX 01.indd 122 5/17/2019 2:49:08 PM
Introducing Spark andKafka | 123
Here, in the above example, ‘studata’ is simply a variable and sparkContext is the object pro-
vided by Spark itself. The function sortByKey() is called over the collection RDD ‘studata’ and
the list is sorted over the subject name, i.e., the Key here.
RDD from External Datasets (Referencing a Dataset): Using Text le:
Using JSON le:
RDD from Existing RDD: Transformation mutates one RDD into another RDD and thus, transforma-
tion is the way to create an RDD from already existing RDD. This creates the difference between
Apache Spark and Hadoop MapReduce. Transformation acts as a function that intakes an RDD
and produces one. The input RDD does not get changed because RDDs are immutable in nature,
but it produces one or more RDD by applying operations.
M06 Big Data Simplified XXXX 01.indd 123 5/17/2019 2:49:09 PM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.35.194