Introducing Spark andKafka | 139
Let us take a look at the competition. On one side, you have Matlab and R which have the
benefit of being fairly easy to use, but they are less scalable. On the other side, there is Mahout
and GraphLab, which are more scalable but at the cost of ease.
The ML pipelines were officially introduced into the Spark package as an attempt to simplify
machine learning, embracing machine learning’s flow of loading data, extracting features, train-
ing the data and testing that trained data. All through that pipeline, a standard interface allows
tuning, testing and early failure detection.
The ML algorithms help spam filtering, fraud detection or even recommendation analy-
sis. Anabundance of use cases are also at the heart of machine learning. In addition, MLlib is
attempting to bring this complex subject matter down to a larger audience that has previously
been possible.
6.1.7 Spark Libraries: GraphX
We now look at our nal Spark library called GraphX. It brings Spark’s table-like structure into
a graph-structured world similar to that of social networking. GraphX works by using RDDs
behind the scenes just by storing the data in a graph-optimized structure aptly named as Graph.
Let us check how GraphX compares to other systems. GraphX’s PageRank algorithm has been
measured to run at double the speed of Giraph. GraphLab measures about 33% slower than
GraphX.
What are some of the key applications of GraphX? Well, remember, the web itself is a giant
graph. There’s PageRank, the algorithm for website ranking and then, there are social networks.
There is a plenty of analysis that can be performed with the vast social graphs available today and
applications in science like genetic analysis.
Some use cases where Spark outperforms Hadoop in processing.
Stream Processing: For processing logs and detecting frauds in live streams for alerts,
Apache Spark is the best solution.
Sensor Data Processing: Apache Spark’s ‘In-memory computing’ works best here, as the
data is retrieved and combined from different sources.
Spark is preferred over Hadoop for near real time querying of data.
6.1.8 PySpark: Spark with Python
What is PySpark: PySpark is a python API for spark released by Apache Spark community to sup-
port python with Spark. Using PySpark, we can easily integrate and work with RDD in python
programming language too. There are numerous features that make PySpark such an amazing
framework when it comes to working with huge datasets. Be it performing computations on large
data sets or just to analyse them, the data engineers are turning to this tool.
Key Features of PySpark
Real Time Computations: Due to the in-memory processing in PySpark framework, itshows
low latency.
M06 Big Data Simplified XXXX 01.indd 139 5/17/2019 2:49:15 PM
140 | Big Data Simplied
Polyglot: PySpark framework is compatible with various languages, like Scala, Java, Python
and R, which makes it one of the most preferable frameworks for processing huge datasets.
Caching and Disk Persistence: PySpark framework provides powerful caching and very
good disk persistence.
Fast Processing: PySpark framework is way faster than other traditional frameworks for big
data processing.
Works Well with RDD: Python programming language is dynamically typed which helps
when working with RDD.
Need of PySpark: Python is one of the most widely used programming languages among data sci-
entists. Owing to its simple interactive interface or it is easy to learn or it is a general-purpose
language, it is trusted by many data scientists to perform data analysis, machine learning and
many more tasks on big data. On the other hand, Apache Spark is one of the most amazing tools
that help handling big data.
However there can be a confusion around which is preferable for Spark – Scala or Python.
Thefollowing table addresses this.
Parameter Python with Spark Scala with Spark
Performance and speed of
the execution of an application
and high throughput.
Python is comparatively slower
than Scala when used with
Spark, but programmers can
do much more with Python
than Scala due to the easy
interface that it provides.
Spark is written in Scala, so it
integrates well with Scala. It is
faster than Python.
Data science libraries for
machine learning and deep
learning.
In Python API, you don’t
have to worry about the
visualizations or data science
libraries. You can easily port
the core parts of R to Python
as well.
Scala lacks proper data science
libraries and tools. Scala does
not have proper local tools and
visualizations.
Readability of code and
application maintenance
Readability, maintenance and
familiarity of code is better in
Python API.
In Scala API, it is easy to make
internal changes since Spark is
written in Scala.
Complexity (in terms
of learning and use in
development)
Python API has an easy, simple
and comprehensive interface.
Scala’s syntax and the fact that
it produces verbose output is
why it is considered a complex
language.
Therefore, by comparing the parameters as shown above, the conclusion is to use Scale or
Python based on the project requirements and environments of the cluster. We need to keep
in mind about the data volume, cluster memory, etc., when we choose the suitable language
for Spark.
M06 Big Data Simplified XXXX 01.indd 140 5/17/2019 2:49:15 PM
Introducing Spark andKafka | 141
Launch PySpark: Start PySpark shell using the command ‘pyspark’.
As previously seen in Spark SQL, here read ‘emp.json’ from HDFS using PySpark.
>>>RDDRead = sc.textFile(“/sparkData/emp.json”)
>>>RDDRead.collect()
>>>RDDRead.first()
>>>RDDRead.count()
MapReduce Wordcount Problem Using PySpark: Initially, create a le named ‘sparkInput.txt’ in local and
put it into HDFS inside ‘/data’ directory using the below command.
hadoop fs -put /<local-path>/sparkInput.txt /data/sparkInput.txt
M06 Big Data Simplified XXXX 01.indd 141 5/17/2019 2:49:16 PM
142 | Big Data Simplied
Start the PySpark shell and develop the script as below:
>>from pyspark import SparkContext, SparkConf
>>conf = SparkConf().setAppName(“PysparkWordcount”)
fileRDD =sc.textFile(“/data/sparkInput.txt”)
nonempty_lines = fileRDD.filter(lambda x: len(x) > 0)
words = nonempty_lines.flatMap(lambda x: x.split(‘ ‘))
for word in wordcount.collect():
print(word) //give 4 spaces at the starting of this print command as
to maintain indentation
wordcount = words.map(lambda x:(x,1)).reduceByKey(lambda x,y: x+y).
map(lambda x: (x[1], x[0])).sortByKey(False) // here perform the map
reduce task
Save the output in HDFS inside ‘/data/pySparkOutput’ directory as follows.
wordcount.saveAsTextFile(“hdfs://localhost:54310/data/pySparkOutput”)
M06 Big Data Simplified XXXX 01.indd 142 5/17/2019 2:49:16 PM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.202.27