Apache Spark

Apache Spark is a cluster-computing framework from the University of California, Berkeley's AMPLab. Spark is not a substitute for the complete Hadoop ecosystem but mostly for the MapReduce aspect of a Hadoop cluster. Whereas Hadoop MapReduce uses on-disk batch operations to process data, Spark uses both in-memory and on-disk operations. As expected, it is faster with datasets that fit in memory and this is why it is more useful for real-time streaming applications but it can also be used with ease for datasets that don't fit in memory.

Apache Spark can run on top of HDFS using YARN or in a standalone mode as shown in the following diagram:

This means that in some cases (such as the one that we will use in our use case below) we can completely ditch Hadoop for Spark if our problem is really well defined and constrained within Spark's capabilities.

Spark can be up to 100 times faster than Hadoop MapReduce for in-memory operations. Spark offers user-friendly APIs for Scala (its native language), Java, Python, and Spark SQL (a variation of the SQL92 specification). Both Spark and MapReduce are resilient to failure. Spark uses resilient distributed datasets (RDDs) which are distributed across the whole cluster.

As we can see from the overall Spark architecture as follows, we can have several different modules of Spark working together for different needs, from SQL querying to streaming and machine learning libraries.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.98.34