Scala and the Spark ecosystem

To provide more enhancement and additional big data processing capabilities, Spark can be configured and run on top of existing Hadoop-based clusters. The core APIs in Spark, on the other hand, are written in Java, Scala, Python, and R. Compared to MapReduce, with the more general and powerful programming model, Spark also provides several libraries that are part of the Spark ecosystems for additional capabilities for general-purpose data processing and analytics, graph processing, large-scale structured SQL, and Machine Learning (ML) areas.

The Spark ecosystem consists of the following components as shown (for details please refer Chapter 16, Spark Tuning):

  • Apache Spark core: This is the underlying engine for the Spark platform on which all the other functionalities are built. Also, it's the one that provides in-memory processing.
  • Spark SQL: As mentioned Spark core is the underlying engine and all the other components or features are built upon it. Spark SQL is the Spark component that provides support for different data structures (structured and semi-structured data).
  • Spark streaming: This component is responsible for streaming data for analytics and converts them into mini batches that can be used later on for analytics.
  • MLlib (Machine Learning Library): MLlib is a machine learning framework that supports lots of ML algorithms in a distributed fashion.
  • GraphX: A distributed graph framework built on top of Spark to express user-defined graph components in a parallel fashion.

As mentioned earlier, most functional programming languages allow the user to write nice, modular, and extensible code. Also, functional programming encourages safe ways of programming by writing functions that look like mathematical functions. Now, how did Spark make all the APIs work as a single unit? It was possible because of the advancement in the hardware and of course, the functional programming concepts. Since adding syntactic sugar to easily do lambda expressions is not sufficient to make a language functional, this is just the start.

Although the RDD concept in Spark works quite well, there are many use cases where it's a bit complicated due to its immutability. For the following example which is the classic example of calculating an average, make the source code robust and readable; of course, to reduce the overall cost, one does not want to first compute totals, then counts, even if the data is cached in the main memory.

val data: RDD[People] = ...
data.map(person => (person.name, (person.age, 1)))
.reduceByKey(_ |+| _)
.mapValues { case (total, count) =>
total.toDouble / count
}.collect()

The DataFrames API (this will be discussed in the later chapters in detail) produces equally terse and readable code where the functional API fits well for most use cases and minimizes the MapReduce stages; there are many shuffles that can cost dramatically and the key reasons for this are as follows:

  • Large code bases require static typing to eliminate trivial mistakes, such as aeg instead of age instantly
  • Complex code requires transparent APIs to communicate design clearly
  • 2x speed-ups in the DataFrames API via under-the-hood mutation can be equally achieved by encapsulating state via OOP and using mapPartitions and combineByKey
  • Flexibility and Scala features are required to build functionality quickly

The combination of OOP and FP with Spark can make a pretty hard problem easier in Barclays. For example, in Barclays, recently an application called Insights Engine has been developed to execute an arbitrary number N of near-arbitrary SQL-like queries. The application can execute them in a way that can scale with increasing N.

Now let's talk about pure functions, higher order functions, and anonymous functions, which are the three important concepts in the functional programming of Scala.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.122.235