Why use Java for Spark?

With the rise in multi-core CPUs, Java could not keep up with the change in its design to utilize that extra power available to its disposal because of the complexity surrounding concurrency and immutability. We will discuss this in detail, later. First let's understand the importance and usability of Java in the Hadoop ecosystem. As MapReduce was gaining popularity, Google introduced a framework called Flume Java that helped in pipelining multiple MapReduce jobs. Flume Java consists of immutable parallel collections capable of performing lazily evaluated optimized chained operations. That might sound eerily similar to what Apache Spark does, but then even before Apache Spark and Java Flume, there was Cascading, which built an abstraction over MapReduce to simplify the way MapReduce tasks are developed, tested, and run. All these frameworks were majorly a Java implementation to simplify MapReduce pipelines among other things.

These abstractions were simple in fact, Apache Crunch, Apache's implementation of Flume Java, was so concise that a word count problem could be written in four-five lines of Java code. That is huge!

Nevertheless these Java implementations could never do away with the heavy I/O bound operations. And it was just then that Apache Spark was born, primarily written in Scala, which not only addressed the limitations of MapReduce, but also the verbosity of Java while developing a MapReduce Job without using any abstraction framework. In their paper, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, the authors state that:

"We chose Scala due to its combination of conciseness (which is convenient for interactive use) and efficiency (due to static typing). However, nothing about the RDD abstraction requires a functional language."

Eventually the Java community realized that with time they would have to support functional programming to harness the capacity of cluster computing as vertical scaling of enterprise hardware was not the solution for applications being developed for the future. Java started releasing features around concurrency. Java 7 introduced the fork join framework to harness the capabilities of modern hardware; it also introduced an instruction set invokedynamic, which enabled method calls to be made at runtime. This single instruction set, change allowed dynamically typed language on the JVM. Java 8 beautifully utilized the concept of late binding in invokedynamic and developed entire Lambda expressions around it. It also released the Stream API, which can be thought of as a collection framework for functional programming in Java without actually storing the elements. Other notable changes in Java 8 are:

Functional interfaces
Default and static methods in interfaces
New date and time API
Scripting support

Apache Spark has added support for Java 8 and hence the difference in writing a code in Scala and Java is blurring with each release. The learning curve also matches its potential as one is left to learn only the Apache Spark API and not Scala as well. Another highlight of Apache Spark is Spark-SQL, which can come in handy if interactive and dynamic coding is required.

Table of Contents for Why use Java for Spark?

Create new playlist

Sign In

Sign Up

Table of Contents for
Why use Java for Spark?