Code generation

Apache Spark V1.5 introduced code generation for expression evaluation. In order to understand this, let's start with an example. Let's have a look at the following expression:

val i = 23
val j = 5
var z = i*x+j*y

Imagine x and y are data coming from a row in a table. Now, consider that this expression is applied for every row in a table of, let's say, one billion rows. Now the Java Virtual Machine has to execute (interpret) this expression one billion times, which is a huge overhead. So what Tungsten actually does is transform this expression into byte-code and have it shipped to the executor thread.

As you might know, every class executed on the JVM is byte-code. This is an intermediate abstraction layer to the actual machine code specific for each different micro-processor architecture. This was one of the major selling points of Java decades ago. So the basic workflow is:

Java source code gets compiled into Java byte-code.
Java byte-code gets interpreted by the JVM.
The JVM translates this byte-code and issues platform specific-machine code instructions to the target CPU.

These days nobody ever thinks of creating byte-code on the fly, but this is what's happening in code generation. Apache Spark Tungsten analyzes the task to be executed and instead of relying on chaining pre-compiled components it generates specific, high-performing byte code as written by a human to be executed by the JVM.

Another thing Tungsten does is to accelerate the serialization and deserialization of objects, because the native framework that the JVM provides tends to be very slow. Since the main bottleneck on every distributed data processing system is the shuffle phase (used for sorting and grouping similar data together), where data gets sent over the network in Apache Spark, object serialization and deserialization are the main contributor to the bottleneck (and not I/O bandwidth), also adding to the CPU bottleneck. Therefore increasing performance here reduces the bottleneck.

All things introduced until now are known as Tungsten Phase 1 improvements. With Apache Spark V2.0, Tungsten Phase 2 improvements went live, which are the following:

Columnar storage
Whole stage code generation

Table of Contents for Code generation

Create new playlist

Sign In

Sign Up

Table of Contents for
Code generation