A practical example on whole stage code generation performance

So let's actually run a little test. We join one billion integers. Once with whole stage code generation turned off and once with it turned on. So let's first have a look at the corresponding explained outputs:

An asterisk symbol, in the explained output, indicates that these operations are executed as a single thread with whole stage generated code. Note that the BroadcastExchange operation, which belongs to BroadcastHashJoin, is not part of this, since it might run on another machine.

Operator fusing: Other data processing systems call this technique operator fusing. In Apache Spark whole stage code generation is actually the only means for doing so.

Now, let's actually turn off this feature:

scala> spark.conf.set("spark.sql.codegen.wholeStage",false)

If we now explain the preceding statement again, we get the following output:

The only difference is the missing asterisk symbol in front of some operators, and this actually means that each operator is executed as a single thread or at least as different code fragments, passing data from one to another.

So let's actually materialize this query by calling the count method, once with and once without whole stage code generation, and have a look at the performance differences:

As we can see, without whole stage code generation, the operations (composed of job 0 and job 1) take nearly 32 seconds, whereas with whole stage code generation enabled, it takes only slightly more than half a second (job id 2 and 3).

If we now click on the job's description, we can obtain a DAG visualization, where DAG stands for directed acyclic graph - another way of expressing a data processing workflow:

DAG visualization is a very important tool when it comes to manual performance optimization. This is a very convenient way to find out how Apache Spark translates your queries into a physical execution plan. And therefore a convenient way of discovering performance bottlenecks.

By the way, RDBMS administrators use a similar visualization mechanism in relational database tooling which is called visual explain. As we can see in Apache Spark V2.0, with the whole stage code generation enabled, multiple operators get fused together.

Table of Contents for A practical example on whole stage code generation performance

Create new playlist

Sign In

Sign Up

Table of Contents for
A practical example on whole stage code generation performance