Understanding the UnsafeRow object

At its core, since Apache Spark V1.4 Tungsten uses org.apache.spark.sql.catalyst.expressions.UnsafeRow, which is a binary representation of a row object. An UnsafeRow object consists of three regions, as illustrated in the following figure:

Note that all regions, and also the contained fields within the regions, are 8-byte aligned. Therefore, individual chunks of data perfectly fit into 64 bit CPU registers. This way, a compare operation can be done in a single machine instruction only. In addition, 8-byte stride memory access patterns are very cache-friendly, but more on this will be explained later.

Please note that although the technology is called unsafe it is perfectly safe, for you as user, since Tungsten takes care of the memory management. This is just a technology allowing people other than the JVM creators to implement memory management. Another success story for the openness of the JVM. In the following sections, we'll explain the purpose of each memory region of the UnsafeRow object.

Table of Contents for Understanding the UnsafeRow object

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding the UnsafeRow object