A practical example on memory usage and performance

So let's have a look at how Tungsten actually saves a lot of memory when using DataFrames or Datasets in contrast to RDDs. Let's have a look at the following code:

//create 10^6 integers and store it in intList
val intList = 0 to math.pow(10, 6).toInt
//create a RDD out of intList
val rdd = sc.parallelize(intList)
//..and count it
rdd.cache.count

Here we create a Scala list containing 1,000,000 integers, create a RDD out of it, cache it, and finally count the number of entries. We can do the same using a DataFrame:

//create a DataFrame out of intList
val df = intList.toDF
//..and count it
df.cache.count

Again, we take the list of integers, but this time we create a DataFrame out of it. And again, we cache and count.

Luckily Apache Spark provides very detailed insights into its internals. We can access the management UI of Apache Spark on port 4040. Besides job-and task-related information, the Storage tab provides information on each RDD currently existing on the cluster. Let's have a look at memory usage:

We can see that the RDD takes nearly 4 megabytes of memory, whereas DataFrame (which is backed by a RDD[Row] object) takes only 1 megabyte. Unfortunately, while the wrapping of values into the Row objects has advantages for memory consumption, it has disadvantages for execution performance. Therefore, we give it another try using a Dataset:

//create a Dataset out of DataFrame df
case class Ints(value: Int)
val ds = df.as[Ints]
//..and count it
ds.cache.count

First, we create a case class for the Dataset's content, since Datasets are strongly (and statically) typed. Then we use the existing DataFrame object df in order to create a Dataset ds out of it. And then, again, we count. Now we ran three Apache Spark jobs on the three different data types and we can take a look at their execution times:

As we can see, the first job on the RDD took only 500 ms, whereas the second job on the DataFrame took two seconds. This means that saving memory actually caused an increase in execution time.

Fortunately, when using Datasets, this performance loss is mitigated and we can take advantage of both the fast execution time and the efficient usage of memory.

Why are we using DataFrames and Datasets at all if RDDs are faster? We could also just put additional memory to the cluster. Note that although this particular execution on an RDD runs faster, Apache Spark jobs are very rarely composed only out of a single operation on an RDD. In theory you could write very efficient Apache Spark jobs on RDDs only, but actually re-writing your Apache Spark application for performance tuning will take a lot of time. So the best way is to use DataFrames and Datasets, to make use of the Catalyst optimizer, in order to get efficient calls to the RDD operations generated.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.34.62