Garbage collection tuning

Although it is not a major problem in your Java or Scala programs that just read an RDD sequentially or randomly once and then execute numerous operations on it, Java Virtual Machine (JVM) GC can be problematic and complex if you have a large amount of data objects w.r.t RDDs stored in your driver program. When the JVM needs to remove obsolete and unused objects from the old objects to make space for the newer ones, it is mandatory to identify them and remove them from the memory eventually. However, this is a costly operation in terms of processing time and storage. You might be wondering that the cost of GC is proportional to the number of Java objects stored in your main memory. Therefore, we strongly suggest you tune your data structure. Also, having fewer objects stored in your memory is recommended.

The first step in GC tuning is collecting the related statistics on how frequently garbage collection by JVM occurs on your machine. The second statistic needed in this regard is the amount of time spent on GC by JVM on your machine or computing nodes. This can be achieved by adding -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to the Java options in your IDE, such as Eclipse, in the JVM startup arguments and specifying a name and location for our GC log file, as follows:

Figure 27: Setting GC verbose on Eclipse

Alternatively, you can specify verbose:gc while submitting your Spark jobs using the Spark-submit script, as follows:

--conf “spark.executor.extraJavaOptions = -verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps"

In short, when specifying GC options for Spark, you must determine where you want the GC options specified, on the executors or on the driver. When you submit your jobs, specify --driver-java-options -XX:+PrintFlagsFinal -verbose:gc and so on. For the executor, specify --conf spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -verbose:gc and so on.

Now, when your Spark job is executed, you will be able to see the logs and messages printed in the worker's node at /var/log/logs each time a GC occurs. The downside of this approach is that these logs will not be on your driver program but on your cluster's worker nodes.

It is to be noted that verbose:gc only prints appropriate message or logs after each GC collection. Correspondingly, it prints details about memory. However, if you are interested in looking for more critical issues, such as a memory leak, verbose:gc may not be enough. In that case, you can use some visualization tools, such as jhat and VisualVM. A better way of GC tuning in your Spark application can be read at https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.188.27