To measure the impact of garbage collection, you can ask the JVM to print details about the garbage collection. You can do this by adding -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
to your SPARK_JAVA_OPTS
environment variable in conf/spark-env.sh
. The details will then be printed to the standard output when you run your job, which will be available as described in the Where to find logs? section of this chapter.
If you find that your Spark cluster is using too much time on garbage collection, you can reduce the amount of space used for RDD caching by changing spark.storage.memoryFraction
, which is set to 0.66
by default. If you are planning to run Spark for a long time on a cluster, you may wish to enable spark.cleaner.ttl
. By default, Spark does not clean up any metadata; set spark.cleaner.ttl
to a nonzero value in seconds to clean up metadata after that length of time.
You can also control the RDD storage level if you find that you are using too much memory. If your RDDs don't fit in the memory and you still wish to cache them, you can try using a different storage level such as:
These options are set when you call the persist function on your RDD. By default, the RDDs are stored in a deserialized form, which requires less parsing. We can save space by adding _SER
to the storage level; in this case, Spark will serialize the data to be stored, which normally saves some space.
18.118.37.154