Memory usage and management

Memory usages by your Spark application and underlying computing nodes can be categorized as execution and storage. Execution memory is used during the computation in merge, shuffles, joins, sorts, and aggregations. On the other hand, storage memory is used for caching and propagating internal data across the cluster. In short, this is due to the large amount of I/O across the network.

Technically, Spark caches network data locally. While working with Spark iteratively or interactively, caching or persistence are optimization techniques in Spark. These two help in saving interim partial results so that they can be reused in subsequent stages. Then these interim results (as RDDs) can be kept in memory (default) or more solid storage, such as disk, and/or replicated. Furthermore, RDDs can be cached using cache operations too. They can also be persisted using a persist operation. The difference between cache and persist operations is purely syntactic. The cache is a synonym of persisting or persists (MEMORY_ONLY), that is, cache is merely persisted with the default storage level MEMORY_ONLY.

If you go under the Storage tab in your Spark web UI, you should observe the memory/storage used by an RDD, DataFrame, or Dataset object, as shown in Figure 10. Although there are two relevant configurations for tuning memory in Spark, users do not need to readjust them. The reason is that the default values set in the configuration files are enough for your requirements and workloads.

spark.memory.fraction is the size of the unified region as a fraction of (JVM heap space - 300 MB) (default 0.6). The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in case of sparse and unusually large records. On the other hand, spark.memory.storageFraction expresses the size of R storage space as a fraction of the unified region (default is 0.5). The default value of this parameter is 50% of Java heap space, that is, 300 MB.

A more detailed discussion on memory usage and storage is given in Chapter 15, Text Analytics Using Spark ML.

Now, one question might arise in your mind: which storage level to choose? To answer this question, Spark storage levels provide you with different trade-offs between memory usage and CPU efficiency. If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), let your Spark driver or master go with it. This is the most memory-efficient option, allowing operations on the RDDs to run as fast as possible. You should let it go with this, because this is the most memory-efficient option. This also allows numerous operations on the RDDs to be done as fast as possible.

If your RDDs do not fit the main memory, that is, if MEMORY_ONLY does not work out, you should try using MEMORY_ONLY_SER. It is strongly recommended to not spill your RDDs to disk unless your UDF (aka user-defined function that you have defined for processing your dataset) is too expensive. This also applies if your UDF filters a large amount of the data during the execution stages. In other cases, recomputing a partition, that is, repartition, may be faster for reading data objects from disk. Finally, if you want fast fault recovery, use the replicated storage levels.

In summary, there are the following StorageLevels available and supported in Spark 2.x: (number _2 in the name denotes 2 replicas):

DISK_ONLY: This is for disk-based operation for RDDs
DISK_ONLY_2: This is for disk-based operation for RDDs for 2 replicas
MEMORY_ONLY: This is the default for cache operation in memory for RDDs
MEMORY_ONLY_2: This is the default for cache operation in memory for RDDs with 2 replicas
MEMORY_ONLY_SER: If your RDDs do not fit the main memory, that is, if MEMORY_ONLY does not work out, this option particularly helps in storing data objects in a serialized form
MEMORY_ONLY_SER_2: If your RDDs do not fit the main memory, that is, if MEMORY_ONLY does not work out with 2 replicas, this option also helps in storing data objects in a serialized form
MEMORY_AND_DISK: Memory and disk (aka combined) based RDD persistence
MEMORY_AND_DISK_2: Memory and disk (aka combined) based RDD persistence with 2 replicas
MEMORY_AND_DISK_SER: If MEMORY_AND_DISK does not work, it can be used
MEMORY_AND_DISK_SER_2: If MEMORY_AND_DISK does not work with 2 replicas, this option can be used
OFF_HEAP: Does not allow writing into Java heap space

Note that cache is a synonym of persist (MEMORY_ONLY). This means that cache is solely persist with the default storage level, that is, MEMORY_ONLY. Detailed information can be found at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-StorageLevel.html.

Table of Contents for Memory usage and management

Create new playlist

Sign In

Sign Up

Table of Contents for
Memory usage and management