Memory tuning

In this section, we will discuss some advanced strategies that can be used by users like you to make sure that an efficient use of memory is carried out while executing your Spark jobs. More specifically, we will show how to calculate the memory usages of your objects. We will suggest some advanced ways to improve it by optimizing your data structures or by converting your data objects in a serialized format using Kryo or Java serializer. Finally, we will look at how to tune Spark's Java heap size, cache size, and the Java garbage collector.

There are three considerations in tuning memory usage:

The amount of memory used by your objects: You may even want your entire dataset to fit in the memory
The cost of accessing those objects
The overhead of garbage collection: If you have a high turnover in terms of objects

Although Java objects are fast enough to access, they can easily consume a factor of 2 to 5x more space than the actual (aka raw) data in their original fields. For example, each distinct Java object has 16 bytes of overhead with an object header. A Java string, for example, has almost 40 bytes of extra overhead over the raw string. Furthermore, Java collection classes like Set, List, Queue, ArrayList, Vector, LinkedList, PriorityQueue, HashSet, LinkedHashSet, TreeSet, and so on, are also used. The linked data structures, on the other hand, are too complex, occupying too much extra space since there is a wrapper object for each entry in the data structure. Finally, the collections of primitive types frequently store them in the memory as boxed objects, such as java.lang.Double and java.lang.Integer.

Table of Contents for Memory tuning

Create new playlist

Sign In

Sign Up

Table of Contents for
Memory tuning