Memory tuning

In this section, we will discuss some advanced strategies that can be used by users like you to make sure that an efficient use of memory is carried out while executing your Spark jobs. More specifically, we will show how to calculate the memory usages of your objects. We will suggest some advanced ways to improve it by optimizing your data structures or by converting your data objects in a serialized format using Kryo or Java serializer. Finally, we will look at how to tune Spark's Java heap size, cache size, and the Java garbage collector.

There are three considerations in tuning memory usage:

  • The amount of memory used by your objects: You may even want your entire dataset to fit in the memory
  • The cost of accessing those objects
  • The overhead of garbage collection: If you have a high turnover in terms of objects

Although Java objects are fast enough to access, they can easily consume a factor of 2 to 5x more space than the actual (aka raw) data in their original fields. For example, each distinct Java object has 16 bytes of overhead with an object header. A Java string, for example, has almost 40 bytes of extra overhead over the raw string. Furthermore, Java collection classes like Set, List, Queue, ArrayList, Vector, LinkedList, PriorityQueue, HashSet, LinkedHashSet, TreeSet, and so on, are also used. The linked data structures, on the other hand, are too complex, occupying too much extra space since there is a wrapper object for each entry in the data structure. Finally, the collections of primitive types frequently store them in the memory as boxed objects, such as java.lang.Double and java.lang.Integer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.162.242