Serialized RDD storage

As discussed already, despite other types of memory tuning, when your objects are too large to fit in the main memory or disk efficiently, a simpler and better way of reducing memory usage is storing them in a serialized form.

This can be done using the serialized storage levels in the RDD persistence API, such as MEMORY_ONLY_SER. For more information, refer to the previous section on memory management and start exploring available options.

If you specify using MEMORY_ONLY_SER, Spark will then store each RDD partition as one large byte array. However, the only downside of this approach is that it can slow down data access times. This is reasonable and obvious too; fairly speaking, there's no way to avoid it since each object needs to deserialize on the fly back while reusing.

As discussed previously, we highly recommend using Kryo serialization instead of Java serialization to make data access a bit faster.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.235.63