Lifecycle of a broadcast variable

Broadcast variables are created using SparkContext and their lifecycle is managed by BroadcastManager and ContextCleaner. When a broadcast variable is created, it gets copied to each executor's memory.

BroadcastManager is a service that gets initialized as soon as SparkContext is created. The BroadcastManager service initializes a TorrentBroadcastFactory (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-TorrentBroadcastFactory.html) object, which is used to create a broadcast variable. BroadcastManager manages all the broadcast variables created, using SparkContext to keep track of them.

Once the broadcast variable is no longer needed, it should be removed from the executor's memory. There are two ways to do this:

  • unpersist(): User can unpersist a broadcast variable when it is not needed any more. It can be done as follows:
broadcastVar.unpersist();

The unpersist() method removes the copies the broadcast variable from the executors memory asynchronously. However, this method provides an overload where the user can run unpersist synchronously as well.

public void unpersist(boolean blocking)

Therefore, unpersist can be done synchronously as follows:

broadcastVar.unpersist(true);

The unpersist() method removes the broadcast variable from the executors memory. However, it does not remove it from the driver's memory. If the broadcast variable is used again in the program after unpersist() is called, it will be resent to the executors.

  • Destroy: The destroy method removes all the data and metadata related to the broadcast variable from executors and drivers asynchronously. Once, the broadcast variable has been destroyed, it cannot be called again. Any attempt to do so will throw an exception. It can be done as follows:
broadcastVar.destroy();

Similarly to the unpersist method, the destroy method also provides an overload, so it can be called synchronously as follows:

broadcastVar.destroy(true);

It is recommended to unpersist a broadcast variable before destroying it as it gets removed from the executors memory asynchronously first and does not block the Spark program execution.

Once a broadcast variable has been destroyed and garbage collected, all the data and metadata related to it is cleaned by ContextCleaner.

ContextCleaner is a Spark service which is responsible for cleaning up RDDs, broadcasts, accumulators, and so on. It also gets created with SparkContext. It maintains a weak reference to RDDs, broadcast variables, accumulators, and so on, and when the object is garbage collected it cleans up the data and metadata related to it using a daemon thread called Spark Context Cleaner thread.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.255.168