Data checkpointing

Data checkpointing saves the actual RDDs to HDFS so that, if there is a failure of the Streaming application, the application can recover the checkpointed RDDs and continue from where it left off. While streaming application recovery is a good use case for the data checkpointing, checkpointing also helps in achieving better performance whenever some RDDs are lost because of cache cleanup or loss of an executor by instantiating the generated RDDs without a need to wait for all the parent RDDs in the lineage (DAG) to be recomputed.

Checkpointing must be enabled for applications with any of the following requirements:

  • Usage of stateful transformations: If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic RDD checkpointing.
  • Recovering from failures of the driver running the application: Metadata checkpoints are used to recover with progress information.

If your streaming application does not have the stateful transformations, then the application can be run without enabling checkpointing.

There might be loss of data received but not processed yet in your streaming application.

Note that checkpointing of RDDs incurs the cost of saving each RDD to storage. This may cause an increase in the processing time of those batches where RDDs get checkpointed. Hence, the interval of checkpointing needs to be set carefully so as not to cause performance issues. At tiny batch sizes (say 1 second), checkpointing too frequently every tiny batch may significantly reduce operation throughput. Conversely, checkpointing too infrequently causes the lineage and task sizes to grow, which may cause processing delays as the amount of data to be persisted is large.

For stateful transformations that require RDD checkpointing, the default interval is a multiple of the batch interval that is at least 10 seconds.

A checkpoint interval of 5 to 10 sliding intervals of a DStream is a good setting to start with.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.245.219