Output stage

By default, output operations guarantee at least once processing of streaming messages. Operations where we are writing the output to disk is sufficient as even if the data gets written more than once it will only overwrite the data already written to disk and hence it gives an impression of processing the data only once. Care should be taken in case of operations where we iterate over each elements of DStream such as foreachRDD because in case of failure the same data can be traversed again we do not have any automatic mechanisms to deal with it. There are a couple of guidelines that one can follow to achieve exactly once semantics:

  • Idempotent operation: The simplest way to ensure exactly once semantics in output operation is to make sure the operations is idempotent, as is the case in saveAs***Files operations, the same data gets written to files multiple times, overwriting the previous failed attempts.
  • Transactional updates: If an identifier can be identified per micro batch then the successive operation can be made transactional by keeping a check if the identifier is getting repeated or not and if it does get repeated then determining how to handle them. The discretion of choosing the identifier is open to the design of writing streaming code, however, one of the popular approaches is to use batch time available in foreachRDD and the partition index of the RDD to create an identifier.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.189.228