Fault tolerance semantics

Delivering end-to-end exactly once semantics was one of the key goals behind the design of Structured streaming, which implements the Structured streaming sources, the output sinks, and the execution engine to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Every streaming source is assumed to have offsets (similar to Kafka offsets) to track the read position in the stream. The engine uses checkpointing and write ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured streaming can ensure end-to-end exactly once semantics under any failure.

Remember that exactly once the paradigm is more complicated in traditional streaming using some external database or store to maintain the offsets.

The structured streaming is still evolving and has several challenges to overcome before it can be widely used. Some of them are as follows:

  • Multiple streaming aggregations are not yet supported on streaming datasets
  • Limiting and taking first N rows is not supported on streaming datasets
  • Distinct operations on streaming datasets are not supported
  • Sorting operations are supported on streaming datasets only after an aggregation step is performed and that too exclusively when in complete output mode
  • Any kind of join operations between two streaming datasets are not yet supported
  • Only a few types of sinks - file sink and for each sink are supported
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.248.211