How transparent fault tolerance and exactly-once delivery guarantee is achieved

Apache Spark structured streaming supports full crash fault tolerance and exactly-once delivery guarantee without the user taking care of any specific error handling routines. Isn't this amazing? So how is this achieved?

Full crash fault tolerance and exactly-once delivery guarantee are terms of systems theory. Full crash fault tolerance means that you can basically pull the power plug of the whole data center at any point in time, and no data is lost or left in an inconsistent state. Exactly-once delivery guarantee means, even if the same power plug is pulled, it is guaranteed that each tuple- end-to-end from the data source to the data sink - is delivered - only, and exactly, once. Not zero times and also not more than one time. Of course those concepts must also hold in case a single node fails or misbehaves (for example- starts throttling).

First of all, states between individual batches and offset ranges (position in a source stream) are kept in-memory but are backed by a Write Ahead Log (WAL) in a fault-tolerant filesystem such as HDFS. A WAL is basically a log file reflecting the overall stream processing state in a pro-active fashion. This means, before data is transformed though an operator, it is first persistently stored in the WAL in a way it can be recovered after a crash. So, in other words, during the processing of an individual mini batch, the regions of the worker memory, as well as the position offset of the streaming source, are persisted to disk. In case the system fails and has to recover, it can re-request chunks of data from the source. Of course, this is only possible if the source supports these semantics.

Table of Contents for How transparent fault tolerance and exactly-once delivery guarantee is achieved

Create new playlist

Sign In

Sign Up

Table of Contents for
How transparent fault tolerance and exactly-once delivery guarantee is achieved