Apache Spark structured streaming supports full crash fault tolerance and exactly-once delivery guarantee without the user taking care of any specific error handling routines. Isn't this amazing? So how is this achieved?
First of all, states between individual batches and offset ranges (position in a source stream) are kept in-memory but are backed by a Write Ahead Log (WAL) in a fault-tolerant filesystem such as HDFS. A WAL is basically a log file reflecting the overall stream processing state in a pro-active fashion. This means, before data is transformed though an operator, it is first persistently stored in the WAL in a way it can be recovered after a crash. So, in other words, during the processing of an individual mini batch, the regions of the worker memory, as well as the position offset of the streaming source, are persisted to disk. In case the system fails and has to recover, it can re-request chunks of data from the source. Of course, this is only possible if the source supports these semantics.