Fault tolerance and reliability

Streaming jobs are designed to run continuously and failure in the job can result in loss of data, state, or both. Making streaming jobs fault tolerant becomes one of the essential goals of writing the streaming job in the first place. Any streaming job comes with some guarantees either by design or by implementing certain configuration features, which mandates how many times a message will be processed by the system:

At most once guarantee: Records in such systems can either be processed once or not at all. These systems are least reliable as far as streaming solution is concerned.
At least once guarantee: The system will process the record at least once and hence by design there will be no loss of messages, but then messages can be processed multiple times giving the problem of duplication. This scenario however is better than the previous case and there are use cases where duplicate data may not cause any problem or can easily be deduced.
Exactly once guarantee: Records here will be processed exactly once with an assurance that neither the data will be lost nor the data will be processed multiple times. This is an achievable ideal scenario any streaming job vouch for either by design or by configuration changes.

Now with these three possible guarantees let's understand the failure scenarios in a Spark Streaming job and if possible how each one can be recovered. Any Spark job on a high level comprises of three stages, that is, data receiver, transformation, and output with the possibility of failure occurring at any stage. In order for a streaming job to guarantee end to end exactly once processing each stage of the job should provide the same guarantee too. Stages of Spark Streaming job is as follows:

Table of Contents for Fault tolerance and reliability

Create new playlist

Sign In

Sign Up

Table of Contents for
Fault tolerance and reliability