More on stream life cycle management

Streaming tends to be used in the creation of continuous applications. This means that the process is running in the background and, in contrast to batch processing, doesn't have a clear stop time; therefore, DataFrames and Datasets backed by a streaming source, support various methods for stream life cycle management, which are explained as follows:

start: This starts the continuous application. This method doesn't block. If this is not what you want, use awaitTermination.
stop : This terminates the continuous application.
awaitTermination : As mentioned earlier, starting a stream using the start method immediately returns, which means that the call is not blocking. Sometimes you want to wait until the stream is terminated, either by someone else calling stop on it or by an error.
exception: In case a stream stopped because of an error, the cause can be read using this method.
sourceStatus: This is to obtain real-time meta information on the streaming source.
sinkStatus : This is to obtain real-time meta information on the streaming sink.

Sinks in Apache Spark streaming are smart in the sense that they support fault tolerance and end-to-end exactly-once delivery guarantee as mentioned before. In addition, Apache Spark needs them to support different output methods. Currently, the following three output methods, append, update, and complete, significantly change the underlying semantics. The following paragraph contains more details about the different output methods.

Different output modes on sinks: Sinks can be specified to handle output in different ways. This is known as outputMode. The naive choice would use an incremental approach as we are processing incremental data with streaming anyway. This mode is referred to as append. However, there exist requirements where data already processed by the sink has to be changed. One example is the late arrival problem of missing data in a certain time window, which can lead to changing results once the computation for that particular time window is recomputed. This mode is called complete.

Since Version 2.1 of Apache Spark, the update mode was introduced that behaves similarly to the complete mode but only changes rows that have been altered, therefore saving processing resources and improving speed. Some types of modes do not support all query types. As this is constantly changing, it is best to refer to the latest documentation at http://spark.apache.org/docs/latest/streaming-programming-guide.html.

Table of Contents for More on stream life cycle management

Create new playlist

Sign In

Sign Up

Table of Contents for
More on stream life cycle management