Streaming transformations

As in the previous word count example, we saw that words from each line were being counted once and for the next set of records the counter was reset again. But what if we want to add the previous state count to the new set of words in the following batch to come? Can we do that and how? The answer to the first part of the question is, in Spark Streaming there are two kinds of transformation, stateful and stateless transformation, so if we want to preserve the previous state then one will have to opt for stateful transformation rather than the stateless transformation that we achieved in the previous example.

Stream processing can be stateless or stateful based on the requirement. Some stream processing problems may require maintaining a state over a period of time, others may not.

Consider that an airline company wants to process data consistiting of the temperature reading of all active a flights at real time. If the airline wants to just print or store the reading received from flight at real time than no state needs to be maintained and it is considered as stateless stream processing. Another scenario is that every 10 minutes, the airline wants to find out the average of temperature readings of the last one hour for every active flight. In this case, a window of one hour can be defined with a sliding interval of 10 minutes. Hence, in this scenario state needs to be maintained and it can be considered as stateful stream processing within a window time frame.

Suppose if an airline wants to find the average temperature reading of an entire journey of every flight, that is, from the time the flight takes off to the time it lands then an external state needs to be maintained since the duration of journey will be different for different flights and also flights can be delayed as well. Hence, no particular window duration can be defined. Such streaming use case is considered stateful stream processing too whose state is maintained throughout the session of execution of a Spark Streaming job.

Let's develop these use cases using Spark Streaming.

Consider, after taking off flights are sending data in the following JSON format:

 
{"flightId":"tz302","timestamp":1494423926816,"temperature":21.12,"landed":false} 
  
where flightId = unique id of the flight 
timestamp = time at which event is generated 
temperature = temperature reading at the time of event generation 
landed =  boolean flag which defines whether flight has been landed or still on the way 
 
Following POJO will be used to map messages to Java Object for further processing 
 
public class FlightDetails implements Serializable { 
   private String flightId; 
   private double temp; 
   private boolean landed; 
private long temperature; 
}

In this example, we will be reading messages from a socket. In real-world scenarios, messages should be pushed to durable message brokers/queues such as Kafka, Kinesis, and so on.

Table of Contents for Streaming transformations

Create new playlist

Sign In

Sign Up

Table of Contents for
Streaming transformations