How Apache Spark improves windowing

Apache Spark structured streaming is significantly more flexible in the window-processing model. As streams are virtually treated as continuously appended tables, and every row in such a table has a timestamp, operations on windows can be specified in the query itself and each query can define different windows. In addition, if there is a timestamp present in static data, window operations can also be defined, leading to a very flexible stream-processing model.

In other words, Apache Spark windowing is just a sort of special type of grouping on the timestamp column. This makes it really easy to handle late arriving data as well because Apache Spark can include it in the appropriate window and rerun the computation on that window when a certain data item arrives late. This feature is highly configurable.

Event time versus processing time: In time series analysis and especially in stream computing, each record is assigned to a particular timestamp. One way of creating such a timestamp is the arrival time at the stream-processing engine. Often, this is not what you want. Usually, you want to assign an event time for each record at that particular point in time when it was created, for example, when a measurement on an IoT device took place. This allows coping with latency between creating and processing of an event, for example, when an IoT sensor was offline for a certain amount of time, or network congestion caused a delay of data delivery.

The concept of late data is interesting when using event time instead of processing time to assign a unique timestamp to each tuple. Event time is the timestamp when a particular measurement took place, for example. Apache Spark structured streaming can automatically cope with subsets of data arriving at a later point in time transparently.

Late data: If a record arrives at any streaming engine, it is processed immediately. Here, Apache Spark streaming doesn't differ from other engines. However, Apache Spark has the capability of determining the corresponding windows a certain tuple belongs to at any time. If, for whatever reason, a tuple arrives late, all affected windows will be updated and all affected aggregate operations based on these updated windows are rerun. This means that results are allowed to change over time in case late data arrives. This is supported out of the box without the programmer worrying about it. Finally, since Apache Spark V2.1, it is possible to specify the amount of time that the system accepts late data using the withWatermark method.

The watermark is basically the threshold used to define how old a late arriving data point is allowed to be in order to still be included in the respective window. Again, consider the HTTP server log file working over a minute length window. If, for whatever reason, a data tuple arrives which is more than 4 hours old it might not make sense to include it in the windows if, for example, this application is used to create a time-series forecast model on an hourly basis to provision or de-provision additional HTTP servers to a cluster. A four-hour-old data point just wouldn't make sense to process, even if it could change the decision, as the decision has already been made.

Table of Contents for How Apache Spark improves windowing

Create new playlist

Sign In

Sign Up

Table of Contents for
How Apache Spark improves windowing