How streaming engines use windowing

There exist five different properties in two dimensions, which is how windows can be defined, where each window definition needs to use one property of each dimension.

The first property is the mode in which subsequent windows of a continuous stream of tuples can be created: sliding and tumbling.

The second is that the number of tuples that fall into a window has to be specified: either count-based, time-based or session-based.

Let's take a look at what they mean:

  • Sliding windows: A sliding window removes a tuple from it whenever a new tuple is eligible to be included.
  • Tumbling windows: A tumbling window removes all tuples from it whenever there are enough tuples arriving to create a new window.
  • Count-based windows: Such windows always have the n newest elements in it. Note that this can be achieved either by a sliding or tumbling tuple update policy.
  • Time-based windows: This window takes the timestamp of a tuple into account in order to determine whether it belongs to a certain window or not. Such a window can contain the latest n seconds worth of data, for example. Such a window can be sliding and tumbling as well.
    Time-based windows especially are eligible for late arriving data, which is a very interesting concept that Apache Spark Structured Streaming makes possible.
  • Session-based windows: This window takes a session ID of a tuple into account in order to determine whether it belongs to a certain window or not. Such a window eventually contains all data from a user interaction with an online shop, for example. Such a window can be sliding and tumbling as well, although this notion doesn't make really sense here because you want eventually act on/react to all data belonging to a specific session.

Time-based and Session-based windows especially are eligible for late arriving data, which is a very interesting concept that Apache Spark Structured Streaming makes possible.

Let's take a look at the following figure, which illustrates tumbling windows:

As can be observed, every tuple (or message respectively) ends up in one single window. Now let's have a look at sliding windows:

Sliding windows are meant to share tuples among their neighbors. This means that for every tuple arriving, a new window is issued. One example of such a paradigm is the calculation of a moving average once a tuple arrives for the last 100 data points. Now let's consider time based windows:

Finally, this last illustration shows time-based windows. It's important to notice that the number of tuples per window can be different as it only depends on how many messages in a certain time frame have arrived; only those will be included in the respective window. So for example consider an HTTP server log file to be streamed into Apache Spark Structured Streaming (using Flume as a possible solution). We are grouping tuples (which are the individual lines of the log file) together on a minute basis. Since the number of concurrent users are different at each point in time during the day the size of the minute windows will also vary depending on it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.47.218