Notion of time in stream processing

An important feature in any stream processing system is the notion of time, its modelling, and its integration within the processing topology. Operations such as Windowing depend heavily on the notion of time boundaries.

Common notions of time in streams are:

Event time: A specific point in time when a specific event actually occurred. The occurrence of the event is usually captured at the source that is bringing in the event. For example, if the event is a change in the geolocation of a car, then the associated event-time would be the time when the GPS sensor captured the location change.
Processing time: The point in time when the event is actually consumed by the underlying processing system. Processing time, by definition, occurs AFTER the event has occurred. As an example, imagine an analytics application that reads and processes the geolocation data reported from car sensors to present it to a fleet-management dashboard. Here, processing-time in the analytics application might be milliseconds or seconds (such as for realtime pipelines based on Apache Kafka and Kafka Streams) or hours (such as for batch pipelines based on Apache Hadoop or Apache Spark) after event-time.
Ingestion time: The point in time when an event or data record is stored in a topic partition by a Kafka broker.

The choice between event-time and ingestion-time is actually done through the configuration of Kafka (not Kafka streams). From Kafka 0.10.x onward, timestamps are automatically embedded into Kafka messages. Depending on Kafka's configuration, these timestamps represent event-time or ingestion-time. The respective Kafka configuration setting can be specified on the broker level or per topic. The default timestamp extractor in Kafka streams will retrieve these embedded timestamps as is. Hence, the effective time semantics of your application depend on the effective Kafka configuration for these embedded timestamps.

Kafka streams assigns a timestamp to every data record via the TimestampExtractor interface. These per-record timestamps describe the progress of a stream with regards to time and are leveraged by time-dependent operations, such as window operations.

We have covered enough on both the streaming layer (in the form of Kafka as an event bus) and shed some light on the processing layer (when we discussed about Kafka streams). To conclude the discussion of stream processing, let's also discuss Samza's stream processing API.

Table of Contents for Notion of time in stream processing

Create new playlist

Sign In

Sign Up

Table of Contents for
Notion of time in stream processing