Increased performance with good old friends

As in Apache SparkSQL for batch processing and, as Apache Spark structured streaming is part of Apache SparkSQL, the Planner (Catalyst) creates incremental execution plans as well for mini batches. This means that the whole streaming model is based on batches. This is the reason why a unified API for streams and batch processing could be achieved. The price we pay is that Apache Spark streaming sometimes has drawbacks when it comes to very low latency requirements (sub-second, in the range of tens of ms). As the name Structured Streaming and the usage of DataFrames and Datasets implies, we are also benefiting from performance improvements due to project Tungsten, which has been introduced in a previous chapter. To the Tungsten engine itself, a mini batch doesn't look considerably different from an ordinary batch. Only Catalyst is aware of the incremental nature of streams. Therefore, as of Apache Spark V2.2, the following operations are not (yet) supported, but they are on the roadmap to be supported eventually:

  • Chain of aggregations
  • Taking first n rows
  • Distinct
  • Sorting before aggregations
  • Outer joins between streaming and static data (only limited support)

As this is constantly changing; it is best to refer to the latest documentation: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.145.82