Understand the unified architecture for batch and stream

More and more big data applications rely on streaming data. There are many reasons for this: notably the increasing need for real-time insights where a system must output analytics as new data comes in on-the-fly. We will not spend a lot of time discussing the difference between batch and streaming data, but intuitively, batch data is at rest in a database or a file, whereas streaming data is, well, streaming from a source to a sink.

There is a specific architecture that Google mentions a lot, which combines batch and stream processing into a single pipeline, and it is worth our understanding this architecture, as follows:

In the GCP word, the most common batch data source is GCS (that is, buckets) and the reliable messaging layer is Pub/Sub. Pub/Sub virtually always feeds into Dataflow, which is based on the Apache Beam APIs and combines batch and streaming logic into pipelines.

The classic source for summary analytics is BigQuery, and the best place to store granular, tick-by-tick processed data is BigTable (why BigTable? Because it supports fast writes and very large datasets, order of PB). So, penciling in all of those, we get the GCP version of the architecture.

Now, this is an important architectural set piece, and you really should commit it to memory. However, this does not mean that you should actually adopt it, at least not as of the time of this writing, in early 2018. The weak link, right now, in this is Dataflow. In theory, Dataflow is an awesome technology. It unifies batch and streaming layers, de-dupes, and orders the streaming data coming in from Pub/Sub and can perform complex event-time and processing-time operations, such as windowing and watermarking. The downside? You have to write code to do this; there is no UI currently available. What's more? The code that can be in either Python or Java is not all that easy to write. In time, and probably very soon, Dataflow will be an attractive proposition, though, so we should not be quick to dismiss it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.119.49