Computing real time views

This brings us to the very core of what a speed layer is supposed to do. It is meant for producing views, at a very regular interval, that conform to the access patterns and thus can be efficiently queried.

The biggest difference between views generated out of the speed layer versus the view generated by the batch layer is that the Speed layer works only on the very recent data and that the data in the view will need to be updated as soon as new data arrives. The phrase "when new data arrives" is quite subjective, but for a speed layer, you should account for data arriving anywhere between milliseconds to few seconds. Clearly, if the data is arriving in such short intervals, the computation of the view also needs to be efficient to avoid any data blockage happening at the tail-end of the system. In order to understand the implications of such rapidly-arriving data, let's look at the conceptual representation of how a view is generated by the speed layer.

For simplicity, I would directly rule out the speed layer adopting the same algorithm as the batch layer. Even though it might be a much simpler approach, it has major drawbacks, namely latency and computational-resource overdose.

I borrow the following example, directly from Nathan Marz's "Big Data" book, with some changes in the overall approach of calculating the views, to explain to the reader why computing real time views is easier said than done:

"Suppose the data system you are building receives 32GB of new data per day and that new data gets into the serving layer at most 6 hours after being received. The speed layer would be responsible for at most 6 hours of data—about 8GB. While not a huge amount, 8GB is substantial when attempting to achieve sub-second latencies. Additionally, running a function on 8GB of data each time you receive a new piece of data will be extremely resource intensive. If the average size of a data unit is 100 bytes, the 8GB of recent data equates to approximately 86 million data units. Keeping the real-time views up to date would thus require an unreasonable amount of 86M * 8GB worth of processing every 6 hours. You could reduce the resource usage by batching the updates, but this greatly increases the update latency."

You may have been bogged down by the details, but stay with me for another few minutes. One of the things you have just learned is how to approach the processing system's capacity planning. To revise, the two important things to understand in capacity planning for a processing system are the following:

  • Average size of data unit arriving
  • Total size of data arriving in a given unit of time

Now that we understand capacity planning, let's move on to the speed layer view-calculation problem. The speed layer needs to be very efficient when it comes to processing and creating views. As you may have already inferred from an implicit mention that batch layer replaces the entire view it had created previously. This is where the speed layer differs from the batch layer.

Instead of replacing the entire view, the speed layer only updates the existing view, thereby greatly reducing its dependence on memory as well as computation. This also poses a different challenge for the choice of the underlying data store:

Since the real time views need to be updated frequently and not completely wiped out, a read-only datastore is not a god choice for holding the real time views. We require a data store that supports both random reads and random writes. Now, this could be data store that satisfies this requirement. Apache Cassandra, Mongo, as well as traditional relational databases, all fall into this category.

Even though the real time views are stored in a completely different storage solution, it still needs to be noted that structurally both the real time and the batch views may look semantically similar. This is an important thing to keep in mind when defining the data structure of the different views you plan to create.

We have already gone into a lot more detail early on without talking about the high-level architecture of the streaming system. I usually avoid doing it, but I think in this situation, it was worth going a little deeper to provide readers with an understanding of the complexity that may be built into a streaming system.

Now we will discuss the typical high-level reference architecture of a streaming system.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.169.40