Lambda architecture and batch processing

The Lambda architecture had to be involved in the discussion of batch processing since batch processing is one of the layers in the Lambda architecture, and we are building a batch processing layer for a data-intensive application.

So, the following screenshot shows how to quickly review a Lambda architecture:

We will not go in to the details of the Lambda architecture again but instead focus on point 2 in the preceding diagram, labelled the batch layer.

The batch layer in Lambda mainly performs two functions:

  • Manages the master dataset. This master dataset is advised to be defined as an immutable, append-only set of raw data.
  • Precomputes the batch view that feeds into the serving layer being used by the underlying query system.

It is the second function that we will discuss in this chapter. Managing the master dataset does not usually require a lot of development time or effort. Point 1 is mainly the responsibility of the operations team.

Now, if you remember from our previous discussions on Lambda architecture, we mentioned how it came into existence and the driving factors for such an architecture. I will recap that discussion and focus on the batch layer only.

The batch layer in the Lambda architecture gets its data as input from various feeder sources that collect the data from different places and put them into a huge data store as is. This data store is labelled as the master dataset in the preceding diagram.

Now, another piece that is missing from the preceding diagram is the batch processing application itself, which reads the data from this master dataset, transforms it, and creates batch views out of them.

This application will have the same high-level design as depicted in first diagram of this chapter.

To give you a glimpse into the internal-component architecture of the batch layer, look at the following diagram:

Don't try to read in too much into the picture as it is meant to only show the various components of the batch layer and not the execution architecture where the Reader, Processor, and Writer are co-located with the data itself.

But you can quickly understand that the main purpose of a batch application is to iterate through the entire relevant dataset and create views out of it. With that use case in mind, we can imagine that it is not feasible to bring the entire dataset to the processing unit/system, as that would be prohibitive both at the network layer as well as the OS layer.

We will soon look at a more detailed flow diagram and description depicting exactly how the batch layer should be implemented and what the design considerations and characteristics should be. For now, just keep in mind that the batch layer exists so that the end users do not have to run their queries on the entire master dataset, which would not only be prohibitive but also time-consuming. Thus, batch systems, or batch Layers in your architecture, precompute the master dataset and derive multiple batch views so that end user queries can be resolved with low latency.

Before we delve deeper into the technical architecture of a batch layer, let's discuss certain common strategies, building blocks, and patterns that would act as the starting point in designing the batch layer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.170.239