Processing strategy

The key foundation of any batch system is its processing strategy. It is very important for anyone to understand what they want from their batch system and then adopt a strategy to achieve their goal. You should understand that NOT every batch system needs to be complex or distributed in nature. Not every meta store needs to be a Hadoop. If your underlying store is a relational database and you have needs to perform batch operations, you may consider using spring batch as the technology of choice rather than spark, for example.

There are multiple factors that affect the selection of a processing strategy. They include (but are not limited to):

  • Estimated batch data volume
  • Concurrent execution alongside other batch or online/realtime applications
  • Available batch windows (and with more enterprises wanting to be up and running 24/7, this may not really be an option)

For a batch process, the following may be some of your options:

  • Perform processing within a batch window defined previously
  • Concurrent batch execution or realtime/online processing
  • Processing different batch runs or jobs at the same time, that is, in parallel
  • Partitioning (that is, processing many instances of the same job at the same time on a subset of data)
  • A combination of these

The order defined as part of the preceding list is directly proportional to the implementation complexity, processing in a batch window being the easiest, and partitioning the data to achieve distributed processing the most complex, to implement.

Since this book is focused on data-intensive applications, we will not go deep into simple scenarios where the batch processing is done by defining a separate application and fetching the data to the processor itself. But it is still important to understand the design for simple batch processing so that you can have a broader understanding of which approach to take and when.

The following section will discuss and describe the preceding processing options. It is important to notice, especially in case of processing where you mutate the data to change its state in a transactional store, that the commit-and-locking strategy that a batch process will adopt is dependent upon the type of processing being performed. Therefore, the batch architecture cannot simply be an afterthought when designing the overall architecture; it should be the first candidate in any architecture related to data-intensive applications.

The locking strategy can either be based on normal database-level locks, or a custom locking strategy can be implemented as part of defining the overall architecture. Retry logic comes in handy during the locking strategy to avoid aborting a batch job in case of a lock situation:

  • Perform processing within a batch window: When we talk about simple batch processes that each run in their own specified batch window, where the data being updated is not in use by online users or other batch processes, concurrency is not an issue and a single commit can be done at the end of the batch run.

In cases where the data volume is huge, such that it is difficult to define a decent batch window and/or when the overall processing logic is complex enough to fit in a defined window, we may need a more robust approach to handling batch use cases. Understanding this is paramount to avoiding any pitfalls with the overall batch system. Imagine there is no locking strategy in place and the system still relies on a single commit point, modifying any of the many batch programs can be painful. Therefore, even with the simplest batch systems, consider the need for commit logic for restart-recovery options as well as the information concerning the more complex cases here

  • Concurrent batch execution or realtime/online processing: The bone of contention with any Batch System is the total time for which the Batch System can keep a lock on the data it is processing. Clearly, if we have proper batch windows, this can be a liberal value but when we do not have the leverage to get the exclusive window, we need to make sure that the Batch applications responsible for processing the data that can potentially be updated by online users simultaneously should not lock any shared data for more than the agreed SLA (usually a few seconds).

In order to achieve this, the transactions that are spun up by the batch System, if any, should be committed to the database in quick intervals. This minimizes the portion of data that is unavailable to other processes as well as the elapsed time for which the data is unavailable. Usually, if your data resides in a data store (whether relational or non-relational), there would be some level of locking support provided by the underlying system.

Let's discuss two locking strategies that can be employed to make sure data doesn't get corrupted:

  • Optimistic locking: A strategy where the system keeps track of the version number during the entire transaction process. It starts with a user asking for a specific piece of data from the underlying data store. The data store returns the data along with metadata in the form of a version number. This version number is something that the underlying data store can keep track of. When the user wants to update the value it read, it sends back the version number it received from the data store. The data store checks the version of the incoming data with the attest version it has. If the version matches, the update is successful. If the version doesn't match, an exception is thrown to the end user. This flow is depicted as a sequence diagram:
  • Pessimistic locking: A strategy that assumes that record-contention is not an exception but a norm and thus there is a high chance for it to occur. Therefore, in order to remediate the situation, a physical or logical lock is usually required whenever the data is retrieved. There are multiple pessimistic locking strategies that can be applied, such as a dedicated lock-column where each table in the database defines a dedicated column for handling locks on that row. When the application wants to perform a command operation (such as update or delete) on a given row, it first sets the flag as true in the lock column, if the flag is not already set by another transaction. With the flag in place, if there is another transaction that tries to retrieve the same row for a command operation, it will not be allowed. The application should handle such a situation gracefully, for example by retrying the same operation after a moment of waiting. Once the transaction that had set the flag is done/updating the row, it also clears the flag, in the context of the same transaction. If the update to the data in the row succeeds but the update to the flag does not, the entire transaction should either be rolled back or it should be handled gracefully, for example by retrying the transaction. Finally, when the flag has been reset to depict no lock, it can then be taken up by the next transaction. As identified implicitly, maintaining the integrity of the data is the prime requirement for a pessimistic locking strategy. This also applies to the initial fetching of the row and the corresponding setting of the pessimistic flag. This can be achieved using DB-level locks, for example SELECT FOR UPDATE. This kind of locking mechanism has the same downside as can be identified in the case of physical locking. But the grace point for software controlled by Pessimistic locking is that it can be easier to maintain, for example, by implementing a timeout for transactions that keep the lock acquired for a configured amount of time.

It is always good to know and understand how both pessimistic and optimistic locking works, but at the same time, I would like to highlight that in batch processing, locking only comes into the picture when we have concurrent batches working on the dataset and the data cannot be moved to the processor.

As a rule of thumb, we should always think of optimistic locking as the one that is more suitable for online applications, while pessimistic locking is the one that is more appropriate for batch applications. Whenever logical locking is used, the same scheme must be used for all applications accessing data entities protected by logical locks.

Another thing you should note is that both optimistic and pessimistic locking solutions only partially address locking a single record. More often than not in a batch processing system, we may need to obtain a lock on a group or set of data. With physical locks, you have to manage these very carefully in order to avoid potential deadlocks. With logical locks, it is usually best to build a logical lock manager that understands the logical record groups you want to protect and can ensure that locks are coherent and non-deadlocking. This logical lock manager usually uses its own tables for lock management, contention-reporting, and a time-out mechanism:

  • Parallel processing: As the name suggests, this adds the capability to execute multiple batch jobs in parallel. This helps to keep the overall batch processing time to a minimum and utilizes the resources efficiently. A lot of people confuse parallel processing with concurrent processing. In addition, multi-threading is different from multi-processing. Please be mindful that all these concepts are separate but work together to provide a good user experience.
  • Multi-processing: It means spanning multiple processes (and not threads) that execute simultaneously. A single process can spawn multiple threads. Creating a process is a heavy operation because creating a process involves (on UNIX at-least) the allocation of a specific address space where the process will reside during its lifetime. In contrast, creating a thread inside a given process is less expensive because the thread uses part of the address space of the process.

As you can imagine, the time it would take (commonly referred to as context switching) to switch between two threads of the same process is much less than the time it would take to switch between two processes. The reason being that switching between two threads does not require the OS to swap out the address space since the two threads share the same address space. In the case of two processes, the OS needs to perform extra work to switch between address spaces of different processes.

In addition, communication between different threads of the same process is very simple because all threads in the same process share the same address space. This brings in flexibility but also opportunities to implement certain use cases (including batch processing) using both the multi-processing as well as multi-threading approach.

Now imagine you are executing a multithreaded-process on a single processor. In such a case, a processor switches execution resources between threads of the process, and this results in concurrent execution of these threads.

Now imagine the same multi-threaded process is getting executed in a shared-memory multiprocessor operating system. In this case, each thread in the process can essentially execute on a separate processor, at the same instant in time, resulting in parallel execution. Usually, when a given process consists of fewer or as many threads as there are processors, the thread support system in conjunction with the operating environment ensures that each thread runs on a different processor.

There is much more to parallel/concurrent execution than described here. But a detailed discussion on it is outside the scope of this chapter. Here is a good place to read more about it: https://docs.oracle.com/cd/E19455-01/806-5257/6je9h032e/index.html.

Coming back to parallel processing with respect to batch processing, we can identify that given a Batch Job, we can execute (sub)processing of the job in parallel as long as these job executions are not sharing the same set of data, be it a common file, the same database table, or the same index space. Thus, it is very important that when you define a batch job, you also define how you will partition the data space that your batch job would be reading.

For example, if your batch job is reading data from a kafka topic, then you can define a partitioning strategy, which would partition the incoming data based on certain criteria such that data lands up in multiple partitions that is then picked up by different workers that execute the batch job.

If your data resides in a database table, a good partitioning approach could be to define the range of primary keys that each instance (thread) of the batch job can work on. Yet another could be, if you have captured immutable date while creating a record, to partition data based on this immutable date.

There are other architecture approaches that you could explore. For example, using a control table that is shared across the execution of different instances of the job to make decisions on shared resources. Typically, at the start of the execution of an instance of the batch job, information is retrieved from this control table to make decisions on whether the given execution should go ahead, wait, or abort the execution.

The next section explores how data can be partitioned to be used for parallel execution.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.42.136