Data partitioning

Data partitioning divides the dataset and distributes the data over multiple servers, or shards. Each shard is an independent database, and collectively, the shards make up a single logical database. In batch processing specifically, using partitioning allows multiple versions of large batch applications to run concurrently. The main purpose is to reduce the overall elapsed time required to process long batch jobs. Processes that can be successfully partitioned are those where the input file can be split and/or the main database tables partitioned to allow the application to run against different sets of data.

Data partitioning addresses issues of scale, such as:

High query rates exhausting the CPU capacity of the server
Larger datasets exceeding the storage capacity of a single machine
Working set sizes larger than the system's RAM, thus stressing the I/O capacity of disk drives

The batch system architecture should be flexible enough to allow dynamic configuration of the overall number of partitions in a system. Both automatic and user-controlled configurations should be considered. Automatic configurations may be based on parameters such as the input file size and/or the number of input records.

There are various partitioning patterns that one need to keep in mind when defining the partition strategy for their application. We'll discuss few of them here.

Table of Contents for Data partitioning

Create new playlist

Sign In

Sign Up

Table of Contents for
Data partitioning