Chapter 1. Batch and Spring

When I graduated from Northern Illinois University back in 2001 after spending most of the previous two years working on COBOL, mainframe Assembler, and Job Control Language (JCL), I took a job as a consultant to learn Java. I specifically took that position because of the opportunity to learn Java when it was the hot new thing. Never in my wildest dreams did I think I'd be back writing about batch processing. I'm sure most Java developers don't think about batch, either. They think about the latest web framework or JVM language. They think about service-oriented architectures and things like REST versus SOAP or whatever alphabet soup is hot at the time.

But the fact is, the business world runs on batch. Your bank and 401k statements are all generated via batch processes. The e-mails you receive from your favorite stores with coupons in them? Probably sent via batch processes. Even the order in which the repair guy comes to your house to fix your laundry machine is determined by batch processing. In a time when we get our news from Twitter, Google thinks that waiting for a page refresh takes too long to provide search results, and YouTube can make someone a household name overnight, why do we need batch processing at all?

There are a few good reasons:

  • You don't always have all the required information immediately. Batch processing allows you to collect information required for a given process before starting the required processing. Take your monthly bank statement as an example. Does it make sense to generate the file format for your printed statement after every transaction? It makes more sense to wait until the end of the month and look back at a vetted list of transactions from which to build the statement.

  • Sometimes it makes good business sense. Although most people would love to have what they buy online put on a delivery truck the second they click Buy, that may not be the best course of action for the retailer. If a customer changes their mind and wants to cancel an order, it's much cheaper to cancel if it hasn't shipped yet. Giving the customer a few extra hours and batching the shipping together can save the retailer large amounts of money

  • It can be a better use of resources. Having a lot of processing power sitting idle is expensive. It's more cost effective to have a collection of scheduled processes that run one after the other using the machine's full potential at a constant, predictable rate.

This book is about batch processing with the framework Spring Batch. This chapter looks at the history of batch processing, calls out the challenges in developing batch jobs, makes a case for developing batch using Java and Spring Batch, and finally provides a high-level overview of the framework and its features.

A History of Batch Processing

To look at the history of batch processing, you really need to look at the history of computing itself.

The time was 1951. The UNIVAC became the first commercially produced computer. Prior to this point, computers were each unique, custom-built machines designed for a specific function (for example, in 1946 the military commissioned a computer to calculate the trajectories of artillery shells). The UNIVAC consisted of 5,200 vacuum tubes, weighed in at over 14 tons, had a blazing speed of 2.25MHz (compared to the iPhone 4, which has a 1GHz processor) and ran programs that were loaded from tape drives. Pretty fast for its day, the UNIVAC was considered the first commercially available batch processor.

Before going any further into history, I should define what, exactly, batch processing is. Most of the applications you develop have an aspect of user interaction, whether it's a user clicking a link in a web app, typing information into a form on a thick client, or tapping around on phone and tablet apps. Batch processing is the exact opposite of those types of applications. Batch processing, for this book's purposes, is defined as the processing of data without interaction or interruption. Once started, a batch process runs to some form of completion without any intervention.

Four years passed in the evolution of computers and data processing before the next big change: high-level languages. They were first introduced with Lisp and Fortran on the IBM 704, but it was the Common Business Oriented Language (COBOL) that has since become the 800-pound gorilla in the batch-processing world. Developed in 1959 and revised in 1968, 1974, and 1985, COBOL still runs batch processing in modern business. A Gartner study[1] estimated that 60% of all global code and 85% of global business data is housed in the language. To put this in perspective, if you printed out all that code and stacked the printout, you'd have a stack 227 miles high. But that's where the innovation stalled.

COBOL hasn't seen a significant revision in a quarter of a century.[2] The number of schools that teach COBOL and its related technologies has declined significantly in favor of newer technologies like Java and .NET. The hardware is expensive, and resources are becoming scarce.

Mainframe computers aren't the only places that batch processing occurs. Those e-mails I mentioned previously are sent via batch processes that probably aren't run on mainframes. And the download of data from the point-of-sale terminal at your favorite fast food chain is batch, too. But there is a significant difference between the batch processes you find on a mainframe and those typically written for other environments (C++ and UNIX, for example). Each of those batch processes is custom developed, and they have very little in common. Since the takeover by COBOL, there has been very little in the way of new tools or techniques. Yes, cron jobs have kicked off custom-developed processes on UNIX servers and scheduled tasks on Microsoft Windows servers, but there have been no new industry-accepted tools for doing batch processes.

Until now. In 2007, Accenture announced that it was partnering with Interface21 (the original authors of the Spring framework, and now SpringSource) to develop an open source framework that would be used to create enterprise batch processes. As Accenture's first formal foray into the open source world, it chose to combine its expertise in batch processing with Spring's popularity and feature set to create a robust, easy-to-use framework. At the end of March 2008, the Spring Batch 1.0.0 release was made available to the public; it represented the first standards-based approach to batch processing in the Java world. Slightly more than a year later, in April 2009, Spring Batch went 2.0.0, adding features like replacing support for JDK 1.4 with JDK 1.5+, chunk-based processing, improved configuration options, and significant additions to the scalability options within the framework.

Batch Challenges

You're undoubtedly familiar with the challenges of GUI-based programming (thick clients and web apps alike). Security issues. Data validation. User-friendly error handling. Unpredictable usage patterns causing spikes in resource utilization (have a blog post of yours show up on the front page of Slashdot to see what I mean here). All of these are byproducts of the same thing: the ability for users to interact with your software.

However, batch is different. I said earlier that a batch process is a process that can run without additional interaction to some form of completion. Because of that, most of the issues with GUI applications are no longer valid. Yes, there are security concerns, and data validation is required, but spikes in usage and friendly error handling either are predictable or may not even apply to your batch processes. You can predict the load during a process and design accordingly. You can fail quickly and loudly with only solid logging and notifications as feedback, because technical resources address any issues.

So everything in the batch world is a piece of cake and there are no challenges, right? Sorry to burst your bubble, but batch processing presents its own unique twist on many common software development challenges. Software architecture commonly includes a number of ilities. Maintainability. Usability. Scalability. These and other ilities are all relevant to batch processes, just in different ways.

The first three ilities—usability, maintainability, and extensibility—are related. With batch, you don't have a user interface to worry about, so usability isn't about pretty GUIs and cool animations. No, in a batch process, usability is about the code: both its error handling and its maintainability. Can you extend common components easily to add new features? Is it covered well in unit tests so that when you change an existing component, you know the effects across the system? When the job fails, do you know when, where, and why without having to spend a long time debugging? These are all aspects of usability that have an impact on batch processes.

Next is scalability. Time for a reality check: when was the last time you worked on a web site that truly had a million visitors a day? How about 100,000? Let's be honest: most web sites developed in large corporations aren't viewed nearly that many times. However, it's not a stretch to have a batch process that needs to process 100,000 to 500,000 transactions in a night. Let's consider 4 seconds to load a web page to be a solid average. If it takes that long to process a transaction via batch, then processing 100,000 transactions will take more than four days (and a month and a half for 1 million). That isn't practical for any system in today's corporate environment. The bottom line is that the scale that batch processes need to be able to handle is often one or more orders of magnitude larger than that of the web or thick-client applications you've developed in the past.

Third is availability. Again, this is different from the web or thick-client applications you may be used to. Batch processes typically aren't 24/7. In fact, they typically have an appointment. Most enterprises schedule a job to run at a given time when they know the required resources (hardware, data, and so on) are available. For example, take the need to build statements for retirement accounts. Although you can run the job at any point in the day, it's probably best to run it some time after the market has closed so you can use the closing fund prices to calculate balances. Can you run when you need to? Can you get the job done in the time allotted so you don't impact other systems? These and other questions affect the availability of your batch system.

Finally you must consider security. Typically, in the batch world, security doesn't revolve around people hacking into the system and breaking things. The role a batch process plays in security is in keeping data secure. Are sensitive database fields encrypted? Are you logging personal information by accident? How about access to external systems—do they need credentials, and are you securing those in the appropriate manner? Data validation is also part of security. Generally, the data being processed has already been vetted, but you still should be sure that rules are followed.

As you can see, plenty of technological challenges are involved in developing batch processes. From the large scale of most systems to security, batch has it all. That's part of the fun of developing batch processes: you get to focus more on solving technical issues than on moving form fields three pixels to the right on a web application. The question is, with existing infrastructures on mainframes and all the risks of adopting a new platform, why do batch in Java?

Why Do Batch Processing in Java?

With all the challenges just listed, why choose Java and an open source tool like Spring Batch to develop batch processes? I can think of six reasons to use Java and open source for your batch processes: maintainability, flexibility, scalability, development resources, support, and cost.

Maintainability is first. When you think about batch processing, you have to consider maintenance. This code typically has a much longer life than your other applications. There's a reason for that: no one sees batch code. Unlike a web or client application that has to stay up with the current trends and styles, a batch process exists to crunch numbers and build static output. As long as it does its job, most people just get to enjoy the output of their work. Because of this, you need to build the code in such a way that it can be easily modified without incurring large risks.

Enter the Spring framework. Spring was designed for a couple of things you can take advantage of: testability and abstraction. The decoupling of objects that the Spring framework encourages with dependency injection and the extra testing tools Spring provides allow you to build a robust test suite to minimize the risk of maintenance down the line. And without yet digging into the way Spring and Spring Batch work, Spring provides facilities to do things like file and database I/O declaratively. You don't have to write JDBC code or manage the nightmare that is the file I/O API in Java. Things like transactions and commit counts are all handled by the framework, so you don't have to manage where you are in the process and what to do when something fails. These are just some of the maintainability advantages that Spring Batch and Java provide for you.

The flexibility of Java and Spring Batch is another reason to use them. In the mainframe world, you have one option: run COBOL on a mainframe. That's it. Another common platform for batch processing is C++ on UNIX. This ends up being a very custom solution because there are no industry-accepted batch-processing frameworks. Neither the mainframe nor the C++/UNIX approach provides the flexibility of the JVM for deployments and the feature set of Spring Batch. Want to run your batch process on a server, desktop, or mainframe with *nix or Windows? It doesn't matter. Need to scale your process to multiple servers? With most Java running on inexpensive commodity hardware anyway, adding a server to a rack isn't the capital expenditure that buying a new mainframe is. In fact, why own servers at all? The cloud is a great place to run batch processes. You can scale out as much as you want and only pay for the CPU cycles you use. I can't think of a better use of cloud resources than batch processing.

However, the "write once, run anywhere" nature of Java isn't the only flexibility that comes with the Spring Batch approach. Another aspect of flexibility is the ability to share code from system to system. You can use the same services that already are tested and debugged in your web applications right in your batch processes. In fact, the ability to access business logic that was once locked up on some other platform is one of the greatest wins of moving to this platform. By using POJOs to implement your business logic, you can use them in your web applications, in your batch processes—literally anywhere you use Java for development.

Spring Batch's flexibility also goes toward the ability to scale a batch process written in Java. Let's look at the options for scaling batch processes:

  • Mainframe: The mainframe has limited additional capacity for scalability. The only true way to accomplish things in parallel is to run full programs in parallel on the single piece of hardware. This approach is limited by the fact that you need to write and maintain code to manage the parallel processing and the difficulties associated with it, such as error handling and state management across programs. In addition, you're limited by the resources of a single machine.

  • Custom processing: Starting from scratch, even in Java, is a daunting task. Getting scalability and reliability correct for large amounts of data is very difficult. Once again, you have the same issue of coding for load balancing. You also have large infrastructure complexities when you begin to distribute across physical devices or virtual machines. You must be concerned with how communication works between pieces. And you have issues of data reliability. What happens when one of your custom-written workers goes down? The list goes on. I'm not saying it can't be done; I'm saying that your time is probably better spent writing business logic instead of reinventing the wheel.

  • Java and Spring Batch: Although Java by itself has the facilities to handle most of the elements in the previous item, putting the pieces together in a maintainable way is very difficult. Spring Batch has taken care of that for you. Want to run the batch process in a single JVM on a single server? No problem. Your business is growing and now needs to divide the work of bill calculation across five different servers to get it all done overnight? You're covered. Data reliability? With little more than some configuration and keeping some key principals in mind, you can have transaction rollback and commit counts completely handled.

As you see as you dig into the Spring Batch framework, the issues that plague the previous options for batch processing can be mitigated with well-designed and tested solutions. Up to now, this chapter has talked about technical reasons for choosing Java and open source for your batch processing. However, technical issues aren't the only reasons for a decision like this. The ability to find qualified development resources to code and maintain a system is important. As mentioned earlier, the code in batch processes tends to have a significantly longer lifespan than the web apps you may be developing right now. Because of this, finding people who understand the technologies involved is just as important as the abilities of the technologies themselves. Spring Batch is based on the extremely popular Spring framework. It follows Spring's conventions and uses Spring's tools as well as any other Spring-based application. So, any developer who has Spring experience will be able to pick up Spring Batch with a minimal learning curve. But will you be able to find Java and, specifically, Spring resources?

One of the arguments for doing many things in Java is the community support available. The Spring family of frameworks enjoy a large and very active community online through their forums. The Spring Batch project in that family has had one of the fastest-growing forums of any Spring project to date. Couple that with the strong advantages associated with having access to the source code and the ability to purchase support if required, and all support bases are covered with this option.

Finally you come to cost. Many costs are associated with any software project: hardware, software licenses, salaries, consulting fees, support contracts, and more. However, not only is a Spring Batch solution the most bang for your buck, but it's also the cheapest overall. Using commodity hardware and open source operating systems and frameworks (Linux, Java, Spring Batch, and so on), the only recurring costs are for development salaries, support contracts, and infrastructure—much less than the recurring licensing costs and hardware support contracts related to other options.

I think the evidence is clear. Not only is using Spring Batch the most sound route technically, but it's also the most cost-effective approach. Enough with the sales pitch: let's start to understand exactly what Spring Batch is.

Other Uses for Spring Batch

I bet by now you're wondering if replacing the mainframe is all Spring Batch is good for. When you think about the projects you face on an ongoing basis, it isn't every day that you're ripping out COBOL code. If that was all this framework was good for, it wouldn't be a very helpful framework. However, this framework can help you with many other use cases.

The most common use case is data migration. As you rewrite systems, you typically end up migrating data from one form to another. The risk is that you may write one-off solutions that are poorly tested and don't have the data-integrity controls that your regular development has. However, when you think about the features of Spring Batch, it seems like a natural fit. You don't have to do a lot of coding to get a simple batch job up and running, yet Spring Batch provides things like commit counts and rollback functionality that most data migrations should include but rarely do.

A second common use case for Spring Batch is any process that requires parallelized processing. As chipmakers approach the limits of Moore's Law, developers realize that the only way to continue to increase the performance of apps is not to process single transactions faster, but to process more transactions in parallel. Many frameworks have recently been released that assist in parallel processing. Apache Hadoop's MapReduce implementation, GridGain, and others have come out in recent years to attempt to take advantage of both multicore processors and the numerous servers available via the cloud. However, frameworks like Hadoop require you to alter your code and data to fit their algorithms or data structures. Spring Batch provides the ability to scale your process across multiple cores or servers (as shown in Figure 1-1 with master/slave step configurations) and still be able to use the same objects and datasources that your web applications use.

Simplifying parallel processing

Figure 1.1. Simplifying parallel processing

Finally you come to constant or 24/7 processing. In many use cases, systems receive a constant or near-constant feed of data. Although accepting this data at the rate it comes in is necessary for preventing backlogs, when you look at the processing of that data, it may be more performant to batch the data into chunks to be processed at once (as shown in Figure 1-2). Spring Batch provides tools that let you do this type of processing in a reliable, scalable way. Using the framework's features, you can do things like read messages from a queue, batch them into chunks, and process them together in a never-ending loop. Thus you can increase throughput in high-volume situations without having to understand the complex nuances of developing such a solution from scratch.

Batching JMS processing to increase throughput

Figure 1.2. Batching JMS processing to increase throughput

As you can see, Spring Batch is a framework that, although designed for mainframe-like processing, can be used to simplify a variety of development problems. With everything in mind about what batch is and why you should use Spring Batch, let's finally begin looking at the framework itself.

The Spring Batch Framework

The Spring Batch framework (Spring Batch) was developed as a collaboration between Accenture and SpringSource as a standards-based way to implement common batch patterns and paradigms.

Features implemented by Spring Batch include data validation, formatting of output, the ability to implement complex business rules in a reusable way, and the ability to handle large data sets. You'll find as you dig through the examples in this book that if you're familiar at all with Spring, Spring Batch just makes sense.

Let's start at the 30,000-foot view of the framework, as shown in Figure 1-3.

The Spring Batch architecture

Figure 1.3. The Spring Batch architecture

Spring Batch consists of three tiers assembled in a layered configuration. At the top is the application layer, which consists of all the custom code and configuration used to build out your batch processes. Your business logic, services, and so on, as well as the configuration of how you structure your jobs, are all considered the application. Notice that the application layer doesn't sit on top of but instead wraps the other two layers, core and infrastructure. The reason is that although most of what you develop consists of the application layer working with the core layer, sometimes you write custom infrastructure pieces such as custom readers and writers.

The application layer spends most of its time interacting with the next layer, the core. The core layer contains all the pieces that define the batch domain. Elements of the core component include the Job and Step interfaces as well as the interfaces used to execute a Job: JobLauncher and JobParameters.

Below all this is the infrastructure layer. In order to do any processing, you need to read and write from files, databases, and so on. You must be able to handle what to do when a job is retried after a failure. These pieces are considered common infrastructure and live in the infrastructure component of the framework.


A common misconception is that Spring Batch is or has a scheduler. It doesn't. There is no way within the framework to schedule a job to run at a given time or based on a given event. There are a number of ways to launch a job, from a simple cron script to Quartz or even an enterprise scheduler like UC4, but none within the framework itself. Chapter 6 covers launching a job.

Let's walk through some features of Spring Batch.

Defining Jobs with Spring

Batch processes have a number of different domain-specific concepts. A job is a process that consists of a number of steps. There maybe input and output related to each step. When a step fails, it may or may not be repeatable. The flow of a job may be conditional (for example, execute the bonus calculation step only if the revenue calculation step returns revenue over $1,000,000). Spring Batch provides classes, interfaces, and XML schemas that define these concepts using POJOs and XML to divide concerns appropriately and wire them together in a way familiar to those who have used Spring. Listing 1-1, for example, shows a basic Spring Batch job configured in XML. The result is a framework for batch processing that you can pick up very quickly with only a basic understanding of Spring as a prerequisite.

Example 1.1. Sample Spring Batch Job Definition

<bean id="accountTasklet"

<job id="accountJob">
  <step id="accountStep">
    <tasklet ref="accountTasklet"/>

Managing Jobs

It's one thing to be able to write a Java program that processes some data once and never runs again. But mission-critical processes require a more robust approach. The ability to keep the state of a job for reexecution, maintaining data integrity when a job fails through transaction management and saving performance metrics of past job executions for trending, are features that you expect in an enterprise batch system. These features are included in Spring Batch, and most of them are turned on by default; they require only minimal tweaking for performance and requirements as you develop your process.

Local and Remote Parallelization

As discussed earlier, the scale of batch jobs and the need to be able to scale them is vital to any enterprise batch solution. Spring Batch provides the ability to approach this in a number of different ways. From a simple thread-based implementation, where each commit interval is processed in its own thread of a thread pool; to running full steps in parallel; to configuring a grid of workers that are fed units of work from a remote master via partitioning; Spring Batch provides a collection of different options, including parallel chunk/step processing, remote chunk processing, and partitioning.

Standardizing I/O

Reading in from flat files with complex formats, XML files (XML is streamed, never loaded as a whole), or even a database, or writing to files or XML, can be done with only XML configuration. The ability to abstract things like file and database input and output from your code is an attribute of the maintainability of jobs written in Spring Batch.

The Spring Batch Admin Project

Writing your own batch-processing framework doesn't just mean having to redevelop the performance, scalability, and reliability features you get out of the box with Spring Batch. You also need to develop some form of administration toolset to do things like start and stop processes and view the statistics of previous job runs. However, if you use Spring Batch, it includes all that functionality as well as a newer addition: the Spring Batch Admin project. The Spring Batch Admin project provides a web-based control center that provides controls for your batch process (like launching a job, as shown in Figure 1-4) as well as the ability to monitor the performance your process over time.

The Spring Batch Admin project user interface

Figure 1.4. The Spring Batch Admin project user interface

And All the Features of Spring

Even with the impressive list of features that Spring Batch includes, the greatest thing is that it's built on Spring. With the exhaustive list of features that Spring provides for any Java application, including dependency injection, aspect-oriented programming (AOP), transaction management, and templates/helpers for most common tasks (JDBC, JMS, e-mail, and so on), building an enterprise batch process on a Spring framework offers virtually everything a developer needs.

As you can see, Spring Batch brings a lot to the table for developers. The proven development model of the Spring framework, scalability, and reliability features as well as an administration application are all available for you to get a batch process running quickly with Spring Batch.

How This Book Works

After going over the what and why of batch processing and Spring Batch, I'm sure you're chomping at the bit to dig into some code and learn what building batch processes with this framework is all about. Chapter 2 goes over the domain of a batch job, defines some of the terms I've already begun to use (job, step, and so on), and walks you through setting up your first Spring Batch project. You honor the gods by writing a "Hello, World!" batch job and see what happens when you run it.

One of my main goals for this book is to not only provide an in-depth look at how the Spring Batch framework works, but also show you how to apply those tools in a realistic example. Chapter 3 provides the requirements and technical architecture for a project that you implement in Chapter 10.


This chapter walked through a history of batch processing. It covered some of the challenges a developer of a batch process faces as well as justified the use of Java and open source technologies to conquer those challenges. Finally, you began an overview of the Spring Batch framework by examining its high-level components and features. By now, you should have a good view of what you're up against and understand that the tools to meet the challenges exist in Spring Batch. Now, all you need to do is learn how. Let's get started.


[2] There have been revisions in COBOL 2002 and Object Oriented COBOL, but their adoption has been significantly less than for previous versions.

