Bulk and batch operations

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

12.2. Bulk and batch operations

You use object/relational mapping to move data into the application tier so that you can use an object-oriented programming language to process that data. This is a good strategy if you're implementing a multiuser online transaction processing application, with small to medium size data sets involved in each unit of work.

On the other hand, operations that require massive amounts of data are best not executed in the application tier. You should move the operation closer to the location of the data, rather than the other way round. In an SQL system, the DML statements UPDATE and DELETE execute directly in the database and are often sufficient if you have to implement an operation that involves thousands of rows. More complex operations may require more complex procedures to run inside the database; hence, you should consider stored procedures as one possible strategy.

You can fall back to JDBC and SQL at all times in Hibernate or Java Persistence applications. In this section, we'll show you how to avoid this and how to execute bulk and batch operations with Hibernate and JPA.

12.2.1. Bulk statements with HQL and JPA QL

The Hibernate Query Language (HQL) is similar to SQL. The main difference between the two is that HQL uses class names instead of table names, and property names instead of column names. It also understands inheritance—that is, whether you're querying with a superclass or an interface.

The JPA query language, as defined by JPA and EJB 3.0, is a subset of HQL. Hence, all queries and statements that are valid JPA QL are also valid HQL. The statements we'll show you now, for bulk operations that execute directly in the database, are available in JPA QL and HQL. (Hibernate adopted the standardized bulk operations from JPA.)

The available statements support updating and deleting objects directly in the database without the need to retrieve the objects into memory. A statement that can select data and insert it as new entity objects is also provided.

Updating objects directly in the database

In the previous chapters, we've repeated that you should think about state management of objects, not how SQL statements are managed. This strategy assumes that the objects you're referring to are available in memory. If you execute an SQL statement that operates directly on the rows in the database, any change you make doesn't affect the in-memory objects (in whatever state they may be). In other words, any direct DML statement bypasses the Hibernate persistence context (and all caches).

A pragmatic solution that avoids this issue is a simple convention: Execute any direct DML operations first in a fresh persistence context. Then, use the Hibernate Session or EntityManager to load and store objects. This convention guarantees that the persistence context is unaffected by any statements executed earlier. Alternatively, you can selectively use the refresh() operation to reload the state of a persistent object from the database, if you know it's been modified behind the back of the persistence context.

Hibernate and JPA offer DML operations that are a little more powerful than plain SQL. Let's look at the first operation in HQL and JPA QL, an UPDATE:

Query q =
   session.createQuery("update Item i set i.isActive = :isActive");
q.setBoolean("isActive", true);
int updatedItems = q.executeUpdate();

This HQL statement (or JPA QL statement, if executed with the EntityManager) looks like an SQL statement. However, it uses an entity name (class name) and a property name. It's also integrated into Hibernate's parameter binding API. The number of updated entity objects is returned—not the number of updated rows. Another benefit is that the HQL (JPA QL) UPDATE statement works for inheritance hierarchies:

Query q = session.createQuery(
    "update CreditCard set stolenOn <= :now where type = 'Visa'"
);
q.setTimestamp("now", new Date() );
int updatedCreditCards = q.executeUpdate();

The persistence engine knows how to execute this update, even if several SQL statements have to be generated; it updates several base tables (because CreditCard is mapped to several superclass and subclass tables). This example also doesn't contain an alias for the entity class—it's optional. However, if you use an alias, all properties must be prefixed with an alias. Also note that HQL (and JPA QL) UPDATE statements can reference only a single entity class; you can't write a single statement to update Item and CreditCard objects simultaneously, for example. Subqueries are allowed in the WHERE clause; any joins are allowed only in these subqueries.

Direct DML operations, by default, don't affect any version or timestamp values of the affected entities (this is standardized in Java Persistence). With HQL, however, you can increment the version number of directly modified entity instances:

Query q =
   session.createQuery(
        "update versioned Item i set i.isActive = :isActive"
   );
q.setBoolean("isActive", true);
int updatedItems = q.executeUpdate();

(The versioned keyword is not allowed if your version or timestamp property relies on a custom org.hibernate.usertype.UserVersionType.)

The second HQL (JPA QL) bulk operation we introduce is the DELETE:

Query q = session.createQuery(
    "delete CreditCard c where c.stolenOn is not null"
);
int updatedCreditCards = q.executeUpdate();

The same rules as for UPDATE statements apply: no joins, single entity class only, optional aliases, subqueries allowed in the WHERE clause.

Just like SQL bulk operations, HQL (and JPA QL) bulk operations don't affect the persistence context, they bypass any cache. Credit cards or items in memory aren't updated if you execute one of these examples.

The last HQL bulk operation can create objects directly in the database.

Creating new objects directly in the database

Let's assume that all your customers' Visa cards have been stolen. You write two bulk operations to mark the day they were stolen (well, the day you discovered the theft) and to remove the compromised credit-card data from your records. Because you work for a responsible company, you have to report the stolen credit cards to the authorities and affected customers. So, before you delete the records, you extract everything that was stolen and create a few hundred (or thousand) StolenCreditCard objects. This is a new class you write just for that purpose:

public class StolenCreditCard {

    private Long id;
    private String type;
    private String number;
    private String expMonth;
    private String expYear;
    private String ownerFirstname;
    private String ownerLastname;
    private String ownerLogin;
    private String ownerEmailAddress;
    private Address ownerHomeAddress;

    ... // Constructors, getter and setter methods
}

You now map this class to its own STOLEN_CREDIT_CARD table, either with an XML file or JPA annotations (you shouldn't have any problem doing this on your own). Next, you need a statement that executes directly in the database, retrieves all compromised credit cards, and creates new StolenCreditCard objects:

Query q = session.createQuery(
    "insert into StolenCreditCard
           (type, number, expMonth, expYear,

            ownerFirstname, onwerLastname, ownerLogin,
            ownerEmailAddress, ownerHomeAddress)
     select
            c.type, c.number, c.expMonth, c.expYear,
            u.firstname, u.lastname, u.username,
            u.email, u.homeAddress
        from CreditCard c join c.user u
        where c.stolenOn is not null"
);
int createdObjects = q.executeUpdate();

This operation does two things: First, the details of CreditCard records and the respective owner (a User) are selected. The result is then directly inserted into the table to which the StolenCreditCard class is mapped.

Note the following:

The properties that are the target of an INSERT ... SELECT (in this case, the StolenCreditCard properties you list) have to be for a particular subclass, not an (abstract) superclass. Because StolenCreditCard isn't part of an inheritance hierarchy, this isn't an issue.
The types returned by the SELECT must match the types required for the INSERT—in this case, lots of string types and a component (the same type of component for selection and insertion).
The database identifier for each StolenCreditCard object will be generated automatically by the identifier generator you map it with. Alternatively, you can add the identifier property to the list of inserted properties and supply a value through selection. Note that automatic generation of identifier values works only for identifier generators that operate directly inside the database, such as sequences or identity fields.
If the generated objects are of a versioned class (with a version or timestamp property), a fresh version (zero, or timestamp of today) will also be generated. Alternatively, you can select a version (or timestamp) value and add the version (or timestamp) property to the list of inserted properties.

Finally, note that INSERT ... SELECT is available only with HQL; JPA QL doesn't standardize this kind of statement—hence, your statement may not be portable.

HQL and JPA QL bulk operations cover many situations in which you'd usually resort to plain SQL. On the other hand, sometimes you can't exclude the application tier in a mass data operation.

12.2.2. Processing with batches

Imagine that you have to manipulate all Item objects, and that the changes you have to make aren't as trivial as setting a flag (which you've done with a single statement previously). Let's also assume that you can't create an SQL stored procedure, for whatever reason (maybe because your application has to work on database-management systems that don't support stored procedures). Your only choice is to write the procedure in Java and to retrieve a massive amount of data into memory to run it through the procedure.

You should execute this procedure by batching the work. That means you create many smaller datasets instead of a single dataset that wouldn't even fit into memory.

Writing a procedure with batch updates

The following code loads 100 Item objects at a time for processing:

Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();

ScrollableResults itemCursor =
    session.createQuery("from Item").scroll();

int count=0;
while ( itemCursor.next() ) {
    Item item = (Item) itemCursor.get(0);
    modifyItem(item);
    if ( ++count % 100 == 0 ) {
        session.flush();
        session.clear();
    }
}
tx.commit();
session.close();

You use an HQL query (a simple one) to load all Item objects from the database. But instead of retrieving the result of the query completely into memory, you open an online cursor. A cursor is a pointer to a result set that stays in the database. You can control the cursor with the ScrollableResults object and move it along the result. The get(int i) call retrieves a single object into memory, the object the cursor is currently pointing to. Each call to next() forwards the cursor to the next object. To avoid memory exhaustion, you flush() and clear() the persistence context before loading the next 100 objects into it.

A flush of the persistence context writes the changes you made to the last 100 Item objects to the database. For best performance, you should set the size of the Hibernate (and JDBC) configuration property hibernate.jdbc.batch_size to the same size as your procedure batch: 100. All UDPATE statements that are executed during flushing are then also batched at the JDBC level.

(Note that you should disable the second-level cache for any batch operations; otherwise, each modification of an object during the batch procedure must be propagated to the second-level cache for that persistent class. This is an unnecessary overhead. You'll learn how to control the second-level cache in the next chapter.)

The Java Persistence API unfortunately doesn't support cursor-based query results. You have to call org.hibernate.Session and org.hibernate.Query to access this feature.

The same technique can be used to create and persist a large number of objects.

Inserting many objects in batches

If you have to create a few hundred or thousand objects in a unit of work, you may run into memory exhaustion. Every object that is passed to insert() or persist() is added to the persistence context cache.

A straightforward solution is to flush and clear the persistence context after a certain number of objects. You effectively batch the inserts:

Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();

for ( int i=0; i<100000; i++ ) {
    Item item = new Item(...);
    session.save(item);
    if ( i % 100 == 0 ) {
        session.flush();
        session.clear();
    }
}
tx.commit();
session.close();

Here you create and persist 100,000 objects, 100 at a time. Again, remember to set the hibernate.jdbc.batch_size configuration property to an equivalent value and disable the second-level cache for the persistent class. Caveat: Hibernate silently disables JDBC batch inserts if your entity is mapped with an identity identifier generator; many JDBC drivers don't support batching in that case.

Another option that completely avoids memory consumption of the persistence context (by effectively disabling it) is the StatelessSession interface.

12.2.3. Using a stateless Session

The persistence context is an essential feature of the Hibernate and Java Persistence engine. Without a persistence context, you wouldn't be able to manipulate object state and have Hibernate detect your changes automatically. Many other things also wouldn't be possible.

However, Hibernate offers you an alternative interface, if you prefer to work with your database by executing statements. This statement-oriented interface, org.hibernate.StatelessSession, feels and works like plain JDBC, except that you get the benefit from mapped persistent classes and Hibernate's database portability.

Imagine that you want to execute the same "update all item objects" procedure you wrote in an earlier example with this interface:

Session session = sessionFactory.openStatelessSession();
Transaction tx = session.beginTransaction();

ScrollableResults itemCursor =
    session.createQuery("from Item").scroll();

while ( itemCursor.next() ) {
    Item item = (Item) itemCursor.get(0);
    modifyItem(item);
    session.update(item);
}
tx.commit();
session.close();

The batching is gone in this example—you open a StatelessSession. You no longer work with objects in persistent state; everything that is returned from the database is in detached state. Hence, after modifying an Item object, you need to call update() to make your changes permanent. Note that this call no longer reattaches the detached and modified item. It executes an immediate SQL UPDATE; the item is again in detached state after the command.

Disabling the persistence context and working with the StatelessSession interface has some other serious consequences and conceptual limitations (at least, if you compare it to a regular Session):

A StatelessSession doesn't have a persistence context cache and doesn't interact with any other second-level or query cache. Everything you do results in immediate SQL operations.
Modifications to objects aren't automatically detected (no dirty checking), and SQL operations aren't executed as late as possible (no write-behind).
No modification of an object and no operation you call are cascaded to any associated instance. You're working with instances of a single entity class.
Any modifications to a collection that is mapped as an entity association (one-to-many, many-to-many) are ignored. Only collections of value types are considered. You therefore shouldn't map entity associations with collections, but only the noninverse side with foreign keys to many-to-one; handle the relationship through one side only. Write a query to obtain data you'd otherwise retrieve by iterating through a mapped collection.
The StatelessSession bypasses any enabled org.hibernate.Interceptor and can't be intercepted through the event system (both features are discussed later in this chapter).
You have no guaranteed scope of object identity. The same query produces two different in-memory detached instances. This can lead to data-aliasing effects if you don't carefully implement the equals() method of your persistent classes.

Good use cases for a StatelessSession are rare; you may prefer it if manual batching with a regular Session becomes cumbersome. Remember that the insert(), update(), and delete() operations have naturally different semantics than the equivalent save(), update(), and delete() operations on a regular Session. (They probably should have different names, too; the StatelessSession API was added to Hibernate ad hoc, without much planning. The Hibernate developer team discussed renaming this interface in a future version of Hibernate; you may find it under a different name in the Hibernate version you're using.)

So far in this chapter, we've shown how you can store and manipulate many objects with the most efficient strategy through cascading, bulk, and batch operations. We'll now consider interception and data filtering, and how you can hook into Hibernate's processing in a transparent fashion.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Bulk and batch operations

Create new playlist

Sign In

Sign Up

12.2. Bulk and batch operations

12.2.1. Bulk statements with HQL and JPA QL

12.2.2. Processing with batches

12.2.3. Using a stateless Session

Table of Contents for
Bulk and batch operations