Chapter 20. Scaling Hibernate

In this chapter

  • Performing bulk and batch data operations
  • Improving scalability with the shared cache

You use object/relational mapping to move data into the application tier in order to use an object-oriented programming language to process that data. This is a good strategy when implementing a multiuser online transaction-processing application with small to medium size data sets involved in each unit of work.

On the other hand, operations that require massive amounts of data aren’t best-suited for the application tier. You should move the operation closer to where the data lives, rather than the other way around. In an SQL system, the DML statements UPDATE and DELETE execute directly in the database and are often sufficient if you have to implement an operation that involves thousands of rows. Operations that are more complex may require additional procedures to run inside the database; therefore, you should consider stored procedures as one possible strategy. You can fall back to JDBC and SQL at any time in Hibernate applications. We discussed some these options earlier, in chapter 17. In this chapter, we show you how to avoid falling back to JDBC and how to execute bulk and batch operations with Hibernate and JPA.

A major justification for our claim that applications using an object/relational persistence layer outperform applications built using direct JDBC is caching. Although we argue passionately that most applications should be designed so that it’s possible to achieve acceptable performance without the use of a cache, there’s no doubt that for some kinds of applications, especially read-mostly applications or applications that keep significant metadata in the database, caching can have an enormous impact on performance. Furthermore, scaling a highly concurrent application to thousands of online transactions per second usually requires some caching to reduce the load on the database server(s). After discussing bulk and batch operations, we explore Hibernate’s caching system.

Major new features in JPA 2

  • Bulk update and delete operations, which translate directly into SQL UPDATE and DELETE statements, are now standardized and available in the JPQL, criteria, and SQL execution interfaces.
  • The configuration settings and annotations to enable a shared entity data cache are now standardized.

20.1. Bulk and batch processing

First we look at standardized bulk statements in JPQL, such as UPDATE and DELETE, and their equivalent criteria versions. After that, we repeat some of these operations with SQL native statements. Then, you learn how to insert and update a large number of entity instances in batches. Finally, we introduce the special org.hibernate.StatelessSession API.

20.1.1. Bulk statements in JPQL and criteria

The Java Persistence Query Language is similar to SQL. The main difference between the two is that JPQL uses class names instead of table names and property names instead of column names. JPQL also understands inheritance—that is, whether you’re querying with a superclass or an interface. The JPA criteria query facility supports the same query constructs as JPQL but in addition offers type-safe and easy programmatic statement creation.

The next statements we show you support updating and deleting data directly in the database without the need to retrieve them into memory. We also provide a statement that can select data and insert it as new entity instances, directly in the database.

Updating and deleting entity instances

JPA offers DML operations that are a little more powerful than plain SQL. Let’s look at the first operation in JPQL: an UPDATE.

Listing 20.1. Executing a JPQL UPDATE statement

This JPQL statement looks like an SQL statement, but it uses an entity name (class name) and property names. The aliases are optional, so you can also write update Item set active = true. You use the standard query API to bind named and positional parameters. The executeUpdate call returns the number of updated entity instances, which may be different from the number of updated database rows, depending on the mapping strategy.

This UPDATE statement only affects the database; Hibernate doesn’t update any Item instance you’ve already retrieved into the (current) persistence context. In the previous chapters, we’ve repeated that you should think about state management of entity instances, not how SQL statements are managed. This strategy assumes that the entity instances you’re referring to are available in memory. If you update or delete data directly in the database, what you’ve already loaded into application memory, into the persistence context, isn’t updated or deleted.

A pragmatic solution that avoids this issue is a simple convention: execute any direct DML operations first in a fresh persistence context. Then, use the Entity-Manager to load and store entity instances. This convention guarantees that the persistence context is unaffected by any statements executed earlier. Alternatively, you can selectively use the refresh() operation to reload the state of an entity instance in the persistence context from the database, if you know it’s been modified outside of the persistence context.

Bulk JPQL/criteria statements and the second-level cache

Executing a DML operation directly on the database automatically clears the optional Hibernate second-level cache. Hibernate parses your JPQL and criteria bulk operations and detects which cache regions are affected. Hibernate then clears the regions in the second-level cache. Note that this is a coarse-grained invalidation: although you may only update or delete a few rows in the ITEM table, Hibernate clears and invalidates all cache regions where it holds Item data.

This is the same operation with the criteria API:

CriteriaUpdate<Item> update =
    criteriaBuilder.createCriteriaUpdate(Item.class);
Root<Item> i = update.from(Item.class);
update.set(i.get(Item_.active), true);
update.where(
    criteriaBuilder.equal(i.get(Item_.seller), johndoe)
);
<enter/>
int updatedEntities = em.createQuery(update).executeUpdate();

Another benefit is that the JPQL UPDATE statement and a CriteriaUpdate work with inheritance hierarchies. The following statement marks all credit cards as stolen if the owner’s name starts with “J” :

Hibernate knows how to execute this update, even if several SQL statements have to be generated or some data needs to be copied into a temporary table; it updates rows in several base tables (because CreditCard is mapped to several superclass and subclass tables).

JPQL UPDATE statements can reference only a single entity class, and criteria bulk operations may have only one root entity; you can’t write a single statement to update Item and CreditCard data simultaneously, for example. Subqueries are allowed in the WHERE clause, and any joins are allowed only in these subqueries.

You can update values of an embedded type: for example, update User u set u.homeAddress.street = .... You can’t update values of an embeddable type in a collection. This isn’t allowed: update Item i set i.images.title = ....

Hibernate Feature

Direct DML operations, by default, don’t affect any version or timestamp values in the affected entities (as standardized by JPA). But a Hibernate extension lets you increment the version number of directly modified entity instances:

int updatedEntities =
   em.createQuery("update versioned Item i set i.active = true")
      .executeUpdate();

The version of each updated Item entity instance will now be directly incremented in the database, indicating to any other transaction relying on optimistic concurrency control that you modified the data. (Hibernate doesn’t allow use of the versioned keyword if your version or timestamp property relies on a custom org.hibernate.usertype.UserVersionType.)

With the JPA criteria API, you have to increment the version yourself:

CriteriaUpdate<Item> update =
    criteriaBuilder.createCriteriaUpdate(Item.class);
<enter/>
Root<Item> i = update.from(Item.class);
<enter/>
update.set(i.get(Item_.active), true);
<enter/>
update.set(
    i.get(Item_.version),
    criteriaBuilder.sum(i.get(Item_.version), 1)
);
<enter/>
int updatedEntities = em.createQuery(update).executeUpdate();

The second bulk operation we introduce is DELETE:

em.createQuery("delete CreditCard c where c.owner like 'J%'")
   .executeUpdate();
CriteriaDelete<CreditCard> delete =
    criteriaBuilder.createCriteriaDelete(CreditCard.class);
<enter/>
Root<CreditCard> c = delete.from(CreditCard.class);
<enter/>
delete.where(
    criteriaBuilder.like(
        c.get(CreditCard_.owner),
        "J%"
    )
);
<enter/>
em.createQuery(delete).executeUpdate();

The same rules for UPDATE statements and CriteriaUpdate apply to DELETE and Criteria-Delete: no joins, single entity class only, optional aliases, or subqueries allowed in the WHERE clause.

Another special JPQL bulk operation lets you create entity instances directly in the database.

Creating new entity instances
Hibernate Feature

Let’s assume that some of your customers’ credit cards have been stolen. You write two bulk operations to mark the day they were stolen (well, the day you discovered the theft) and to remove the compromised credit-card data from your records. Because you work for a responsible company, you have to report the stolen credit cards to the authorities and affected customers. Therefore, before you delete the records, you extract everything stolen and create a few hundred (or thousand) StolenCreditCard records. You write a new mapped entity class just for this purpose:

Hibernate maps this class to the STOLENCREDITCARD table. Next, you need a statement that executes directly in the database, retrieves all compromised credit cards, and creates new StolenCreditCard records. This is possible with the Hibernate-only INSERT ... SELECT statement:

int createdRecords =
    em.createQuery(
       "insert into" +
          " StolenCreditCard(id, owner, cardNumber, expMonth, expYear,
 userId, username)" +
          " select c.id, c.owner, c.cardNumber, c.expMonth, c.expYear, u.id,
 u.username" +
          " from CreditCard c join c.user u where c.owner like 'J%'"
    ).executeUpdate();

This operation does two things. First, it selects the details of CreditCard records and the respective owner (a User). Second, it inserts the result directly into the table mapped by the StolenCreditCard class.

Note the following:

  • The properties that are the target of an INSERT ... SELECT (in this case, the StolenCreditCard properties you list) have to be for a particular subclass, not an (abstract) superclass. Because StolenCreditCard isn’t part of an inheritance hierarchy, this isn’t an issue.
  • The types returned by the projection in the SELECT must match the types required for the arguments of the INSERT.
  • In the example, the identifier property of StolenCreditCard is in the list of inserted properties and supplied through selection; it’s the same as the original CreditCard identifier value. Alternatively, you can map an identifier generator for StolenCreditCard; but this works only for identifier generators that operate directly inside the database, such as sequences or identity fields.
  • If the generated records are of a versioned class (with a version or timestamp property), a fresh version (zero, or the current timestamp) is also generated. Alternatively, you can select a version (or timestamp) value and add the version (or timestamp) property to the list of inserted properties.

The INSERT ... SELECT statement was, at the time of writing, not supported by the JPA or Hibernate criteria APIs.

JPQL and criteria bulk operations cover many situations in which you’d usually resort to plain SQL. In some cases, you may want to execute SQL bulk operations without falling back to JDBC.

20.1.2. Bulk statements in SQL

In the previous section, you saw JPQL UPDATE and DELETE statements. The primary advantage of these statements is that they work with class and property names and that Hibernate knows how to handle inheritance hierarchies and versioning when generating SQL. Because Hibernate parses JPQL, it also knows how to efficiently dirty-check and flush the persistence context before the query and how to invalidate second-level cache regions.

If JPQL doesn’t have the features you need, you can execute native SQL bulk -statements:

With JPA native bulk statements, you must be aware of one important issue: Hibernate will not parse your SQL statement to detect the affected tables. This means Hibernate doesn’t know whether a flush of the persistence context is required before the query executes. In the previous example, Hibernate doesn’t know you’re updating rows in the ITEM table. Hibernate has to dirty-check and flush any entity instances in the persistence context when you execute the query; it can’t only dirty-check and flush Item instances in the persistence context.

You must consider another issue if you enable the second-level cache (if you don’t, don’t worry): Hibernate has to keep your second-level cache synchronized to avoid returning stale data, so it will invalidate and clear all second-level cache regions when you execute a native SQL UPDATE or DELETE statement. This means your second-level cache will be empty after this query!

Hibernate Feature

You can get more fine-grained control over dirty checking, flushing, and second-level cache invalidation with the Hibernate API for SQL queries:

With the addSynchronizedEntityClass() method, you can let Hibernate know which tables are affected by your SQL statement and Hibernate will clear only the relevant cache regions. Hibernate now also knows that it has to flush only modified Item entity instance in the persistence context, before the query.

Sometimes you can’t exclude the application tier in a mass data operation. You have to load data into application memory and work with the EntityManager to perform your updates and deletions, which brings us to batch processing.

20.1.3. Processing in batches

If you have to create or update a few hundred or thousand entity instances in one transaction and unit of work, you may run out of memory. Furthermore, you have to consider the time it takes for the transaction to complete. Most transaction managers have a low transaction timeout, in the range of seconds or minutes. The Bitronix transaction manager used for the examples in this book has a default transaction time-out of 60 seconds. If your unit of work takes longer to complete, you should first override this timeout for a particular transaction:

This is the UserTransaction API. Only future transactions started on this thread will have the new timeout. You must set the timeout before you begin() the transaction.

Next, let’s insert a few thousand Item instances into the database in a batch.

Inserting entity instances in batches

Every transient entity instance you pass to EntityManager#persist() is added to the persistence context cache, as explained in section 10.2.8. To prevent memory exhaustion, you flush() and clear() the persistence context after a certain number of insertions, effectively batching the inserts.

Listing 20.2. Inserting a large number of entity instances

  1. Create and persist 100,000 Item instances.
  2. After 100 operations, flush and clear the persistence context. This executes the SQL INSERT statements for 100 Item instances, and because they’re now in detached state and no longer referenced, the JVM garbage collection can reclaim that memory.

You should set the hibernate.jdbc.batch_size property in the persistence unit to the same size as your batch, here 100. With this setting, Hibernate will batch the INSERT statements at the JDBC level, with PreparedStatement#addBatch().

Batching interleaved SQL statements

A batch procedure persisting several different entity instances in an interleaved fashion, let’s say an Item, then a User, then another Item, another User, and so on, isn’t efficiently batched at the JDBC level. When flushing, Hibernate generates an insert into ITEM SQL statement, then an insert into USERS statement, then another insert into ITEM statement, and so on. Hibernate can’t execute a larger batch at once, given that each statement is different from the last. If you enable the property hibernate.order_inserts in the persistence unit configuration, Hibernate sorts the operations before trying to build a batch of statements. Hibernate then executes all INSERT statements for the ITEM table and all INSERT statements for the USERS table. Then, Hibernate can batch the statements at the JDBC level.

If you enable the shared second-level cache for the Item entity, you should then bypass the cache for your batch (insertion) procedure; see section 20.2.5.

A serious problem with mass insertions is contention on the identifier generator: every call of EntityManager#persist() must obtain a new identifier value. Typically, the generator is a database sequence, called once for every persisted entity instance. You have to reduce the number of database round trips for an efficient batch procedure.

Hibernate Feature

In section 4.2.5, we recommended the Hibernate-specific enhanced-sequence generator, not least because it supports certain optimizations ideal for batch operations. First, define the generator in the package-info.java metadata:

@org.hibernate.annotations.GenericGenerator(
  name = "ID_GENERATOR_POOLED",
  strategy = "enhanced-sequence",
  parameters = {
     @org.hibernate.annotations.Parameter(
        name = "sequence_name",
        value = "JPWH_SEQUENCE"
     ),
@org.hibernate.annotations.Parameter(
        name = "increment_size",
        value = "100"
     ),
     @org.hibernate.annotations.Parameter(
        name = "optimizer",
        value = "pooled-lo"
     )
})

Now use the generator with @GeneratedValue in your mapped entity classes.

With increment_size set to 100, the sequence produces the “next” values 100, 200, 300, 400, and so on. The pooled-lo optimizer in Hibernate generates intermediate values each time you call persist(), without another round trip to the database. Therefore, if the next value obtained from the sequence is 100, Hibernate will generate the identifier values 101, 102, 103, and so on in the application tier. Once the optimizer’s pool of 100 identifier values is exhausted, the database obtains the next sequence value, and the procedure repeats. This means you only make one round trip to get an identifier value from the database per batch of 100 insertions. Other identifier generator optimizers are available, but the pooled-lo optimizer covers virtually all use cases and is the easiest to understand and configure.

Be aware that an increment size of 100 will leave large gaps in between numeric identifiers if an application uses the same sequence but doesn’t apply the same algorithm as Hibernate’s optimizer. This shouldn’t be too much of a concern; instead of being able to generate a new identifier value each millisecond for 300 million years, you might exhaust the number space in 3 million years.

You can use the same batching technique to update large number of entity instances.

Updating entity instances in batches
Hibernate Feature

Imagine that you have to manipulate many Item entity instances and that the changes you need to make aren’t as trivial as setting a flag (which you’ve done with a single UPDATE JPQL statement previously). Let’s also assume that you can’t create a database stored procedure, for whatever reason (maybe because your application has to work on database-management systems that don’t support stored procedures). Your only choice is to write the procedure in Java and to retrieve a massive amount of data into memory to run it through the procedure.

This requires working in batches and scrolling through a query result with a database cursor, which is a Hibernate-only query feature. Please review our explanation of scrolling with cursors in section 14.3.3 and make sure database cursors are properly supported by your DBMS and JDBC driver. The following code loads 100 Item entity instances at a time for processing.

Listing 20.3. Updating a large number of entity instances

  1. You use a JPQL query to load all Item instances from the database. Instead of retrieving the result of the query completely into application memory, you open an online database cursor.
  2. You control the cursor with the ScrollableResults API and move it along the result. Each call to next() forwards the cursor to the next record.
  3. The get(int i) call retrieves a single entity instance into memory: the record the cursor is currently pointing to.
  4. To avoid memory exhaustion, you flush and clear the persistence context before loading the next 100 records into it.

For the best performance, you should set the size of the property hibernate.jdbc .batch_size in the persistence unit configuration to the same value as your procedure batch: 100. Hibernate batches at the JDBC level all UPDATE statements executed while flushing. By default, Hibernate won’t batch at the JDBC level if you’ve enabled versioning for an entity class—some JDBC drivers have trouble returning the correct updated row count for batch UPDATE statements (Oracle is known to have this issue). If you’re sure your JDBC driver supports this properly, and your Item entity class has an @Version annotation, enable JDBC batching by setting the property hibernate.jdbc.batch_versioned_data to true. If you enable the shared second-level cache for the Item entity, you should then bypass the cache for your batch (update) procedure; see section 20.2.5.

Another option that avoids memory consumption in the persistence context (by effectively disabling it) is the org.hibernate.StatelessSession interface.

20.1.4. The Hibernate StatelessSession interface

Hibernate Feature

The persistence context is an essential feature of the Hibernate engine. Without a persistence context, you can’t manipulate entity state and have Hibernate detect your changes automatically. Many other things wouldn’t also be possible.

Hibernate offers an alternative interface, however, if you prefer to work with your database by executing statements. The statement-oriented interface org.hibernate.StatelessSession, feels and works like plain JDBC, except that you get the benefit of mapped persistent classes and Hibernate’s database portability. The most interesting methods in this interface are insert(), update(), and delete(), which all map to the equivalent immediately executed JDBC/SQL operation.

Let’s write the same “update all item entity data” procedure from the earlier example with this interface.

Listing 20.4. Updating data with a StatelessSession

  1. Open a StatelessSession on the Hibernate SessionFactory, which you can unwrap from an EntityManagerFactory.
  2. Use a JPQL query to load all Item instances from the database. Instead of retrieving the result of the query completely into application memory, open an online database cursor.
  3. Scroll through the result with the cursor, and retrieve an Item entity instance. This instance is in detached state; there is no persistence context!
  4. Because Hibernate doesn’t detect changes automatically without a persistence context, you have to execute SQL UPDATE statements manually.

Disabling the persistence context and working with the StatelessSession interface has some other serious consequences and conceptual limitations (at least, if you compare it to a regular EntityManager and org.hibernate.Session):

  • The StatelessSession doesn’t have a persistence context cache and doesn’t interact with any other second-level or query cache. There is no automatic dirty checking or SQL execution when a transaction commits. Everything you do results in immediate SQL operations.
  • No modification of an entity instance and no operation you call are cascaded to any associated instance. Hibernate ignores any cascading rules in your mappings. You’re working with instances of a single entity class.
  • You have no guaranteed scope of object identity. The same query executed twice in the same StatelessSession produces two different in-memory detached instances. This can lead to data-aliasing effects if you don’t carefully implement the equals() and hashCode() methods in your persistent classes.
  • Hibernate ignores any modifications to a collection that you mapped as an entity association (one-to-many, many-to-many). Only collections of basic or embeddable types are considered. Therefore, you shouldn’t map entity associations with collections—but only many-to-one or one-to-one—and handle the relationship through that side only. Write a query to obtain data you’d otherwise retrieve by iterating through a mapped collection.
  • Hibernate doesn’t invoke JPA event listeners and event callback methods for operations executed with StatelessSession. StatelessSession bypasses any enabled org.hibernate.Interceptor, and you can’t intercept it through the Hibernate core event system.

Good use cases for a StatelessSession are rare; you may prefer it if manual batching with a regular EntityManager becomes cumbersome.

In the next section, we introduce the Hibernate shared caching system. Caching data on the application tier is a complementary optimization that you can utilize in any sophisticated multiuser application.

20.2. Caching data

In this section, we show you how to enable, tune, and manage the shared data caches in Hibernate. The shared data cache is not the persistence context cache, which Hibernate never shares between application threads. For reasons explained in section 10.1.2, this isn’t optional. We call the persistence context a first-level cache. The shared data cache—the second-level cacheis optional, and although JPA standardizes some configuration settings and mapping metadata for shared caching, every vendor has different solutions for optimization. Let’s start with some background information and explore the architecture of Hibernate’s shared cache.

20.2.1. The Hibernate shared cache architecture

Hibernate Feature

A cache keeps a representation of current database state close to the application, either in memory or on disk of the application server machine. A cache is a local copy of the data and sits between your application and the database. Simplified, to Hibernate a cache looks like a map of key/value pairs. Hibernate can store data in the cache by providing a key and a value, and it can look up a value in the cache with a key.

Hibernate has several types of shared caches available. You may use a cache to avoid a database hit whenever the following take place:

  • The application performs an entity instance lookup by identifier (primary key); this may get a hit in the entity data cache. Initializing an entity proxy on demand is the same operation and, internally, may hit the entity data cache instead of the database. The cache key is the identifier value of the entity instance, and the cache value is the data of the entity instance (its property values). The actual data is stored in a special disassembled format, and Hibernate assembles an entity instance again when it reads from the entity data cache.
  • The persistence engine initializes a collection lazily; a collection cache may hold the elements of the collection. The cache key is the collection role: for example, “Item[1234]#bids” would be the bids collection of an Item instance with identifier 1234. The cache value in this case would be a set of Bid identifier values, the elements of the collection. (Note that this collection cache does not hold the Bid entity data, only the data’s identifier values!)
  • The application performs an entity instance lookup by a unique key attribute. This is a special natural identifier cache for entity classes with unique properties: for example, User#username. The cache key is the unique property, such as the username, and the cached value is the User’s entity instance identifier.
  • You execute a JPQL, criteria, or SQL query, and the result of the actual SQL query is already stored in the query result cache. The cache key is the rendered SQL statement including all its parameter values, and the cache value is some representation of the SQL result set, which may include entity identifier values.

It’s critically important to understand that the entity data cache is the only type of cache that holds actual entity data values. The other three cache types only hold entity identifier information. Therefore, it doesn’t make sense to enable the natural identifier cache, for example, without also enabling the entity data cache. A lookup in the natural identifier cache will, when a match is found, always involve a lookup in the entity data cache. We’ll further analyze this behavior below with some code examples.

As we hinted earlier, Hibernate has a two-level cache architecture.

Enabling reference storage for immutable data

Hibernate holds data in the second-level entity cache as a copy in a disassembled format and reassembles it when read from the cache. Copying data is an expensive operation; so, as an optimization, Hibernate allows you to specify that immutable data may be stored as is rather than copied into the second-level cache. This is useful for reference data. Let’s say you have a City entity class with the properties zipcode and name, annotated @Immutable. If you enable the configuration property hibernate.cache.use_reference_entries in your persistence unit, Hibernate will try to (and can’t in some special cases) to store a reference of City directly in the second-level data cache. One caveat is that if you accidentally modify an instance of City in your application, the change will effectively write-through to all concurrent users of the (local) cache region, because they all get the same reference.

The second-level cache

You can see the various elements of Hibernate’s caching system in figure 20.1. The first-level cache is the persistence context cache, which we discussed in section 10.1.2. Hibernate does not share this cache between threads; each application thread has its own copy of the data in this cache. Hence, there are no issues with transaction isolation and concurrency when accessing this cache.

Figure 20.1. Hibernate’s two-level cache architecture

The second-level cache system in Hibernate may be process-scoped in the JVM or may be a cache system that can work in a cluster of JVMs. Multiple application threads may access the shared second-level caches concurrently. The cache concurrency strategy defines the transaction isolation details for entity data, collection elements, and natural identifier caches. Whenever an entry is stored or loaded in these caches, Hibernate will coordinate access with the configured strategy. Picking the right cache concurrency strategy for entity classes and their collections can be challenging, and we’ll guide you through the process with several examples later on.

The query result cache also has its own, internal strategy for handling concurrent access and keeping the cached results fresh and coordinated with the database. We show you how the query cache works and for which queries it makes sense to enable result caching.

The cache provider implements the physical caches as a pluggable system. For now, Hibernate forces you to choose a single cache provider for the entire persistence unit. The cache provider is responsible for handling physical cache regions—the buckets where the data is held on the application tier (in memory, in indexed files, or even replicated in a cluster). The cache provider controls expiration policies, such as when to remove data from a region by timeout, or keeping only the most-recently used data when the cache is full. The cache provider implementation may be able to communicate with other instances in a cluster of JVMs, to synchronize data in each instance’s buckets. Hibernate itself doesn’t handle any clustering of caches; this is fully delegated to the cache provider engine.

In this section, you set up caching on a single JVM with the Ehcache provider, a simple but very powerful caching engine (originally developed for Hibernate specifically as the easy Hibernate cache). We only cover some of Ehcache’s basic settings; consult its manual for more information.

Frequently, the first question many developers have about the Hibernate caching system is, “Will the cache know when data is modified in the database?” Let’s try to answer this question before you get hands-on with cache configuration and usage.

Caching and concurrency

If an application does not have exclusive access to the database, shared caching should only be used for data that changes rarely and for which a small window of inconsistency is acceptable after an update. When another application updates the database, your cache may contain stale data until it expires. The other application may be a database-triggered stored procedure or even an ON DELETE or ON UPDATE foreign key option. There is no way for Hibernate’s cache system to know when another application or trigger updates the data in the database; the database can’t send your application a message. (You could implement such a messaging system with database triggers and JMS, but doing so isn’t exactly trivial.) Therefore, using caching depends on the type of data and the freshness of the data required by your business case.

Let’s assume for a moment that your application has exclusive access to the database. Even then, you must ask the same questions as a shared cache makes data retrieved from the database in one transaction visible to another transaction. What transaction isolation guarantees should the shared cache provide? The shared cache will affect the isolation level of your transactions, whether you read only committed data or if reads are repeatable. For some data, it may be acceptable that updates by one application thread aren’t immediately visible by other application threads, providing an acceptable window of inconsistency. This would allow a much more efficient and aggressive caching strategy.

Start this design process with a diagram of your domain model, and look at the entity classes. Good candidates for caching are classes that represent

  • Data that changes rarely
  • Noncritical data (for example, content-management data)
  • Data that’s local to the application and not modified by other applications

Bad candidates include

  • Data that is updated often
  • Financial data, where decisions must be based on the latest update
  • Data that is shared with and/or written by other applications

These aren’t the only rules we usually apply. Many applications have a number of classes with the following properties:

  • A small number of instances (thousands, not millions) that all fit into memory
  • Each instance referenced by many instances of another class or classes
  • Instances that are rarely (or never) updated

We sometimes call this kind of data reference data. Examples of reference data are Zip codes, locations, static text messages, and so on. Reference data is an excellent candidate for shared caching, and any application that uses reference data heavily will benefit greatly from caching that data. You allow the data to be refreshed when the cache timeout period expires, and some small window of inconsistency is acceptable after an update. In fact, some reference data (such as country codes) may have an extremely large window of inconsistency or may be cached eternally if the data is read-only.

You must exercise careful judgment for each class and collection for which you want to enable caching. You have to decide which concurrency strategy to use.

Selecting a cache concurrency strategy

A cache concurrency strategy is a mediator: it’s responsible for storing items of data in the cache and retrieving them from the cache. This important role defines the transaction isolation semantics for that particular item. You’ll have to decide, for each persistent class and collection, which cache concurrency strategy to use if you want to enable the shared cache.

The four built-in Hibernate concurrency strategies represent decreasing levels of strictness in terms of transaction isolation:

  • TRANSACTIONAL—Available only in environments with a system transaction manager, this strategy guarantees full transactional isolation up to repeatable read, if supported by the cache provider. With this strategy, Hibernate assumes that the cache provider is aware of and participating in system transactions. Hibernate doesn’t perform any kind of locking or version checking; it relies solely on the cache provider’s ability to isolate data in concurrent transactions. Use this strategy for read-mostly data where it’s critical to prevent stale data in concurrent transactions, in the rare case of an update. This strategy also works in a cluster if the cache provider engine supports synchronous distributed caching.
  • READ_WRITE—Maintains read committed isolation where Hibernate can use a time-stamping mechanism; hence, this strategy only works in a non-clustered environment. Hibernate may also use a proprietary locking API offered by the cache provider. Enable this strategy for read-mostly data where it’s critical to prevent stale data in concurrent transactions, in the rare case of an update. You shouldn’t enable this strategy if data is concurrently modified (by other applications) in the database.
  • NONSTRICT_READ_WRITE—Makes no guarantee of consistency between the cache and the database. A transaction may read stale data from the cache. Use this strategy if data hardly ever changes (say, not every 10 seconds) and a window of inconsistency isn’t of critical concern. You configure the duration of the inconsistency window with the expiration policies of your cache provider. This strategy is usable in a cluster, even with asynchronous distributed caching. It may be appropriate if other applications change data in the same database.
  • READ_ONLY—Suitable for data that never changes. You get an exception if you trigger an update. Use it for reference data only.

With decreasing strictness come increasing performance and scalability. A clustered asynchronous cache with NONSTRICT_READ_WRITE can handle many more transactions than a synchronous cluster with TRANSACTIONAL. You have to evaluate carefully the performance of a clustered cache with full transaction isolation before using it in production. In many cases, you may be better off not enabling the shared cache for a particular class, if stale data isn’t an option!

You should benchmark your application with the shared cache disabled. Enable it for good candidate classes, one at a time, while continuously testing the scalability of your system and evaluating concurrency strategies. You must have automated tests available to judge the impact of changes to your cache setup. We recommend that you write these tests first, for the performance and scalability hotspots of your application, before you enable the shared cache.

With all this theory under your belt, it’s time to see how caching works in practice. First, you configure the shared cache.

20.2.2. Configuring the shared cache

You configure the shared cache in your persistence.xml configuration file.

Listing 20.5. Shared cache configuration in persistence.xml

Path: /model/src/main/resources/META-INF/persistence.xml

  1. The shared cache mode controls how entity classes of this persistence unit become cacheable. Usually you prefer to enable caching selectively for only some entity classes. Options: DISABLE_SELECTIVE, ALL, and NONE.
  2. Hibernate’s second-level cache system has to be enabled explicitly; it isn’t enabled by default. You can separately enable the query result cache; it’s disabled by default as well.
  3. Pick a provider for the second-level cache system. For Ehcache, add the org.hibernate:hibernate-ehcache Maven artifact dependency to your classpath. Then, choose how Hibernate uses Ehcache with this region factory setting; here you tell Hibernate to manage a single Ehcache instance internally as the second-level cache provider.
  4. Hibernate passes this property to Ehcache when the provider is started, setting the location of the Ehcache configuration file. All physical cache settings for cache regions are in this file.
  5. This controls how Hibernate disassembles and assembles entity state when data is stored and loaded from the second-level cache. The structured cache entry format is less efficient but necessary in a clustered environment. For a nonclustered second-level cache like the singleton Ehcache on this JVM, you can disable this setting and use a more efficient format.
  6. When you experiment with the second-level cache, you usually want to see what’s happening behind the scenes. Hibernate has a statistics collector and an API to access these statistics. For performance reasons, it’s disabled by default (and should be disabled in production).

The second-level cache system is now ready, and Hibernate will start Ehcache when you build an EntityManagerFactory for this persistence unit. Hibernate won’t cache anything by default, though; you have to enable caching selectively for entity classes and their collections.

20.2.3. Enabling entity and collection caching

We now look at entity classes and collections of the CaveatEmptor domain model and enable caching with the right concurrency strategy. In parallel, you’ll configure the necessary physical cache regions in the Ehcache configuration file.

First the User entity: this data rarely changes, but, of course, a user may change their user name or address from time to time. This isn’t critical data in a financial sense; few people make buying decisions based on a user’s name or address. A small window of inconsistency is acceptable when a user changes name or address information. Let’s say there is no problem if, for a maximum of one minute, the old information is still visible in some transactions. This means you can enable caching with the NONSTRICT_READ_WRITE strategy:

Path: /model/src/main/java/org/jpwh/model/cache/User.java

Hibernate Feature

The @Cacheable annotation enables the shared cache for this entity class, but a Hibernate annotation is necessary to pick the concurrency strategy. Hibernate stores and loads User entity data in the second-level cache, in a cache region named your.package.name.User. You can override the name with the region attribute of the @Cache annotation. (Alternatively, you can set a global region name prefix with the hibernate.cache.region_prefix property in the persistence unit.)

You also enable the natural identifier cache for the User entity with @org.hibernate.annotations.NaturalIdCache. The natural identifier properties are marked with @org.hibernate.annotations.NaturalId, and you have to tell Hibernate whether the property is mutable. This enables you to look up User instances by username without hitting the database.

Next, configure the cache regions for both the entity data and the natural identifier caches in Ehcache:

Path: /model/src/main/resources/cache/ehcache.xml

<ehcache xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:noNamespaceSchemaLocation="http://ehcache.org/ehcache.xsd">
<enter/>
    <cache name="org.jpwh.model.cache.User"
           maxElementsInMemory="500"
           eternal="false"
           timeToIdleSeconds="30"
           timeToLiveSeconds="60"/>
<enter/>
    <cache name="org.jpwh.model.cache.User##NaturalId"
           maxElementsInMemory="500"
           eternal="false"
           timeToIdleSeconds="30"
           timeToLiveSeconds="60"/>
<enter/>
</ehcache>

You can store a maximum 500 entries in both caches, and Ehcache won’t keep them eternally. Ehcache will remove an element if it hasn’t been accessed for 30 seconds and will remove even actively accessed entries after 1 minute. This guarantees that your window of inconsistency from cache reads is never more than 1 minute. In other words, the cache region(s) will hold up to the 500 most-recently used user accounts, none older than 1 minute, and shrink automatically.

Let’s move on to the Item entity class. This data changes frequently, although you still have many more reads than writes. If the name or description of an item is changed, concurrent transactions should see this update immediately. Users make financial decisions, whether to buy an item, based on the description of an item. Therefore, READ_WRITE is an appropriate strategy:

Path: /model/src/main/java/org/jpwh/model/cache/Item.java

@Entity
@Cacheable
@org.hibernate.annotations.Cache(
    usage = org.hibernate.annotations.CacheConcurrencyStrategy.READ_WRITE
)
public class Item {
<enter/>
    // ...
}

Hibernate will coordinate reads and writes when Item changes are made, ensuring that you can always read committed data from the shared cache. If another application is modifying Item data directly in the database, all bets are off! You configure the cache region in Ehcache to expire the most-recently used Item data after one hour, to avoid filling up the cache bucket with stale data:

Path: /model/src/main/resources/cache/ehcache.xml

<cache name="org.jpwh.model.cache.Item"
       maxElementsInMemory="5000"
       eternal="false"
       timeToIdleSeconds="600"
       timeToLiveSeconds="3600"/>

Consider the bids collection of the Item entity class: A particular Bid in the Item#bids collection is immutable, but the collection itself is mutable, and concurrent units of work need to see any addition or removal of a collection element immediately:

Path: /model/src/main/java/org/jpwh/model/cache/Item.java

public class Item {
<enter/>
    @OneToMany(mappedBy = "item")
    @org.hibernate.annotations.Cache(
        usage = org.hibernate.annotations.CacheConcurrencyStrategy.READ_WRITE
    )
    protected Set<Bid> bids = new HashSet<>();
<enter/>
    // ...
}

You configure the cache region with the same settings as for the entity class owning the collection, because each Item has one bids collection:

Path: /model/src/main/resources/cache/ehcache.xml

<cache name="org.jpwh.model.cache.Item.bids"
       maxElementsInMemory="5000"
       eternal="false"
       timeToIdleSeconds="600"
       timeToLiveSeconds="3600"/>

It’s critical to remember that the collection cache will not contain the actual Bid data. The collection cache only holds a set of Bid identifier values. Therefore, you must enable caching for the Bid entity as well. Otherwise, Hibernate may hit the cache when you start iterating through Item#bids, but then, due to cache misses, load each Bid separately from the database. This is a case where enabling the cache will result in more load on your database server!

We’ve said that Bids are immutable, so you can cache this entity data as READ_ONLY:

Path: /model/src/main/java/org/jpwh/model/cache/Bid.java

@Entity
@org.hibernate.annotations.Immutable
@Cacheable
@org.hibernate.annotations.Cache(
    usage = CacheConcurrencyStrategy.READ_ONLY
)
public class Bid {
<enter/>
    // ...
}

Even though Bids are immutable, you should configure an expiration policy for the cache region, to prevent old bid data from clogging up the cache eternally:

Path: /model/src/main/resources/cache/ehcache.xml

<cache name="org.jpwh.model.cache.Bid"
       maxElementsInMemory="100000"
       eternal="false"
       timeToIdleSeconds="600"
       timeToLiveSeconds="3600"/>

You’re now ready to test the cache and explore Hibernate’s caching behavior.

20.2.4. Testing the shared cache

Hibernate’s transparent caching behavior can be difficult to analyze. The API for loading and storing data is still the EntityManager, with Hibernate automatically writing and reading data in the cache. Of course, you can see actual database access by logging Hibernate’s SQL statements, but you should familiarize yourself with the org.hibernate.stat.Statistics API to obtain more information about a unit of work and see what’s going on behind the scenes. Let’s run through some examples to see how this works.

You enabled the statistics collector earlier in the persistence unit configuration, in section 20.2.2. You access the statistics of the persistence unit on the org.hibernate.SessionFactory:

Path: /examples/src/test/java/org/jpwh/test/cache/SecondLevel.java

Statistics stats =
    JPA.getEntityManagerFactory()
        .unwrap(SessionFactory.class)
        .getStatistics();
<enter/>
SecondLevelCacheStatistics itemCacheStats =
    stats.getSecondLevelCacheStatistics(Item.class.getName());
assertEquals(itemCacheStats.getElementCountInMemory(), 3);
assertEquals(itemCacheStats.getHitCount(), 0);

Here, you also get statistics for the data cache region for Item entities, and you can see that there are several entries already in the cache. This is a warm cache; Hibernate stored data in the cache when the application saved Item entity instances. However, the entities haven’t been read from the cache, and the hit count is zero.

When you now look up an Item instance by identifier, Hibernate attempts to read the data from the cache and avoids executing an SQL SELECT statement:

Path: /examples/src/test/java/org/jpwh/test/cache/SecondLevel.java

Item item = em.find(Item.class, ITEM_ID);
assertEquals(itemCacheStats.getHitCount(), 1);

You also have some User entity data in the cache, so initializing the Item#seller proxy hits the cache, too:

Path: /examples/src/test/java/org/jpwh/test/cache/SecondLevel.java

When you iterate through the Item#bids collection, Hibernate uses the cache:

Path: /examples/src/test/java/org/jpwh/test/cache/SecondLevel.java

  1. The statistics tell you that there are three Item#bids collections in the cache (one for each Item). No successful cache lookups have occurred so far.
  2. The entity cache of Bid has five records, and you haven’t accessed it either.
  3. Initializing the collection reads the data from both caches.
  4. The cache found one collection as well as the data for its three Bid elements.

The special natural identifier cache for Users is not completely transparent. You need to call a method on the org.hibernate.Session to perform a lookup by natural identifier:

Path: /examples/src/test/java/org/jpwh/test/cache/SecondLevel.java

  1. The natural identifier cache region for Users has one element.
  2. The org.hibernate.Session API performs natural identifier lookup; this is the only API for accessing the natural identifier cache.
  3. You had a cache hit for the natural identifier lookup. The cache returned the identifier value “johndoe”.
  4. You also had a cache hit for the entity data of that User.

The statistics API offers much more information than we’ve shown in these simple examples; we encourage you to explore this API further. Hibernate collects information about all its operations, and these statistics are useful for finding hotspots such as the queries taking the longest time and the entities and collections most accessed.

Accessing statistics with JMX

You can analyze Hibernate statistics at runtime through the standard Java Management Extension (JMX) system. Bind the Hibernate Statistics object as an MBean; this is only a few lines of code with a dynamic proxy. We’ve included an example in org.jpwh.test.cache.SecondLevel.

As mentioned at the beginning of this section, Hibernate transparently writes and reads the cached data. For some procedures, you need more control over cache usage, and you may want to bypass the caches explicitly. This is where cache modes come into play.

20.2.5. Setting cache modes

JPA standardizes control of the shared cache with several cache modes. The following EntityManager#find() operation, for example, doesn’t attempt a cache lookup and hits the database directly:

Path: /examples/src/test/java/org/jpwh/test/cache/SecondLevel.java

The default CacheRetrieveMode is USE; here, you override it for one operation with BYPASS.

A more common usage of cache modes is the CacheStoreMode. By default, Hibernate puts entity data in the cache when you call EntityManager#persist(). It also puts data in the cache when you load an entity instance from the database. But if you store or load a large number of entity instances, you may not want to fill up the available cache. This is especially important for batch procedures, as we showed earlier in this chapter.

You can disable storage of data in the shared entity cache for the entire unit of work by setting a CacheStoreMode on the EntityManager:

Path: /examples/src/test/java/org/jpwh/test/cache/SecondLevel.java

Let’s look at the special cache mode CacheStoreMode.REFRESH. When you load an entity instance from the database with the default CacheStoreMode.USE, Hibernate first asks the cache whether it already has the data of the loaded entity instance. Then, if the cache already contains the data, Hibernate doesn’t put the loaded data into the cache. This avoids a cache write, assuming that cache reads are cheaper. With the REFRESH mode, Hibernate always puts loaded data into the cache without first querying the cache

In a cluster with synchronous distributed caching, writing to all cache nodes is usually a very expensive operation. In fact, with a distributed cache, you should set the configuration property hibernate.cache.use_minimal_puts to true. This optimizes second-level cache operation to minimize writes, at the cost of more frequent reads. If, however, there is no difference for your cache provider and architecture between reads and writes, you may want to disable the additional read with CacheStoreMode.REFRESH. (Note that some cache providers in Hibernate may set use_minimal_ puts: for example, with Ehcache this setting is enabled by default.)

Cache modes, as you’ve seen, can be set on the find() operation and for the entire EntityManager. You can also set cache modes on the refresh() operation and on individual Querys as hints, as discussed in section 14.5. The per-operation and per-query settings override the cache mode of the EntityManager.

The cache mode only influences how Hibernate works with the caches internally. Sometimes you want to control the cache system programmatically: for example, to remove data from the cache.

20.2.6. Controlling the shared cache

The standard JPA interface for controlling the caches is the Cache API:

Path: /examples/src/test/java/org/jpwh/test/cache/SecondLevel.java

EntityManagerFactory emf = JPA.getEntityManagerFactory();
Cache cache = emf.getCache();
<enter/>
assertTrue(cache.contains(Item.class, ITEM_ID));
cache.evict(Item.class, ITEM_ID);
cache.evict(Item.class);
cache.evictAll();

This is a simple API, and it only allows you to access cache regions of entity data. You need the org.hibernate.Cache API to access the other cache regions, such as the collection and natural identifier cache regions:

Path: /examples/src/test/java/org/jpwh/test/cache/SecondLevel.java

org.hibernate.Cache hibernateCache =
    cache.unwrap(org.hibernate.Cache.class);
<enter/>
assertFalse(hibernateCache.containsEntity(Item.class, ITEM_ID));
hibernateCache.evictEntityRegions();
hibernateCache.evictCollectionRegions();
hibernateCache.evictNaturalIdRegions();
hibernateCache.evictQueryRegions();

You’ll rarely need these control mechanisms. Also, note that eviction of the second-level cache is nontransactional: that is, Hibernate doesn’t lock the cache regions during eviction.

Let’s move on to the last part of the Hibernate caching system: the query result cache.

20.2.7. The query result cache

The query result cache is by default disabled, and every JPA, criteria, or native SQL query you write always hits the database first. In this section, we show you why Hibernate disables the query cache by default and then how to enable it for particular queries when needed.

The following procedure executes a JPQL query and stores the result in a special cache region for query results:

Path: /examples/src/test/java/org/jpwh/test/cache/SecondLevel.java

  1. You have to enable caching for a particular query. Without the org.hibernate.cachable hint, the result won’t be stored in the query result cache.
  2. Hibernate executes the SQL query and retrieves the result set into memory.
  3. Using the statistics API, you can find out more details. This is the first time you execute this query, so you get a cache miss, not a hit. Hibernate puts the query and its result into the cache. If you run the same query again, the result will be from the cache.
  4. The entity instance data retrieved in the result set is stored in the entity cache region, not in the query result cache.

The org.hibernate.cachable hint is set on the Query API, so it also works for criteria and native SQL queries. Internally, the cache key is the SQL Hibernate uses to access the database, with arguments rendered into the string where you had parameter markers.

The query result cache doesn’t contain the entire result set of the SQL query. In the last example, the SQL result set contained rows from the ITEM table. Hibernate ignores most of the information in this result set; only the ID value of each ITEM record is stored in the query result cache. The property values of each Item are stored in the entity cache region.

Now, when you execute the same query again, with the same argument values for its parameters, Hibernate first accesses the query result cache. It retrieves the identifier values of the ITEM records from the cache region for query results. Then, Hibernate looks up and assembles each Item entity instance by identifier from the entity cache region. If you query for entities and decide to enable caching, make sure you also enable regular data caching for these entities. If you don’t, you may end up with more database hits after enabling the query result cache!

If you cache the result of a query that doesn’t return entity instances but returns only scalar or embeddable values (for example, select i.name from Item i or select u.homeAddress from User), the values are held in the query result cache region directly.

The query result cache uses two physical cache regions:

Path: /model/src/main/resources/cache/ehcache.xml

<cache name="org.hibernate.cache.internal.StandardQueryCache"
       maxElementsInMemory="500"
       eternal="false"
       timeToIdleSeconds="600"
       timeToLiveSeconds="3600"/>
<enter/>
<cache name="org.hibernate.cache.spi.UpdateTimestampsCache"
       maxElementsInMemory="50"
       eternal="true"/>

The first cache region is where the query results are stored. You should let the cache provider expire the most-recently used result sets over time, such that the cache uses the available space for recently executed queries.

The second region, org.hibernate.cache.spi.UpdateTimestampsCache, is special: Hibernate uses this region to decide whether a cached query result set is stale. When you re-execute a query with caching enabled, Hibernate looks in the timestamp cache region for the timestamp of the most recent insert, update, or delete made to the queried table(s). If the timestamp found is later than the timestamp of the cached query results, Hibernate discards the cached results and issues a new database query. This effectively guarantees that Hibernate won’t use the cached query result if any table that may be involved in the query contains updated data; hence, the cached result may be stale. You should disable expiration of the update timestamp cache so that the cache provider never removes an element from this cache. The maximum number of elements in this cache region depends on the number of tables in your mapped model.

The majority of queries don’t benefit from result caching. This may come as a surprise. After all, it sounds like avoiding a database hit is always a good thing. There are two good reasons this doesn’t always work for arbitrary queries, compared to entity retrieval by identifier or collection initialization.

First, you must ask how often you’re going to execute the same query repeatedly, with the same arguments. Granted, your application may execute a few queries repeatedly with exactly the same arguments bound to parameters and the same automatically generated SQL statement. We consider this a rare case, but when you’re certain you’re executing a query repeatedly, it becomes a good candidate for result set caching.

Second, for applications that perform many queries and few inserts, deletes, or updates, caching query results can improve performance and scalability. On the other hand, if the application performs many writes, Hibernate won’t use the query result cache efficiently. Hibernate expires a cached query result set when there is any insert, update, or delete of any row of a table that appears in the cached query result. This means cached results may have a short lifetime, and even if you execute a query repeatedly, Hibernate won’t use cached results due to concurrent modifications of rows in the tables referenced by the query.

For many queries, the benefit of the query result cache is nonexistent or, at least, doesn’t have the impact you’d expect. But if your query restriction is on a unique natural identifier, such as select u from User u where u.username = ?, you should consider natural identifier caching and lookup as shown earlier in this chapter.

20.3. Summary

  • You saw options you need to scale up your application and handle many concurrent users and larger data sets.
  • With bulk UPDATE and DELETE operations, you can modify data directly in the database and still benefit from JPQL and criteria APIs without falling back to SQL.
  • You learned about batch operations that let you work with large numbers of records in the application tier.
  • We discussed the Hibernate caching system in detail: how you can selectively enable and optimize shared caching of entity, collection, and query result data.
  • You configured Ehcache as a cache provider and learned how to peek under the hood with the Hibernate statistics API.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.210.91