Chapter 5. Querying the Data Grid

So far, you have learned how to access data in the Coherence cache using identity-based operations, such as get and put. In many cases this will be exactly what you need, but there will also be many other situations where you either won't know an object's identity or you will simply need to look up one or more objects based on attributes other than the identity.

Coherence allows you to do that by specifying a filter for set-based operations defined by the QueryMap interface, which we mentioned briefly in Chapter 3,

public interface QueryMap extends Map {
Set keySet(Filter filter);
Set entrySet(Filter filter);
Set entrySet(Filter filter, Comparator comparator);
...
}

As you can see from the previous interface definition, all three methods accept a filter as the first argument, which is an instance of a class implementing a very simple com.tangosol.util.Filter interface:

public interface Filter {
boolean evaluate(Object o);
}

Basically, the Filter interface defines a single method, evaluate, which takes an object to evaluate as an argument and returns true if the specified object satisfies the criteria defined by the filter, or false if it doesn't.

This mechanism is very flexible, as it allows you to filter your cached objects any way you want. For example, it would be quite simple to implement a filter that can be used to retrieve all the account transactions in a specific period:

public class TransactionFilter implements Filter {
private Long m_accountId;
private Date m_from;
private Date m_to;
public TransactionFilter(Long accountId, Date from, Date to) {
m_accountId = accountId;
m_from = from;
m_to = to;
}
public boolean evaluate(Object o) {
Transaction tx = (Transaction) o;
return tx.getId().getAccountId().equals(m_accountId)
&& tx.getTime().compareTo(from) >= 0
&& tx.getTime().compareTo(to) <= 0;
}
}

While the previous sample filter implementation is perfectly valid and will return correct results if executed against the transactions cache, it would be very cumbersome if you had to define every single query criterion in the application by implementing a custom filter class as we did previously.

Fortunately, Coherence provides a number of built-in filters that make custom filter implementation unnecessary in the vast majority of cases.

Built-in filters

Most queries can be expressed in terms of object attributes and standard logical and relational operators, such as AND, OR, equals, less than, greater than, and so on. For example, if we wanted to find all the transactions for an account, it would be much easier if we could just execute the query analogous to the select * from Transactions where account_id = 123 SQL statement than to write a custom filter that checks if the accountId attribute is equal to 123.

The good news is that Coherence has a number of built-in filters that allow us to do exactly that. The following table lists all the filters from the com.tangosol.util.filter package that you can use to construct custom queries:

Filter Class

Description

EqualsFilter

Compares two values for equality.

NotEqualsFilter

Compares two values for inequality.

IsNullFilter

Checks if the value is null.

IsNotNullFilter

Checks if the value is not null.

LessFilter

Checks if the first value is smaller than the second value.

LessEqualsFilter

Checks if the first value is smaller than or equal to the second value.

GreaterFilter

Checks if the first value is greater than the second value.

GreaterEqualsFilter

Checks if the first value is greater than or equal to the second value.

BetweenFilter

Checks if the value is within certain range.

LikeFilter

Checks if the value matches specified pattern, similar to the SQL LIKE operator.

InFilter

Checks if the value is within a specified collection.

InKeySetFilter

Checks if the entry key is within a specified collection.

ContainsFilter

Checks if the collection contains a specified element.

ContainsAllFilter

Checks if the collection contains all of the elements from another collection.

ContainsAnyFilter

Checks if the collection contains any of the elements from another collection.

AndFilter

Returns the logical and of two other filters.

OrFilter

Returns the logical or of two other filters.

XorFilter

Returns the logical exclusive or of two other filters.

NotFilter

Negates the results of another filter.

LimitFilter

Allows paging through the results of another filter.

AlwaysFilter

Returns true.

NeverFilter

Returns false.

AllFilter

Returns logical and of all filters in a filter array.

AnyFilter

Returns logical or of all filters in a filter array.

As you can see, pretty much all of the standard Java logical operators and SQL predicates are covered. This will allow us to construct query expressions as complex as the ones we can define in Java code or the SQL where clause.

The bad news is that there is no query language in Coherence that allows you to specify a query as a string. Instead, you need to create the expression tree for the query programmatically, which can make things a bit tedious.

For example, the where clause of the SQL statement we specified earlier, select * from Transactions where account_id = 123, can be represented by the following Coherence filter definition:

Filter filter = new EqualsFilter("getId.getAccountId", 123);

In this case it is not too bad: we simply create an instance of an EqualsFilter that will extract the value of an accountId attribute from a Transaction.Id instance and compare it with 123. However, if we modify the query to filter transactions by date as well, the filter expression that we need to create becomes slightly more complex:

Filter filter = new AndFilter(
new EqualsFilter("getId.getAccountId", accountId),
new BetweenFilter("getTime", from, to));

If you need to combine several logical expressions, this can quickly get out of hand, so we will look for a way to simplify filter creation shortly. But first, let's talk about something we used in the examples without paying much attention to it&mdash;value extractors.

Value extractors

As you can see from the previous examples, a query is typically expressed in terms of object attributes, such as accountId or time, while the evaluate method defined by the Filter interface accepts a whole object that the attributes belong to, such as a Transaction instance.

That implies that we need a generic way to extract attribute values from an object instance&mdash;otherwise, there would be no way to define reusable filters, such as the ones in the table earlier that ship with Coherence, and we would be forced to implement a custom filter for each query we need to execute. In order to solve this problem and enable extraction of attribute values from an object, Coherence introduces value extractors.

A value extractor is an object that implements a com.tangosol.util.ValueExtractor interface:

public interface ValueExtractor {
Object extract(Object target);
}

The sole purpose of a value extractor is to extract a derived value from the target object that is passed as an argument to the extract method. The result could be a single attribute value, a combination of multiple attributes (concatenation of first and last name, for example), or in general, a result of some transformation of a target object.

Reflection extractor

In the vast majority of cases, you will want to extract a value of the single attribute of a target object, in which case you can use the built-in ReflectionExtractor class. The ReflectionExtractor accepts a method name as a constructor argument, invokes the specified method on a target object via reflection, and returns the result of that method invocation.

As a matter of fact, the ReflectionExtractor is used so often that you can simply specify a method name as a string in most places where a value extractor is expected and an instance of a ReflectionExtractor will be created automatically for you, which is what we took advantage of in the previous filter definitions. For example, the filter definition:

Filter filter = new BetweenFilter("getTime", from, to);

Is really just a shorter form of:

Filter filter = new BetweenFilter(
new ReflectionExtractor("getTime"),
from, to);

I will have to admit that as useful as ReflectionExtractor is, I have never liked it much. The main reason for this is that it forces you to spell out a full method name for an attribute, when a property name, as defined by the Java Bean specification should've been enough and would've made the code more readable. This is especially bothersome when accessing a Coherence cluster from a .NET client, in which case the'get' prefix in front of the property name truly feels unnatural.

Fortunately, it is easy to fix the problem by implementing a similar value extractor that uses introspection to obtain an attribute value:

public class PropertyExtractor
implements ValueExtractor, Serializable {
private final String m_propertyName;
private transient volatile Method m_readMethod;
public PropertyExtractor(String propertyName) {
m_propertyName = propertyName;
}
public Object extract(Object o) {
if (o == null) {
return null;
}
Class targetClass = o.getClass();
try {
if (readMethod == null || readMethod.getDeclaringClass() != targetClass) {
PropertyDescriptor pd =
new PropertyDescriptor(propertyName, o.getClass());
readMethod = pd.getReadMethod();
}
return readMethod.invoke(o);
}
catch (Exception e) {
throw new RuntimeException(e);
}
}
}

Now we can use PropertyExtractor instead of ReflectionExtractor in our filter definitions:

Filter filter = new BetweenFilter(
new PropertyExtractor("time"),from, to);

In this example the difference is not significant and it could even be argued that the PropertyExtractor makes the code harder to read as we have to specify it explicitly, instead of using a filter constructor that takes string as an argument and creates ReflectionExtractor for us. However, in the next section we will implement a helper class that makes filter creation much simpler, and the PropertyExtractor will allow us to make things as simple as they can be.

Note

Expression languages and value extractors

If you are familiar with any of the popular expression languages, such as MVEL, OGNL, or SpEL, you will notice that I could've easily implemented the previous value extractor using one of them. Not only would that allow me to do a simple property extraction, but I would be able to use much more sophisticated expressions for extraction.

Considering that I created SpEL (Spring Expression Language) while working on Spring.NET a few years back, you can imagine that I am a big proponent of their usage. To prove that, I have implemented value extractors for MVEL, OGNL, SpEL, Groovy, and even Java 6 Scripting in Coherence Tools, so you can easily use your favorite EL with Coherence.

Other built-in value extractors

While the ReflectionExtractor is definitely the one that is used most often, there are several other value extractors that ship with Coherence.

IdentityExtractor

The simplest extractor is the IdentityExtractor, which doesn't really extract anything from the target object, but returns the target object itself. This extractor can come in handy when you actually want filters to operate on the cache value itself instead of on one of its attributes, which is typically the case only if the value is of a simple type, such as one of the intrinsic numeric types or a string.

ChainedExtractor and MultiExtractor

There are also two composite value extractors, ChainedExtractor and MultiExtractor. Both of them accept an array of value extractors as a constructor argument, but they use them differently.

The ChainedExtractor executes extractors one by one, using the result of the previous extractor as the target object to evaluate the next extractor against. For example, you can use the ChainedExtractor to extract the accountId attribute from a Transaction instance:

ValueExtractor ex =
new ChainedExtractor(new ValueExtractor[] {
new ReflectionExtractor("getId"),
new ReflectionExtractor("getAccountId") });

This is necessary because the Transaction class does not expose the accountId attribute directly&mdash;we need to extract the id attribute from a transaction first, and then extract accountId from a Transaction.Id instance.

Note

Avoiding the need for chaining

Of course, we could've easily avoided the need for chaining if we simply exposed the accountId attribute directly on the Transaction class. Doing that is trivial and makes perfect sense in this case.

However, if I had done that, I'd have to come up with an example of ChainedExtractor usage that is outside of our domain.

When creating a ChainedExtractor you can also use a convenience constructor that will parse a dot-separated string and create an array of ReflectionExtractor instances automatically:

ValueExtractor ex =
new ChainedExtractor("getId.getAccountId");

As a matter of fact, you don't even need to go that far&mdash;all built-in filters will automatically create a ChainedExtractor containing an array of ReflectionExtractors from a dot-separated string, which is the feature we relied on earlier when we defined a query that returns transactions for a specific account.

On the other hand, MultiExtractor will execute all extractors against the same target object and return the list of extracted values. While you will rarely use this extractor when querying the cache, it can be very convenient when you want to extract only a subset of an object's attributes during aggregation (which we'll discuss shortly), in order to minimize the amount of data that needs to be transferred across the wire.

PofExtractor

One of the features introduced in Coherence 3.5 is PofExtractor&mdash;an extractor that can be used to extract values from the POF-serialized binaries without deserialization. This can provide a huge performance boost and reduced memory footprint for queries that would otherwise have to deserialize every single object in the cache in order to evaluate the filter.

However, you will only see those benefits when working with caches containing large objects. For small objects, the overhead of initializing a structure that is used to keep track of the location of serialized attributes within a binary POF value will likely be higher (both from memory and performance perspective) than the full deserialization of an object.

Implementing a custom value extractor

While the built-in value extractors should be sufficient for most usage scenarios, there might be some situations where implementing a custom one makes sense. We have already implemented one custom value extractor, PropertyExtractor, in order to improve on the built-in ReflectionExtractor and allow ourselves to specify JavaBean property names instead of the full method names, but there are other scenarios when this might be appropriate.

One reason why you might want to implement a custom value extractor is to enable transformation of cache values from their native type to some other type. For example, most applications use UI controls such as drop-downs or list boxes to present a list of possible choices to the user. Let's assume that we need to display a list of countries in a drop-down on the registration screen for new customers.

We already have a cache containing all the countries, so we could easily get all the values from it and send them to the client, which would use them to populate the drop-down. However, the Country class we defined in Chapter 2 has a number of attributes we don't need in order to populate the drop-down list, such as capital, currency symbol, and currency name&mdash;the only attributes we do need are the country code and country name, so by sending any other information to the client we would only be wasting network bandwidth.

As a matter of fact, for most, if not all, drop-downs and list boxes in an application we will need only an identifier that will be returned as a selected value, and a description that should be used for display purposes. That means that we can define a class containing only those two attributes:

public class LookupValue implements Serializable {
private Object m_id;
private String m_description;
public LookupValue(Object id, String description) {
m_id = id;
m_description = description;
}
public Object getId() {
return m_id;
}
public String getDescription() {
return m_description;
}
}

Now that we have a holder class that can be used to represent any lookup value, the remaining question is how we can transform instances of the Country class into the instances of the LookupValue class. The answer is simple&mdash;we can write a custom value extractor that will do it for us:

public class LookupValueExtractor
extends AbstractExtractor
implements PortableObject, Serializable {
private ValueExtractor m_idExtractor;
private ValueExtractor m_descriptionExtractor;
public LookupValueExtractor(ValueExtractor idExtractor,
ValueExtractor descriptionExtractor) {
m_idExtractor = idExtractor;
m_descriptionExtractor = descriptionExtractor;
}
public Object extractFromEntry(Map.Entry entry) {
Object id = InvocableMapHelper.extractFromEntry(m_idExtractor,
entry);
String description = (String)
InvocableMapHelper.extractFromEntry(m_descriptionExtractor, entry);
return new LookupValue(id, description);
}
// equals and hashCode omitted for brevity
}

The implementation is actually very simple: we allow users to specify two value extractors, an idExtractor and a descriptionExtractor, that we use to extract the values that are used to create a LookupValue instance. However, one thing deserves clarification.

Instead of simply implementing the ValueExtractor interface, we are extending the AbstractExtractor class and implementing the extractFromEntry method. The reason for this is that we want to be able to extract id and description not only from the entry value, but from the entry key as well.

In order to achieve that, we rely on the InvocableMapHelper class, which provides a utility method that can be used to extract a value from any object that implements Map.Entry interface.

Of course, the LookupValueExtractor is only part of the story&mdash;we still need a way to execute this extractor against all the objects in the countries cache and get the collection of extracted lookup values back. We will see what the best way to do that is shortly, but for now let's return to Coherence filters and see how we can make complex queries easier to create.

Simplifying Coherence queries

As you have probably realized by now, Coherence queries can become quite cumbersome to create as the number of attributes used within the query grows, especially if non-default value extractors need to be used.

One, and possibly the best, approach would be to implement a real query language. We could define a grammar for Coherence queries that would be used to parse a SQL-like query string into a parse tree representing a Coherence filter. This would actually be fairly straightforward, as grammar elements would map pretty much directly to the built-in filters provided by Coherence.

However, this would distract us from the main topic of the book and lead us into the discussion of topics such as language grammars and parsers, so implementation of a full-blown Coherence query language is out of the scope of this book.

What we will do instead is implement a FilterBuilder class that will allow us to define the queries in a simpler way. While this approach won't allow us to express all possible queries, it will cover a large number of the most common use cases.

Filter builder

The idea behind the FilterBuilder implementation is that many queries are based on simple attribute comparisons, where multiple attribute comparisons are concatenated using the logical AND, or less often OR operator.

If you review the table of the built-in filter types at the beginning of this chapter, you will see that Coherence already provides all the core facilities we need to implement this: we have all the common comparison operators, as well as some of the less common ones, and there are AllFilter and AnyFilter, which allow us to create logical AND and OR filters for an array of filters respectively. What we don't have is an easy way to create an array of filters, and that's exactly what the FilterBuilder will help us do.

The goal is to be able to create a filter using code similar to this:

Filter filter = new FilterBuilder()
.equals("id.accountId", 123)
.between("time", from, to)
.build();

This will allow us to define complex queries in a much shorter and significantly more readable way. In order to support the previous syntax, we can implement the FilterBuilder class as follows:

public class FilterBuilder {
private Class defaultExtractorType;
private List<Filter> filters = new ArrayList<Filter>();
// constructors
public FilterBuilder() {
this(PropertyExtractor.class);
}
public FilterBuilder(Class defaultExtractorType) {
this.defaultExtractorType = defaultExtractorType;
}
// public members
public FilterBuilder equals(String propertyName,
Object value) {
return equals(createExtractor(propertyName), value);
}
public FilterBuilder equals(ValueExtractor extractor,
Object value) {
filters.add(new EqualsFilter(extractor, value));
return this;
}
public FilterBuilder notEquals(String propertyName,
Object value) {
return notEquals(createExtractor(propertyName), value);
}
public FilterBuilder notEquals(ValueExtractor extractor,
Object value) {
filters.add(new NotEqualsFilter(extractor, value));
return this;
}
public FilterBuilder greater(String propertyName,
Comparable value) {
return greater(createExtractor(propertyName), value);
}
public FilterBuilder greater(ValueExtractor extractor,
Comparable value) {
filters.add(new GreaterFilter(extractor, value));
return this;
}
// and so on...
}

Basically, we are implementing two overloaded methods for each built-in filter: one that accepts a value extractor as the first argument, and one that accepts a string and creates a value extractor for it.

However, unlike the built-in filters, we do not create an instance of a ReflectionExtractor automatically, but delegate the actual creation of an extractor to the createExtractor factory method:

protected ValueExtractor createExtractor(String propertyName) {
if (propertyName.indexOf('.') >= 0) {
return new ChainedExtractor(
createExtractorArray(propertyName.split(".")));
}
if (propertyName.indexOf(',') >= 0) {
return new MultiExtractor(
createExtractorArray(propertyName.split(",")));
}
return createDefaultExtractor(propertyName);
}

As you can see, if the specified property name is a dot-separated string, we will create an instance of a ChainedExtractor. Similarly, we will create an instance of a MultiExtractor for a comma-separated list of property names.

For all other properties, we will delegate extractor creation to the createDefaultExtractor method:

protected ValueExtractor createDefaultExtractor(String propertyName) {
Constructor ctor = getConstructor(defaultExtractorType);
return (ValueExtractor) ctor.newInstance(propertyName);
}

This allows us to control on a case-by-case basis which value extractor should be used within our filter. In most cases, the default PropertyExtractor should work just fine, but you can easily change the behavior by specifying a different extractor class as a constructor argument:

Filter filter = new FilterBuilder(ReflectionExtractor.class)
.equals("getCustomerId", 123)
.greater("getTotal", 1000.0)
.build();

You can even specify your own custom extractor class&mdash;the only requirement is that it implements a constructor that accepts a single string argument.

Obtaining query results

The easiest way to obtain query results is to invoke one of the QueryMap.entrySet methods:

Filter filter = ...;
Set<Map.Entry> results = cache.entrySet(filter);

This will return a set of Map.Entry instances representing both the key and the value of a cache entry, which is likely not what you want. More often than not you need only values, so you will need to iterate over the results and extract the value from each Map.Entry instance:

List values = new ArrayList(results.size());
for (Map.Entry entry : entries) {
values.add(entry.getValue());
}

After doing this a couple times you will probably want to create a utility method for this task. Because all the queries should be encapsulated within various repository implementations, we can simply add the following utility methods to our AbstractCoherenceRepository class:

public abstract class AbstractCoherenceRepository<K, V extends Entity<K>> {
...
protected Collection<V> queryForValues(Filter filter) {
Set<Map.Entry<K, V>> entries = getCache().entrySet(filter);
return extractValues(entries);
}
protected Collection<V> queryForValues(Filter filter,
Comparator comparator) {
Set<Map.Entry<K, V>> entries =
getCache().entrySet(filter, comparator);
return extractValues(entries);
}
private Collection<V> extractValues(Set<Map.Entry<K, V>> entries) {
List<V> values = new ArrayList<V>(entries.size());
for (Map.Entry<K, V> entry : entries) {
values.add(entry.getValue());
}
return values;
}

Note

What happened to the QueryMap.values() method?

Obviously, things would be a bit simpler if the QueryMap interface also had an overloaded version of the values method that accepts a filter and optionally comparator as arguments.

I'm not sure why this functionality is missing from the API, but I hope it will be added in one of the future releases. In the meantime, a simple utility method is all it takes to provide the missing functionality, so I am not going to complain too much.

Controlling query scope using data affinity

As we discussed in the previous chapter, data affinity can provide a significant performance boost because it allows Coherence to optimize the query for related objects. Instead of executing the query in parallel across all the nodes and aggregating the results, Coherence can simply execute it on a single node, because data affinity guarantees that all the results will be on that particular node. This effectively reduces the number of objects searched to approximately C/N, where C is the total number of objects in the cache query is executed against, and N is the number of partitions in the cluster.

However, this optimization is not automatic&mdash;you have to target the partition to search explicitly, using KeyAssociatedFilter:

Filter query = ...;
Filter filter = new KeyAssociatedFilter(query, key);

In the previous example, we create a KeyAssociatedFilter that wraps the query we want to execute. The second argument to its constructor is the cache key that determines the partition to search.

To make all of this more concrete, let's look at the final implementation of the code for our sample application that returns account transactions for a specific period. First, we need to add the getTransactions method to our Account class:

public Collection<Transaction> getTransactions(Date from, Date to) {
return getTransactionRepository().findTransactions(m_id, from, to);
}

Finally, we need to implement the findTransactions method within the CoherenceTransactionRepository:

public Collection<Transaction> findTransactions(
Long accountId, Date from, Date to) {
Filter filter = new FilterBuilder()
.equals("id.accountId", accountId)
.between("time", from, to)
.build();
return queryForValues(
new KeyAssociatedFilter(filter, accountId),
new DefaultTransactionComparator());
}

As you can see, we target the query using the account identifier and ensure that the results are sorted by transaction number by passing DefaultTransactionComparator to the queryForValues helper method we implemented earlier. This ensures that Coherence looks for transactions only within the partition that the account with the specified id belongs to.

Querying near cache

One situation where a direct query using the entrySet method might not be appropriate is when you need to query a near cache.

Because there is no way for Coherence to determine if all the results are already in the front cache, it will always execute the query against the back cache and return all the results over the network, even if some or all of them are already present in the front cache. Obviously, this is a waste of network bandwidth.

What you can do in order to optimize the query is to obtain the keys first and then retrieve the entries by calling the CacheMap.getAll method:

Filter filter = ...;
Set keys = cache.keySet(filter);
Map results = cache.getAll(keys);

The getAll method will try to satisfy as many results as possible from the front cache and delegate to the back cache to retrieve only the missing ones. This will ensure that we move the bare minimum of data across the wire when executing queries, which will improve the throughput.

However, keep in mind that this approach might increase latency, as you are making two network roundtrips instead of one, unless all results are already in the front cache. In general, if the expected result set is relatively small, it might make more sense to move all the results over the network using a single entrySet call.

Another potential problem with the idiom used for near cache queries is that it could return invalid results. There is a possibility that some of the entries might change between the calls to keySet and getAll. If that happens, getAll might return entries that do not satisfy the filter anymore, so you should only use this approach if you know that this cannot happen (for example, if objects in the cache you are querying, or at least the attributes that the query is based on, are immutable).

Sorting the results

We have already seen that the entrySet method allows you to pass a Comparator as a second argument, which will be used to sort the results. If your objects implement the Comparable interface you can also specify null as a second argument and the results will be sorted based on their natural ordering. For example, if we defined the natural sort order for transactions by implementing Comparable within our Transaction class, we could've simply passed null instead of a DefaultTransactionComparator instance within the findTransactions implementation shown earlier.

On the other hand, if you use near cache query idiom, you will have to sort the results yourself. This is again an opportunity to add utility methods that allow you to query near cache and to optionally sort the results to our base repository class. However, there is a lot more to cover in this chapter, so I will leave this as an exercise for the reader.

Paging over query results

The LimitFilter is somewhat special and deserves a separate discussion. Unlike other filters, which are used to compose query criteria, the LimitFilter is used to control how many result items are returned at a time. Basically, it allows you to page through query results n items at a time.

This also implies that unlike other filters, which are constructed, executed, and discarded, an instance of a LimitFilter is something you might need to hold on to for an extended period of time, as it is a mutable object that keeps track of the current page number, top and bottom anchor objects, and other state that is necessary to support paging.

Let's look at a simple example to better demonstrate the proper usage of a LimitFilter:

NamedCache countries = CacheFactory.getCache("countries");
LimitFilter filter = new LimitFilter(
new LikeFilter("getName", "B%"), 5);
Set<Map.Entry> entries = countries.entrySet(filter, null);
// contains countries 1-5 whose name starts with a letter 'B'
filter.nextPage();
entries = countries.entrySet(filter, null);
// contains countries 6-10 whose name starts with a letter 'B'
filter.setPage(4);
entries = countries.entrySet(filter, null);
// contains countries 21-25 whose name starts with 'B'

As you can see, you can page through the result by executing the same query over and over again and modifying the current page of the LimitFilter between the query executions by calling the nextPage, previousPage, or setPage method.

The LimitFilter is extremely powerful as it allows you to execute the main query only once and then obtain the results in chunks of the size you specify. It maps very nicely to a common requirement for results paging within a web application, allowing you to bring the web server only the data it needs to generate the current page, thus reducing network traffic and improving application performance and scalability. You can safely store an instance of a LimitFilter within a user's HTTP session and reuse it later when the user navigates to another page of the results.

One thing to note in the preceding example is that we are using the entrySet method to retrieve the results, contrary to what we have discussed in the previous section. The reason for that is that we want to return countries sorted by name (natural order), and as I mentioned earlier, if we need to support paging over the sorted results we have no other option but to sort them within the cluster using an overload of the entrySet method that accepts a comparator.

However, this is really not an issue, as the amount of data sent over the wire will be naturally limited by the LimitFilter itself and will typically be very small, so we don't need to optimize the query for the near caching scenario.

Using indexes to improve query performance

Just as you can use indexes to improve query performance against a relational database, you can use them to improve the performance of a Coherence query. That is not to say that Coherence indexes are the same as database indexes&mdash;in fact, they are very different, and we'll discuss how indexes are implemented in Coherence shortly.

However, they are similar in the way they work, as they allow query processor to optimize queries by:

  1. Limiting the number of entries that have to be evaluated by the filter

  2. Avoiding the need for object deserialization by providing the necessary information within the index itself

Both of these features are very important and can have a significant impact on query performance. For that reason, it is recommended that you always create indexes for the attributes that you query on.

Anatomy of an Index

A Coherence index is an instance of a class that implements the com.tangosol.util.MapIndex interface:

public interface MapIndex {
ValueExtractor getValueExtractor();
boolean isOrdered();
Map getIndexContents();
Object get(Object key);
}

The getValueExtractor method returns the value extractor used to extract the attribute that should be indexed from an object, while the isOrdered method returns whether the index is sorted or not.

The get method allows us to obtain a value of the indexed attribute for the specified key directly from an index, which avoids object deserialization and repeat value extraction.

Finally, the getIndexContents method returns the actual index contents. This is a map that uses the value extracted from the indexed attribute as a key, while the value for each index entry is a set of cache keys corresponding to that attribute value.

Looking at an example should make the previous paragraph much easier to understand.

Let's assume that we have the following entries in the cache:

Key

Value

1

Person(firstName = 'Aleksandar', lastName = 'Seovic')

2

Person(firstName = 'Marija', lastName = 'Seovic')

3

Person(firstName = 'Ana Maria', lastName = 'Seovic')

4

Person(firstName = 'Novak', lastName = 'Seovic')

5

Person(firstName = 'Aleksandar', lastName = 'Jevic')

If we create an index on the lastName attribute, our index contents will look like this:

Key

Value

Jevic

{ 5 }

Seovic

{ 1, 2, 3, 4 }

On the other hand, an index on the firstName attribute will look like this:

Key

Value

Aleksandar

{ 1, 5 }

Ana Maria

{ 3 }

Marija

{ 2 }

Novak

{ 4 }

Note

Index internals

Keep in mind that while I'm showing the actual values in the previous examples, the Coherence index actually stores both keys and values in their internal binary format.

For the most part you shouldn't care about this fact, but it is good to know if you end up accessing index contents directly.

The previous example should also make it obvious why indexes have such a profound effect on query performance.

If we wanted to obtain a list of keys for all people in the cache that have last name 'Seovic', without an index Coherence would have to deserialize each cache entry, extract the lastName attribute and perform a comparison, and if the comparison matches then it would add the cache entry key to the resulting list of keys.

With an index, Coherence doesn't need to do any of this&mdash;it will simply look up an index entry with the key Seovic and return the set of keys from that index entry.

Creating indexes

Now that we know how Coherence indexes are structured and why we should use them, let's look at how we can create them.

At the beginning of this chapter, we showed an incomplete definition of a QueryMap interface. What we omitted from it are the two methods that allow us to create and remove cache indexes:

public interface QueryMap extends Map {
Set keySet(Filter filter);
Set entrySet(Filter filter);
Set entrySet(Filter filter, Comparator comparator);
void addIndex(ValueExtractor extractor,
boolean isOrdered,
Comparator comparator);
void removeIndex(ValueExtractor extractor);
}

As you can see, in order to create an index you need to specify three things:

  • The value extractor that should be used to retrieve attribute value to use as an index key

  • The flag specifying whether the index should be ordered or not

  • Finally, the comparator to use for ordering

The first argument is by far the most important, as it determines index contents. It is also used as the index identifier, which is why you need to ensure that all value extractors you create implement the equals and hashCode methods properly.

If you decide to create an ordered index, index entries will be stored within a SortedMap instance, which introduces some overhead on index updates. Because of that, you should only order indexes for attributes that are likely to be used for sorting query results or in range queries, such as greater, less, between, and so on.

The last argument allows you to specify a comparator to use for index ordering, but you can specify null if the attribute you are indexing implements the Comparable interface and you want the index to use natural ordering. Of course, if an index is not ordered, you should always specify null for this argument.

Now that we know all of this, let's see what the code to define indexes on firstName and lastName attributes from the previous example should look like:

NamedCache people = CacheFactory.getCache("people");
people.addIndex(new PropertyExtractor("firstName"), false, null);
people.addIndex(new PropertyExtractor("lastName"), true, null);

As you can see, adding indexes to a cache is very simple. In this case we have created an unordered index on the firstName attribute and an ordered index using natural string ordering on the lastName attribute.

The last thing you should know is that the call to the addIndex method is treated by Coherence as a hint that an index should be created. What this means in practice is that you can safely create the same set of indexes on each Coherence node, even if another node has already created those same indexes. If an index for the specified extractor already exists, Coherence will simply ignore all subsequent requests for its creation.

Coherence query limitations

You have probably noticed by now that all the filters we have used as examples are evaluated against a single cache. This is one of the limitations of the Coherence query mechanism&mdash;it is not possible to perform an equivalent of a table join and execute the query against it.

However, while the ability to execute queries across multiple caches would come in handy occasionally, this is not too big a problem in practice. In most cases, you can perform any necessary table joins before loading the data into the cache, so you end up with all the information you need to query on in a single cache. Remember, the purpose of Coherence is to bring data closer to the application in a format that is easily consumable by the application, so transforming data from multiple tables into instances of a single aggregate is definitely something that you will be doing often.

That said, there will still be cases when you don't have all the data you need in a single cache, and you really, really need that join. In those cases the solution is to execute the query directly against the backend database and obtain a list of identifiers that you can use to retrieve objects from a cache.

Another important limitation you should be aware of is that Coherence queries only take into account objects that are already in the cache&mdash; they will not load any data from the database into the cache automatically. Because partial results are typically not what you want, this implies that you need to preload all the data into the cache before you start executing queries against it.

Alternatively, you can choose not to use Coherence queries at all and adopt the same approach as in the previous case by querying the database directly in order to obtain identifiers for the objects in the result, and using those identifiers to look up objects in the cache. Of course, this assumes that your cache is configured to automatically load the missing objects from the database on gets, which is something you will learn how to do in Chapter 8,Implementing the Persistence Layer.

Aggregators

Coherence filters are great when you need to retrieve a subset of objects from the cache based on a certain criteria, but there are cases when you want to process these objects as well in order to return a single result.

For example, you might want to retrieve the total amount of all orders for a particular customer. One possible solution is to retrieve all the orders for the customer using a filter and to iterate over them on the client in order to calculate the total. While this will work, you need to consider the implications:

  1. You might end up moving a lot of data across the network in order to calculate a result that is only few bytes long

  2. You will be calculating the result in a single-threaded fashion, which might introduce a performance bottleneck into your application

The better approach would be to calculate partial results on each cache node for the data it manages, and to aggregate those partial results into a single answer before returning it to the client. Fortunately, we can use Coherence aggregators to achieve exactly that.

By using an aggregator, we limit the amount of data that needs to be moved across the wire to the aggregator instance itself, the partial results returned by each Coherence node the aggregator is evaluated on, and the final result. This reduces the network traffic significantly and ensures that we use the network as efficiently as possible. It also allows us to perform the aggregation in parallel, using full processing power of the Coherence cluster.

At the very basic, an aggregator is an instance of a class that implements the com.tangosol.util.InvocableMap.EntryAggregator interface:

interface EntryAggregator extends Serializable {
Object aggregate(Set set);
}

However, you will rarely have the need to implement this interface directly. Instead, you should extend the com.tangosol.util.aggregator.AbstractAggregator class that also implements the com.tangosol.util.InvocableMap.ParallelAwareAggregator interface, which is required to ensure that the aggregation is performed in parallel across the cluster.

The AbstractAggregator class has a constructor that accepts a value extractor to use and defines the three abstract methods you need to override:

public abstract class AbstractAggregator
implements InvocableMap.ParallelAwareAggregator {
public AbstractAggregator(ValueExtractor valueExtractor) {
...
}
protected abstract void init(boolean isFinal);
protected abstract void process(Object value, boolean isFinal);
protected abstract Object finalizeResult(boolean isFinal);
}

The init method is used to initialize the result of aggregation, the process method is used to process a single aggregation value and include it in the result, and the finalizeResult method is used to create the final result of the aggregation.

Because aggregators can be executed in parallel, the init and finalizeResult methods accept a flag specifying whether the result to initialize or finalize is the final result that should be returned by the aggregator or a partial result, returned by one of the parallel aggregators.

The process method also accepts an isFinal flag, but in its case the semantics are somewhat different&mdash;if the isFinal flag is true, that means that the object to process is the result of a single parallel aggregator execution that needs to be incorporated into the final result. Otherwise, it is the value extracted from a target object using the value extractor that was specified as a constructor argument.

This will all be much clearer when we look at an example. Let's write a simple aggregator that returns an average value of a numeric attribute:

public class AverageAggregator
extends AbstractAggregator {
private transient double sum;
private transient int count;
public AverageAggregator() {
// deserialization constructor
}
public AverageAggregator(ValueExtractor valueExtractor) {
super(valueExtractor);
}
public AverageAggregator(String propertyName) {
super(propertyName);
}
protected void init(boolean isFinal) {
sum = 0;
count = 0;
}
protected void process(Object value, boolean isFinal) {
if (value != null) {
if (isFinal) {
PartialResult pr = (PartialResult) o;
sum += pr.getSum();
count += pr.getCount();
}
else {
sum += ((Number) o).doubleValue();
count++;
}
}
}
protected Object finalizeResult(boolean isFinal) {
if (isFinal) {
return count == 0 ? null : sum / count;
}
else {
return new PartialResult(sum, count);
}
}
static class PartialResult implements Serializable {
private double sum;
private int count;
PartialResult(double sum, int count) {
this.sum = sum;
this.count = count;
}
public double getSum() {
return sum;
}
public int getCount() {
return count;
}
}
}

As you can see, the init method simply sets both the sum and the count fields to zero, completely ignoring the value of the isFinal flag. This is ok, as we want those values to start from zero whether we are initializing our main aggregator or one of the parallel aggregators.

The finalizeResult method, on the other hand, depends on the isFinal flag to decide which value to return. If it is true, it divides the sum by the count in order to calculate the average and returns it. The only exception is if the count is zero, in which case the result is undefined and the null value is returned.

However, if the isFinal flag is false, the finalizeResult simply returns an instance of a PartialResult inner class, which is nothing more than a holder for the partial sum and related count on a single node.

Finally, the process method also uses the isFinal flag to determine its correct behavior. If it's true, that means that the value to be processed is a PartialResult instance, so it reads partial sum and count from it and adds them to the main aggregator's sum and count fields. Otherwise, it simply adds the value to the sum field and increments the count field by one.

We have implemented AverageAggregator in order to demonstrate with a simple example how the isFinal flag should be used to control the aggregation, as well as to show that the partial and the final result do not have to be of the same type. However, this particular aggregator is pretty much a throw-away piece of code, as we'll see in the next section.

Built-in aggregators

Just as with filters, Coherence ships with a number of useful built-in aggregators, and an equivalent of the AverageAggregator is one of them. Actually, there are two average aggregators built-in, as you can see in the following table:

Filter Class

Description

BigDecimalAverage

Calculates the average for a set of numeric values extracted from the cache entries and returns the result as a BigDecimal.

BigDecimalMax

Returns the maximum value, as a BigDecimal, for a set of numeric values extracted from the cache entries.

BigDecimalMin

Returns the minimum value, as a BigDecimal, for a set of numeric values extracted from the cache entries.

BigDecimalSum

Calculates the sum for a set of numeric values extracted from the cache entries and returns the result as a BigDecimal.

DoubleAverage

Calculates the average for a set of numeric values extracted from the cache entries and returns the result as a Double.

DoubleMax

Returns the maximum value, as a Double, for a set of numeric values extracted from the cache entries.

DoubleMin

Returns the minimum value, as a Double, for a set of numeric values extracted from the cache entries.

DoubleSum

Calculates the sum for a set of numeric values extracted from the cache entries and returns the result as a Double.

LongMax

Returns the maximum value, as a Long, for a set of numeric values extracted from the cache entries.

LongMin

Returns the minimum value, as a Long, for a set of numeric values extracted from the cache entries.

LongSum

Calculates the sum for a set of numeric values extracted from the cache entries and returns the result as a Long.

ComparableMax

Returns the maximum value for a set of Comparable values extracted from the cache entries.

ComparableMin

Returns the minimum value for a set of Comparable values extracted from the cache entries.

Count

Returns the number of values in an entry set; equivalent to SQL's "select count(*)".

DistinctValues

Returns a set of unique values extracted from the cache entries; equivalent to SQL's "select distinct".

CompositeAggregator

Executes a collection of aggregators against the same entry set and returns the list of results, one for each aggregator in the collection.

GroupAggregator

A wrapper aggregator that allows you to split entries in a set based on some criteria and to aggregate each subset separately and independently.

The important thing to note about the various average, max, min, and sum aggregators is that they differ from each other in how they treat the numeric values they are aggregating, as well as by the type of the return value.

For example, while you can use the DoubleAverage aggregator to calculate the average for any set of java.lang.Number-derived values, you should be aware that each individual value will be converted to Double first using the Number.doubleValue method, which might lead to rounding errors. What you will typically want to do is use the most appropriate aggregator based on the actual type of the values you are aggregating, and convert the final result to the desired type if necessary.

Using aggregators

So far we have learned how to implement an aggregator and which aggregators are shipped with Coherence, but we haven't learned how to use them yet.

In order to execute an aggregator, you need to use one of the methods defined by the com.tangosol.util.InvocableMap interface:

public interface InvocableMap extends Map {
Object aggregate(Collection keys,
InvocableMap.EntryAggregator aggregator);
Object aggregate(Filter filter,
InvocableMap.EntryAggregator aggregator);
}

There are few more methods in the InvocableMap interface that we will cover in more detail in the next chapter, but these two are all we need to execute aggregators against cache entries.

The first overload of the aggregate method accepts an explicit collection of keys for a set of entries to aggregate, while the second one uses a filter to determine the set of entries aggregation should be performed on. Both methods accept an aggregator instance as a second argument, which can be either one of the built-in aggregators or a custom aggregator you have implemented.

Implementing LookupValuesAggregator

Earlier in the chapter, we started the implementation of a generic solution that will allow us to extract lookup values that are suitable for data binding to UI controls such as drop-downs and list boxes. So far, we have implemented a LookupValueExtractor, which allows us to extract a LookupValue instance from any object, in any cache.

In this section we will complete the exercise by implementing a LookupValuesAggregator&mdash;a simple aggregator that can be used to aggregate extracted lookup values into a list.

public class LookupValuesAggregator
extends AbstractAggregator {
private transient List<LookupValue> results;
public LookupValuesAggregator(ValueExtractor idExtractor,
ValueExtractor descriptionExtractor){
super(new LookupValueExtractor(idExtractor,
descriptionExtractor));
}
protected void init(boolean isFinal) {
results = new ArrayList<LookupValue>();
}
protected void process(Object value, boolean isFinal) {
if (isFinal) {
results.addAll((Collection<LookupValue>) value);
}
else {
results.add((LookupValue) value);
}
}
protected Object finalizeResult(boolean isFinal) {
return results;
}
}

As you can see, both init and finalizeResult methods are trivial&mdash;the first one simply initializes the results list, while the second one returns it. This works both for the main and parallel aggregators, so we can ignore the isFinal flag.

The process method, however, uses the isFinal flag to determine if it should add a single LookupValue instance or the list of LookupValues returned by the parallel aggregator to the results collection.

Summary

In this chapter, you have learned how to query Coherence caches and how to perform aggregations on the data within the cache. We have covered built-in filters and aggregators and learned how to build custom ones.

We also talked about indexes and the impact they have on query performance. I cannot stress enough how important indexes are&mdash;make sure that you always use them and you will avoid a lot of potential performance problems.

In the next chapter, we will look at the remaining methods defined by the InvocableMap and learn one of the most important Coherence features from the scalability and performance perspective&mdash;a feature that allows us to process data in place and in parallel across the cluster.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.229.111