Pagination with a standard query works very well if you are matching documents with the documents that do not change too often; otherwise, doing pagination with live data returns unpredictable results. To bypass this problem, Elasticsearch provides an extra parameter in the query: scroll.
You need an up-and-running Elasticsearch installation as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.
A Maven tool, or an IDE that natively supports it for Java programming such as Eclipse or IntelliJ IDEA, installed.
The code for this recipe is in the chapter_14/nativeclient
directory and the referred class is ScrollQueryExample
.
The search is done as in the Execute a standard search recipe. The main difference is a setScroll
timeout, which allows the resulting IDs to be stored in memory for a query for a defined timeout. The steps are like those for a standard search apart from:
TimeValue
object to define time in a more human way:import org.elasticsearch.common.unit.TimeValue;
setScroll
value. We can change the code of the Execute a standard search recipe to use scroll in this way:SearchResponse response = client.prepareSearch(index).setTypes(type).setSize(30) .setQuery(query).setScroll(TimeValue.timeValueMinutes(2)) .execute().actionGet();
do { for (SearchHit hit : response.getHits().getHits()) { System.out.println("hit: " + hit.getIndex() + ":" + hit.getType() + ":" + hit.getId()); } response = client.prepareSearchScroll (response.getScrollId()).setScroll (TimeValue.timeValueMinutes(2)).execute( } while (response.getHits().getHits().length != 0);
hit: mytest:mytype:499 hit: mytest:mytype:531 hit: mytest:mytype:533 hit: mytest:mytype:535 hit: mytest:mytype:555 hit: mytest:mytype:559 hit: mytest:mytype:571 hit: mytest:mytype:575 ...truncated...
To use the scrolling result, it's enough to add setScroll
with a timeout to the method call.
When using scrolling, some behaviors must be considered:
A standard search is changed in a scroll in this way:
SearchResponse response = client.prepareSearch(index).setTypes(type).setSize(30) .setQuery(query).setScroll(TimeValue.timeValueMinutes(2)) .execute().actionGet();
The response contains the results as the standard search, plus a scroll ID, which is required to fetch the next results.
To execute the scroll, you need to call the prepareSearchScroll
client method with a scroll ID and a new timeout. In the example, we process all the result documents:
do { for (SearchHit hit : response.getHits().getHits()) { //process your hit } response = client.prepareSearchScroll(response.getScrollId()).setScroll(TimeValue.timeValueMinutes(2)).execute( } while (response.getHits().getHits().length != 0);
To understand that we are at the end of the scroll, we can check that no results are returned.
There are a lot of scenarios in which scroll is very important; but when working on big data solutions, when the results number of results is very large, it's easy to hit the timeout. In these scenarios, it is important to have good architecture in which you fetch the results as fast as possible, and don't process the results iteratively in the loop, but defer the manipulation result in a distributed way.
In this case the best solution is to use the search_after
functionality of Elasticsearch sorting by _uid
as described in Using search_after functionality recipe in Chapter 5, Search.
3.22.181.154