Executing a scroll search

Pagination with a standard query works very well if you are matching documents with the documents that do not change too often; otherwise, doing pagination with live data returns unpredictable results. To bypass this problem, Elasticsearch provides an extra parameter in the query: scroll.

Getting ready

You need an up-and-running Elasticsearch installation as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.

A Maven tool, or an IDE that natively supports it for Java programming such as Eclipse or IntelliJ IDEA, installed.

The code for this recipe is in the chapter_14/nativeclient directory and the referred class is ScrollQueryExample.

How to do it...

The search is done as in the Execute a standard search recipe. The main difference is a setScroll timeout, which allows the resulting IDs to be stored in memory for a query for a defined timeout. The steps are like those for a standard search apart from:

  1. We import the TimeValue object to define time in a more human way:
            import org.elasticsearch.common.unit.TimeValue; 
    
  2. We execute the search by setting the setScroll value. We can change the code of the Execute a standard search recipe to use scroll in this way:
            SearchResponse response =  
            client.prepareSearch(index).setTypes(type).setSize(30) 
            .setQuery(query).setScroll(TimeValue.timeValueMinutes(2)) 
            .execute().actionGet(); 
    
  3. To manage the scrolling we need to create a loop until the results are returned:
            do { 
                for (SearchHit hit : response.getHits().getHits()) { 
                    System.out.println("hit: " + hit.getIndex() + ":" +   
                    hit.getType() + ":" + hit.getId()); 
                } 
                response = client.prepareSearchScroll
                (response.getScrollId()).setScroll
                (TimeValue.timeValueMinutes(2)).execute( 
             } while (response.getHits().getHits().length != 0);  
    
  4. The loop will iterate on all the results until records are available. The output will be similar to this one:
            hit: mytest:mytype:499 
            hit: mytest:mytype:531 
            hit: mytest:mytype:533 
            hit: mytest:mytype:535 
            hit: mytest:mytype:555 
            hit: mytest:mytype:559 
            hit: mytest:mytype:571 
            hit: mytest:mytype:575 
            ...truncated... 
    

How it works...

To use the scrolling result, it's enough to add setScroll with a timeout to the method call.

When using scrolling, some behaviors must be considered:

  • The timeout defines the time slice that an Elasticsearch server keeps the results for. If you ask for a scroll after the timeout, the server returns an error. So, the user must be careful with short timeouts.
  • The scroll consumes memory until it ends or a timeout is raised. Setting too large a timeout without consuming the data, results in a big memory overhead. Using a large number of open scrollers consumes a lot of memory proportional to the number of IDs and their related data (score, order, and so on) in the results.
  • With scrolling it's not possible to paginate the documents, as there is no start. Scrolling is designed to fetch consecutives results.

A standard search is changed in a scroll in this way:

SearchResponse response = client.prepareSearch(index).setTypes(type).setSize(30) 
       .setQuery(query).setScroll(TimeValue.timeValueMinutes(2)) 
       .execute().actionGet(); 

The response contains the results as the standard search, plus a scroll ID, which is required to fetch the next results.

To execute the scroll, you need to call the prepareSearchScroll client method with a scroll ID and a new timeout. In the example, we process all the result documents:

do { 
    for (SearchHit hit : response.getHits().getHits()) { 
        //process your hit 
    } 
    response = client.prepareSearchScroll(response.getScrollId()).setScroll(TimeValue.timeValueMinutes(2)).execute( 
} while (response.getHits().getHits().length != 0); 

To understand that we are at the end of the scroll, we can check that no results are returned.

There are a lot of scenarios in which scroll is very important; but when working on big data solutions, when the results number of results is very large, it's easy to hit the timeout. In these scenarios, it is important to have good architecture in which you fetch the results as fast as possible, and don't process the results iteratively in the loop, but defer the manipulation result in a distributed way.

In this case the best solution is to use the search_after functionality of Elasticsearch sorting by _uid as described in Using search_after functionality recipe in Chapter 5, Search.

See also

  • Refer to the Executing a scroll Query recipe in Chapter 5, Search, which describes scroll queries in depth, and the Using search_after functionality recipe in Chapter 5, Search for scrolling on very large datasets
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.253.239