Executing a scrolling query

Every time a query is executed, the results are calculated and returned to the user. In Elasticsearch, there is not a deterministic order for records: pagination on a big block of values can bring inconsistency between results due to added and deleted documents and also documents with the same score. The scrolling query tries to resolve this kind of problem, giving a special cursor that allows the user to uniquely iterate all the documents.

Getting ready

You will need an up-and-running Elasticsearch installation as used in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.

To execute curl via a command line, you need to install curl for your operating system.

To correctly execute the following commands, you will need an index populated with the chapter_05/populate_query.sh script available in the online code.

How to do it...

In order to execute a scrolling query, we will perform the following steps:

  1. From the command line, we can execute a search of type scan, as follows:
            curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?
            pretty&scroll=10m&size=1' -d '{"query": {
                "match_all": {}
              }}'
    
  2. If everything works, the command will return the following result:
           { 
             "_scroll_id" :   
             "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAAGFkxOZU5IaDh3U19TSEs3QWZ
              CempKMEEAAAAAAAAABxZMTmVOSGg4d1NfU0hLN0FmQnpqSjBBAAAAAAAAAAg
              WTE5lTkhoOHdTX1NISzdBZkJ6akowQQAAAAAAAAAJFkxOZU5IaDh3U19TS
              Es3QWZCempKMEEAAAAAAAAAChZMTmVOSGg4d1NfU0hLN0FmQnpqSjBB", 
               "took" : 10, 
               "timed_out" : false, 
               "_shards" : { 
               "total" : 5, 
               "successful" : 5, 
               "failed" : 0 
             }, 
              "hits" : { 
              "total" : 3, 
              "max_score" : 1.0, 
              "hits" : [ 
            { 
                "_index" : "test-index", 
                "_type" : "test-type", 
                "_id" : "2", 
                "_score" : 1.0, 
                "_source" : {...} 
              } 
            ] 
          } 
        } 
    
  3. The result is composed of the following:
    • scroll_id: The value to be used for scrolling records
    • took: The time required to execute the query
    • timed_out: Whether the query was timed out
    • _shards: This query status is the information about the status of shards during the query
    • hits: An object that contains the total count and the result hits
  4. With a scroll_id, you can use scroll to get the results, as follows:
           curl -XGET 'localhost:9200/_search/scroll?scroll=10m' -d   
          'DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAAGFkxOZU5IaDh3U19TSEs3QWZCemp
           KMEEAAAAAAAAABxZMTmVOSGg4d1NfU0hLN0FmQnpqSjBBAAAAAAAAAAg
           WTE5lTkhoOHdTX1NISzdBZkJ6akowQQAAAAAAAAAJFkxOZU5IaDh3U19TSEs
           3QWZCempKMEEAAAAAAAAAChZMTmVOSGg4d1NfU0hLN0FmQnpqSjBB'
  5. The result should be something similar to the following:
            { 
              "_scroll_id" :        
              "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAAGFkxOZU5IaDh3U19TSEs3QWZ
              CempKMEEAAAAAAAAABxZMTmVOSGg4d1NfU0hLN0FmQnpqSjBBAAAAAAAAAAg
              WTE5lTkhoOHdTX1NISzdBZkJ6akowQQAAAAAAAAAJFkxOZU5IaDh3U19TSEs
              3QWZCempKMEEAAAAAAAAAChZMTmVOSGg4d1NfU0hLN0FmQnpqSjBB", 
                "took" : 20, 
                "timed_out" : false, 
                "_shards" : { 
                "total" : 5, 
                "successful" : 0, 
                "failed" : 5 
             }, 
               "hits" : { 
               "total" : 3, 
               "max_score" : 0.0, 
             ...} 
    

How it works...

The scrolling query is interpreted as a standard search. This kind of search is designed to iterate on a large set of results, so the score and the order are not computed.

During the query phase, every shard stores the state of the IDs in the memory until the timeout. Processing a scrolling query is done in two steps, as follows:

  1. The first part executes a query and returns a scroll_id used to fetch the results.
  2. The second part executes the document scrolling. You iterate the second step, getting the new scroll_id, and fetch other documents.

Tip

If you need to iterate on a big set of records, the scrolling query must be used, otherwise you could have duplicated results.

The scrolling query is similar to every executed standard query, but there is a special parameter that must be passed in the query string.

The scroll=(your timeout) parameter allows the user to define how long the hits should live. The time can be expressed in seconds using the s postfix (that is to say 5s, 10s, 15s, and so on) or in minutes using the m postfix (that is to say 5m, 10m, and so on). If you are using a long timeout, you must be sure that your nodes have a lot of RAM to keep the resulting ID live. This parameter is mandatory and must be always provided.

There's more...

Scrolling is very useful for executing re-indexing actions or iterating on very large result sets, and the best approach for this kind of action is to use the sort by the special field _doc to obtain all the matched documents, and to be more efficient.

So, if you need to iterate on a large bucket of documents for re-indexing, you should execute a similar query, as follows:

    curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?  
    pretty&scroll=10m&size=1' -d '  {
      "query": {
        "match_all": {}
      },
      "sort": [
        "_doc"
      ]
    }'

The scroll result values are kept in the memory until the scroll timeout. It's best practice to clean this memory if you don't use the scroller any more: to delete a scroll from the Elasticsearch memory, the commands are as follows:

  • If you know your scroll ID/IDs, you can provide them to the delete call as follows:
        curl -XDELETE localhost:9200/_search/scroll -d ' 
        { 
           "scroll_id" : ["DnF1ZXJ5VGhlbkZldGNoBQAA..."] 
        }' 
  • If you want to clean all the scrolls, you can use the special _all keyword, as follows:
        curl -XDELETE localhost:9200/_search/scroll/_all 

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.89.82