Executing a scan query

Every time a query is executed, the results are calculated and returned to the user. In ElasticSearch there isn't standard order for records, pagination on a big block of values can bring inconsistencies between results due to added and deleted documents. The scan query tries to resolve these kinds of problems by giving a special cursor that allows to uniquely iterate all the documents. It's often used to back up documents or reindex them.

Getting ready

You need a working ElasticSearch cluster and an index populated with the script available in online code.

How to do it...

For executing a scan query, we need to perform the following steps:

  1. From command line, we can execute a search of type scan as follows:
    curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?search_type=scan&scroll=10m&size=50' -d '{"query":{"match_all":{}}}'
    
  2. If everything is all right, the command will return the following result:
    {
      "_scroll_id" : "c2Nhbjs1OzQ1Mzp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ1Njp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ1Nzp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ1NDp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ1NTp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzE7dG90YWxfaGl0czozOw==",
      "took" : 1,
      "timed_out" : false,
      "_shards" : {
        "total" : 5,
        "successful" : 5,
        "failed" : 0
      },
      "hits" : {
        "total" : 3,
        "max_score" : 0.0,
        "hits" : [ ]
      }

    The result is composed by:

    • scroll_id: This is the value to be used for scrolling records
    • took: This is the time required to execute the query
    • timed_out: This checks if the query was timed out
    • _shards query status: This gives the information about the status of shards during the query
    • hits: This gives the other hits that are available after scrolling
  3. By having a scroll_id parameter, you can use scroll to get the results:
    curl -XGET 'localhost:9200/_search/scroll?scroll=10m' -d 'c2Nhbjs1OzQ2Mzp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ2Njp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ2Nzp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ2NDp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ2NTp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzE7dG90YWxfaGl0czozOw=='
  4. The result should be something similar to the following one:
    {
      "_scroll_id" : "c2NhbjswOzE7dG90YWxfaGl0czozOw==",
      "took" : 20,
      "timed_out" : false,
      "_shards" : {
        "total" : 5,
        "successful" : 0,
        "failed" : 5
      },
      "hits" : {
        "total" : 3,
        "max_score" : 0.0,
    …}

How it works...

The query is interpreted as it is done for search. This kind of search is taught to iterate on a large set of results, so the score and the order are not computed.

During the query phase, every shard takes the state of the IDs in memory until timeout.

Processing a scan query is done in the following two steps:

  • The first part executes a query and returns a scroll_id parameter which is used to fetch results.
  • The second part executes the documents scrolling. You iterate the second step, getting the new scroll_id value and fetch other documents.

Tip

If you need to iterate on a large set of records, scan query must be used, otherwise you will have doubled results.

The scan query is a standard query, but are two special parameters that are passed in the query string, which are as follows:

  • search_type=scan: This parameter informs ElasticSearch to execute the scan query.
  • scroll=(your timeout): This parameter allows defining how long the hits should live. The time can be expressed in seconds using the s postfix (that is, 5s, 10s, or 15s) or in minutes using the m postfix (that is, 5m, or 10m). If you are using a long timeout, you must be sure that your nodes have a lot of RAM to keep them alive. This parameter is mandatory and must be always provided.

Tip

Size is also a bit special as it is treated "per shard" meaning that if you have size = 10 and 5 shards each scroll will return 50.

See also

  • The Executing a search recipe in this chapter
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.118.211