Every time a query is executed, the results are calculated and returned to the user. In ElasticSearch there isn't standard order for records, pagination on a big block of values can bring inconsistencies between results due to added and deleted documents. The scan query tries to resolve these kinds of problems by giving a special cursor that allows to uniquely iterate all the documents. It's often used to back up documents or reindex them.
You need a working ElasticSearch cluster and an index populated with the script available in online code.
For executing a scan query, we need to perform the following steps:
scan
as follows:curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?search_type=scan&scroll=10m&size=50' -d '{"query":{"match_all":{}}}'
{ "_scroll_id" : "c2Nhbjs1OzQ1Mzp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ1Njp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ1Nzp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ1NDp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ1NTp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzE7dG90YWxfaGl0czozOw==", "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 0.0, "hits" : [ ] }
The result is composed by:
scroll_id
: This is the value to be used for scrolling recordstook
: This is the time required to execute the querytimed_out
: This checks if the query was timed out_shards query status
: This gives the information about the status of shards during the queryhits
: This gives the other hits that are available after scrollingscroll_id
parameter, you can use scroll to get the results:curl -XGET 'localhost:9200/_search/scroll?scroll=10m' -d 'c2Nhbjs1OzQ2Mzp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ2Njp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ2Nzp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ2NDp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzQ2NTp4d1Ftcng0NlNCYUpVOXh4c0ZiYll3OzE7dG90YWxfaGl0czozOw=='
{ "_scroll_id" : "c2NhbjswOzE7dG90YWxfaGl0czozOw==", "took" : 20, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 0, "failed" : 5 }, "hits" : { "total" : 3, "max_score" : 0.0, …}
The query is interpreted as it is done for search. This kind of search is taught to iterate on a large set of results, so the score and the order are not computed.
During the query phase, every shard takes the state of the IDs in memory until timeout.
Processing a scan query is done in the following two steps:
scroll_id
parameter which is used to fetch results.scroll_id
value and fetch other documents.The scan query is a standard query, but are two special parameters that are passed in the query string, which are as follows:
search_type=scan
: This parameter informs ElasticSearch to execute the scan query.scroll=(your timeout)
: This parameter allows defining how long the hits should live. The time can be expressed in seconds using the s
postfix (that is, 5s
, 10s
, or 15s
) or in minutes using the m
postfix (that is, 5m
, or 10m
). If you are using a long timeout, you must be sure that your nodes have a lot of RAM to keep them alive. This parameter is mandatory and must be always provided.18.222.118.211