Every time a query is executed, the results are calculated and returned to the user. In Elasticsearch, there is not a deterministic order for records: pagination on a big block of values can bring inconsistency between results due to added and deleted documents and also documents with the same score. The scrolling query tries to resolve this kind of problem, giving a special cursor that allows the user to uniquely iterate all the documents.
You will need an up-and-running Elasticsearch installation as used in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.
To execute curl
via a command line, you need to install curl
for your operating system.
To correctly execute the following commands, you will need an index populated with the chapter_05/populate_query.sh
script available in the online code.
In order to execute a scrolling query, we will perform the following steps:
curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? pretty&scroll=10m&size=1' -d '{"query": { "match_all": {} }}'
{ "_scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAAGFkxOZU5IaDh3U19TSEs3QWZ CempKMEEAAAAAAAAABxZMTmVOSGg4d1NfU0hLN0FmQnpqSjBBAAAAAAAAAAg WTE5lTkhoOHdTX1NISzdBZkJ6akowQQAAAAAAAAAJFkxOZU5IaDh3U19TS Es3QWZCempKMEEAAAAAAAAAChZMTmVOSGg4d1NfU0hLN0FmQnpqSjBB", "took" : 10, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 1.0, "hits" : [ { "_index" : "test-index", "_type" : "test-type", "_id" : "2", "_score" : 1.0, "_source" : {...} } ] } }
scroll_id
: The value to be used for scrolling recordstook
: The time required to execute the querytimed_out
: Whether the query was timed out_shards
: This query status is the information about the status of shards during the queryhits
: An object that contains the total count and the result hitsscroll_id
, you can use scroll to get the results, as follows:curl -XGET 'localhost:9200/_search/scroll?scroll=10m' -d 'DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAAGFkxOZU5IaDh3U19TSEs3QWZCemp KMEEAAAAAAAAABxZMTmVOSGg4d1NfU0hLN0FmQnpqSjBBAAAAAAAAAAg WTE5lTkhoOHdTX1NISzdBZkJ6akowQQAAAAAAAAAJFkxOZU5IaDh3U19TSEs 3QWZCempKMEEAAAAAAAAAChZMTmVOSGg4d1NfU0hLN0FmQnpqSjBB'
{ "_scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAAGFkxOZU5IaDh3U19TSEs3QWZ CempKMEEAAAAAAAAABxZMTmVOSGg4d1NfU0hLN0FmQnpqSjBBAAAAAAAAAAg WTE5lTkhoOHdTX1NISzdBZkJ6akowQQAAAAAAAAAJFkxOZU5IaDh3U19TSEs 3QWZCempKMEEAAAAAAAAAChZMTmVOSGg4d1NfU0hLN0FmQnpqSjBB", "took" : 20, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 0, "failed" : 5 }, "hits" : { "total" : 3, "max_score" : 0.0, ...}
The scrolling query is interpreted as a standard search. This kind of search is designed to iterate on a large set of results, so the score and the order are not computed.
During the query phase, every shard stores the state of the IDs in the memory until the timeout. Processing a scrolling query is done in two steps, as follows:
scroll_id
used to fetch the results.scroll_id
, and fetch other documents.The scrolling query is similar to every executed standard query, but there is a special parameter that must be passed in the query string.
The scroll=(your timeout)
parameter allows the user to define how long the hits should live. The time can be expressed in seconds using the s
postfix (that is to say 5s, 10s, 15s, and so on) or in minutes using the m
postfix (that is to say 5m, 10m, and so on). If you are using a long timeout, you must be sure that your nodes have a lot of RAM to keep the resulting ID live. This parameter is mandatory and must be always provided.
Scrolling is very useful for executing re-indexing actions or iterating on very large result sets, and the best approach for this kind of action is to use the sort by the special field _doc
to obtain all the matched documents, and to be more efficient.
So, if you need to iterate on a large bucket of documents for re-indexing, you should execute a similar query, as follows:
curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? pretty&scroll=10m&size=1' -d ' { "query": { "match_all": {} }, "sort": [ "_doc" ] }'
The scroll result values are kept in the memory until the scroll timeout. It's best practice to clean this memory if you don't use the scroller any more: to delete a scroll from the Elasticsearch memory, the commands are as follows:
curl -XDELETE localhost:9200/_search/scroll -d ' { "scroll_id" : ["DnF1ZXJ5VGhlbkZldGNoBQAA..."] }'
_all
keyword, as follows:curl -XDELETE localhost:9200/_search/scroll/_all
3.145.89.82