Practical considerations for bulk processing

It's awesome to minimize the requests using the search types and bulk APIs we saw in this chapter, but you also need to think that for a large amount of processing to be done by Elasticsearch, you need to take care of resource utilization and control the size of your requests accordingly. The following are some points that will help you while working with the things you have learned in this chapter.

The most important factor to be taken care of is the size of your documents. Fetching or indexing 1 KB of 1,000 documents in a single request is damn easier than 100 KB of 1,000 documents:

  • Multisearch: While querying with multi search requests, you should take care of how many queries you are hitting in a single request. You just can't combine 1,000 queries in a single query and execute them in one go. Also, the number of queries should be minimized according to the complexity of queries. So, you can break your query set into multiple multi-search requests in batches of, for example, 100 queries per batch, and execute them. You can combine the results after all the batches are executed. The same rule applies while querying with the mget API too.
  • Scan-scroll: A search with scan and scroll is highly beneficial for performing deep paginations, but the number of documents returned in a single request is usually scroll_size*number_of_shards. We have also seen that we need to pass the timeout using a scroll parameter because it tells Elasticsearch for how long a search context needs to be open on the server to serve a particular scroll request. Scroll timeouts do not need to be long enough to process all the data — they just need to be long enough to process the previous batch of results. Each scroll request (with the scroll parameter) sets a new expiry time. So, you need to set the scroll timeout wisely so that there are not too many open search contexts existing at the same time on your Elasticsearch server. This heavily affects the background merge process of Lucene indexes. Also, the scroll time should not be so small that your request gets a timeout.
  • Bulk indexing and bulk updates: Sending too much data in a single request can harm your Elasticsearch node if you do not have the optimal resources available. Remember that while data indexing or updating, the merging of Lucene segments is done in the background, and with a large amount of data merging and flushing on the disk, a very high amount of CPU and disk IO is required. So, choose the numbers wisely by benchmarking your requests.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.72.70