Optimizing search runtime

The search runtime speed is also a primary concern, and so it should be performed. You can also perform optimization at various levels at runtime. When Solr fetches the results for the queries passed by the user, you can limit the fetching of the result to a certain number by specifying the rows attribute in your search. The following query will return 10 rows of results from 10 to 20.

q=Scaling Big Data&rows=10&start=10

This can also be specified in solrconfig.xml as queryResultWindowSize, thereby setting the size to a limited number of query results.

Let's look at various other optimizations possible in the search runtime.

Optimizing through search query

Whenever a query request is forwarded to a search instance, Solr can respond in various ways, such as XML or JSON. A typical Solr response not only contains information about the matched results, but also contains information about your facets, highlighted text, and many other things which are used by the client (by default, a velocity template-based client provided by Solr). This in turn is a heavy response and can be optimized by providing a compression over the result. Compressing the result, however, incurs more CPU time, and this may impact the response time and query performance. However, there is a significant value in terms of the response size that passes over the network.

Filter queries

A normal query on Solr will perform the search, and then apply a complex scoring mechanism to determine the relevance of the document that appeared with the search results. A filter query on Solr will perform the search and apply the filter; this does not apply any scoring mechanism. A query can easily be converted into a filter query:

Normally: q=name:Scaling Hadoop AND type:books
Filter Query: q=name:Scaling Hadoop&fq=type:books

The processing required for scoring is not needed; hence, it is faster than a normal query. Since the scoring is no more applicable with filter queries, if the same query is passed again and again, the results are returned from the filter cache directly.

Optimizing the Solr cache

Solr provides caching at various levels as a part of its optimization. For caching at these levels, there are multiple implementations available in Solr by default. LRUCache is the least recently used cache (based on synchronized LinkedHashMap), FastLRUCache, and LFUCache is the least frequently used cache (based on ConcurrentHashMap). Among these FastLRUCache is expected to be faster than all others. These caches are associated with search (index searchers).

Note

Cache Autowarming is a feature by which a cache can pre-populate itself with objects from old search instances/cache.

These cache objects do not carry an expiry; they live as long as the index searches are alive. The configuration for different caches can be specified in solrconfig.xml as shown in the following screenshot:

Optimizing the Solr cache

There are common parameters to the cache:

Parameter

Description

Class

You can specify the type of cache you wish to attach, that is, LRUCache, FastLRUCache, or LFUCache.

Size

This is the maximum size a cache can reach.

initialSize

Initial size of the cache when it is initialized.

autowarmCount

The number of entries to seed from an old cache.

minSize

Applicable to FastLRUCache. After the cache reaches its peak size, it tries to reduce the cache size to minSize. The default value is 90 percent of the size.

acceptableSize

If FastLRUCache cannot reduce to minSize when the cache reaches its peak, it will at least reach acceptableSize.

All cache is initialized when a new index searcher instance is opened. Let's look at different caches in Solr and how you can utilize them for speeding up your search.

The filter cache

This cache is responsible for storing the documents for filter queries that are passed to Solr. Each filter is cached separately; when queries are filtered, this cache returns the results, and eventually, based on the filtering criteria, the system performs an intersection of them. If you have faceting, the use of a filter cache can improve performance. This cache stores the document IDs in an unordered state.

The query result cache

This cache will store the top N query results for each query passed by the user. It stores an ordered set of document IDs. For queries that are repeated, this cache is very effective. You can specify the maximum number of documents that can be cached by this cache in solrconfig.xml:

<queryResultMaxDocsCached>200</queryResultMaxDocsCached>

The document cache

This cache primarily stores the documents that are fetched from the disk. Once a document loads into a cache, search does not then need to fetch it from the disk again, reducing your overall disk IOs. This cache works on the IDs of documents, so the autowarming feature does not really have any impact, since the document IDs keep changing as and when there is a change in index.

Note

The size of the document cache should be based on the size of the results and the size of the maximum number of queries allowed to run; this will ensure that there is no refetch of the document by Solr.

The field value cache

This cache is used mainly for faceting. If you have regularly use faceting, it makes sense to enable caching for field levels. This cache can also be used for sorting. It supports multivalued fields. You can monitor the caching status in the administration of Solr. It provides information such as current load, hit rations, and hits.

The field value cache

The lazy field loading

By default, Solr reads all stored fields and then filters the ones that are not needed. This becomes a performance overhead for a large number of fields. When this flag is set, only fields that are requested will be loaded immediately; the rest of the fields are loaded lazily. This offers significant improvement in the search speed. This can be done by setting the following flag in solconfig.xml:

<enableLazyFieldLoading>true</enableLazyFieldLoading>

In addition to these options, you can also define your cache implementation.

Optimizing Hadoop

When running Solr with Hadoop for indexing (Solr patches) or for search (Katta), the optimization of Hadoop adds performance benefits to a big data search instance. The optimization can be done at the storage level that is HDFS as well as at the level of the MapReduce programs. Hadoop should preferably run on 64-bit machines to allow administrators to go beyond the Java heap size of 3 GB (it is limited to 3 GB in 32 bit). You also need to set a high priority for Hadoop user jobs and scheduler.

While storing the indexes in a distributed environment like Hadoop, storing in a compressed format can improve the storage space as well as the memory footprint. This in turn reduces your disk IO and bytes transferred over wires, by adding an overhead for extracting it as and when needed. You can do that this by enabling mapred.compress.map.output=true. Another interesting parameter is the block size of a file for HDFS. This needs to be defined well; considering the fact that all indexes are stored in HDFS files, defining the appropriate block size (dfs.block.size) will be a great help. The number of MapReduce tasks can also be optimized based on the input size (the batch size of Solr documents for indexing/sharding). In case of Solr-1301, the output of reduce tasks are passed to SolrOutputFormat, which calls SolrRecordWriter for writing the data. After completing the reduce task, SolrRecordWriter calls commit() and optimize() for performing index merging. There are additional parameters that can definitely add value towards optimizations in mapred-site.xml:

Parameter

Description

mapred.map.tasks.speculative.execution / mapred.reduce.tasks.speculative.execution

Hadoop jobs can become slow for various reasons, such as other processes consuming memory or misconfiguration. The slowness is hard to detect. So, when such jobs take more time than expected, Hadoop launches a new task as backup. This is a speculative execution of tasks. It is enabled by default and can be set to false for tasks that take more time, that is, indexing tasks.

mapred.tasktracker.map.tasks.minimum/ mapred.tasktracker.reduce.tasks.minimum

This parameter defines the maximum number of task tracker tasks that can be created. We must understand that having a larger mapper/reducer count compared to physical CPU cores will result in CPU context switching, which may result in an overall slow job completion. However, a balanced per CPU job configuration may result in faster job completion results. Typically, it should be driven based on the number of cores and memory.

mapred.child.java.opts

This value can have heap size as a parameter, that is, Xmx64M. This value should be driven by the amount of memory and the maximum number of tasks in tasktracker.

mapred.job.map/reduce.memory.mb

This value sets the virtual memory size for mapper and reducer. Setting this to -1 will use the maximum amount of memory available.

mapred.jobtracker.maxtasks.per.job

Defines the maximum number of tasks for a single job. This can be set to -1 to utilize the maximum number of tasks.

mapred.reduce.parallel.copies

This defines the number of threads for parallel copy in the reducer task. A very large number can demand more memory and exceed the heap size. This value is driven by network strength. A lower number can help balance the network traffic but slow down the overall transfers. For a gigabit Ethernet, this value can be set between 10 and 15.

mapreduce.reduce.input.limit

This value determines the limit on the input size of the reducer. This can be set to -1, that is, no limit.

mapred.min.split.size

During execution, map tasks are created for each slice/split. This parameter lets you control the size of each slice. Setting it to 0 enables Hadoop to determine this size.

You can perform additional enhancements in core-site.xml:

Parameter

Description

io.sort.factor

When heavy output is expected from map jobs (particularly for large jobs), this value should be set to higher values (default is 10). This defines the number of input files that get merged in a batch during the map/reduce task.

io.sort.mb

This defines the buffer size in megabytes for sorting. From experience, this value can be approximately 20% to 30% of the child heap size defined using mapred.child.java.opts. The default is 100 MB.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.29.189