Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Optimizing search runtime

The search runtime speed is also a primary concern, and so it should be performed. You can also perform optimization at various levels at runtime. When Solr fetches the results for the queries passed by the user, you can limit the fetching of the result to a certain number by specifying the rows attribute in your search. The following query will return 10 rows of results from 10 to 20.

q=Scaling Big Data&rows=10&start=10

This can also be specified in solrconfig.xml as queryResultWindowSize, thereby setting the size to a limited number of query results.

Let's look at various other optimizations possible in the search runtime.

Optimizing through search query

Whenever a query request is forwarded to a search instance, Solr can respond in various ways, such as XML or JSON. A typical Solr response not only contains information about the matched results, but also contains information about your facets, highlighted text, and many other things which are used by the client (by default, a velocity template-based client provided by Solr). This in turn is a heavy response and can be optimized by providing a compression over the result. Compressing the result, however, incurs more CPU time, and this may impact the response time and query performance. However, there is a significant value in terms of the response size that passes over the network.

Filter queries

A normal query on Solr will perform the search, and then apply a complex scoring mechanism to determine the relevance of the document that appeared with the search results. A filter query on Solr will perform the search and apply the filter; this does not apply any scoring mechanism. A query can easily be converted into a filter query:

Normally: q=name:Scaling Hadoop AND type:books
Filter Query: q=name:Scaling Hadoop&fq=type:books

The processing required for scoring is not needed; hence, it is faster than a normal query. Since the scoring is no more applicable with filter queries, if the same query is passed again and again, the results are returned from the filter cache directly.

Optimizing the Solr cache

Solr provides caching at various levels as a part of its optimization. For caching at these levels, there are multiple implementations available in Solr by default. LRUCache is the least recently used cache (based on synchronized LinkedHashMap), FastLRUCache, and LFUCache is the least frequently used cache (based on ConcurrentHashMap). Among these FastLRUCache is expected to be faster than all others. These caches are associated with search (index searchers).

Note

Cache Autowarming is a feature by which a cache can pre-populate itself with objects from old search instances/cache.

These cache objects do not carry an expiry; they live as long as the index searches are alive. The configuration for different caches can be specified in solrconfig.xml as shown in the following screenshot:

There are common parameters to the cache:

Parameter	Description
Class	You can specify the type of cache you wish to attach, that is, `LRUCache`, `FastLRUCache`, or `LFUCache`.
Size	This is the maximum size a cache can reach.
initialSize	Initial size of the cache when it is initialized.
autowarmCount	The number of entries to seed from an old cache.
minSize	Applicable to `FastLRUCache`. After the cache reaches its peak size, it tries to reduce the cache size to `minSize`. The default value is 90 percent of the size.
acceptableSize	If `FastLRUCache` cannot reduce to `minSize` when the cache reaches its peak, it will at least reach `acceptableSize`.

All cache is initialized when a new index searcher instance is opened. Let's look at different caches in Solr and how you can utilize them for speeding up your search.

The filter cache

This cache is responsible for storing the documents for filter queries that are passed to Solr. Each filter is cached separately; when queries are filtered, this cache returns the results, and eventually, based on the filtering criteria, the system performs an intersection of them. If you have faceting, the use of a filter cache can improve performance. This cache stores the document IDs in an unordered state.

The query result cache

This cache will store the top N query results for each query passed by the user. It stores an ordered set of document IDs. For queries that are repeated, this cache is very effective. You can specify the maximum number of documents that can be cached by this cache in solrconfig.xml:

<queryResultMaxDocsCached>200</queryResultMaxDocsCached>

The document cache

This cache primarily stores the documents that are fetched from the disk. Once a document loads into a cache, search does not then need to fetch it from the disk again, reducing your overall disk IOs. This cache works on the IDs of documents, so the autowarming feature does not really have any impact, since the document IDs keep changing as and when there is a change in index.

Note

The size of the document cache should be based on the size of the results and the size of the maximum number of queries allowed to run; this will ensure that there is no refetch of the document by Solr.

The field value cache

This cache is used mainly for faceting. If you have regularly use faceting, it makes sense to enable caching for field levels. This cache can also be used for sorting. It supports multivalued fields. You can monitor the caching status in the administration of Solr. It provides information such as current load, hit rations, and hits.

The lazy field loading

By default, Solr reads all stored fields and then filters the ones that are not needed. This becomes a performance overhead for a large number of fields. When this flag is set, only fields that are requested will be loaded immediately; the rest of the fields are loaded lazily. This offers significant improvement in the search speed. This can be done by setting the following flag in solconfig.xml:

<enableLazyFieldLoading>true</enableLazyFieldLoading>

In addition to these options, you can also define your cache implementation.

Optimizing Hadoop

When running Solr with Hadoop for indexing (Solr patches) or for search (Katta), the optimization of Hadoop adds performance benefits to a big data search instance. The optimization can be done at the storage level that is HDFS as well as at the level of the MapReduce programs. Hadoop should preferably run on 64-bit machines to allow administrators to go beyond the Java heap size of 3 GB (it is limited to 3 GB in 32 bit). You also need to set a high priority for Hadoop user jobs and scheduler.

While storing the indexes in a distributed environment like Hadoop, storing in a compressed format can improve the storage space as well as the memory footprint. This in turn reduces your disk IO and bytes transferred over wires, by adding an overhead for extracting it as and when needed. You can do that this by enabling mapred.compress.map.output=true. Another interesting parameter is the block size of a file for HDFS. This needs to be defined well; considering the fact that all indexes are stored in HDFS files, defining the appropriate block size (dfs.block.size) will be a great help. The number of MapReduce tasks can also be optimized based on the input size (the batch size of Solr documents for indexing/sharding). In case of Solr-1301, the output of reduce tasks are passed to SolrOutputFormat, which calls SolrRecordWriter for writing the data. After completing the reduce task, SolrRecordWriter calls commit() and optimize() for performing index merging. There are additional parameters that can definitely add value towards optimizations in mapred-site.xml:

Parameter	Description
`mapred.map.tasks.speculative.execution / mapred.reduce.tasks.speculative.execution`	Hadoop jobs can become slow for various reasons, such as other processes consuming memory or misconfiguration. The slowness is hard to detect. So, when such jobs take more time than expected, Hadoop launches a new task as backup. This is a speculative execution of tasks. It is enabled by default and can be set to false for tasks that take more time, that is, indexing tasks.
`mapred.tasktracker.map.tasks.minimum/ mapred.tasktracker.reduce.tasks.minimum`	This parameter defines the maximum number of task tracker tasks that can be created. We must understand that having a larger mapper/reducer count compared to physical CPU cores will result in CPU context switching, which may result in an overall slow job completion. However, a balanced per CPU job configuration may result in faster job completion results. Typically, it should be driven based on the number of cores and memory.
`mapred.child.java.opts`	This value can have heap size as a parameter, that is, `Xmx64M`. This value should be driven by the amount of memory and the maximum number of tasks in `tasktracker`.
`mapred.job.map/reduce.memory.mb`	This value sets the virtual memory size for mapper and reducer. Setting this to -1 will use the maximum amount of memory available.
`mapred.jobtracker.maxtasks.per.job`	Defines the maximum number of tasks for a single job. This can be set to -1 to utilize the maximum number of tasks.
`mapred.reduce.parallel.copies`	This defines the number of threads for parallel copy in the reducer task. A very large number can demand more memory and exceed the heap size. This value is driven by network strength. A lower number can help balance the network traffic but slow down the overall transfers. For a gigabit Ethernet, this value can be set between 10 and 15.
`mapreduce.reduce.input.limit`	This value determines the limit on the input size of the reducer. This can be set to -1, that is, no limit.
`mapred.min.split.size`	During execution, map tasks are created for each slice/split. This parameter lets you control the size of each slice. Setting it to 0 enables Hadoop to determine this size.

You can perform additional enhancements in core-site.xml:

Parameter	Description
`io.sort.factor`	When heavy output is expected from map jobs (particularly for large jobs), this value should be set to higher values (default is 10). This defines the number of input files that get merged in a batch during the map/reduce task.
`io.sort.mb`	This defines the buffer size in megabytes for sorting. From experience, this value can be approximately 20% to 30% of the child heap size defined using `mapred.child.java.opts`. The default is 100 MB.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Optimizing search runtime

Create new playlist

Sign In

Sign Up

Optimizing search runtime

Optimizing through search query

Filter queries

Optimizing the Solr cache

Note

The filter cache

The query result cache

The document cache

Note

The field value cache

The lazy field loading

Optimizing Hadoop

Table of Contents for
Optimizing search runtime