The search runtime speed is also a primary concern, and so it should be performed. You can also perform optimization at various levels at runtime. When Solr fetches the results for the queries passed by the user, you can limit the fetching of the result to a certain number by specifying the rows attribute in your search. The following query will return 10 rows of results from 10 to 20.
q=Scaling Big Data&rows=10&start=10
This can also be specified in solrconfig.xml
as queryResultWindowSize
, thereby setting the size to a limited number of query results.
Let's look at various other optimizations possible in the search runtime.
Whenever a query request is forwarded to a search instance, Solr can respond in various ways, such as XML or JSON. A typical Solr response not only contains information about the matched results, but also contains information about your facets, highlighted text, and many other things which are used by the client (by default, a velocity template-based client provided by Solr). This in turn is a heavy response and can be optimized by providing a compression over the result. Compressing the result, however, incurs more CPU time, and this may impact the response time and query performance. However, there is a significant value in terms of the response size that passes over the network.
A normal query on Solr will perform the search, and then apply a complex scoring mechanism to determine the relevance of the document that appeared with the search results. A filter query on Solr will perform the search and apply the filter; this does not apply any scoring mechanism. A query can easily be converted into a filter query:
Normally: q=name:Scaling Hadoop AND type:books Filter Query: q=name:Scaling Hadoop&fq=type:books
The processing required for scoring is not needed; hence, it is faster than a normal query. Since the scoring is no more applicable with filter queries, if the same query is passed again and again, the results are returned from the filter cache directly.
Solr provides caching at various levels as a part of its optimization. For caching at these levels, there are multiple implementations available in Solr by default. LRUCache
is the least recently used cache (based on synchronized LinkedHashMap
), FastLRUCache
, and LFUCache
is the least frequently used cache (based on ConcurrentHashMap
). Among these FastLRUCache
is expected to be faster than all others. These caches are associated with search (index searchers).
These cache objects do not carry an expiry; they live as long as the index searches are alive. The configuration for different caches can be specified in solrconfig.xml
as shown in the following screenshot:
There are common parameters to the cache:
Parameter |
Description |
---|---|
Class |
You can specify the type of cache you wish to attach, that is, |
Size |
This is the maximum size a cache can reach. |
initialSize |
Initial size of the cache when it is initialized. |
autowarmCount |
The number of entries to seed from an old cache. |
minSize |
Applicable to |
acceptableSize |
If |
All cache is initialized when a new index searcher instance is opened. Let's look at different caches in Solr and how you can utilize them for speeding up your search.
This cache is responsible for storing the documents for filter queries that are passed to Solr. Each filter is cached separately; when queries are filtered, this cache returns the results, and eventually, based on the filtering criteria, the system performs an intersection of them. If you have faceting, the use of a filter cache can improve performance. This cache stores the document IDs in an unordered state.
This cache will store the top N query results for each query passed by the user. It stores an ordered set of document IDs. For queries that are repeated, this cache is very effective. You can specify the maximum number of documents that can be cached by this cache in solrconfig.xml
:
<queryResultMaxDocsCached>200</queryResultMaxDocsCached>
This cache primarily stores the documents that are fetched from the disk. Once a document loads into a cache, search does not then need to fetch it from the disk again, reducing your overall disk IOs. This cache works on the IDs of documents, so the autowarming feature does not really have any impact, since the document IDs keep changing as and when there is a change in index.
This cache is used mainly for faceting. If you have regularly use faceting, it makes sense to enable caching for field levels. This cache can also be used for sorting. It supports multivalued fields. You can monitor the caching status in the administration of Solr. It provides information such as current load, hit rations, and hits.
By default, Solr reads all stored fields and then filters the ones that are not needed. This becomes a performance overhead for a large number of fields. When this flag is set, only fields that are requested will be loaded immediately; the rest of the fields are loaded lazily. This offers significant improvement in the search speed. This can be done by setting the following flag in solconfig.xml
:
<enableLazyFieldLoading>true</enableLazyFieldLoading>
In addition to these options, you can also define your cache implementation.
When running Solr with Hadoop for indexing (Solr patches) or for search (Katta), the optimization of Hadoop adds performance benefits to a big data search instance. The optimization can be done at the storage level that is HDFS as well as at the level of the MapReduce programs. Hadoop should preferably run on 64-bit machines to allow administrators to go beyond the Java heap size of 3 GB (it is limited to 3 GB in 32 bit). You also need to set a high priority for Hadoop user jobs and scheduler.
While storing the indexes in a distributed environment like Hadoop, storing in a compressed format can improve the storage space as well as the memory footprint. This in turn reduces your disk IO and bytes transferred over wires, by adding an overhead for extracting it as and when needed. You can do that this by enabling mapred.compress.map.output=true
. Another interesting parameter is the block size of a file for HDFS. This needs to be defined well; considering the fact that all indexes are stored in HDFS files, defining the appropriate block size (dfs.block.size
) will be a great help. The number of MapReduce tasks can also be optimized based on the input size (the batch size of Solr documents for indexing/sharding). In case of Solr-1301, the output of reduce tasks are passed to SolrOutputFormat
, which calls SolrRecordWriter
for writing the data. After completing the reduce task, SolrRecordWriter
calls commit()
and optimize()
for performing index merging. There are additional parameters that can definitely add value towards optimizations in mapred-site.xml
:
You can perform additional enhancements in core-site.xml
:
Parameter |
Description |
---|---|
|
When heavy output is expected from map jobs (particularly for large jobs), this value should be set to higher values (default is 10). This defines the number of input files that get merged in a batch during the map/reduce task. |
|
This defines the buffer size in megabytes for sorting. From experience, this value can be approximately 20% to 30% of the child heap size defined using |
3.149.29.189