Index optimization

The indexes used in Apache Solr are inverted indexes. In case of the inverted indexing technique, all your text will be parsed and words will be extracted out of it. These words are then stored as index items, with the location of their appearance. For example, consider the following statements:

  1. "Mike enjoys playing on a beach"
  2. "Playing on the ground is a good exercise"
  3. "Mike loves to exercise daily"

The index with location information for all these sentences will look like following (Numbers in brackets denote (sentence number, word number)):

Mike     (1,1), (3,1)
enjoys   (1,2)
playing  (1,3), (2,1)
on       (1,4), (2,2)
a        (1,5), (2,5)
beach    (1,6)
ground   (2,3)
is       (2,4)
good     (2,6)
loves    (3,2)
to       (3,3)
exercise (2,7), (3,4)
daily    (3,5)

When you perform a delete on your inverted index, it does not delete the document; it only marks the document as deleted. It will get cleaned only when the segment that the index is a part of is merged. When you create an index, you should avoid modifying the index.

Limiting indexing buffer size

As the index size grows, the Solr instance starts using up more CPU time and memory to perform a faceted search. When the indexes are first created, the overall operation runs in the batch mode. All the documents are kept in memory until it exceeds the RAM buffer size specified in solr-config.xml:

<ramBufferSizeMB>100</ramBufferSizeMB>

Once the size is exceeded, Solr creates a new segment or merges the index with the current segment. The default value of the RAM buffer size is 100 MB (Solr 1.4 onwards). Similarly, there is another parameter that controls the maximum number of documents in the buffer of Solr while indexing:

<maxBufferedDocs>1000</maxBufferedDocs>

When an indexed document crosses the limits defined for both the RAM buffer size and the maximum number of buffer documents, it will flush the changes. You can also control the maximum number of threads used for indexing the document by tuning maxIndexingThread; the default value is 8. By setting this parameter appropriately as per your usage, you can speed up your indexing process. By setting this parameter, you can use clients that can connect concurrently to the search server for uploading the data by using multiple threads. Solr provides the ConcurrentUpdateSolrServer class for the same.

The number of commit operations has to be decided optimally. Frequent commit operations eat more CPU/IO time, whereas few commit operations demand an increase in the memory size of your instance.

When to commit changes?

Commit is the operation that ensures that all the updates/uploads to Solr are stored on the disk. With Solr, you can perform commit in the following different ways:

  • Automatic commit
  • Soft commit

When autocommit is enabled, the document uploaded to Apache Solr gets written to the storage immediately. In case of a cluster environment, a hard commit will replicate the indexes across all the nodes. This condition is the maximum time (maxTime) or maximum number of documents (maxDocs) after which commit should take place. Choosing relatively low values for these works well for an environment where you have continuous index updates; this incurs a significant performance bottleneck for batch updates in a distributed environment. At the same time, having a high value for maxTime or maxDocs may pose a high risk of losing indexed documents in case of failures.

There is also an option called openSearcher in the handler definition of solrconfig.xml. When this value is set to true, it allows a new searcher to get initialized after the changes are committed to the storage. This option enables users to see the newly committed changes in their search results immediately. Each handler also has updateLog, which is a transaction log that enables the recovery of updates in case of failures; this therefore supports/enables durability.

Note

To achieve the maximum durability of a Solr instance, it is recommended to have a hard commit size limit based on the size of the update log.

Similar to hard commit is the soft commit. Soft commit is a faster alternative which, unlike hard commit, only makes the index changes visible for searches. It does not perform any sync of indexes across nodes. In case of a power failure of the machine, the changes made using soft commit are lost. With soft commit, Solr can achieve near-real-time search capabilities. You should have the soft commit maxTime set less than the hard commit time. Therefore, the configuration file would look as shown in the following screenshot:

When to commit changes?

Solr also allows you to pass the commit request in your update request itself.

Optimizing index merge

While creating index segments, the following flowchart depicts how Solr functions:

Optimizing index merge

Solr keeps the newly updated index in the most recent segment; if the segment is filled up, it will create a new segment. Solr performs a merge of segments as and when the number of lowest-level segments is equal to mergeFactor, specified in the Solr configuration file. In such a case, it merges all the segments into one. Consider the following case:

<mergeFactor>20</mergeFactor>

The segments are merged when the number of lowest-level segments equals 20. This process continues. mergeFactor directly carries the impact on your search query time and indexing time. If you have a high mergeFactor, your index creation process will be faster as it does not really need to perform the merging of indexes; however, for a search, Solr has to look into multiple files in the file store. If you have a low mergeFactor, it will slow down your indexing process due to the need to perform a merge over huge indexes. The search will be relatively faster as it has to look at few files.

Optimize option for index merging

When this option is called, Solr runs the index merge operation, and it forces all the index segments to get merged into a single segment. This is an expensive operation, which in turn reads and rewrites all the indexes of Solr. It impacts the functioning of the search instance, so it is recommended to run this operation when there is no/less load on the instance. It provides additional attributes such as waitFlush (blocks the instance until the index changes are flushed to a disk), waitSearcher (blocks until a new searcher with all the changes visible is made available), and maxSegment (you can choose to optimize your instance to maximum segment listed here). Solr also allows you to call optimize through a URL call itself:

curl 'http://localhost:8983/solr/update?optimize=true&maxSegments=2&waitFlush=false'

While running in the SolrCloud environment, you should be careful while running optimize (forced merge) on your own, instead you can rely on Solr to perform optimization, as a partial merge (that which it does in background).

Optimizing the container

Most of the big data implementations including Solr and Hadoop run under J2EE container with some JDK. While scaling your instance for more data and more indexes, it becomes important to optimize your containers as well to ensure that you get optimal high-speed performance out of the system. Choosing the right JVM is therefore a very important factor. There are many JVMs available in the market today that can be considered, such as Oracle Java HotSpot, BEA JRockit, and Open Source JVM. You can look at comparisons between different JVMs at http://en.wikipedia.org/wiki/Comparison_of_Java_virtual_machines. Apache Solr allows you to run multiple Solr instances with their own JVMs. Zing JVM from the Azul system is considered to be a high-performance JVM for Solr/Lucene implementations.

Optimizing concurrent clients

You can control the number of concurrent connections that can be made in your container. This in turn reduces traffic on your instance, which may be running in a standalone/distributed environment.

In the tomcat server, you can simply modify the following entries in server.xml for changing the number of concurrent connections:

Optimizing concurrent clients

Similarly, in Jetty, you can control the number of connections held by modifying jetty.xml:

Optimizing concurrent clients

Optimizing Java virtual memory

One of the key optimization factors is controlling the virtual memory size of your big data Solr instance. This is applicable to instances running in the distributed environment, as well as the instances running as a standalone search instance. As your big data search instance scales with the data size, it requires more and more memory and it therefore becomes important to optimize accordingly. Apache Solr has an in-built cache, which should be one of the factors considered for optimization. Since both Hadoop and Solr run on JVMs, one has to look at the optimization of Java Virtual Machine (JVM).

All Solr instances run inside the J2EE container as applications, and all the common optimizations for applications are applicable to it. It starts with choosing the right heap size for your JVM. The heap size for JVM can be controlled by the following parameters.

Parameter

Description

-Xms

Minimum heap size required with which the container is initialized

-Xmx

Maximum heap size up to which the container is allowed to grow

When you choose the minimum heap size to be low, the initialization of the application itself might take more time. Similarly, having a higher minimum heap size may unnecessarily block the huge memory segment, which could be useful for your other processes. However, it will reduce the calls to resize the heap when the heap is full, since the heap holds more memory at the start time. Similarly, having a low maximum heap size may fail your application running in-between, throwing Out Of Memory exceptions for large indexes/objects of your search. When providing the memory size for JVM, you need to ensure that you keep sufficient memory for your operating system and other processes so as to avoid them going into the thrashing mode. In the production environment, it is better to keep the minimum and the maximum heap sizes the same, to avoid the overhead of the heap size.

Note

When you are running optimized Solr instances in the container, it is recommended not to install any other applications on the same container, so as to minimize the CPU time and the memory getting distributed among Solr and these applications.

When the heap is full, JVM tries to grab more memory based on the –Xmx parameter. Before doing that, it performs garbage collection. Garbage collection in JVM is a process through which JVM reclaims the memory consumed by objects that are unused/expired/not referred by any of your application processes running in memory. Today's JVMs trigger the garbage collection process automatically as and when needed. The process can be explicitly called from the application code through the System.gc() call, and this will explicitly trigger the garbage collection process, cleaning up the garbage. Such explicit calls to garbage collection should be avoided for the following reasons:

  • There is no control over whether the garbage collection process is run while your search/indexing is run.
  • When the garbage collection process is run, it will end up taking your CPU time and memory, which impacts the overall functioning of the search.
  • The heap size influences the time for running the garbage collection process. A longer heap size will make the garbage collector take more time to identify and clean the VM objects. New releases of Java (1.7 onwards) have some optimization over the garbage collection.

If you are using Solr faceting, or features like sorting, you will require more memory. The operating system performs memory swapping based on the needs of processors. This can create huge latency in any search with large indexes. Many of the operating systems allow users to control the swapping of programs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.121.156