Speeding up indexing with Solr segment merge tuning

During indexing, Solr (actually Lucene) creates a series of new index files—the segments. Each segment is written once and read many times, which means that once it is written, it cannot be changed (although some data can be changed, such as delete document markings or numerical doc values). After some time, Solr will try to merge multiple small segments into bigger ones. This is because the more segments the index is built of, the slower the queries will be. Of course, we have the ability to force segment merge (by running the force merge command), but such an operation is resource intensive, because Lucene will rewrite the index segments. Because of that, Solr allows you to tune the segment merge process to match our needs, and this recipe will show you how to do that.

How to do it...

The merge policy is what controls how merges are done in Apache Lucene and thus in Solr. By default, the merge policy in not explicitly defined in Solr and reasonable defaults are used. This means that by default, Solr will use org.apache.lucene.index.TieredMergePolicy:

  1. The first and the only step is to add the merge policy configuration to our solrconfig.xml file. The following section should be added to the indexConfig section of the mentioned configuration file:
    <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
     <int name="maxMergeAtOnce">30</int>
     <int name="segmentsPerTier">30</int>
     <int name="maxMergedSegmentMB">20000</int>
    </mergePolicy>

Now, after restarting Solr, we should see fewer segments than we were seeing previously.

How it works...

Let's start by saying what segment merge is. As we know, a Lucene index is built of pieces—segments. Each segment is a write-once, read many times structure, which means that once written, it can't be altered. Each segment is also a miniature Lucene index by itself. The segment merging process builds a new, larger segment using two or more smaller ones. The new segment will contain the merged information from the old segments. During the segment merge process, deleted documents are physically removed—so no deleted documents will be present in the newly created segment. You have to remember that during delete, Lucene only marks the document for deletion and doesn't remove it from the segment itself.

Now let's get back to our Solr configuration. First of all, we changed the solrconfig.xml file to include the explicitly defined merge policy. As I already mentioned, the merge policy is the logic that is responsible for telling Lucene when to start segment merging. We decided to use the default org.apache.lucene.index.TieredMergePolicy merge policy; you can read more about it at http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/index/TieredMergePolicy.html. In general, Lucene will divide the segments into tiers of similar size and will try to merge them.

We decided to set the maxMergeAtOnce property to 30. This tells how many segments should be merged at once during the normal merge process (by normal we mean not forced by calling Solr's optimize command). We also set the segmentsPerTier property to 30 as well. This property tells Lucene how many segments per chosen tier are allowed. The default value is 10—smaller values means more merges, but smaller segments number. Higher values that are equal to or higher than the maxMergeAtOnce property mean less frequent merges at the cost of the more segments present. We also set the maxMergedSegmentMB property to 20000 (which translates to about 20 GB). This property specifies the maximum segment size Lucene is allowed to produce during the normal merge process. If a merge process will result in a segment larger than this value, the merge policy will merge fewer segments to keep the size limiting.

There's more...

There are two more things I would like to mention.

Increasing the RAM buffer size to improve the indexing throughput

In addition to what was written about the segment merging tuning, we can also modify the ramBufferSizeMB property and increase it from the default 100 MB to 512 MB. The value of this property controls the amount of memory Lucene can use to store documents before they are flushed to disk (or rather to the Directory implementation). If your documents are large, the default ramBufferSizeMB value may result in many small segments being created, because the amount of buffer space won't be enough.

However, remember that you need to have enough memory for Solr to be able to work with such buffer size. To change the ramBufferSizeMB value, you need to add the following section to the solrconfig.xml file (in the indexConfig section):

<ramBufferSizeMB>512</ramBufferSizeMB>

Speeding up querying with merge policy tuning

Of course, we are not only allowed to speed up indexing by allowing more segments in the index. We can also lower the number of segments the index is built on and have a slightly higher query performance at the cost of more often and more intensive merging. To do this, we can try using the following merge policy configuration:

<mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
 <int name="maxMergeAtOnce">3</int>
 <int name="segmentsPerTier">3</int>
 <int name="maxMergeAtOnceExplicit">30</int>
</mergePolicy>

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.27.72