Optimizing a single Solr server – scale up

There are a large number of different options that Solr gives you for enhancing the performance of a specific Solr instance, and for most of these options, deciding to modify them depends on the specific performance result you are trying to tune for. This section is structured from the most generally useful optimizations to more specific optimizations.

Configuring JVM settings to improve memory usage

Solr runs inside a Java Virtual Machine (JVM), an environment that abstracts your Java-based application from the underlying operating system. JVM performance improves with every release, so use the latest version. There are many different parameters that you can tune the JVM for. However, most of them are "black magic", and changing them from the defaults can quickly cause problems if you don't know what you're doing. Additionally, the folks who write the JVMs spend a lot of time coming up with sophisticated algorithms that mean the JVM will usually tune itself better than you can. However, there is a fairly simple configuration change that most Java server applications benefit from (not just Solr), which is to set the initial and maximum heap memory allocated to the JVM to the same value and to specify that you are running a server application, so the JVM can tune its optimization strategy for a long running process:

java( )–Xms2G -Xmx2G -server -jar start.jar

Of course, the question now is how much memory should be allocated to the Java heap. If you specify too little, then you run the risk of getting an OutOfMemoryException. If you specify the largest practical value, which is the actual memory you have, less some for the operating system and other processes, this is a suboptimal configuration too. Operating systems make use of available memory as a cache for disk access, and Solr searches benefit substantially from this, while indexing does not, especially since Solr 4.0 was released. I recommend measuring how much heap you need by picking some high value, then running a full battery of representative queries against Solr so that all its caches get filled, then using JConsole to perform a full garbage collection. At that point, you can see how much memory it's using. With that figure, provide some breathing room of 20 percent.

Tip

With 8 GB of RAM available, I typically set Solr to use 4 GB. I don't go above 6 or 8 GB without a very good reason because the impact of garbage collection starts to be very appreciable. Add -XX:+PrintGCApplicationStoppedTime to your startup to see GC pause.

The ultimate figure is of course highly dependent on the size of your Solr caches and other aspects of the Solr configuration; therefore, tuning the heap size should be one of the later steps.

Note

Jump forward to Chapter 11, Deployment, for a discussion about enabling JMX to work with JConsole.

Using MMapDirectoryFactory to leverage additional virtual memory

If you have plenty of virtual memory relative to your index size, then using memory-mapped I/O via MMapDirectoryFactory should be faster than StandardDirectoryFactory for interacting with the filesystem on 64-bit JVMs. This is set via the <directoryFactory /> tag in solrconfig.xml, and is chosen by default on 64-bit Solaris and Linux JVMs. The memory used is outside of the Java heap, so you do not need to modify any JVM startup options. This is one of the reasons that even if you have 32 or 64 GB of memory, your Solr instance may only be set to use 8 GB.

Enabling downstream HTTP caching to reduce load

Solr has great support for using HTTP caching headers to enable downstream HTTP software to cache results. Frequently, you may have the same search being issued over and over, even though the results are always the same. Placing an intermediate caching server, such as Squid, in front of Solr should reduce the load on Solr and potentially reduce Solr's internal "query cache" requirements, thus, freeing up more RAM. When a request uses certain caching headers, Solr can then indicate whether the content has changed by either sending back an HTTP 200 status code if it has, or a 304 Not Modified code when the content hasn't changed since the last time the request asked for it.

In order to specify that you want Solr to do HTTP caching, you need to configure the <httpCaching/> stanza in solrconfig.xml. By default, Solr is configured to never return a 304 code, instead it always returns a 200 response (a normal non-cached response) with the full body of the results. In ./examples/configsets/mbtype/solrconfig.xml, uncomment the "production" httpCaching stanza and restart Solr:

<httpCachinglastModifiedFrom="openTime" etagSeed="SolrMusicBrainz" never304="false">
  <cacheControl>max-age=43200, must-revalidate</cacheControl>
</httpCaching>

We have specified that sending back 304 messages is okay. We have also specified in cacheControl that the max time to store responses is 43,200 seconds, which is half a day. We've also specified through must-revalidate that any shared cache, such as a Squid proxy, needs to check back with Solr to see whether anything has changed, even checking to see whether the max-age has expired, which acts as an extra check.

Tip

During development, leave never304="true" to ensure that you are always looking at the results of fresh queries and aren't misled by looking at cached results, unless you are using eTags and the browser properly honors them.

By running curl with the mbartists core, we can see additional cache-related information in the header. For your typing convenience, these curl commands are available in ./examples/10/http_cache_commands.txt:

>>curl -v "http://localhost:8983/solr/mbartists
/mb_artists?q=Smashing+Pumpkins"
< HTTP/1.1 200 OK
< Cache-Control: max-age=43200, must-revalidate
< Expires: Tue, 08 Oct 2013 05:42:20 GMT
< Last-Modified: Mon, 07 Oct 2013 17:42:14 GMT
<ETag: "NDgwMDAwMDAwMDAwMDAwMFNvbHJNdXNpY0JyYWlueg=="

So, let's look at what we get back if you pass a last modified header specifying that we have downloaded the content after the previously returned last modified time:

>>curl -v -z "Mon, 07 Oct 2013 17:42:15 GMT" http://localhost:8983/solr/mbartists/mb_artists?q=Smashing+Pumpkins
* About to connect() to localhost port 8983 (#0)
*   Trying ::1... connected
* Connected to localhost (::1) port 8983 (#0)
> GET /solr/mbartists/select/?q=Smashing+Pumpkins HTTP/1.1
> Host: localhost:8983
> Accept: */*
>If-Modified-Since: Mon, 07 Oct 2013 17:42:14 GMT
>
< HTTP/1.1 304 Not Modified
< Cache-Control: max-age=43200
< Expires: Tue, 08 Oct 2013 05:45:23 GMT
< Last-Modified: Mon, 07 Oct 2013 17:42:14 GMT
<ETag: "NDgwMDAwMDAwMDAwMDAwMFNvbHJNdXNpY0JyYWlueg=="

Specifying an If-Modified-Since time just one second after the Last-Modified time means that Solr gives us back a 304 Not Modified code and doesn't have to execute the search nor send a large response to the client, leading to a faster response time and less load on the server.

Entity tags are a newer method that are more robust and flexible than using the Last-Modified date. An ETag is a string that identifies a specific version of a component. In the case of Solr, they are generated by combining the current version of the index with the etagSeed value. Every time the index is modified, the current ETag value will change. If we add the fake artist "The Eric Band" to the mbartists index, and then run our previous query, we'll see that the ETag has changed because the version of the Solr index has changed:

>>curl 'http://localhost:8983/solr/mbartists/update?commit=true' -H "Content-Type: text/xml" --data-binary '<add><doc><field name="a_name">The Eric Band</field><field name="id">Fake:99999</field><field name="type">Artist</field></doc></add>'

>>curl -v -z "Tue, 03 May 2011 09:36:36 GMT GMT" http://localhost:8983/solr/mbartists/select/?q=Smashing+Pumpkins
>
< HTTP/1.1 304 Not Modified
< Cache-Control: max-age=43200
<Expires: Sat, 07 May 2011 02:17:02 GMT
<Last-Modified: Fri, 06 May 2011 14:16:55 GMTGMT
<ETag: "NTMyMzQwMzhmNDgwMDAwMFNvbHJNdXNpY0JyYWlueg=="
< Server: Jetty(6.1.3)

To take advantage of the HTTP protocol level caching supplied by Solr, you need to make sure your client respects the caching directives returned by Solr. Two very popular caches that understand ETags are Varnish (http://www.varnish-cache.org) and Squid (http://www.squid-cache.org).

Tip

Remember, the fastest query possible from Solr's perspective is the query that it doesn't have to make!

Solr caching

Caching is a key part of what makes Solr fast and scalable, and the proper configuration of caches is a common topic on the solr-user mailing list! Solr uses multiple in-memory caches. The caches are associated with individual Index Searchers, which represent a snapshot view of the data. Following a commit, new index searchers are opened and then auto-warmed. Auto-warming is when the cached queries of the former searcher are rerun to populate the new searcher. Following auto-warming, predefined searches are run as configured in solrconfig.xml. Put some representative queries in the newSearcher and firstSearcher listeners, particularly for queries that need sorting on fields. Once complete, the new searcher will begin servicing new incoming requests.

Tip

Each auto-warming query and predefined search increases the commit time, so make sure those searches are actually increasing the cache hit ratio and don't over do it!

There are a number of different caches configured in solrconfig.xml, which are as follows:

  • filterCache: This stores unordered lists of documents that match a query. It is primarily used for storing filter queries (the fq parameter) for reuse, but it's also used in faceting under certain circumstances. It is arguably the most important cache. The filter cache can optionally be used for queries (the q parameter) that are not score-sorted if useFilterForSortedQuery is enabled in solrconfig.xml. However, unless testing reveals performance gains, it is best left disabled—the default setting.
  • queryResultCache: This stores ordered lists of document IDs from searches. The order is defined by any sorting parameters passed. This cache should be large enough to store the results of the most common searches, which you can identify by looking at your server logs. This cache doesn't use much memory, as only the ID of the documents is stored in the cache. The queryResultWindowSize setting allows you to preload document IDs into the cache if you expect users to request documents that fall within the ordered list. So, if a user asks for products 20 through 29, then there is a good chance they will next look for 30 through 39. If the queryResultWindowSize is 50, then the initial request will cache the first 50 document IDs. When the user asks for 30 through 39, they will retrieve the cached data and won't have to access the Lucene indexes.
  • documentCache: This caches field values that have been defined in schema.xml as being stored, so that Solr doesn't have to go back to the filesystem to retrieve the stored values. Fields are stored by default.

    Tip

    The documented wisdom on sizing this cache is to be larger than the max results * max concurrent queries being executed by Solr to prevent documents from being re-fetched during a query. As this cache contains the fields being stored, it can grow large very quickly.

These caches are all configured the same way, which is explained as follows:

  • class: This specifies the cache implementation Java class name. Solr comes with LRUCache, FastLRUCache, and LFUCache. The current wisdom is that for caches that don't have a high hit ratio, and, therefore, have more churn use LRUCache, because the cache is evicting content frequently. If you have a high hit ratio, then the benefits of FastLRUCache kick in because it doesn't require a separate thread for managing the removal of unused items. You want a high hit ratio to maximize FastLRUCache because storing data is slower as the calling thread is responsible for making sure that the cache hasn't grown too large.
  • size: This defines the maximum items that the cache can support and is mostly dependent on how much RAM is available to the JVM.
  • autowarmCount: This specifies how many items should be copied from an old search to a new one during the auto-warming process. Set the number too high and you slow down commits; set it too low and the new searches following those commits won't gain the benefits of the previously cached data. Look at the warmupTime statistic on Solr's admin screen to see how long the warm up takes. There are some other options too, such as initialSize, acceptableSize, minSize, showItems, and cleanupThread specific to FastLRUCache, but specifying these are uncommon. There is a wealth of specific information available on the wiki at http://wiki.apache.org/solr/SolrCaching that covers this topic.

Tuning caches

Monitoring the Plugins / Stats admin page for caches, you can get a sense of how large you need to make your caches. If the hit ratio for your caches is low, then it may be that they aren't caching enough to be useful. However, if you find that the caches have a significant number of evictions, then it implies that they are filling up too quickly and need to be made larger. Caches can be increased in size as long as Solr has sufficient RAM to operate in.

Tip

If your hit ratio for a cache is very low, then you should consider shrinking its size, and perhaps turning it off altogether by commenting out the cache configuration sections in solrconfig.xml. This will reduce the memory footprint and may help improve performance by removing the overhead of checking the caches and auto-warming the caches during commits.

Indexing performance

There are several aspects of Solr tuning that increase indexing performance. We'll start with optimizing the schema, then look at sending data to Solr in bulk, and then finish with Lucene's merge factor and optimization.

Designing the schema

Good schema design is probably one of the most important things you can do to enhance the scalability of Solr. You should refer to Chapter 2, Schema Design, for a refresher on many of the design questions that are inextricably tied to scalability. The easiest schema issue to look at for maximizing scalability is, "Are you storing the minimum information you need to meet the needs of your users?" There are a number of attributes in your schema field definitions, which inform us about what is being indexed and are discussed here:

  • indexed: Some fields are purely there to be returned in search results—not to be searched (used in a query like q or fq). For such fields, set indexed="false". If the field is not needed for search but is needed for faceting or a variety of other features, then see the last bullet, docValues, instead.
  • stored: Storing field values in the index simplifies and speeds up search results because results need not be cross-referenced and retrieved from original sources. It is also required for features such as highlighting. However, storing field values will obviously increase the index size and indexing time. A quick way to see what fields are stored is to do a simple search with fl=* as a parameter; the fields in the result are the stored fields. You should only store fields that you actually display in your search results or need for debugging purposes. It is likely that your index has some data repeated but indexed differently for specialized indexing purposes such as faceting or sorting—only one of those, if any, needs to be stored.
  • docValues: If a field is sorted, faceted, or used in a function query, grouping, collapsing, or stats, then you should usually set docValues="true". Those features require either indexed or docValues, but docValues is ideal. So-called "DocValues" data reduces Java heap memory requirements by memory mapping field data for certain features, and reduces commit latency, making it almost a necessity for real-time search.

    Tip

    If you need faster indexing, see if you can reduce the text analysis you perform in schema.xml to just what you need, for example, if you are certain the input is plain ASCII text, then don't bother mapping accented characters to ASCII equivalents.

Sending data to Solr in bulk

Indexing documents into Solr is often a major bottleneck due to the volume of data that needs to be indexed initially compared to the pace of ongoing updates. The best way to speed up indexing is to index documents in batches. Solr supports sending multiple documents in a single add operation, and this will lead to a drastic speedup in performance.

However, as the size of your individual documents increase, performance may start to decrease. A reasonable rule of thumb is doing document add operations in batches of 10 for large documents, and in batches of 100 for small documents.

To see the impact of batching, I indexed some data using the script examples/10/batch/simple_test.rb and documented the time it took. Take a look at the following table:

Scenario

Time

Single process adding documents one at a time

24m13.391s

Single process adding documents in batches of 100

5m43.492s

Single process adding documents in batches of 500

5m51.322s

You can see the impact that moving from sending one document at a time to a batch of 100 had an almost five-fold reduction in time. However, after a certain point, increasing the batch size doesn't decrease the overall time, instead, it may increase it.

Tip

SolrJ can load data the fastest

The fastest client approach to load data into Solr is SolrJ's ConcurrentUpdateSolrServer Java class. It places documents to be added into a queue that is consumed by multiple threads that each have a separate HTTP connection to Solr. Furthermore, it uses the compact javabin format. Due to the asynchronous nature of its use, the ConcurrentUpdateSolrServer.handleError() method must be extended to implement a callback to respond to errors. Also, with this client, you needn't add documents in batches, as it will only waste memory.

Disabling unique key checking

By default, if you specify a uniqueKey for your schema, when indexing content, Solr checks the uniqueness of the document being indexed so that you don't end up with multiple documents sharing the same primary key. If you know you have unique keys and don't have those documents in the index when doing a bulk load of data, then you can disable this check. For an update request in any format supported by Solr, add overwrite=false as a request parameter in the URL.

Index optimization and mergeFactor settings

When you add a document to Lucene, they get added to an in-memory write buffer that has a limited size—see ramBufferSizeMB in solrconfig.xml. When it gets full or if a commit happens (to include gracefully shutting down Solr), the buffer is flushed into a new Lucene segment on disk. A segment comprises about 11 files or so, and it's read-only. Deleted documents get flagged as such but aren't actually purged right away. As the number of segments increase, Lucene periodically merges them together into larger segments, which purges deleted documents as a side effect too. A key setting controlling this is the mergeFactor in solrconfig.xml, which is basically how many segments get merged into one at once.

Note

Check out this great blog post (with illustrated video) at http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html by Mike McCandless, the author of Lucene in Action, that visualizes what happens during segment merging. This really helped me understand the behavior of Solr during commits.

The rule of thumb is that the more static your content is (that is, the less frequent you need to commit data), the lower the merge factor you want. If your content is changing frequently, or if you have a lot of content to index, then a higher merge factor allows for faster indexing throughput at the expense of search performance. So, if you have infrequent index updates, then a low merge factor of 2 will have fewer segments, which leads to faster searching. However, if you expect to have large indexes, significantly above 10 GB, then having a higher merge factor like 20 will help with the indexing time, but then dial it back once you are done with bulk indexing.

After indexing a lot of documents (or perhaps at off-peak hours), it's sometimes beneficial to issue an optimize command. Optimize forces the merging of Lucene's segments down to one (or whatever the optional maxSegments parameter is), which increases search performance, and it also purges wasted space from deleted documents.

Note

You can see the number of segments on the Overview screen's Statistics section. You can also issue an optimize from this screen.

Optimizing your index is no longer quite as important as it used to be, and indeed the optimize command might eventually be renamed to forceMerge to make it less attractive to invoke by unsuspecting users. Optimization consumes significant CPU, temporary disk space, disk I/O, and will mean index replication must replicate larger segments, so it's not something to invoke often. If you do optimize, consider setting maxSegments as a trade-off.

Tip

Consider having two strategies for indexing your content—the first strategy that is used during bulk loads that minimizes commits and merging to allow for the highest indexing throughput possible, and then a second strategy used during day-to-day routine operations that indexes documents and commits as needed to make them visible, and does some merging to keep the segment count in-check (via either a low mergeFactor or optimize).

Enhancing faceting performance

There are a few items to look at when ensuring that faceting performs well. First of all, faceting and filtering (the fq parameter) go hand-in-hand, thus, monitoring the filter cache to ensure that it is adequately sized. The filter cache is used for faceting itself as well. In particular, any facet.query or facet.range based facets will store an entry for each facet count returned. You should ensure that the resulting facets are as reusable as possible from query to query. For example, it's probably not a good idea to have direct user input involved in either a facet.query or in fq because of the variability. As for dates, try to use fixed intervals that don't change often or round off NOW relative dates to a chunkier interval (for example, NOW/DAY instead of just NOW). For text faceting (for example, facet.field), the filterCache is not used unless you explicitly set facet.method to enum. You should do this when the total number of distinct values in the field is somewhat small, say less than 50. Finally, you should add representative faceting queries to firstSearcher in solrconfig.xml so that when Solr executes its first user query, the relevant caches are already warmed up.

Using term vectors

A term vector is a list of terms resulting from the text analysis of a field's value. It optionally contains the term frequency, document frequency, and numerical offset into the text. Without them, the same information can be derived at runtime but that's slower. While disabled by default, enabling term vectors for a field in schema.xml enhances:

  • MoreLikeThis queries, assuming that the field is referenced in mlt.fl and the input document is a reference to an existing document (that is not externally passed in).
  • By highlighting search results with the standard or FastVector highlighter

By enabling term vectors for a field increases the index size and indexing time, and isn't required to perform MoreLikeThis queries or highlight search results; however, typically, if you are using these features, then the enhanced performance gained is worth the longer indexing time and greater index size.

Improving phrase search performance

For indexes reaching a million documents or more, phrase searches can be slow. If you are using the automatic phrase boosting features of the DisMax query parser (excellent for relevancy), then more phrase queries are occurring than you may be aware of. What slows down phrase searches are the presence of terms in the phrase that show up in a lot of documents. In order to ameliorate this problem, particularly common and uninteresting words such as "the" can be filtered out through a stop filter. But this thwarts searches for a phrase such as "to be or not to be" and prevents disambiguation in other cases where these words, despite being common, are significant. Besides, as the size of the index grows, this is just a band-aid for performance as there are plenty of other words that shouldn't be considered for filtering out, yet are common.

Shingling (sometimes called word-grams) is a clever solution to this problem, which combines pairs of consecutive terms into one so-called shingle. The original terms still get indexed, but only the shingles are used in phrase queries. Shingles naturally have a very low frequency relative to single words. Consider the text "The quick brown fox jumped over the lazy dog". The use of shingling in a typical configuration would yield the indexed terms (shingles) "the quick", "quick brown", "brown fox", "fox jumped", "jumped over", "over the", "the lazy", and "lazy dog" in addition to all of the original nine terms. Since so many more terms are indexed, naturally there is a commensurate increase in indexing time and resulting index size. Common-grams is a more selective variation of shingling that only shingles when one of the consecutive words is in a configured list. Given the preceding sentence using an English stop word list, the indexed terms would be "the quick", "over the", "the lazy", and the original nine terms.

Note

As a side benefit, these techniques also improve search relevancy since the TF and IDF factors are using coarser units (the shingles) than the individual terms.

In our MusicBrainz dataset, there are nearly seven million tracks, and that is a lot! These track names are ripe for either shingling or common-grams. Despite the high document count, the documents are small and so the actual index is only a couple gigabytes. Both the approaches are quite plausibly appropriate given different trade-offs. Here is a variation of the MusicBrainz title field called title_commonGrams:

<fieldType name="title_commonGrams" class="solr.TextField" positionIncrementGap="100"">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishMinimalStemFilterFactory"/>
    <filter class="solr.CommonGramsFilterFactory" 
            words="commongrams.txt" ignoreCase="true"/>"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishMinimalStemFilterFactory"/>
    <filter class="solr.CommonGramsQueryFilterFactory" 
            words="commongrams.txt" ignoreCase="true""/>
  </analyzer>
</fieldType>

Note

Notice that the filter's class name varies from index to query time, which is very unusual.

To come up with a list of common words for common-grams, use stop words and add some of the Top Terms list in Solr's schema browser as a guide for the field in question. You could try a more sophisticated methodology, but this is a start. Shingle filters go in the same position, but they are configured a little differently:

<!-- index time …-->
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
<!-- query time -->
<filter class="solr.ShingleFilterFactory" 
  maxShingleSize="2" outputUnigrams="false" 
  outputUnigramsIfNoShingles="true"/>

You might choose to save additional index space and search performance by adding a stop filter after shingling or common-grams for typical stop-words so long as they don't need to be searchable by themselves. This wasn't done here for MusicBrainz song titles since we didn't feel it was worth it on name data.

Evaluating the search performance improvement of shingling proved to be tricky for the limited time I gave it. Some rough (nonscientific) testing showed that a search for Hand in my Pocket against the shingled field versus the nonshingled field was two to three times faster. I've seen very compelling search performance numbers using common-grams from others online, but I didn't evaluate it.

Shingling and common-grams increase phrase search performance at the expense of indexing speed and disk use. In the following table, I present the relative cost of these two techniques on the track name field in MusicBrainz compared to the cost of doing typical text analysis. These percentages may look high and might scare you away, but remember that these are based purely on one aspect (the index portion) of one field. The overall index time and disk use didn't change dramatically, not even with shingling. You should try it on your data to see the effects.

 

Indexing time increase %

Disk space increase %

Common-grams

220%

25%

Shingling

450%

110%

Tip

The use of either is recommended for most applications

Given that shingling takes over five times as long and uses twice as much disk space, it makes more sense on small-to-medium scale indexes, where phrase boosting is used to improve relevancy. Common-grams is a relative bargain for all other applications.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.63.87