Commit, optimize, and rollback the transaction log

Data sent to Solr is not immediately searchable, nor do deletions take immediate effect. Like a database, changes must be committed. There are two types of commits:

  • Hard commit: This is expensive because it pushes the changes to the filesystem (making them persistent) and has a significant performance impact. This is performed by the <autoCommit> option in solrconfig.xml or by adding commit=true request parameter to a Solr update URL.
  • Soft commit: This is less expensive but is not persistent. This is performed by the <autoSoftCommit> option in solrconfig.xml or using the softCommit=true option along with the commit parameter or by using the commitWithin parameter.

The request to Solr could be the same request that contains data to be indexed then committed, or an empty request—it doesn't matter. For example, you can visit this URL to issue a commit on our mbreleases core: http://localhost:8983/solr/mbreleases/update?commit=true. You can also commit changes using the XML syntax by simply sending this to Solr:

<commit />

There are three important things to know about commits that are unique to Solr:

  • Commits are slow. Depending on the index size and disk hardware, Solr's auto-warming configuration, and Solr's cache state prior to committing, a commit can take a considerable amount of time. With a lot of warming configured, it can take a number of minutes in extreme cases. To learn how to decrease this time, read about real-time search in Chapter 10, Scaling Solr.
  • There is no transaction isolation. This means that if more than one Solr client were to submit modifications and commit them at overlapping times, it is possible for part of one client's set of changes to be committed before that client told Solr to commit. This applies to rollback as well. If this is a problem for your application, then consider using one client process that is responsible for updating Solr.
  • Simultaneous commits should be avoided, particularly more than two. The problem actually pertains to simultaneous query warming, which is the latter and lengthy part of a commit. Solr will use a lot of resources and it might even yield an error indicating there is too much simultaneous warming—though the commit will eventually still have its effect.

When you are bulk-loading data, these concerns are not an issue since you're going to issue a final commit at the end. But if Solr is asynchronously updated by independent clients in response to changed data, commits could come too quickly and might overlap. To address this, Solr has two similar features, autoCommit and commitWithin. The first refers to a snippet of XML configuration that is commented in solrconfig.xml, in which Solr will automatically commit at a document-count threshold or time-lapse threshold (time of oldest uncommitted document). In this case, Solr itself handles committing and so your application needn't send commits. commitWithin is a similar time-lapse option that is set by the client on either the <add commitWithin="…"> element or the <commit commitWithin="…"/> element of an XML formatted update message or a request parameter by the same name. It will ensure a commit occurs within the specified number of milliseconds. Here's an example of a 30-second commit window:

<commit commitWithin="30000"/>

Since Solr 4.0, the commitWithin performs a soft-commit, which prevents the slaves from replicating the changes in a master/slave configuration. However, this default behavior can be overwritten in solrconfig.xml by enabling the forceHardCommit option to allow commitWithin to perform hard commits.

Don't overlap commits

During indexing, you may find that you are starting to see this error message:

<h2>HTTP ERROR: 503</h2><pre>Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later.</pre>

Every time a commit happens, a new searcher is created, which invokes the searcher warm up process for populating the cache, and that can take a while. While you can bump up the maxWarmingSearchers parameter in solrconfig.xml, you shouldn't since you could still hit the new limit, but worse is that memory requirements can soar and the system will slow down when multiple searchers are warming. So, you need to ensure commits aren't happening concurrently—or, if you must, that there are no more than two. If you see this problem, you should use the autoCommit or commitWithin parameter when issuing commits. In both cases, you need to choose a time window that is long enough for a commit to finish.

Tip

commitWithin is preferable to autoCommit

The commitWithin feature is preferable to the autoCommit feature in solrconfig.xml because the latter is global and can't be disabled.

Index optimization

Lucene's index is internally composed of one or more segments. When a buffer of indexed documents gets flushed to the disk, it creates a new segment. Deletes get recorded in another file, but they go to disk too. Sometimes, after a new segment is written, Lucene will merge some of them together. When Lucene has just one segment, it is in an optimized state. The more segments there are, the more query performance will degrade. Of course, optimizing an index comes at a cost; the larger your index is, the longer it will take to optimize. Finally, an optimize command implies commit semantics. You may specify an optimize command in all the places you specify a commit. So, to use it in a URL, try this: http://localhost:8983/solr/mbreleases/update?optimize=true. For the XML format, simply send this:

<optimize />

We recommend explicitly optimizing the index at an opportune time, such as after a bulk load of data and/or a daily interval in off-peak hours, if there are low-volume sporadic updates to the index. Chapter 10, Scaling Solr has a tip on optimizing to more than one segment if the optimizes are taking too long.

Both commit and optimize commands take two additional Boolean options that default to true:

<optimize waitFlush="true" waitSearcher="true"/>

If you were to set these to false, then commit and optimize commands return immediately, even though the operation hasn't actually finished yet. So, if you write a script that commits with these at their false values and then executes a query against Solr, you might find that the search does not reflect the changes yet. By waiting for the data to flush to the disk (waitFlush) and waiting for a new searcher to be ready to respond to changes (waitSearcher), this circumstance is avoided. These options are useful for executing an optimize command from a script that simply wants to optimize the index and otherwise doesn't care when newly added data is searchable.

Note

No matter how long a commit or optimize command takes, Solr still executes searches concurrently—there is no read lock. However, query latency may be impacted.

Rolling back an uncommitted change

There is one final indexing command to discuss—rollback. All uncommitted changes can be canceled by sending Solr the rollback command either via a URL parameter such as http://localhost:8983/solr/mbreleases/update?rollback=true or with the following XML code:

<rollback />

The transaction log

When the transaction log (tlogs) are enabled via the updateLog feature in solrconfig.xml, Solr writes the raw documents into the tlog files for recovery purposes. Transaction logs are used for near real-time (NRT) get, durability, and SolrCloud replication recovery.

To enable tlogs, simply add the following code to your updateHandler configuration:

<updateLog>
  <str name="dir">${solr.ulog.dir:}</str>
</updateLog>

Here, dir represents the target directory for transaction logs. This defaults to the Solr data directory.

Tip

If you don't need NRT get feature and you are not using SolrCloud, you can safely comment-out the updateLog section in solrconfig.xml. For more details about NRT get, see https://cwiki.apache.org/confluence/display/solr/RealTime+Get.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.70.132