Solr indexing configuration

Once the schema has been defined, it's time to configure and tune the indexing process by means of another file that resides in the same directory of the schema—solrconfig.xml.

The file contains a lot of sections, but fortunately, there are a lot of optional parts with default values that usually work well in most scenarios. We will try to underline the most important of them with respect to this chapter.

As a general note, it's possible to use system properties and default values within this file. Therefore, we are able to create a dynamic expression, like this:

<dataDir>${my.data.dir:/var/data/defaultDataDir}</dataDir>

The value of the dataDir element will be replaced at runtime with the value of the my.data.dir system property, or with the default value of /var/data/defaultDataDir if that property doesn't exist.

General settings

The heading part of the solrconfig.xml file contains general settings that are not strictly related to the index phase.

The first is the Lucene match version:

<luceneMatchVersion>LUCENE_47</luceneMatchVersion>

This allows you to control which version of Lucene will be internally used by Solr. This is useful to manage migration phases towards the newer versions of Solr, thus allowing backward compatibility with indexes built with previous versions.

A second piece of information you can set here is the data directory, that is, the directory where Solr will create and manage the index. It defaults to a directory called data under $SOLR_HOME.

<dataDir>/var/data/defaultDataDir</dataDir>

Index configuration

The section within the <indexConfig> tag contains a lot of things that you can configure in order to fine-tune the Solr index phase.

A curious thing you can see in this section, in the solrconfig.xml file of the example core, is that most things are commented. This is very important, because it means that Solr provides good default values for those settings.

The following table summarizes the settings you will find within the <indexConfig> section:

Attribute

Description

writeLockTimeout

The maximum allowed time to wait for a write lock on an IndexWriter.

maxIndexingThreads

The maximum allowed number of threads that index documents in parallel. Once this threshold has been reached, incoming requests will wait until there's an available slot.

useCompoundFile

If this is set to true, Solr will use a single compound file to represent the index. The default value is false.

ramBufferSizeMB

When accumulated document updates exceed this memory threshold, all pending updates are flushed.

ramBufferSizeDocs

This has the same behavior as that of the previous attribute, but the threshold is defined as the count of document updates.

mergePolicy

The names of the class, along with settings, that defines and implements the merge strategy.

mergeFactor

A threshold indicating how many segments an index is allowed to have before they are merged into one segment. Each time an update is made, it is added to the most recent index segment. When that segment fills up (that is, when the maxBufferedDocs and ramBufferSizeMB thresholds are reached), a new segment is created and subsequent updates are inserted there. Once the number of segments reaches this threshold, Solr will merge all of them into one segment.

mergeScheduler

The class that is responsible for controlling how merges are executed.

lockType

The lock type used by Solr to indicate that a given index is already owned by IndexWriter.

Update handler and autocommit feature

The <UpdateHandlerSection> configures the component that is responsible for handling requests to update the index.

This is where it's possible to tell Solr to periodically run unsolicited commits so that clients won't need to do that explicitly while indexing. Declaring two different thresholds can trigger auto-commits:

  • maxDocs: The maximum number of documents to add since the last commit
  • maxTime: The maximum amount of time (in milliseconds) to pass for a document being added to index

They are not exclusive, so it's perfectly legal to have settings such as these:

<autoCommit>
  <maxDocs>5000</maxDocs>
  <maxTime>300000</maxTime>
</autoCommit>

Starting from Solr 4.0, there are two kinds of commit. A hard commit flushes the uncommitted documents to the index, therefore creating and changing segments and data files on the disk. The other type is called soft commit, which doesn't actually write uncommitted changes but just reopens the internal Solr searcher in order to make uncommitted data in the memory available for searches.

Hard commits are expensive, but after their execution, data is permanently part of the index. Soft commits are fast but transient, so in case of a system crash, changes are lost.

Hard and soft commits can coexist in a Solr configuration. The following is an example that shows this:

<autoCommit>
  <maxTime>900000</maxTime>
</autoCommit>
<autoSoftCommit>
  <maxTime>1000</maxTime>
</autoSoftCommit>

Here, a soft commit will be triggered every second (1000 milliseconds), and a hard commit will run every 15 minutes (900000 milliseconds).

RequestHandler

A RequestHandler instance is a pluggable component that handles incoming requests. It is configured in solrconfig.xml as a specific endpoint by means of its name attribute.

Requests sent to Solr can belong to several categories: search, update, administration, and stats. In this context, we are interested in those handlers that are in charge of handling index update requests. Although not mandatory, those handlers are usually associated with a name starting with the /update prefix, for example, the default handler you will find in the configuration:

<requestHandler name="/update" class="solr.UpdateRequestHandler"/>

Prior to Solr 4, each kind of input format (for example, JSON, XML, and so on) required a dedicated handler to be configured. Now the general-purpose update handler, that is, the /update handler uses the content type of the incoming request in order to detect the format of the input data. The following table lists the built-in content types:

Mime-type

Description

application/xml

text/xml

XML messages

application/json

text/json

JSON messages

application/csv

text/csv

Comma-separated values

application/javabin

Java-serialized objects (Java clients only)

Each format has its own way of encoding the kind of update operation (for example, add, delete, and commit) and the input documents. This is a sample add command in XML:

<add>
  <doc>
    <field name="id">12020</field>
    <field name="title">Round around midnight</field>
  </doc>
  …
</add>

Later, we will index some data using different techniques and different formats.

UpdateRequestProcessor

The write path of the index process has been conceived by Solr developers with modularity and extensibility in mind. Specifically, the index process has been structured as a chain of responsibilities, where each set of components adds its own contribution to the whole index process.

The UpdateRequestProcessor chain is an important configurable aspect of the index process. If you want to declare your custom chain, you need to add a corresponding section within the configuration. This is an example of a custom chain:

<updateRequestProcessorChain name="my-index-chain">
  <processor class="…"/>
  <processor class="…">
    <str name="aParameterName">aParameterValue</str>
  </processor>
  <processor name="solr.RunUpdateProcessorFactory"/>
  <processor name="solr.LogUpdateProcessorFactory"/>
</updateRequestProcessorChain>

Defining a new chain requires a name and a set of UpdateRequestProcessorFactory components that are in charge of creating processor instances for that chain.

Note

Actually, the definition of the chain is not enough. It must be enabled, (that is, associated with RequestHandler) in the following way:

<requestHandler name="/myReqHandler"
  class="solr.UpdateRequestHandler">
  <lst name="defaults">
    <str name="update.chain">chain.name</str>
  </lst>
</requestHandler>

There are a lot of already implemented UpdateRequestProcessor components that you can use in your chain, but in general, it's absolutely easy to create your own processor and customize the index chain.

Tip

The example project with this chapter contains several examples of UpdateRequestProcessor within the org.gazzax.labs.solr.ase.ch2.urp package.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.108.175