Update request processors

No matter how you choose to import data, there is a final configuration point within Solr that allows manipulation of the imported data before it gets indexed. The Solr request handlers that update data put documents on an update request processor chain. If you search solrconfig.xml for updateRequestProcessorChain, then you'll see an example.

You can specify which chain to use on the update request with the update.chain parameter. It could be useful, but you'll probably always use one chain. If no chain is specified, you get a default chain of LogUpdateProcessorFactory and RunUpdateProcessorFactory. The following are the possible update request processors that you can choose from. Their names all end in UpdateProcessorFactory.

  • SignatureUpdateProcessorFactory: This generates a hash ID value based on the field values you specify. If you want to deduplicate your data (that is, you don't want to add the same data twice accidentally), then this will do that for you. For further information, see http://wiki.apache.org/solr/Deduplication.
  • UIMAUpdateProcessorFactory: This hands the document off to the Unstructured Information Management Architecture (UIMA), a Solr contrib module that enhances the document through natural language processing (NLP) techniques. For further information, see http://wiki.apache.org/solr/SolrUIMA.

    Tip

    Although it's nice to see an NLP integration option in Solr, beware that NLP processing tends to be computationally expensive. Instead of using UIMA in this way, consider performing this processing external to Solr and cache the results to avoid re-computation as you adjust your indexing process.

  • LogUpdateProcessorFactory: This is the one responsible for writing the log messages you see when an update occurs.
  • RunUpdateProcessorFactory: This is the one that actually indexes the document; don't forget it or the document will vanish! To decompose this last step further, it hands the document to Lucene, which will then process each field according to the analysis configuration in the schema.
  • FieldMutatingUpdateProcessorFactory: This allows you to manipulate the field values when adding the documents to the index. You can configure for what fields the processor should act on by name, type, name regex, or type class. The following are the useful extensions of the FieldMutatingUpdateProcessorFactory implementation:
    • TrimFieldUpdateProcessorFactory: This trims leading and trailing white spaces from any CharSequence values found in fields matching the specified conditions and returns the resulting string. By default, this processor matches all fields.
    • RemoveBlankFieldUpdateProcessorFactory: This removes any values found, which are CharSequence with a length of 0. (that is, empty strings). By default, this processor applies itself to all fields.
    • FieldLengthUpdateProcessorFactory: This replaces any CharSequence values found in fields matching the specified conditions with the lengths of those CharSequences (as an integer). By default, this processor matches no fields.
    • ConcatFieldUpdateProcessorFactory: This concatenates multiple values for fields matching the specified conditions using a configurable delimiter that defaults to ", ". By default, this processor concatenates the values for any field name, which according to the schema is multiValued="false" and uses TextField or StrField.
    • FirstFieldValueUpdateProcessorFactory: This trims leading and trailing white spaces from any CharSequence values found in fields matching the specified conditions and returns the resulting String.
    • LastFieldValueUpdateProcessorFactory: This keeps only the last value of fields matching the specified conditions. By default, this processor matches no fields.
    • MinFieldValueUpdateProcessorFactory: This keeps only the minimum value from any selected fields where multiple values are found. By default, this processor matches no fields.
    • MaxFieldValueUpdateProcessorFactory: This keeps only the maximum value from any selected fields where multiple values are found. By default, this processor matches no fields.
    • TruncateFieldUpdateProcessorFactory: This truncates any CharSequence values found in fields matching the specified conditions to a maximum character length. By default, this processor matches no fields.
    • IgnoreFieldUpdateProcessorFactory: This ignores and removes fields matching the specified conditions from any document being added to the index. By default, this processor ignores any field name that does not exist according to the schema.
    • CountFieldValuesUpdateProcessorFactory: This replaces any list of values for a field matching the specified conditions with the count of the number of values for that field. By default, this processor doesn't match any fields. The typical use case for this processor would be in combination with the CloneFieldUpdateProcessorFactory so that it's possible to query by the quantity of values in the source field.
    • HTMLStripFieldUpdateProcessorFactory: This strips all HTML markup in any CharSequence values found in fields matching the specified conditions. By default, this processor matches no fields.
    • RegexReplaceProcessorFactory: This applies a configured regex to any CharSequence values found in the selected fields, and replaces any matches with the configured replacement string. By default, this processor applies itself to no fields.
    • PreAnalyzedUpdateProcessorFactory: This parses configured fields of any document being added using PreAnalyzedField with the configured format parser. Fields are specified using the same patterns as in FieldMutatingUpdateProcessorFactory. They are then checked to see whether they follow a pre-analyzed format defined by the parser. The valid fields are then parsed. The original SchemaField is used for initial creation of IndexableField, which is then modified to add the results from parsing (token stream value and/or string value) and then it will be directly added to the final Lucene Document to be indexed. Fields that are declared in the patterns list but are not present in the current schema will be removed from the input document.
  • CloneFieldUpdateProcessorFactory: This is used to clone the values found in any matching source field into the configured dest field. If the dest field already exists in the document, the values from the source fields will be added to it. The boost value associated with the dest will not be changed, and any boost specified on the source fields will be ignored.
  • StatelessScriptUpdateProcessorFactory: This enables custom update processing code to be written in several scripting languages (such as JavaScript, Ruby, Groovy, or Python). In the script, you have access to several Solr objects, allowing you, for instance, to modify a document before it's indexed by Solr, or to add custom info to the Solr log. It can be a very useful feature to centralize logic when using multiple clients, or to modify requests when you have no control over all clients.

    Tip

    The ScriptUpdateProcessor is powerful!

    You should certainly use the other update processors as appropriate, but there is nearly nothing you can't do with this one. A sample script file (update-script.js) can be found in the conf directory. For more details, see http://wiki.apache.org/solr/ScriptUpdateProcessor.

  • DocExpirationUpdateProcessorFactory: Introduced in Solr 4.8, this is used to automatically delete documents from the index. This is executed via a background thread. There are two options available related to the expiration of documents, periodically delete documents from the index based on an expiration field and computing expiration field values for documents from a time to live (TTL). The expirationFieldName value is the name of the expiration field, and autoDeletePeriodSeconds specifies how often the timer thread should trigger a deleteByQuery to remove the documents. This factory can also be configured to look for a _ttl_ request parameter, as well as a _ttl_ field in each document that is indexed. Refer to the Solr wiki or the API docs for more information.
  • DocBasedVersionConstraintsProcessorFactory: Introduced in Solr 4.6, this is used to enforce the version constraints based on per-document version numbers using a configured name of a versionField. It should be configured on the default update processor before the DistributedUpdateProcessorFactory. Using this, if a document with the same unique key already exists in the index and its value of the versionField is not less than the value in the new document, then the new document will be rejected with a 409 version conflict error.
  • RegexpBoostProcessorFactory: This update processor is used to read the inputField, match its content against the regular expressions found in boostFilename, and if it matches, return the corresponding boost value into the boostField as a double value from the file. If more than one patterns match, then the boost values are multiplied.
  • TikaLanguageIdentifierUpdateProcessorFactory and LangDetectLanguageIdentifierUpdateProcessorFactory: This identifies the language of a document before indexing and then makes appropriate decisions about analysis, and so on. For further info about language detection, see http://wiki.apache.org/solr/LanguageDetection.

    Tip

    There are many other processors available and you can also write your own. It's a recognized extensibility point in Solr that consequently doesn't require modifying Solr itself. For further information, see http://wiki.apache.org/solr/UpdateRequestProcessor.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.65.247