Language Identifier

The language Identifier extension detects the language (or languages) of fields belonging to a given document. This is a very useful add-on to use in conjunction with the previously described extraction library, to get additional information about data that has been indexed.

The component is implemented as an UpdateRequestProcessor subclass that intercepts and analyzes the incoming data:

<processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
  <str name="langid.fl">text</str>
  <str name="langid.langField">language</str>
  <str name="langid.fallback">en</str>
</processor>

As you can see, this processor can be configured with several options. We can declare the fields of the incoming documents that must be analyzed, the name of the field that will hold the results of language detection, or a default fallback language in case no detection is possible.

Tip

In the example project associated with this chapter, you will find a solrconfig.xml file where the chain is already defined but the UpdateRequestProcessor is commented out. Just remove the comment markers, reload the core using the Administration Console, and reindex the documents under the example-data folder, following the same procedure as we described in the previous section. At the end, you will see an additional "language" field in each document; that is the result of the language detection component.

You should know that declaring the processor within the solrconfig.xml file is not enough. We need to insert that into an update request processor chain, and finally associate that chain with an UpdateRequestHandler. Only those update requests that will be received by that handler will pass through the language detection analysis chain.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.123.34