Language identification

At the time of indexing, language identification is required to map text to language-specific fields. Solr uses the langid UpdateRequestProcessor for language identification. Two types of UpdateRequestProcessor are provided by Solr.

  • TikaLanguageIdentifierUpdateProcessor: Uses the language identification libraries in Apache Tika
  • LangDetectLanguageIdentifierUpdateProcessor: Uses the open source Language Detection Library for Java

Configuration of TikaLanguageIdentifierUpdateProcessor in solrconfig.xml:

<processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
<lst name="defaults">
<str name="langid.fl">title,subject,text,keywords</str>
<str name="langid.langField">language_s</str>
</lst>
</processor>

Configuration of LangDetectLanguageIdentifierUpdateProcessor in solrconfig.xml:

<processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
<lst name="defaults">
<str name="langid.fl">title,subject,text,keywords</str>
<str name="langid.langField">language_s</str>
</lst>
</processor>

UpdateRequestProcessor provides many langid parameters. We are not going to explain them here. Determining the language at index time is always preferable over query time. During query time, the input provided by the user may be short and sometimes not meaningful enough to extract language information. At index time, full documents are present, so language identification becomes easier.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.214.233