Language identification

At the time of indexing, language identification is required to map text to language-specific fields. Solr uses the langid UpdateRequestProcessor for language identification. Two types of UpdateRequestProcessor are provided by Solr.

TikaLanguageIdentifierUpdateProcessor: Uses the language identification libraries in Apache Tika
LangDetectLanguageIdentifierUpdateProcessor: Uses the open source Language Detection Library for Java

Configuration of TikaLanguageIdentifierUpdateProcessor in solrconfig.xml:

<processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
 <lst name="defaults">
 <str name="langid.fl">title,subject,text,keywords</str>
 <str name="langid.langField">language_s</str>
 </lst>
</processor>

Configuration of LangDetectLanguageIdentifierUpdateProcessor in solrconfig.xml:

<processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
 <lst name="defaults">
 <str name="langid.fl">title,subject,text,keywords</str>
 <str name="langid.langField">language_s</str>
 </lst>
</processor>

UpdateRequestProcessor provides many langid parameters. We are not going to explain them here. Determining the language at index time is always preferable over query time. During query time, the input provided by the user may be short and sometimes not meaningful enough to extract language information. At index time, full documents are present, so language identification becomes easier.

Table of Contents for Language identification

Create new playlist

Sign In

Sign Up

Table of Contents for
Language identification