The configuration for language detection is done in solrconfig.xml and both Tika as well as langdetect language detection use the same parameters, as follows:
<processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
<lst name="defaults">
<str name="langid.fl">title,subject,text,keywords</str>
<str name="langid.langField">language_s</str>
</lst>
</processor>
<processor class=
"org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
<lst name="defaults">
<str name="langid.fl">title,subject,text,keywords</str>
<str name="langid.langField">language_s</str>
</lst>
</processor>
As you can see, both the configurations use the same parameters, the only difference being the processor class. The list of parameters is given here:
Parameter | Description |
langid | Used to enable language detection by setting the value to true. |
langid.fl |
This is a required parameter, which can contain either comma-delimited or space-delimited fields to be processed using langid. |
langid.langField |
This is a required parameter used to specify the field for the returned language code. |
langid.langsField |
The same as langid.langField, but in this case, it is used to specify the field for a list instead of a single language code. |
langid.overwrite |
If you enable this parameter, then the content of the langField and langsFields fields will be overwritten provided they already have a value. By default, the value is set to false. |
langid.lcmap | Contains a space-separated list that specifies the language code mappings (colon-delimited) to apply to the detected languages. |
langid.threshold |
Used to set a threshold between 0 and 1, and the language identification score must reach the threshold. Only then is langid accepted. The default value is 0.5. |
langid.whitelist |
Used to specify the allowed language identification codes list. |
langid.map |
Used to enable field name mapping. The default value is false. |