It's frequently the case that you need to index more than a single language. We might have different language versions for the same document, or else multivalued fields that contain different languages.
SolrStarterBook/solr-app/chp09/arts_lang/solrconfig.xml
), the configuration will require the addition of the following code:<updateRequestProcessorChain name="langid"> <processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">fullText,abstract,title,comment</str> <str name="langid.langField">languages</str> <str name="langid.fallback">en</str> <float name="langid.threshold">0.85</float> </lst> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
schema.xml
file from the previous chapters, we will add a new core.properties
file. And then, we will start the core by using /SolrStarterBook/test/chp09/arts_lang/start.sh
. We can easily test if the processor is working using the following command:>> curl -X POST 'http://localhost:8983/solr/arts_lang/update?commit=true&wt=json' -H "Content-Type: text/xml" -d'@example.post.xml'
The example is really simple and self explanatory. We can use the lang.fl=fullText,abstract,etc
parameter to provide a list of fields that can be used for language detection. In order to avoid false identifications, it's preferable to adopt fields with text that are not too short. In our case, to state that our documents are written in English, the document returned in our queries will include a languages=en
field. This is detected by the tool by using the fullText
or abstract
field.
The langid.langField=languages
parameter is used to create a new field named languages
, which will be a multivalued field in our case as a document can contain data from different languages.
This is an important aspect to be handled. We can provide different language-oriented analysis on a per-field basis. We can adopt a convention for managing the alternate version of a same field. If we have an abstract field for every language, we can, for example, manage an abstract_it
field with its own configuration (in this case, with some language-specific tuning for the Italian language). You should not miss the connection between adopting such a naming convention and introducing at runtime (indexing time) a field for tracking the detected language. In this way, it is simple and up to us to trigger some field transformation and decide by scripting what will be the correct field configuration to be adopted for the indexing.
We can provide a langid.fallback=en
parameter to treat the data as English by default (when we are not able or just don't need to detect the language), and a langid.threshold=0.85
parameter to assign a threshold that we consider good to be a precision value.
18.188.178.181