Time for action – detecting language with Tika and LangDetect

It's frequently the case that you need to index more than a single language. We might have different language versions for the same document, or else multivalued fields that contain different languages.

  1. The Tika library (used in previous chapters with the SolrCell component) can also help with multilingual content. Another good option is to use the LangDetect library, as stated in the wiki: https://cwiki.apache.org/confluence/display/solr/Detecting+Languages+During+Indexing. This library is very precise, and seems to be a bit more precise than using Tika, for this specific task.

    Tip

    However, we can adopt the corresponding Tika component TikaLanguageIdentifierUpdateProcessorFactory, which supports compatibility of the same parameters.

  2. Looking at our first example of this chapter (/SolrStarterBook/solr-app/chp09/arts_lang/solrconfig.xml), the configuration will require the addition of the following code:
    <updateRequestProcessorChain name="langid">
      <processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
        <lst name="defaults">
          <str name="langid.fl">fullText,abstract,title,comment</str>
          <str name="langid.langField">languages</str>
          <str name="langid.fallback">en</str>
          <float name="langid.threshold">0.85</float>
        </lst>
      </processor>
      <processor class="solr.LogUpdateProcessorFactory" />
      <processor class="solr.RunUpdateProcessorFactory" />
    </updateRequestProcessorChain>
  3. This processor will then be chained on an update handler in the usual way, as seen before.
  4. Once we have copied a schema.xml file from the previous chapters, we will add a new core.properties file. And then, we will start the core by using /SolrStarterBook/test/chp09/arts_lang/start.sh. We can easily test if the processor is working using the following command:
    >> curl -X POST 'http://localhost:8983/solr/arts_lang/update?commit=true&wt=json' -H "Content-Type: text/xml" -d'@example.post.xml'
  5. Here, the XML Solr document file posted is an example file located in the same test directory.

What just happened?

The example is really simple and self explanatory. We can use the lang.fl=fullText,abstract,etc parameter to provide a list of fields that can be used for language detection. In order to avoid false identifications, it's preferable to adopt fields with text that are not too short. In our case, to state that our documents are written in English, the document returned in our queries will include a languages=en field. This is detected by the tool by using the fullText or abstract field.

The langid.langField=languages parameter is used to create a new field named languages, which will be a multivalued field in our case as a document can contain data from different languages.

Tip

This is an important aspect to be handled. We can provide different language-oriented analysis on a per-field basis. We can adopt a convention for managing the alternate version of a same field. If we have an abstract field for every language, we can, for example, manage an abstract_it field with its own configuration (in this case, with some language-specific tuning for the Italian language). You should not miss the connection between adopting such a naming convention and introducing at runtime (indexing time) a field for tracking the detected language. In this way, it is simple and up to us to trigger some field transformation and decide by scripting what will be the correct field configuration to be adopted for the indexing.

We can provide a langid.fallback=en parameter to treat the data as English by default (when we are not able or just don't need to detect the language), and a langid.threshold=0.85 parameter to assign a threshold that we consider good to be a precision value.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.72.74