Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Language Identifier

The language Identifier extension detects the language (or languages) of fields belonging to a given document. This is a very useful add-on to use in conjunction with the previously described extraction library, to get additional information about data that has been indexed.

The component is implemented as an UpdateRequestProcessor subclass that intercepts and analyzes the incoming data:

<processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
  <str name="langid.fl">text</str>
  <str name="langid.langField">language</str>
  <str name="langid.fallback">en</str>
</processor>

As you can see, this processor can be configured with several options. We can declare the fields of the incoming documents that must be analyzed, the name of the field that will hold the results of language detection, or a default fallback language in case no detection is possible.

Tip

In the example project associated with this chapter, you will find a solrconfig.xml file where the chain is already defined but the UpdateRequestProcessor is commented out. Just remove the comment markers, reload the core using the Administration Console, and reindex the documents under the example-data folder, following the same procedure as we described in the previous section. At the end, you will see an additional "language" field in each document; that is the result of the language detection component.

You should know that declaring the processor within the solrconfig.xml file is not enough. We need to insert that into an update request processor chain, and finally associate that chain with an UpdateRequestHandler. Only those update requests that will be received by that handler will pass through the language detection analysis chain.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Language Identifier

Create new playlist

Sign In

Sign Up

Language Identifier

Tip

Table of Contents for
Language Identifier