What is an analyzer?

An analyzer examines the text of fields and generates a token stream. Normally, only fields of type solr.TextField will specify an analyzer. An analyzer is defined as a child element of the <fieldType> element in the managed-schema.xml file. Here is a simple analyzer configuration:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
 <analyzer class="org.apache.lucene.analysis.core.WhitespaceAnalyzer"/>
</fieldType>

Here, we have defined a single <analyzer> element. This is the simplest way to define an analyzer. We've already understood the positionIncrementGap attribute, which adds a space between multi-value fields, in the previous chapter.

The class attribute value is a fully qualified Java class name. The input text will be analyzed by the analyzer class (WhitespaceAnalyzer). Let's configure the following analyzer configuration in managed-schema.xml and verify through the admin console:

<fieldType name="text_en" class="solr.TextField">
    <analyzer class="org.apache.lucene.analysis.core.WhitespaceAnalyzer"/>
  </fieldType>

Applying WhitespaceAnalyzer on the input string:

Input: Running simple Solr analyzer through admin console

Output: Running, simple, Solr, analyzer, through, admin, and console

This is a very simple example. The analyzer class WhitespaceAnalyzer splits the string at white spaces and generates the token stream. The named class must derive from org.apache.lucene.analysis.Analyzer.

The analyzer may be a single class or may be composed of a series of tokenizers and filter classes. Configuring an analyzer along with tokenizers and filters is very easy and straightforward. Define the <analyzer> element with child elements that name factory classes for the tokenizer and filters. Always configure tokenizers and filters in the order you want to run them in. Here is an example:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
 <analyzer>
 <tokenizer class="solr.StandardTokenizerFactory"/>
 <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
 <filter class="solr.LowerCaseFilterFactory"/> 
 </analyzer>
</fieldType>

Solr will execute tokenizers and filters in the order in which they are configured. The execution order should be defined logically. For example, applying an LCF before a stop filter will impact the performance, as stop words are always going to be removed from the stream and still we would be unnecessarily applying an LCF.

The input text will be passed to the first element in the list (here it is StandardTokenizerFactory) and generate tokens accordingly. The output from this will be the input to the next (immediate successor; here it is StopFilterFactory). In this way, all the steps will be executed. The tokens generated from the last filter (LowerCaseFilterFactory here) will be the final token stream. Solr builds indexes using this stream.

Table of Contents for What is an analyzer?

Create new playlist

Sign In

Sign Up

Table of Contents for
What is an analyzer?