Chapter 2. Understanding Analyzers, Tokenizers, and Filters

In the previous chapter, we read how to install and run Solr on various operating systems and covered its architecture. We talked briefly about the basic building blocks of Solr, such as Solr config files.

In this chapter, we will cover the following core components of the Solr configuration:

  • Analyzers
  • Tokenizers
  • Filters

Introducing analyzers

To make us able to search effectively and efficiently, Solr splits text into tokens during indexing as well as during search (query time). Solr does all of this with the help of its three main components: analyzers, tokenizers, and filters. Analyzers are used during both indexing and searching. An analyzer examines the text of fields and the generated token stream with the help of tokenizers. Then, filters examine the stream of tokens and perform filtering jobs of any one of these: keeping them, discarding them, or creating new tokens. Tokenizers and filters might be combined in the form of pipelines or chains such that the output of one is the input of the other. Such a sequence of tokenizers and filters is called an analyzer, and the resulting output of the analyzer is used to match search queries or build indices. Let's see how we can use these components in Solr and implement them.

Analyzers are core components that preprocess input text at indexing and search time. It's recommended that you use similar or the same analyzers to preprocess text in a compatible manner at query and index time. In simple terms, the role of an analyzer is to examine the input text and generate token streams. An analyzer is specified as a child of a <fieldType> element in the schema.xml configuration file.

In normal usage, only fields of the solr.TextField type specify an analyzer. There are two ways to specify how text fields are analyzed in Solr with the help of analyzers in schema.xml:

  • One way is to specify the class name of an analyzer whose class attribute is a fully qualified Java class name. This is the simplest way of configuring an analyzer with a single <analyzer> element. The class name must be derived from org.apache.lucene.analysis.Analyzer. The following is an example:
    <fieldType name="nametext" class="solr.TextField">
      <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
    </fieldType>

    In this case, a single class, WhitespaceAnalyzer, is responsible for analyzing the content of the named text field and emitting the corresponding tokens. This analyzer can be used in simple cases where only plain English input text is present. But in general, there is always more complex analysis that is done on field content.

  • The second way is to specify a TokenizerFactory followed by a list of optional TokenFilterFactory, which are applied in the listed order. This is the way of performing complex analyses on input text content; for example, you can decompose your analysis into discrete and relatively simple steps. Here is an example:
    <fieldType name="nametext" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"/>
      </analyzer>
    </fieldType>

    In the preceding case, we are trying to set up an analysis chain by simply specifying the <analyzer> element (no class attribute) and child elements, which are factory classes for the tokenizers and filters in the order that you want to run. In this case, no analyzer class is defined in the <analyzer> element. Rather, there is a sequence of more specialized classes clubbed together to act as an analyzer for the field which is going to be analyzed. Soon, you will discover that a Solr distribution comes with a large selection of tokenizers and filters that will help by covering most of the scenarios that you are likely to encounter.

    Note

    Note that classes in the org.apache.solr.analysis package may be referred to here with the short alias solr.prefix.

We will cover tokenizers and filters in detail in upcoming topics.

Analysis phases

We read earlier that analysis happens in two contexts. At index time, when a field is being created, the token stream that results from the analysis is added to the index and defines a set of terms such as position, size, and so on for the field. At query time, the search query is analyzed and the terms are matched against those that are stored in the field's index. In many cases, the same analysis is used at index and query time, but there might be some cases in which you may want to use different steps of analysis during indexing and search time. Here is an example of this:

<fieldType name="nametext" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    filter class="solr.RemoveWordFilterFactory" words="removewords.txt"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

In the preceding example, you can see that we have used two <analyzer> definitions, distinguished by the type attribute. Based on the type attribute, Solr applies the analyzer to the input field at index and query time. At the time of indexing data, we told the analyzer to follow different steps, in comparison to query time. At index time, we told Solr to tokenize the text using the solr.StandardTokenizerFactory class, after which we used a filter called solr.LowerCaseFilterFactory to make the tokens lowercase. After making the tokens lowercase, we used another filter called solr.RemoveWordFilterFactory, which removes the tokens as per the words defined in removewords.txt. The final filter that we used maps the tokens to an alternate value using the solr.SynonymFilterFactory filter, which uses the synonyms.txt file. But at query time, we asked analyzer to apply only the lowercase filter to convert query terms to lowercase. Other filters that were applied at index time were not applied at query time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.23.78