Filters

Like tokenizers, filters consume tokens as input and again produce a stream of tokens. The function of a filter is a bit different from that of a tokenizer. Unlike a tokenizer, a filter receives tokens as the input (passed by a tokenizer), and its function is to look at each token and decide whether to keep this token, change/replace it, or discard it. Filters are also derive from org.apache.lucene.analysis.TokenStream.

A typical example of a filter looks something like this:

<fieldType name="text" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Filters are configured in schema.xml with a <filter> element as a child of <analyzer>, following the <tokenizer> element. Since filters take token streams as input, the filter definition should follow the tokenizer or another filter definition, as shown in the preceding example.

The preceding example starts with a standard tokenizer; it tokenizes the input text. Then these tokens pass through Solr's standard filter, which removes dots from acronyms and performs some other common operations. All the tokens are then set to lowercase, which will help in case-insensitive matching at query time.

Like tokenizers, the class attribute names a factory class that instantiates a filter object as needed. Filter factory classes must implement the org.apache.solr.analysis.TokenFilterFactory interface. Arguments may be passed to tokenizer factories to change their behavior by setting attributes in the <filter> element. An example of filter factory is as follows:

<fieldType name="hyphenDelimited" class="solr.TextField">
  <analyzer type="query">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="- " />
    <filter class="solr.LengthFilterFactory" min="3" max="6"/>
  </analyzer>
</fieldType>

Let's see some of the filter factories that are included in the Solr release.

Lowercase filter

The lowercase filter converts all uppercase letters to lowercase tokens, and all other characters are left unchanged:

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>

Input text: "I Love Apache Solr"

Tokenizer input to filter: "I", "Love", "Apache", "Solr"

Output: "i", "love", "apache", "solr"

Synonym filter

The synonym filter is responsible for synonym mapping. Each token is matched with a list of synonyms present in the synonym file passed as an argument, and if a match is found, then the synonym is put in place of the token:

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"/>
</analyzer>

The format of the synonyms.txt file is as follows:

i-phone, i phone, iphone  => iphone
iit, IIT, I.I.T => Indian Institute of Technology
small => tiny, short, teeny

Input text: "new i-phone is small"

Tokenizer input to filter: "new", "i-phone", "is", "small"

Output: "new", "iphone", "is", "tiny", "short", "teeny"

Porter stem filter

The Porter stem filter applies the Porter stemming algorithm for the English language. This filter is very similar to the Snowball Porter stem filter with the English language. In the snowball porter stem filter, you can provide a language as the input parameter, such as French, Spanish, and so on:

<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory "/>
  <filter class="solr.PorterStemFilterFactory"/>
</analyzer>

Input text: "run runs running ran"

Tokenizer input to filter: "run", "runs", "running", "ran"

Output: "run", "run", "run", "run"

Other filters are:

  • Length filter
  • Keep words filter
  • ICU transform filter
  • KStem filter
  • N-gram filter
  • Pattern replace filter
  • Position filter factory
  • Remove duplicates token filter
  • Shingle filter
  • Reversed wildcard filter
  • Token offset payload filter
  • Trim filter
  • Type token filter
  • Word delimiter filter
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.152.241