Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Filters

Like tokenizers, filters consume tokens as input and again produce a stream of tokens. The function of a filter is a bit different from that of a tokenizer. Unlike a tokenizer, a filter receives tokens as the input (passed by a tokenizer), and its function is to look at each token and decide whether to keep this token, change/replace it, or discard it. Filters are also derive from org.apache.lucene.analysis.TokenStream.

A typical example of a filter looks something like this:

<fieldType name="text" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Filters are configured in schema.xml with a <filter> element as a child of <analyzer>, following the <tokenizer> element. Since filters take token streams as input, the filter definition should follow the tokenizer or another filter definition, as shown in the preceding example.

The preceding example starts with a standard tokenizer; it tokenizes the input text. Then these tokens pass through Solr's standard filter, which removes dots from acronyms and performs some other common operations. All the tokens are then set to lowercase, which will help in case-insensitive matching at query time.

Like tokenizers, the class attribute names a factory class that instantiates a filter object as needed. Filter factory classes must implement the org.apache.solr.analysis.TokenFilterFactory interface. Arguments may be passed to tokenizer factories to change their behavior by setting attributes in the <filter> element. An example of filter factory is as follows:

<fieldType name="hyphenDelimited" class="solr.TextField">
  <analyzer type="query">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="- " />
    <filter class="solr.LengthFilterFactory" min="3" max="6"/>
  </analyzer>
</fieldType>

Let's see some of the filter factories that are included in the Solr release.

Lowercase filter

The lowercase filter converts all uppercase letters to lowercase tokens, and all other characters are left unchanged:

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>

Input text: "I Love Apache Solr"

Tokenizer input to filter: "I", "Love", "Apache", "Solr"

Output: "i", "love", "apache", "solr"

Synonym filter

The synonym filter is responsible for synonym mapping. Each token is matched with a list of synonyms present in the synonym file passed as an argument, and if a match is found, then the synonym is put in place of the token:

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"/>
</analyzer>

The format of the synonyms.txt file is as follows:

i-phone, i phone, iphone  => iphone
iit, IIT, I.I.T => Indian Institute of Technology
small => tiny, short, teeny

Input text: "new i-phone is small"

Tokenizer input to filter: "new", "i-phone", "is", "small"

Output: "new", "iphone", "is", "tiny", "short", "teeny"

Porter stem filter

The Porter stem filter applies the Porter stemming algorithm for the English language. This filter is very similar to the Snowball Porter stem filter with the English language. In the snowball porter stem filter, you can provide a language as the input parameter, such as French, Spanish, and so on:

<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory "/>
  <filter class="solr.PorterStemFilterFactory"/>
</analyzer>

Input text: "run runs running ran"

Tokenizer input to filter: "run", "runs", "running", "ran"

Output: "run", "run", "run", "run"

Other filters are:

Length filter
Keep words filter
ICU transform filter
KStem filter
N-gram filter
Pattern replace filter
Position filter factory
Remove duplicates token filter
Shingle filter
Reversed wildcard filter
Token offset payload filter
Trim filter
Type token filter
Word delimiter filter

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Filters

Create new playlist

Sign In

Sign Up

Filters

Lowercase filter

Synonym filter

Porter stem filter

Table of Contents for
Filters