Like tokenizers, filters consume tokens as input and again produce a stream of tokens. The function of a filter is a bit different from that of a tokenizer. Unlike a tokenizer, a filter receives tokens as the input (passed by a tokenizer), and its function is to look at each token and decide whether to keep this token, change/replace it, or discard it. Filters are also derive from org.apache.lucene.analysis.TokenStream
.
A typical example of a filter looks something like this:
<fieldType name="text" class="solr.TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
Filters are configured in schema.xml
with a <filter>
element as a child of <analyzer>
, following the <tokenizer>
element. Since filters take token streams as input, the filter definition should follow the tokenizer or another filter definition, as shown in the preceding example.
The preceding example starts with a standard tokenizer; it tokenizes the input text. Then these tokens pass through Solr's standard filter, which removes dots from acronyms and performs some other common operations. All the tokens are then set to lowercase, which will help in case-insensitive matching at query time.
Like tokenizers, the class attribute names a factory class that instantiates a filter object as needed. Filter factory classes must implement the org.apache.solr.analysis.TokenFilterFactory
interface. Arguments may be passed to tokenizer factories to change their behavior by setting attributes in the <filter>
element. An example of filter factory is as follows:
<fieldType name="hyphenDelimited" class="solr.TextField"> <analyzer type="query"> <tokenizer class="solr.PatternTokenizerFactory" pattern="- " /> <filter class="solr.LengthFilterFactory" min="3" max="6"/> </analyzer> </fieldType>
Let's see some of the filter factories that are included in the Solr release.
The lowercase filter converts all uppercase letters to lowercase tokens, and all other characters are left unchanged:
<analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer>
Input text: "I Love Apache Solr"
Tokenizer input to filter: "I", "Love", "Apache", "Solr"
Output: "i", "love", "apache", "solr"
The synonym filter is responsible for synonym mapping. Each token is matched with a list of synonyms present in the synonym file passed as an argument, and if a match is found, then the synonym is put in place of the token:
<analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"/> </analyzer>
The format of the synonyms.txt
file is as follows:
i-phone, i phone, iphone => iphone iit, IIT, I.I.T => Indian Institute of Technology small => tiny, short, teeny
Input text: "new i-phone is small"
Tokenizer input to filter: "new", "i-phone", "is", "small"
Output: "new", "iphone", "is", "tiny", "short", "teeny"
The Porter stem filter applies the Porter stemming algorithm for the English language. This filter is very similar to the Snowball Porter stem filter with the English language. In the snowball porter stem filter, you can provide a language as the input parameter, such as French, Spanish, and so on:
<analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory "/> <filter class="solr.PorterStemFilterFactory"/> </analyzer>
Input text: "run runs running ran"
Tokenizer input to filter: "run", "runs", "running", "ran"
Output: "run", "run", "run", "run"
Other filters are:
18.222.152.241