Stop filter

This removes all the words listed inside the stopwords.txt file. Removing stop words will reduce the size of the index and improve performance. These are the standard English stop words provided by Solr:

a, an, and, are, ask, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with.

We can manage (add or remove) words from the file as per our requirement. We can also create a file for other languages and include it by mentioning the file path in the word argument:

Factory class: solr.StopFilterFactory

Arguments:

  • words (optional): The path of the file that contains a list of stop words, one per line. Blank lines and lines that begin with # will be ignored from the file. The path may be an absolute or relative path.
  • format (optional): Indicates the format of the stopword list, for example, format="snowball" for a stopwords list that has been formatted for snowball.
  • ignoreCase (true/false, default false): This ignores casing when testing for stop words. If it is true, the stop list should contain lowercase words.

Example:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" />
</analyzer>
</fieldType>

Input: This is an example

Tokenizer to filter: Thisisanexample

Output: Thisexample

Example:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
</analyzer>
</fieldType>

Input: This is an example

Tokenizer to filter: Thisisanexample

Output: example

In the first example, we have not specified the argument ignoreCase, but in the second example, we have set it to true; so the outputs from both the examples are different. The location of the file stopwords_en.txt is %SOLR_HOME%/example/techproducts/solr/techproducts/conf/lang/stopwords_en.txt, as we currently understand from Solr's built-in example techproducts.

LCF: Converts all uppercase letters to lowercase in the token

Factory class: solr.LowerCaseFilterFactory

Arguments: None

Example:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Input: This is An example

Tokenizer to FilterThisisAnexample

Outputthisisanexample

All uppercase letters from the tokens are converted to lowercase. The sequence order of LowerCaseFilterFactory in the filter chain should be significant. If we define LCF before stop filter, Solr will unnecessarily apply lower case filtering on those stop words that are going to be removed in the next step.

If we need to use LCF in text analysis, then must apply LCF to both the phases of an analyzer (index and query). If we define LCF only in the indexing phase and not in the querying phase, a query for Soccer will never match with the indexed term soccer.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.25.75