This removes all the words listed inside the stopwords.txt file. Removing stop words will reduce the size of the index and improve performance. These are the standard English stop words provided by Solr:
a, an, and, are, ask, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with.
We can manage (add or remove) words from the file as per our requirement. We can also create a file for other languages and include it by mentioning the file path in the word argument:
Factory class: solr.StopFilterFactory
Arguments:
- words (optional): The path of the file that contains a list of stop words, one per line. Blank lines and lines that begin with # will be ignored from the file. The path may be an absolute or relative path.
- format (optional): Indicates the format of the stopword list, for example, format="snowball" for a stopwords list that has been formatted for snowball.
- ignoreCase (true/false, default false): This ignores casing when testing for stop words. If it is true, the stop list should contain lowercase words.
Example:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" />
</analyzer>
</fieldType>
Input: This is an example
Tokenizer to filter: This, is, an, example
Output: This, example
Example:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
</analyzer>
</fieldType>
Input: This is an example
Tokenizer to filter: This, is, an, example
Output: example
In the first example, we have not specified the argument ignoreCase, but in the second example, we have set it to true; so the outputs from both the examples are different. The location of the file stopwords_en.txt is %SOLR_HOME%/example/techproducts/solr/techproducts/conf/lang/stopwords_en.txt, as we currently understand from Solr's built-in example techproducts.
LCF: Converts all uppercase letters to lowercase in the token
Factory class: solr.LowerCaseFilterFactory
Arguments: None
Example:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Input: This is An example
Tokenizer to Filter: This, is, An, example
Output: this, is, an, example
All uppercase letters from the tokens are converted to lowercase. The sequence order of LowerCaseFilterFactory in the filter chain should be significant. If we define LCF before stop filter, Solr will unnecessarily apply lower case filtering on those stop words that are going to be removed in the next step.