Using nonaggressive stemmers

Nowadays, it's nice to have stemming algorithms (algorithms that will reduce words to their stem or root forms) in your application, which will allow you to find words such as cat and cats just by typing cat. However, let's imagine that you have a search engine that searches through contents of the books in a library. One of the requirements is changing the plural forms of the words from plural to singular; nothing less, nothing more. Can Solr do this? Yes, Solr can do this, and this recipe will show you how to do it.

How to do it...

  1. First, let's start with a simple, two-field index (add the following section to your schema.xml file):
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="description" type="text_light_stem" indexed="true" stored="true" />
  2. Now, let's define the text_light_stem field type, which should look like this (add this to your schema.xml file):
    <fieldType name="text_light_stem" class="solr.TextField">
     <analyzer>  
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.EnglishMinimalStemFilterFactory" />
      <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
    </fieldType>
  3. Then, let's check the analysis tool of Solr administration pages; you should see that words such as ways, keys, and populations have been changed to their singular forms:
    How to do it...

How it works...

First, we need to define the fields in the schema.xml file. We do this by adding the contents from the first example into the schema.xml file. It tells Solr that our index will consist of two fields—the id field that will be responsible for holding information about the unique identifier of the document, and the description file that will be responsible for holding the document description.

The description field is where the magic is being done. We defined a new field type for this field, and we called it text_light_stem. The field definition consists of a tokenizer and two filters. The solr.WhitespaceTokenizerFactory tokenizer splits the words on the basis of whitespace characters. The first filter is the one we are interested in. This is the light-stemming filter that we will use to perform minimal stemming. In general, aggressive stemming can and will change the words more, while minimal stemming is usually about removing the plural forms. The class that enables Solr to use this filter is solr.EnglishMinimalStemFilterFactory. This filter takes care of the process of light stemming. You can see this by using analysis tools of the Solr administration panel. The second filter, solr.LowerCaseFilterFactory is responsible for lowercasing terms.

There's more...

Light stemming supports a number of different languages. To use the light stemmers for your respective language, add the following filters to your type:

Language

Filter

Russian

solr.RussianLightStemFilterFactory

Portuguese

solr.PortugueseLightStemFilterFactory

French

solr.FrenchLightStemFilterFactory

German

solr.GermanLightStemFilterFactory

Italian

solr.ItalianLightStemFilterFactory

Spanish

solr.SpanishLightStemFilterFactory

Hungarian

solr.HungarianLightStemFilterFactory

Swedish

solr.SwedishLightStemFilterFactory

Finish

solr.FinnishLightStemFilterFactory

Indonesian

solr.IndonesianStemFilterFactory (with the stemDerivational="false" attribute)

Norwegian

solr.NorwegianLightStemFilterFactory

In the case of solr.IndonesianStemFilterFactory, you need to add the stemDerivational="false" attribute in order to have it working as a light stemmer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.99.57