Time for action – adopting a stemmer

What we need to do now is configure a stemmer for our language. We will adopt an English one to see how to proceed. You can use the following simple configuration as a starting point for future and more advanced improvements.

  1. We can, for example, define a new field for our arts_lang core; or simply modify the existing text_general field we designed before as follows:
    <fieldType name="text_general" class="solr.TextField">
      <analyzer>
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt" />
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="0" stemEnglishPossessive="1" />
        <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true" />
        <filter class="solr.KeywordRepeatFilterFactory" />
        <filter class="solr.PorterStemFilterFactory" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonims.txt" ignoreCase="true" expand="false" />
      </analyzer>
    </fieldType>

Note

Note how we are using a combination of token filters. Once we have normalized the problems with accents and split a sentence into a sequence of words by splitting on whitespaces, we need to adopt the WordDelimiterFilter if we want to perform stemming on each individual term as well. Note that example such as "painting portraits" will be stemmed as "paint" for the first term and "portrait" for the second. So, the collection of stems will probably be the same as for the phrase "portrait of a painter". What I am suggesting here is that we are only playing with the expansion of the form of every single term, we did not make any assumption on the meaning of the terms expanded by the same stem.

What just happened?

The goal is to index both the stemmed and unstemmed terms at the same position. We generally adopt a KeywordRepeatFilterFactory that will emit a copy for every keyword, so that one of them can be reduced to its stem later and then indexed too. To generate a stem for a certain term, a component will often need some model that has been previously trained on a large set of data. In our case, we have decided to start with the simple PorterStemmer class. It does not need external files to be handled for the model counterpart as it already supports rules for English-only stemming. So, we will be warned that this may not work as expected if we plan to use it on a different language. When a KeywordRepeatFilterFactory is used before the actual stemmer filter factory, we generally put after it a RemoveDuplicateTokenFactory to remove the duplicates emitted when the original term does not produce any new stem.

There are a lot of alternate stemmers that are already provided with Solr for your use. They are either designed for a specific language, such as the ItalianLightStemFilterFactory, or to internally switch between different models, for example: <filter class="solr.SnowballPorterFilterFactory" language="Italian"/>.

You will find a list of available stemmers at the following wiki page: http://wiki.apache.org/solr/LanguageAnalysis.

Testing language analysis with JUnit and Scala

Suppose that at a certain point we need to develop a component on our own, or integrate existing components into some standard analyzer chain. If you and your team have some interest in this, a good step with which to start is to be able to perform local unit testing. This will help you to see how a tool will perform, before we put it into the entire workflow.

I think that the Scala language is a good option because it has powerful syntactic features. It is compatible with Java, and I think it is also very compact and simple to read—at least while writing simple code.

For example, if you consider the following code snippet:

import org.apache.lucene.analysis.BaseTokenStreamTestCase.assertAnalyzesTo
[...]
class CustomItalianAnalyzer extends Analyzer {
  override def createComponents(fieldName: String, reader: Reader) = {
    val tokenizer = new WhitespaceTokenizer(Version.LUCENE_45, reader)
    val filter1 = new ItalianLightStemFilter(tokenizer)
    val filter2 = new LowerCaseFilter(Version.LUCENE_45, filter1)
    val filter3 = new CapitalizationFilter(filter2)
    return new TokenStreamComponents(tokenizer, filter3)
	}
}
@Test
def testCustomAnalyzer() = {
  val customAnalyzer = new CustomItalianAnalyzer()
  assertAnalyzesTo(customAnalyzer, "questa è una frase di esempio", List("Quest", "È", "Una", "Frase", "Di", "Esemp").toArray)
}

You should be able to recognize the involved elements even if you are not familiar with programming. Note how the unit testing permits us to focus on simple examples that will be used as reference. We can write as many test examples as we want, just to be sure that our combination of components is working correctly. The preceding Scala code snippet, when every object handled is used as an input for the sibling, is more or less equivalent to the following XML configuration:

<fieldType name="myField" class="myPackage.CustomItalianAnalyzer">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.ItalianLightStemFilterFactory" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.CapitalizationFilter" />
  </analyzer>
</fieldType>

It should help us to understand what happens. You will find some examples of this testing at /SolrStarterBook/projects/solr-maven/solr-plugins-scala/src/test/scala/testschema/TestAnalyzers.scala. Here, the class BaseTokenStreamTestCase, which is included in the solr-test-framework.jar package already available in the Solr standard distribution, plays a major role, because it provides facility methods that helps us by simplifying the task.

Writing new Solr plugins

It's not difficult to write new Solr Plugins, because the Solr internal workflow is well designed. From Solr 4, we have access to many more aspects of the data—from the execution of a query to the storing and indexing of the counterparts to the final response.

The main problem is very often in understanding quickly how to proceed, which interface and classes are needed for our purposes, how to test our code, and so on. The obvious first step is following the instructions on the wiki: http://wiki.apache.org/solr/SolrPlugins.

However, I suggest that you subscribe to the official mailing list (http://lucene.apache.org/solr/discussion.html) in order to follow interesting technical discussions, ask suggestions or help for your problems.

Introducing Solr plugin structure and lifecycle

There are many types of Solr plugins that can be written and inserted at different stages in the internal workflow. The previous example suggests that every plugin will have to implement certain predefined methods to communicate with the engine. For example, a StopWordFilter class would need to find the txt file where we have put our stop list, and then implement a ResourceLoaderAware interface to delegate the resolution and handling of the resource on disk to the core; or a new searcher can implement a SolrCoreAware interface to be able to capture core-specific information.

Implementing interfaces for obtaining information

We can implement the ResourceLoaderAware interface to directly manage resources. The classes implementing this interface are as follows:

  • TokenFilterFactory
  • TokenizerFactory
  • FieldType

If a new component needs to have access to other component's or configuration's references for its core, it should implement the SolrCoreAware interface. The classes implementing this interface are as follows:

  • SolrRequestHandler
  • QueryResponseWriter
  • SearchComponent

As you can see here, the Aware suffix is almost self explanatory; and as you can imagine,a class may want to implement both the interfaces.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.234.188