Introducing stemming for query expansion

We have already adopted some file-based synonym query expansion. We can now adopt a stemmer-based approach, which is useful to add another level of flexibility, to intercept a certain amount of variance on user queries and needs.

Note

We call stemming the process by which we map the words derived from different forms of a certain original form to the same stem (or base, or root form). Words that share the same stem can be seen as a different surface for the same root, and they can be used as synonyms. This is useful to expand query adopting terms that are probably related because they have a similar form, for example, indexing for "painting" can produce the stem "paint" so that we are able to expand the query with results from the adoptions of terms "painter" or "painted". You should keep in mind that different languages may require quite different approaches, algorithms, and tools. If you want to read about stemming, I suggest you to start reading from the related section from the almost classical information retrieval book: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html.

The stemming algorithms are not trivial, as natural languages can be very different from each other in terms of morphology and the forms the words can assume. Imagine how the German words are composed and how they use suffixes, or think about the genre declination in different languages, or the difficulties in hadling a language that is not European-centric with an English-by-default approach. Handling these kinds of very specific configurations will be (maybe not surprisingly) one of the tasks that will engage us very soon to some more advanced topics that we are not able to cover here in detail.

However, there are a lot of resource on the Web on this. These algorithms are still evolving by adopting different approaches—from statistic based to rule based, and many more. Moreover, in the "big data" era they have improved a lot due to the huge quantity of data that was freely available to perform testing and tuning up models. Most of these tools exist in libraries for Natural Language Processing (NLP), which can also provide features such as tokenization, language detection, POS tagging (Part Of Speech annotation, useful for deeper analysis of the text that goes in the direction of handling semantics), entity recognition, and even more advanced features.

A vast list of some interesting tools in the NLP field can be found as usual on Wikipedia: http://en.wikipedia.org/wiki/List_of_natural_language_processing_toolkits#Natural_language_processing_toolkits.

You must not have missed the fact that many of these features are being integrated with the Solr workflow itself in some way, as an external tool or even plugins or internal components. So, I'll suggest you to subscribe to the developer's mailing list to follow any updates.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.171.162