The Clustering component

The clustering component groups documents into similar clusters using sophisticated statistical techniques. Each cluster is identified by a few words from the documents that were used to distinguish the documents in that cluster from the other clusters. As with the MoreLikeThis component, which also uses statistical techniques, the quality of the results is hit or miss. This component resides in its own contrib module and it provides an extension point to integrate a clustering engine.

Tip

The primary means of navigation/discovery of your data should generally be search and faceting. For so-called unstructured text use cases, there are, by definition, few attributes to facet on. Clustering search results and presenting tag clouds (a visualization of faceting on words) are generally exploratory navigation methods of last resort in the absence of more effective document metadata.

Presently, there are two search-result clustering algorithms available as part of the Carrot2 open source project that this module has adapters for; other commercial options exist too. Solr ships with the needed third-party libraries—JAR files. The clustering component has an extension point to support full-index clustering (offline clustering) via the clustering.collection parameter, but no implementation has materialized yet.

To get started with exploring this feature, we'll direct you to the Solr Reference Guide at https://cwiki.apache.org/confluence/display/solr/Result+Clustering. There is quick-start set of instructions in which you'll be clustering Solr's example documents in under five minutes. It should be easy to copy the necessary configuration to your Solr instance and modify it to refer to your document's fields. As you dive into the technology, Carrot2's powerful GUI workbench should be of great help in tuning it to get more effective results. For a public demonstration of Carrot2's clustering, visit http://search.carrot2.org/stable/search.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.137.64