While we are working on our sites, it is generally useful to use faceted search over taxonomies, tag collections, or other kinds of enumerated terms; there are cases in which it could be useful to look at the most recurring and interesting words in a certain field. This kind of search can be played on all the documents in the index or, more interestingly, on a specific search, restricting the size of the documents' collection we are playing on. The steps for finding interesting topics using faceting on tokenized fields with filter query are as follows:
*_entity
fields, we only need to check the configuration for a common tokenized field, which will be used to create a list of recurring terms.<fieldType name="text_general" class="solr.TextField"> <analyzer> <charFilter class="solr.MappingCharFilterFactory"mapping="mapping-ISOLatin1Accent.txt" /> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter class="solr.WordDelimiterFilterFactory"generateWordParts="1" generateNumberParts="1"catenateWords="0" catenateNumbers="0" catenateAll="1" splitOnCaseChange="1" preserveOriginal="0" /> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false" /> </analyzer> </fieldType>
subject
and abstract
), and I suggest you also clean and populate the index again. This can be easily achieved with the usual commands or by using the provided clean
and post
scripts (for example, clean.sh
if you are on *NIX or clean.bat
if you use Windows).>> curl -X GET 'http://localhost:8983/solr/paintings/select?q=*:*&rows=0&facet=true&facet.field=abstract&facet.limit=-1&facet.mincount=50&json.nl=map&wt=json'
>> curl -X GET 'http://localhost:8983/solr/paintings/select?q=*:*&rows=0&facet=true&facet.field=abstract&facet.limit=10&facet.mincount=50&json.nl=map&wt=json&fq=abstract:oil'
We will see filter queries later in this chapter. As a general rule, it is preferable to use them in combination with facets when you need to "pin" a certain term in order to restrict the space used on a subdomain; for example, the collection of documents containing the term oil
.
When comparing the results from the previous two examples, it is worth noting that the hits change for every term. These differences produce a different order of the results; thus, terms proposed in the first example simply disappear in the second.
In the first example (on the left), we are asking for eight facet results (using facet.limit=8
, but we can ask for all with facet.limit=-1
), but we feel that we are only interested in terms that have a certain recurrent use in the documents (facet.mincount=50
).
In the second example (on the right side of the image), we pin the term oil, which should be present in the abstract
field, and use the same query again. This could be quite useful, because in this way we are restricting the actual documents' collection on which we play the query. We can ask for facets over the fields that are different from the one used for this kind of restriction; for example, we could facet on the title (facet.field=title
), while we apply a restriction on the abstract field (fq=abstract:oil
):
>> curl -X GET 'http://localhost:8983/solr/paintings/select?q=*:*&rows=0&facet=true&facet.field=title&facet.limit=10&facet.mincount=50&json.nl=map&wt=json&fq=abstract:oil'
You probably don't have an idea yet on how this could be important for implementing an effective user interface. Consider the fact that we don't want to loose the flexibility of narrowing searches and navigation with facets, while we still want to add some restrictions with filter queries and eventually search in the usual way with the standard query. You may want to read the following page on wiki:
http://wiki.apache.org/solr/SimpleFacetParameters#Multi-Select_Faceting_and_LocalParams
I also suggest you to play with the examples again, changing the fields we are currently using. For example, you might be interested in the result when using the title_entity
field instead of the title
field.
3.144.172.38