Time for action – finding interesting topics using faceting on tokenized fields with a filter query

While we are working on our sites, it is generally useful to use faceted search over taxonomies, tag collections, or other kinds of enumerated terms; there are cases in which it could be useful to look at the most recurring and interesting words in a certain field. This kind of search can be played on all the documents in the index or, more interestingly, on a specific search, restricting the size of the documents' collection we are playing on. The steps for finding interesting topics using faceting on tokenized fields with filter query are as follows:

  1. In this particular scenario, we need to move in the opposite direction as before. We need to use tokenization in a similar way as for the "common" text field; so, we will use a tokenized field here instead of a keyword field to show this particular exception to the normal adoptions of faceting.
  2. Since we have already defined our field for the *_entity fields, we only need to check the configuration for a common tokenized field, which will be used to create a list of recurring terms.
  3. For our needs, a good configuration can be as shown in the following code:
    <fieldType name="text_general" class="solr.TextField">
      <analyzer>
        <charFilter class="solr.MappingCharFilterFactory"mapping="mapping-ISOLatin1Accent.txt" />
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.WordDelimiterFilterFactory"generateWordParts="1" generateNumberParts="1"catenateWords="0" catenateNumbers="0" catenateAll="1" splitOnCaseChange="1" preserveOriginal="0" />
        <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false" />
      </analyzer>
    </fieldType>
  4. We then need to restart the core in order to use the new configuration (for example, we will use it for fields such as subject and abstract), and I suggest you also clean and populate the index again. This can be easily achieved with the usual commands or by using the provided clean and post scripts (for example, clean.sh if you are on *NIX or clean.bat if you use Windows).
  5. Since the abstract field contains some detailed information on a specific painting, this is the ideal candidate for this kind of experimentation. If we want to obtain a list of the most used terms in this field, we can use the following query:
    >> curl -X GET 'http://localhost:8983/solr/paintings/select?q=*:*&rows=0&facet=true&facet.field=abstract&facet.limit=-1&facet.mincount=50&json.nl=map&wt=json'
    
  6. But, we can also play with two more parameters: the standard query and the filter query. For example, if we want to obtain the same results, not on the full collection of documents but only on documents that have a mention of the term oil in their abstract description, we can change the query a little, as shown in the following code:
    >> curl -X GET 'http://localhost:8983/solr/paintings/select?q=*:*&rows=0&facet=true&facet.field=abstract&facet.limit=10&facet.mincount=50&json.nl=map&wt=json&fq=abstract:oil'
    
  7. The standard and filter queries can generally produce simple differences in the results, and even if these differences are important for the score evaluation, they will not affect our experiments. In the left side of the following screenshot, you will see the results from the first query, and on the right side, the results from the second:
    Time for action – finding interesting topics using faceting on tokenized fields with a filter query

We will see filter queries later in this chapter. As a general rule, it is preferable to use them in combination with facets when you need to "pin" a certain term in order to restrict the space used on a subdomain; for example, the collection of documents containing the term oil.

What just happened?

When comparing the results from the previous two examples, it is worth noting that the hits change for every term. These differences produce a different order of the results; thus, terms proposed in the first example simply disappear in the second.

In the first example (on the left), we are asking for eight facet results (using facet.limit=8, but we can ask for all with facet.limit=-1), but we feel that we are only interested in terms that have a certain recurrent use in the documents (facet.mincount=50).

In the second example (on the right side of the image), we pin the term oil, which should be present in the abstract field, and use the same query again. This could be quite useful, because in this way we are restricting the actual documents' collection on which we play the query. We can ask for facets over the fields that are different from the one used for this kind of restriction; for example, we could facet on the title (facet.field=title), while we apply a restriction on the abstract field (fq=abstract:oil):

>> curl -X GET 'http://localhost:8983/solr/paintings/select?q=*:*&rows=0&facet=true&facet.field=title&facet.limit=10&facet.mincount=50&json.nl=map&wt=json&fq=abstract:oil'

You probably don't have an idea yet on how this could be important for implementing an effective user interface. Consider the fact that we don't want to loose the flexibility of narrowing searches and navigation with facets, while we still want to add some restrictions with filter queries and eventually search in the usual way with the standard query. You may want to read the following page on wiki:

http://wiki.apache.org/solr/SimpleFacetParameters#Multi-Select_Faceting_and_LocalParams

I also suggest you to play with the examples again, changing the fields we are currently using. For example, you might be interested in the result when using the title_entity field instead of the title field.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.172.38