Time for action – debugging a query with the Dismax parser

Disjunction Max Query (Dismax) searches for every pair of the field/term separately, and then combines the results in order to calculate the maximum score value of the results. By default, it performs searches on every field.

  1. Let's start again by using cURL for a Dismax query as follows:
    >> curl -X GET 'http://localhost:8983/solr/paintings/select?q=giraffe+dali+%22female+figure%22+vermeer&start=0&fl=title+artist+city&wt=json&indent=true&debugQuery=true&defType=dismax&qf=artist+abstract^2&mm=3&pf=abstract^3'
    
  2. When we find good parameter values in our experiments, we can, for example, configure a new request handler with these values predefined in solrconfig.xml.
  3. Looking at the following screenshot, extracted from the complete results, we can see different DisjunctionMaxQuery objects, one independent from the other:
    Time for action – debugging a query with the Dismax parser
  4. In this example, the choice of parameters' values is used to produce some different and independent searches, using the DisjunctionMaxQuery component, and then assemble the result vector into a final combination.

What just happened?

In this kind of search, obtaining a list of good matches is somewhat more important than a higher precision of the result; avoiding powerful (and sometimes ugly) operators, and searching by a simple keyword sequence will correspond to a more natural query for a common user.

We have used some more new parameters here specific to Dismax as follows:

  • defType:dismax: This is used to activate the Dismax query parser. It is equivalent to checking on the Dismax entry in the web interface, to activate and open it.
  • qf:artist^2 abstract: This is a list of query fields. Here we are restricting the searches to some specific fields, than give to the field artist some more importance than abstract in the results.
  • mm:3: Here, we search (if possible) for a document with a minimum match of three terms.
  • pf:abstract^3 defines a phrase field: We want to boost the phrase search more that has some match in the abstract field.

Tip

Note that by activating the Dismax parsing without specifying values for its own parameters, you will have the same results as with the standard Lucene parser.

The more readable and final parsed version of the same query is exposed by the parsedquery_toString field, as shown in the following screenshot:

What just happened?

The choice of query fields produces an explicit Boolean query for every term. This small series of independent queries are then assembled as a unique proximity search query, using a minimum match of 3 (which corresponds to mm=3).

Finally, the phrase and term boost are taken into account separately and only the second one is considered optional (there is no + sign), so that if no match is found the output will be less precise on that condition, but still there will be some result. Generally speaking, we can think about using this query parser as we always want some result, even less precise, but still pertinent.

Using an Edismax default handler

Extended Dismax Parser (Edismax) has been designed as an evolution of the Dismax parser, so it's probably the best choice as a default query handler in our project for common queries. It is almost identical to Dismax, but introduced with some more fine tuning parameters; for example, the possibility to choose how to act differently with bigram and trigram phrase searches (for example, pf2, and pf3 for sequences of two and three terms respectively), and the chance to define a user field (the uf parameter) where it's still possible to make use of the Lucene syntax.

The most important of all is that it is even more robust on errors, incomplete conditions, missing operands, and so on. It has been also designed to perform automatic escaping and to use lowercase values when possible, as well as to be flexible enough to let us use Boolean parameters with different syntax: AND, and, &&, OR, or, ||. The "fail-fast" behavior of this query parser is thus a good compromise between the Lucene power and the behavior expected by a common user in real scenarios.

You can find an explanation of parameters at http://wiki.apache.org/solr/ExtendedDisMax.

Tip

If you are interested in a good explanation of the origin of this parser, please consider reading this introduction at http://searchhub.org/2010/05/23/whats-a-dismax/.

Configuring an Edismax query handler as the default one in our schema is as simple as adding the following code in our solrconfig.xml (you will find it along with a specific /SolrStarterBook/solr-app/chp05/paintings_edismax core in the examples):

<requestHandler name="standard" class="solr.SearchHandler" default="true" >
  <lst name="defaults">
    <str name="defType">edismax</str>
    <str name="echoParams">explicit</str>
    <str name="qf">artist^2 abstract</str>
    <str name="pf">abstract^3</str>
    <int name="mm">3</int>
    <str name="fq">abstract:painting*</str>
    <str name="fq">title:*</str>
    <str name="fq">artist:*</str>
    <str name="uf">paint~4</str>
    <str name="fl">title artist</str>
    <str name="wt">json</str>
  </lst>
</requestHandler>

Note that if some of the parameters' values were fixed in the queries, a filter query can be used for performance reasons; for example if we added fq=abstract:painting* in our case, if we added fq=comment:painting, we will omit those results that do not have the term painting in the comment, so please use this parameter carefully. Using title:* and artist:* is just a way to obtain only the documents that actually contain the values for those fields. Note that we can pass as many fq single values as we want, and they will be cached internally for the performance. We will see filter queries again later in Chapter 6, Using Faceted Search – from Searching to Finding.

Lastly, since in this example we don't want to use a different handler for our Edismax search from the default one, we need to remove the original default handler as follows:

<requestHandler name="standard" class="solr.StandardRequestHandler" default="true" />

The choice of having the handlers working together or changing the default one will generally depend on your application.

Tip

An interesting article by Nolan Lawson (http://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/) can give us some suggestions on how to combine boosting with different types of parsers such as Edismax and the Lucene one.

However, if we don't substitute the default handler with the Edismax handler, we can of course use both Lucene and Edismax at the same time, as we will see in the next sections.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.82.253