The DisMax query parser – part 1

The lucene query parser we've been using so far for searching offers a rich syntax, but it doesn't do anything more. A notable problem with using this parser is that the query must be well formed according to the aforementioned syntax rules, such as having balanced quotes and parentheses. Users might type just about anything for a query, not knowing anything about this syntax, possibly resulting in an error or unexpected results. The DisMax query parser, named after Lucene's DisjunctionMaxQuery, addresses this problem and adds many features to enhance search relevancy (good scoring). The features of this query parser that have a more direct relationship to scoring are described in the The DisMax query parser – part 2 section in the next chapter. Use of this parser is so important that we need to introduce it here.

You'll see references here to eDisMax, whereby the e stands for extended. This is a forked evolution of DisMax that adds features. It hasn't yet replaced the original DisMax query parser because it enables more support for Lucene's syntax at the expense of a user inadvertently using it. So if you don't care about eDisMax's extra features and don't have users that want the more advanced syntax support, then stick with the venerable DisMax. In a future Solr version, perhaps as soon as the next release, we expect dismax to refer to the enhanced version while the older one will likely exist under another name.

Tip

Almost always use defType=edismax or dismax

The dismax (or edismax) query parser should almost always be chosen for parsing user queries q. Set it in the request handler definition for your app. Furthermore, we recommend the use of edismax. The only consideration against this is whether it will be a problem for users to be able to use Solr's full syntax, inadvertently or maliciously. This will be explained shortly.

Here is a summary of the features that the dismax query parser has over the lucene query parser:

  • Searches across multiple fields with different score boosts through Lucene's DisjunctionMaxQuery.
  • Limits the query syntax to an essential subset. The edismax query parser permits Solr's full syntax, assuming it parses correctly.
  • Automatic phrase boosting of the entire search query. The edismax query parser boosts contiguous portions of the query too.
  • Convenient query boosting parameters, generally for use with function queries.
  • Can specify the minimum number of words to match, depending on the number of words in a query string.
  • Can specify a default query to use when no user query is specified.

The edismax query parser was only mentioned a couple of times in this list, but it improves on the details of how some of these features work.

Tip

Use debugQuery=on or debug=query

Enable query debugging to see a normalized string representation of the parsed query tree, considering all value-add options that dismax performs. Then, look for parsedquery in the debug output. See how it changes depending on the query.

These features will subsequently be described in greater detail. But first, let's take a look at a request handler we've set up to search for artists. Solr configuration that is not related to the schema is located in solrconfig.xml. The following definition is a simplified version of the one in this book's code supplement:

<requestHandler name="/mb_artists" class="solr.SearchHandler">
    <lst name="defaults">
        <str name="defType">edismax</str>
        <str name="qf">a_name a_alias^0.8 a_member_name^0.4</str>
        <str name="q.alt">*:*</str>
        <str name="mm">100%</str>
    </lst>
</requestHandler>

In Solr's admin Query interface screen, we can refer to this by setting Request-Handler to /mb_artists. You can observe the effect in the URL when you submit the form. It wasn't necessary to set up such a request handler, because Solr is fully configurable from a URL, but it's a good practice and it's convenient for Solr's search form.

Searching multiple fields

You use the qf parameter to tell the dismax query parser which fields you want to search and their corresponding score boosts. As explained in the section on request handlers, the query parameters can be specified in the URL or in the request handler configuration in solrconfig.xml—you'll probably choose the latter for this one. Here is the relevant configuration line from our dismax based handler configuration earlier:

<str name="qf">a_name a_alias^0.8 a_member_name^0.4</str>

This syntax is a space-separated list of field names that can each have optional boosts applied using the same syntax that is used in the query syntax for boosting. This request handler is intended to find artists from a user's query. Such a query would ideally match the artist's name, but we'll also search for aliases as well as bands that the artist is a member of. Perhaps the user didn't recall the band name but knew the artist's name. This configuration would give them the band in the search results, most likely towards the end.

Note

The score boosts do not strictly order the results in a cascading fashion. An exact match in a_alias that matched only part of a_name will probably appear on top. If in your application you are matching identifiers of some sort, then you may want to give a boost to that field which is very high, such as 1,000, to virtually assure it will be on top.

One detail involved in searching multiple fields is the effect of stop words (for example, "the", "a", and so on) in the schema definition. If qf refers to some fields using stop words and others that don't, then a search involving stop words will usually return no results. The edismax query parser fixes this by making them all optional in the query unless the query is entirely stop words. With dismax, you can ensure the query analyzer chain in queried fields filters out the same set of stop words.

Limited query syntax

The edismax query parser will first try to parse the user query with the full syntax supported by the lucene query parser, with a couple tweaks. If it fails to parse, it will fall back to the limited syntax of the original dismax in the next paragraph. Someday, this should be configurable, but it is not at this time. The aforementioned "tweaks" to the full syntax are that or and and Boolean operators can be used in a lowercase form, and pure-negative subqueries are supported.

When using dismax (or edismax, when the user query failed to parse with the lucene query parser), the parser will restrict the syntax permitted to terms, phrases, and use of + and - (but not AND, OR, &&, ||) to make a clause mandatory or prohibited. Anything else is escaped if needed to ensure that the underlying query is valid. The intention is to never trigger an error, but unless you're using edismax, you'll have to code for this possibility due to outstanding bugs (SOLR-422, SOLR-874).

The following query example uses all of the supported features of this limited syntax:

"a phrase query" plus +mandatory without -prohibited

Min-should-match

With the lucene query parser, you have a choice of the default operator being OR, thereby requiring just one query clause to match, or choosing AND to make all clauses required. This, of course, only applies to clauses not otherwise explicitly marked required or prohibited in the query using + and -. But these are two extremes, and sometimes it is preferable to find some middle ground. The dismax parser uses a method called min-should-match, a feature which describes how many clauses should match, depending on how many there are in the query—required and prohibited clauses are not included in the numbers. This allows you to quantify the number of clauses as either a percentage or a fixed number. The configuration of this setting is entirely contained within the mm query parameter using a concise syntax specification, which I'll describe in a moment.

Tip

Always set mm. When in doubt what to set it to, use 100 percent. If it is not set, it uses the same defaulting rules as the lucene query parser, most likely resulting in an mm value equivalent to 0 percent, which is probably not what you want.

This feature is more useful if users use many words in their queries—at least three. This in turn suggests a text field that has some substantial text in it but that is not the case for our MusicBrainz dataset. Nevertheless, we will put this feature to good use.

Basic rules

The following are the four basic mm specification formats expressed as examples:

  • 3: This specifies that three clauses are required, the rest are optional.
  • -2: This specifies that two clauses are optional, the rest are required.
  • 66%: This specifies that 66 percent of the clauses (rounded down) are required, the rest are optional.
  • -25%: This specifies that 25 percent of the clauses (rounded down) are optional, the rest are required.

Notice that - inverses the required/optional definition. It does not make any number negative from the standpoint of any definitions herein.

Note

Note that 75% and -25% may seem the same but are not due to rounding. Given five queried clauses, the first requires three, whereas the second requires four. This shows that if you desire a round-up calculation, then you can invert the sign and subtract it from 100.

Two additional points about these rules are as follows:

  • If the mm rule is a fixed number n, but there are fewer queried clauses, then n is reduced to the queried clause count so that the rule will make sense. For example, if mm is -5 and only two clauses are in the query, then all are optional. Sort of!
  • Remember that in all circumstances across Lucene (and thus, Solr); at least one clause in a query must match, even if every clause is optional. So, in the preceding example and for 0 or 0%, one clause must still match, assuming that there are no required clauses present in the query.

Multiple rules

Now that you understand the basic mm specification format, which is for one simple rule, I'll describe the final format, which allows for multiple rules. This format is composed of an ordered space-separated series of number<basicmm. This can be read as, "If the clause count is greater than number, then apply rule basicmm". Only the right-most rule that meets the clause count threshold is evaluated. As they are ordered in an ascending order, the chosen rule is the one that requires the greatest number of clauses. If none match because there are fewer clauses, then all clauses are required—a basic specification of 100 percent.

An example of the mm specification is given here:

2<75% 9<-3

This reads as follows:

If there are over nine clauses, then all but three are required (three are optional, and the rest are required). If there are over two clauses, then 75 percent are required (rounded down). Otherwise (one or two clauses) all clauses are required, which is the default rule.

Tip

I find it easier to interpret these rules if they are read right to left.

What to choose

A simple configuration for min-should-match is to require all clauses:

100%

For MusicBrainz searches, I do not expect users to be using many terms, but I expect most of them to match. If a user searches for three or more terms, then I'll let one be optional. Here is the mm spec:

2<-1

Tip

You may be inclined to require all of the search terms; and that's a good common approach. However, if just one word isn't found, then there will be no search results—an occurrence that most search software tries to minimize. Even if you make some of the words optional, the matching documents that have more of the search words will be towards the top of the search results, assuming score-sorted order (you'll learn why in the next chapter). There are other ways to approach this problem, for example, by performing a secondary search if the first returns none or too few. Solr doesn't do this for you, but it's easy for the client to do. This approach could even tell the user that this was done, which would yield a better search experience.

A default query

The dismax query parser supports a default query, which is used in the event the user query q is not specified. This parameter is q.alt, and it is not subject to the limited syntax of dismax. Here's an example of it used to match all documents from within the request handler defaults in solrconfig.xml:

<str name="q.alt">*:*</str>

This parameter is usually set to *:* to match all documents and is often specified in the request handler configuration in solrconfig.xml. You'll see with faceting in the next section that there will not necessarily even be a keyword search, and so you'll want to display facets over all of the data.

The uf parameter

The DisMax and eDisMax query parsers support fielded queries within the q parameter. This means that a user can explicitly search any valid field using this syntax: field_name:value. The uf (user fields) parameter makes it possible to restrict the set of fields the user can search against. The value of this parameter can be a space-delimited list of field names. A wildcard (*) can be used for field name globing. Dashes can be used to negate fields. For example, to allow user queries to search in the id field, all fields starting with a_ except a_id, the uf parameter value would be id a_* -a_id.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.93.44