Query syntax (the lucene query parser)

The query parser named lucene is Solr's most expressive and capable. With the benefit of hindsight, it should have been named "solr". It is based on Lucene's classic syntax with some additions that will be pointed out explicitly. In fact, you've already seen the first addition, which is local-params.

The lucene query parser does have a couple of query parameters that can be set. These parameters aren't normally specified though; Lucene's query syntax is easily made explicit to not need these options.

  • q.op: This is the default query operator, either AND or OR to signify if all of the search terms or just one of the search terms need to match. If this isn't present, then the default is specified in schema.xml near the bottom in the defaultOperator attribute. If that isn't specified, then the default is OR.
  • df: This is the default field that will be searched by the user query. If this isn't specified, then the default is specified in schema.xml near the bottom in the <defaultSearchField> element. If that isn't specified, then a query that does not explicitly specify a field to search will cause an error.

    Tip

    We recommend not using these parameters, unless they are used with local-params, such as, {! df=text q.op=AND}my query. Similarly, we recommend not setting the global defaults in the schema. One reason is that they affect all queries in the same request that you perhaps didn't intend, such as a facet query. Another is that it makes a query that depends on it ambiguous without knowing what these parameters are.

To play along with the examples in the book, go to http://localhost:8983/solr/#/mbartists/query and set the df parameter to a_name. We advise you not to use that parameter, but this is for experimentation. The default query operator remains at OR and doesn't need changing. You may find it easier to scan the results if you set fl (the field list) to a_name, score.

Note

To see a normalized string representation of the parsed query tree, enable debugQuery or set debug=query (conveniently via the Raw Query Parameters input). Then look for parsedquery in the debug output. See how it changes, depending on the query.

Matching all the documents

Lucene doesn't natively have a query syntax to match all documents. Solr enhances Lucene's query syntax to support this with the following syntax:

*:*

In Solr 4.2, the syntax is as follows:

*

When using dismax, it's common to set q.alt to this match-everything query so that a blank query returns all results.

Mandatory, prohibited, and optional clauses

Lucene has a unique way of combining multiple clauses in a query string. It is tempting to think of this as a mundane detail that is common to Boolean operations in programming languages, but Lucene doesn't quite work that way.

A query expression is decomposed into a set of unordered clauses of three types:

  • A clause can be mandatory: +Smashing

    This matches only artists containing the word Smashing.

  • A clause can be prohibited: -Smashing

    This matches all artists except those with Smashing. You can also use an exclamation mark as in !Smashing but that's rarely used.

  • A clause can be optional: Smashing

    Note

    Spaces must not come between +, ! or - and the search word for it to work as described here, otherwise the operator itself is treated like a separate word and the word to its right will default to optional. Typically, the operator won't actually be searched for since text analysis usually removes it.

The term optional deserves further explanation. If the query expression contains at least one mandatory clause, then any optional clause is just that—optional. This notion may seem pointless, but it serves a useful function in scoring documents that match more of them higher. If the query expression does not contain any mandatory clauses, then at least one of the optional clauses must match. The next two examples illustrate optional clauses.

Here, Pumpkins is optional, and the well-known band will surely be at the top of the list, ahead of bands with names like Smashing Atoms:

+Smashing Pumpkins

In this example, there are no mandatory clauses and so documents with Smashing or Pumpkins are matched, but not Atoms. The Smashing Pumpkins is at the top because it matched both, followed by other bands containing only one of those words:

Smashing Pumpkins –Atoms

If you would like to specify that a certain number or percentage of optional clauses should match or should not match, you can instead use the DisMax query parser with the min-should-match feature, described later in the chapter.

Boolean operators

The Boolean operators AND, OR, and NOT can be used as an alternative syntax to arrive at the same set of mandatory, optional, and prohibited clauses that were mentioned previously. Use the debugQuery feature and observe that the parsedquery string normalizes away this syntax into the previous (clauses being optional by default, such as OR).

Note

Case matters! At least this means that it is harder to accidentally specify a Boolean operator.

When the AND or && operator is used between clauses, then both the left and right sides of the operand become mandatory, if not already marked as prohibited. Let's consider this search result:

Smashing AND Pumpkins

It is equivalent to:

+Smashing +Pumpkins

Similarly, if the OR or || operator is used between clauses, then both the left and right sides of the operand become optional, unless they are marked mandatory or prohibited. If the default operator is already OR, then this syntax is redundant. If the default operator is AND, then this is the only way to mark a clause as optional.

To match artist names that contain Smashing or Pumpkins, try:

Smashing || Pumpkins

The NOT operator is equivalent to the - syntax. So to find artists with Smashing but not Atoms in the name, you can do this:

Smashing NOT Atoms

We didn't need to specify a + on Smashing. This is because it is the only optional clause and there are no explicit mandatory clauses. Likewise, using AND or OR would have no effect in this example.

It may be tempting to try to combine AND with OR such as:

Smashing AND Pumpkins OR Green AND Day

However, this doesn't work as you might expect! Remember that AND is equivalent to both sides of the operand being mandatory, and thus each of the four clauses becomes mandatory. Our dataset returned no results for this query. In order to combine query clauses in some ways, you will need to use subqueries.

Subqueries

You can use parenthesis to compose a query of smaller queries, referred to as subqueries or nested queries. The following example satisfies the intent of the previous example:

(Smashing AND Pumpkins) OR (Green AND Day)

Using what we know previously, this could also be written as:

(+Smashing +Pumpkins) (+Green +Day)

But this is not the same as:

+(Smashing Pumpkins) +(Green Day)

The preceding subquery is interpreted as documents that must have a name with Smashing or Pumpkins and either Green or Day in its name. So if there were a band named Green Pumpkins, then it would match.

Solr added another syntax for subqueries to Lucene's old syntax, which allows the subquery to use a different query parser, including local-params. This is an advanced technique, so don't worry if you don't understand it at first.

As an example, suppose you have a search interface with multiple query boxes, whereas each box is to search a different field. You could compose the query string yourself, but you would have some query-escaping issues to deal with. And if you wanted to take advantage of the dismax parser, then with what you know so far, that isn't possible. Here's an approach using this new syntax:

+{!dismax qf=a_name v=$q.a_name} +{!dismax qf=a_alias v=$q.a_alias}

This example assumes that request parameters of q.a_name and q.a_alias are supplied for the user input for these fields in the schema. Recall from the local-params definition that the parameter v can hold the query and that the $ refers to another named request parameter.

Note

With versions of Solr earlier than 4.1, the syntax is slightly different and more complicated. The syntax uses a magic field named _query_ with its value being the subquery, which practically speaking, needs to be quoted. Here's the query from the preceding example, using the old syntax:

+_query_:"{!dismax qf=a_name v=$q.a_name}" +_query_:"{!dismax qf=a_alias v=$q.a_alias}"

Limitations of prohibited clauses in subqueries

Lucene doesn't actually support a pure negative query; for example:

-Smashing -Pumpkins

Solr enhances Lucene to support this, but only at the top-level query, such as in the preceding example. Consider the following, admittedly strange, query:

Smashing (-Pumpkins)

This query attempts to ask the question: Which artist names contain either Smashing or do not contain Pumpkins? However, it doesn't work and only matches the first clause—(four documents). The second clause should essentially match most documents resulting in a total for the query that is nearly every document. The artist named Wild Pumpkins at Midnight is the only one in the index that does not contain Smashing but does contain Pumpkins, and so this query should match every document except that one.

To make this work, you have to take the subexpression containing only negative clauses, and add the all-documents query clause: *:*, as shown here:

Smashing (-Pumpkins *:*)

Interestingly, this limitation is fixed in the edismax query parser. Hopefully, a future version of Solr will fix it universally, thereby making this workaround unnecessary.

Querying specific fields

To have a clause explicitly search a particular field, you need to precede the relevant clause with the field's name, and then add a colon; spaces may be used in between, but that is generally not done:

a_member_name:Corgan

This matches bands containing a member with the name Corgan. To match Billy and Corgan, do the following:

+a_member_name:Billy +a_member_name:Corgan

Or use this shortcut to match multiple words:

a_member_name:(+Billy +Corgan)

The content of the parenthesis is a subquery, but with the default field being overridden to be a_member_name, instead of what the default field would be otherwise. By the way, we could have used AND instead of +, of course. Moreover, in these examples, all of the searches were targeting the same field, but you can certainly match any combination of fields needed.

Phrase queries and term proximity

A clause may be a phrase query: a contiguous series of words to be matched in order. In the previous examples, we've searched for text containing multiple words such as Billy and Corgan, but let's say we wanted to match Billy Corgan (that is, the two words adjacent to each other in that order). This further constrains the query. Double quotes are used to indicate a phrase query, as shown in the following query:

"Billy Corgan"

Related to phrase queries is the notion of the term proximity, also known as the slop factor or a near query. In our previous example, if we wanted to permit these words to be separated by no more than say three words in between, we could do this:

"Billy Corgan"~3

For the MusicBrainz dataset, this is probably of little use. For larger text fields, this can be useful in improving search relevance. The dismax query parser, which is described in the next chapter, can automatically turn a user's query into a phrase query with a configured slop.

For advanced requirements such as wildcards and Booleans within a phrase query, ComplexPhraseQueryParser can be used. For more information on this parser, its options and performance considerations, visit https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser.

Wildcard queries

A plain keyword search will look in the index for an exact match, subsequent to text analysis processing on both the query and input document text (for example, tokenization and lowercasing). But sometimes you need to express a query for a partial match expressed using wildcards.

Note

There is a highly relevant section in Chapter 3, Text Analysis, on partial/substring indexing. In particular, read about ReversedWildcardFilterFactory. N-grams is a different approach that does not work with wildcard queries.

There are a few points to understand about wildcard queries:

  • Wildcard queries are a type of multiterm query, which means that the input is expanded into multiple terms during analysis. By default, multiterm query analyzer chains are in lowercase. Prefix, regex, and range queries are also forms of multiterm queries. For more information on multiterm query analysis, see the wiki page at http://wiki.apache.org/solr/MultitermQueryAnalysis.
  • If the field that you want to use the wildcard query on is stemmed in the analysis, then smashing* might not find the original text Smashing. The Porter stemmer will transform this word to smash, whereas EnglishMinimalStemmer (used in a_name) won't touch this word. Consequently, don't stem or use a minimal stemmer.
  • Wildcard queries are one of the slowest types you can run. Use of ReversedWildcardFilterFactory helps with this a lot. But if you have an asterisk wildcard on both ends of the word, then this is the worst-case scenario.

To find artists containing words starting with Smash, you can use:

smash*

Or perhaps to find those starting with sma and ending with ing, use:

sma*ing

The asterisk matches any number of characters (perhaps none). You can also use ? to force a match of any character at that position:

sma??*

That would match words that start with sma that have at least two more characters, but potentially more.

As far as scoring is concerned, each matching term gets the same score regardless of how close it is to the query pattern. Lucene can support a variable score at the expense of performance, but you would need to do some hacking to get Solr to do that.

Fuzzy queries

Fuzzy queries are useful when your search term needn't be an exact match, but the closer the better. The fewer the number of character insertions, deletions, or exchanges relative to the search term length, the better the score. The algorithm used is known as the Levenshtein Distance algorithm, also known as the edit distance. Fuzzy queries have the same need to avoid stemming, just as wildcard queries do. For example:

smashing~

Notice the tilde character at the end. Without this notation, simply smashing matches only four documents because only that many artist names contain that word. The search term smashing~ matched 26 documents. The default edit distance is 2, but you can reduce it to 1 like so for less fuzzy matching:

smashing~1

That results in six matched documents—two more than a non-fuzzy search. Prior to Lucene 4, the edit distance was specified as a fraction of the number of characters in the word, and Lucene could search based on whatever edit distance this came to, albeit slowly. Lucene 4 is much faster but is limited to an edit distance no greater than 2, so you are now best off simply specifying 1 or 2 instead of using the fractional syntax.

Regular expression queries

There may be scenarios where you need to match documents using a specific pattern that can't be expressed using wildcard or fuzzy queries. For these cases, a regular expression query might be the answer.

The Solr regular expression syntax is simple and straightforward. Here's an example that matches documents that contain a possible 5-digit zip code, somewhere in the a_address field:

a_address:/[0-9]{5}/

As you can see, the pattern is enclosed in forward slashes (delimiters). Solr implicitly applies the pattern matching the full indexed value. There is no need to anchor to the beginning or end of the input string.

Regular expression queries are constant scoring—the scores of any matching documents will always be 1.0.

Range queries

Lucene lets you query for numeric, date, and even text ranges. The following query matches all of the bands formed in the 1990s:

a_type:2 AND a_begin_date:[1990-01-01T00:00:00.000Z TO 1999-12-31T24:59:99.999Z]

Observe that the date format is the full ISO-8601 date-time in UTC, which Solr mandates (the same format used by Solr to index dates and that which is emitted in search results). The .999 milliseconds part is optional. The [ and ] brackets signify an inclusive range, and, therefore, it includes the dates on both ends. To specify an exclusive range, use { and }. In Solr 3, both sides must be inclusive or both exclusive; Solr 4 allows both. The workaround in Solr 3 is to introduce an extra clause to include or exclude a side of the range.

Tip

Use the right field type

To get the fastest numerical/date range query performance, particularly when there are many indexed values, use a trie field (for example, tdate) with precisionStep. This was discussed in Chapter 2, Schema Design.

For most numbers in the MusicBrainz schema, we only have identifiers, and so it made sense to use the plain long field type, but there are some other fields. For the track duration in the tracks data, we could do a query such as the following one to find all of the tracks that are longer than 5 minutes (300 seconds, 300,000 milliseconds):

t_duration:[300000 TO *]

In this example, we can see Solr's support for open-ended range queries by using *.

Although uncommon, you can also use range queries with text fields. For this to have any use, the field should have only one term indexed. You can control this either by using the string field type, or by using the KeywordTokenizer. You may want to do some experimentation. The following example finds all documents where somefield has a term starting with B:

somefield:[B TO C}

Both sides of the range B and C are not processed with text analysis that could exist in the field type definition. If there is any text analysis such as lowercasing, you will need to do the same to the query or you will get unexpected results.

Date math

Solr extended Lucene's old query parser to add date literals as well as some simple math that is especially useful in specifying date ranges. In addition, there is a way to specify the current date-time using NOW. The syntax offers addition, subtraction, and rounding at various levels of date granularity, such as years, seconds, and so on down to milliseconds. The operations can be chained together as needed, in which case they are executed from left to right. Spaces aren't allowed. For example:

r_event_date:[* TO NOW-2YEAR]

In the preceding example, we searched for documents where an album was released over two years ago. NOW has millisecond precision. Let's say what we really wanted was precision to the day. By using /, we can round down (it never rounds up):

r_event_date:[* TO NOW/DAY-2YEAR]

The units to choose from are YEAR, MONTH, DAY, DATE (synonymous with DAY), HOUR, MINUTE, SECOND, MILLISECOND, and MILLI (synonymous with MILLISECOND). Furthermore, they can be pluralized by adding an S, as in YEARS.

Note

This so-called DateMath syntax is not just for querying dates; it is for supplying dates to be indexed by Solr too! An index-time common usage is to timestamp added data. Using the NOW syntax as the default attribute of a timestamp field definition makes this easy. Here's how to do that: <field name="indexedAt" type="tdate" default="NOW/SECOND" />.

Score boosting

You can easily modify the degree to which a clause in the query string contributes to the ultimate relevancy score by adding a multiplier. This is called boosting. A value between 0 and 1 reduces the score, and numbers greater than 1 increase it. You'll learn more about scoring in the next chapter. In the following example, we search for artists that either have a member named Billy, or have a name containing the word Smashing:

a_member_name:Billy^2 OR Smashing

Here, we search for an artist name containing Billy, and optionally Bob or Corgan, but we're less interested in those that are also named Corgan:

+Billy Bob Corgan^0.7

Existence and nonexistence queries

This is actually not a new syntax case, but an application of range queries. Suppose you wanted to match all of the documents that have an indexed value in a field. Here, we find all of the documents that have something in a_name:

a_name:[* TO *]

As a_name is the default field, just [* TO *] will do.

This can be negated to find documents that do not have a value for a_name, as shown in the following code:

-a_name:[* TO *]

Note

Just a_name:* is usually equivalent, and similarly, -a_name:* for negation. This was an accidental feature that users discovered. However, for some non-text fields such as numbers and dates, it is much slower, as it uses a completely different code path that was designed for wildcard text matching, not the nature of the actual field type. Consequently, we recommend avoiding this syntax. See SOLR-1982.

Like wildcard and fuzzy queries, these are expensive, slowing down as the number of distinct terms in the field increases.

Tip

Performance tip

If you need to perform these frequently, consider adding this to your schema: <field name="field_name_ss" type="string" stored="false" multiValued="true" />. Then, at index time, add the name of fields that have a value to it. There's JavaScript code for this commented in the update-script.js file invoked by an UpdateRequestProcessor. The query would then simply be field_name_ss:a_name, which is as fast as it gets.

Escaping special characters

The following characters are used by the query syntax as described in this chapter:

+ - && || ! ( ) { } [ ] ^ " ~ * ? :  /

In order to use any of these without their syntactical meaning, you need to escape them by a preceding such as seen here:

id:Artist:11650

This also applies to the field name part. In some cases, such as this one, where the character is part of the text that is indexed, the double-quotes phrase query will also work, even though there is only one term:

id:"Artist:11650"

Note

If you're using SolrJ to interface with Solr, the ClientUtils.escapeQueryChars() method will do the escaping for you.

A common situation in which a query needs to be generated, and thus escaped properly, is when generating a simple filter query in response to choosing a field-value facet when faceting. This syntax and suggested situation is getting ahead of us, but I'll show it anyway since it relates to escaping. The query uses the term query parser as {!term f=a_type}group. What follows } is not escaped at all; even a is interpreted literally, and so with this trick, you needn't worry about escaping rules.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.127.221