As you know, Solr is built using the Apache Lucene library. Due to this, some of the query parsers available in Solr allow us to fully leverage the query language of Lucene, giving us great flexibility to understand how our queries work and with what documents they match. In this recipe, we will discuss an example usage of the Lucene query language by looking at a book search site that gives its users the possibility of defining complex Boolean queries that contain phrases.
Let's perform the following steps to achieve this:
schema.xml
file:<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="title" type="text_general" indexed="true" stored="true" /> <field name="description" type="text_general" indexed="true" stored="true" /> <field name="published" type="int" indexed="true" stored="true" />
<doc> <field name="id">1</field> <field name="title">Solr 4.0 cookbook</field> <field name="description">The book is totally focused on the 4.0 version of Apache Solr enterprise search server. The content is divided into ten thematic chapters, just like with the previous version of the book</field> <field name="published">2012</field> </doc> <doc> <field name="id">2</field> <field name="title">Solr 3.1 cookbook</field> <field name="description">The book is focused on the 3.1 version of Solr. The content is divided into ten chapters, each of which consists of a few to several recipes.</field> <field name="published">2011</field> </doc> </add>
solr
in their title and book
in their description. In addition, our user wants to see the books that were published between 2011 and 2013 (inclusive of 2011). However, this is not all. Our user also says that he doesn't want books that have the term 3.1
or the book is focused
phrase in their description. The query seems a bit complicated, but Solr can easily handle it. A request that handles all the requirements looks as follows:http://localhost:8983/solr/cookbook/select?q=title:solr AND description:book AND published:[2011+TO+2013} NOT (description:3.1 OR description:"book is focused")
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">64</int> <lst name="params"> <str name="q">title:solr AND description:book AND published:[2011 TO 2013} NOT (description:3.1 OR description:"book is focused")</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="id">1</str> <str name="title">Solr 4.0 cookbook</str> <str name="description">The book is totally focused on the 4.0 version of Apache Solr enterprise search server. The content is divided into ten thematic chapters, just like with the previous version of the book</str> <int name="published">2012</int> <long name="_version_">1471367306831462400</long></doc> </result> </response>
As we can see, Solr returned the document we were after, so now let's see how it works.
Our index is very simple; it contains four fields:
id
field)title
field)description
field)published
field)By default, Solr uses the standard query parser that supports the full Lucene query language. This means that we can search in a particular field, use phrase and range queries, use Boolean operators, and so on.
Our particular query says that we want the documents that have solr
in the title
field (title:solr
), book
in the description
field (description:book
), and the publication date between 2011
and 2013
, including 2011
(published:[2011 TO 2013}
).These three parts of the query are connected to each other with the Boolean operator AND
. Both sides of the operator must match for the document to be considered a match; in our case, all three conditions must be met. The Lucene query language provides us with the possibility of using three Boolean operators: AND
(requires both operands to be matched), OR
(any operant can be a match), and NOT
(the operand after the operator can't match). After the operands are concatenated with the AND
operator, we have the NOT
operator, which means that the section in the parenthesis can't be a match for the document to be returned in the search results ((description:3.1 OR description:"book is focused")
). This basically means that the description of the book can't have the 3.1
term or the book is focused
phrase.
Of course, the logical operators are not everything that is present in the query. We have the sections specifying that we want a particular value in a field, for example, title:solr
. We have the range query run against the published
field and say that we want all documents with a value between 2011
and 2013
(exclusive) in this field. One thing to remember when it comes to the range query is that using [
or ]
means that we want the value to be inclusive, while using {
or }
means that the value will be exclusive. Finally, we have the phrase query that we construct by surrounding the phrase with the "
character.
Note that the query shown in the example is suboptimal. Some parts of it should be moved to filter queries (such as the section about publication year) because of performance reasons. However, for the purpose of demonstrating the Lucene query language, I decided to go for the simplest example. The more optimal version of the query looks as follows:
http://localhost:8983/solr/cookbook/select?q=title:solr AND description:book NOT (description:3.1 OR description:"book is focused")&fq=published:[2011+TO+2013}
-
, +
, and !
operators). If you want to know more, look at the Javadoc of the classic Lucene query parser available at http://lucene.apache.org/core/4_10_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html.18.188.8.91