Understanding and using the Lucene query language

As you know, Solr is built using the Apache Lucene library. Due to this, some of the query parsers available in Solr allow us to fully leverage the query language of Lucene, giving us great flexibility to understand how our queries work and with what documents they match. In this recipe, we will discuss an example usage of the Lucene query language by looking at a book search site that gives its users the possibility of defining complex Boolean queries that contain phrases.

How to do it...

Let's perform the following steps to achieve this:

  1. The first step is to prepare our index to handle data. To do this, we add the following entries to the schema.xml file:
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text_general" indexed="true" stored="true" />
    <field name="description" type="text_general" indexed="true" stored="true" />
    <field name="published" type="int" indexed="true" stored="true" />
  2. Now, let's index some sample book data. The data that we want to index looks as follows:
    <doc>
      <field name="id">1</field>
      <field name="title">Solr 4.0 cookbook</field>
      <field name="description">The book is totally focused on the 4.0 version of Apache Solr enterprise search server. The content is divided into ten thematic chapters, just like with the previous version of the book</field>
      <field name="published">2012</field>
     </doc>
     <doc>
      <field name="id">2</field>
      <field name="title">Solr 3.1 cookbook</field>
      <field name="description">The book is focused on the 3.1 version of Solr. The content is divided into ten chapters, each of which consists of a few to several recipes.</field>
      <field name="published">2011</field>
     </doc>
    </add>
  3. Let's assume that our user wants to find the books that have the term solr in their title and book in their description. In addition, our user wants to see the books that were published between 2011 and 2013 (inclusive of 2011). However, this is not all. Our user also says that he doesn't want books that have the term 3.1 or the book is focused phrase in their description. The query seems a bit complicated, but Solr can easily handle it. A request that handles all the requirements looks as follows:
    http://localhost:8983/solr/cookbook/select?q=title:solr AND description:book AND published:[2011+TO+2013} NOT (description:3.1 OR description:"book is focused")
  4. The results returned by Solr are as follows:
    <?xml version="1.0" encoding="UTF-8"?>
    <response>
     <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">64</int>
      <lst name="params">
       <str name="q">title:solr AND description:book AND published:[2011 TO 2013} NOT (description:3.1 OR description:"book is focused")</str>
      </lst>
     </lst>
     <result name="response" numFound="1" start="0">
      <doc>
       <str name="id">1</str>
       <str name="title">Solr 4.0 cookbook</str>
       <str name="description">The book is totally focused on the 4.0 version of Apache Solr enterprise search server. The content is divided into ten thematic chapters, just like with the previous version of the book</str>
       <int name="published">2012</int>
       <long name="_version_">1471367306831462400</long></doc>
     </result>
    </response>

As we can see, Solr returned the document we were after, so now let's see how it works.

How it works...

Our index is very simple; it contains four fields:

  • The first field is the one responsible for the unique identifier of the book (the id field)
  • The second field is the title of the book (the title field)
  • The third field is the description of the book (the description field)
  • The fourth field holds the publication year (the published field)

By default, Solr uses the standard query parser that supports the full Lucene query language. This means that we can search in a particular field, use phrase and range queries, use Boolean operators, and so on.

Our particular query says that we want the documents that have solr in the title field (title:solr), book in the description field (description:book), and the publication date between 2011 and 2013, including 2011 (published:[2011 TO 2013}).These three parts of the query are connected to each other with the Boolean operator AND. Both sides of the operator must match for the document to be considered a match; in our case, all three conditions must be met. The Lucene query language provides us with the possibility of using three Boolean operators: AND (requires both operands to be matched), OR (any operant can be a match), and NOT (the operand after the operator can't match). After the operands are concatenated with the AND operator, we have the NOT operator, which means that the section in the parenthesis can't be a match for the document to be returned in the search results ((description:3.1 OR description:"book is focused")). This basically means that the description of the book can't have the 3.1 term or the book is focused phrase.

Of course, the logical operators are not everything that is present in the query. We have the sections specifying that we want a particular value in a field, for example, title:solr. We have the range query run against the published field and say that we want all documents with a value between 2011 and 2013 (exclusive) in this field. One thing to remember when it comes to the range query is that using [ or ] means that we want the value to be inclusive, while using { or } means that the value will be exclusive. Finally, we have the phrase query that we construct by surrounding the phrase with the " character.

Note that the query shown in the example is suboptimal. Some parts of it should be moved to filter queries (such as the section about publication year) because of performance reasons. However, for the purpose of demonstrating the Lucene query language, I decided to go for the simplest example. The more optimal version of the query looks as follows:

http://localhost:8983/solr/cookbook/select?q=title:solr AND description:book NOT (description:3.1 OR description:"book is focused")&fq=published:[2011+TO+2013}

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.8.91