Time for action – debugging a query with the Lucene parser

We will now debug a basic query using the standard Lucene parser, in order to use it later as a reference for thinking about what will happen while using other parsers.

Using the default Lucene parser, suppose we are searching for a painting without remembering its title: we remember that it has something to do with Dali or Vermeer, there was a giraffe in it, and one or more female figures. A simple way to search for it is to go to the search box on our site and type in something similar to giraffe dali "female figure" vermeer. Note how we are using the double quotes to pass two terms as a single complex term: this is what is generally called a phrase query.

  1. We can simulate this search as usual by using the web interface, or cURL using the following command:
    >> curl -X GET 'http://localhost:8983/solr/paintings/select?q=giraffe+dali+%22female+figure%22+vermeer&start=0&rows=10&fl=title+artist+city&wt=json&indent=true&debugQuery=true'
    
  2. Note that the %22 encoding stands for double quotes, and we have not defined a specific field to search in. For simplicity, we have to choose some values at the start, rows, fl and wt parameters, in order to obtain a more readable result, and we add the debugQuery=true parameter to activate the debug search component. The latter permits us to see how the query is internally parsed:
    Time for action – debugging a query with the Lucene parser

Here we can recognize a standard query in the parsedQuery field.

What just happened?

I strongly suggest to play with the different query handlers and define your own examples for them: you don't have to be concerned about using the "wrong" parameters here, as there can only be configurations that can be improved.

It's possible to write a request with cURL with the following syntax:

>> curl -G 'http://localhost:8983/solr/paintings/select?start=0&rows=10&fl=title+artist+city&wt=json&indent=true&json.nl=map&debugQuery=true' --data-urlencode 'q=giraffe dali "female figure" vermeer'

This is better to look at the query parameters in a more readable way.

In our example we have found 91 matching documents, and the first is the one that most probably fits our searches (in the preceding screenshot I only projected it for improving readability). Reading with some patience all the results, we can find some other documents where the veermer term is found in the artist field, but its score is probably less than the first of the results. This is because while imposing other conditions the parser tries to find some documents that fits all of these, and omits them one by one only if a match is not found this way. In this case, the presence of a painting by Dalì with the Veermer name in its title is probably the best choice.

Tip

I suggest you do this kind of evaluation as an exercise, using the same parameters over the web interface, then verifying the values, also adding and removing documents. A very interesting point to notice is that the Lucene score is not in a percentage, as it is designed for an open collection, where items can have a different relative score, depending on documents being removed or added. We will look again at the Lucene score later.

If you prefer, we can look at the query parameter individually. One of the example is as follows:

q=giraffe dali "female figure" vermeer
start=0, rows=10
fl=title artist city
wt=json, indent=true
debugQuery=true

You must have noticed the debug section, where you can read the parsedQuery and parsedQuery_toString output that gives us an idea of how the query is actually parsed by the internal components. It's important to see what is happening: we are obtaining a series of field searches (in the default fullText field) for both the terms and the phrase search, combined as a single Boolean query using the implicit default inclusive "or" condition. This is interesting, because this suggests us the possibility of thinking about decomposing a single query into several independent parts that will be recombined to offer the final one.

The Dismax query parser is focused specifically on this behavior, and moves a step further on it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.172.115