Using filter queries for caching filters

A FilterQuery can be used, as shown, to dynamically include or exclude a document from the documents' collection. This can also improve the speed for retrieving documents as the filters could be internally cached for later use. In fact, only the internal Lucene document identifiers are actually stored to perform the lookup in order to have a small memory footprint.

Moreover, filter queries have a strong impact when calculating scores, because by introducing filters, we change the size of the collection over which we play the queries.

To see this in detail, we can use a simple Boolean query asking for documents that contain the term vermeer in the title field and the term dali in the artist field:

>> curl -X GET 'http://localhost:8983/solr/paintings/select?q=(title:vermeer%20AND%20artist:dali)&fl=title,score&start=0&rows=2&indent=true&wt=json&json.nl=map&debug=true'

The result combining these two conditions will contain only one document; then we can play a little with the other combinations of parameters. The idea is to see what happens to the score when changing the way of combining common and filter queries to obtain results.

The following table shows score variations on different tests:

Parameters

Score Variations

q=(title:vermeer AND artist:dali)

score: 4.3593554

q=title:vermeer

fq=artist:dali

score: 3.369493

q=artist:dali

fq=title:vermeer

score: 2.7659535

q=*:*

fq=title:vermeer

fq=artist:dali

score = 1

q=*:*

fq=(title:vermeer AND artist:dali)

score = 1

Starting from the example query, it's possible to use the parameter combinations proposed in the table to obtain different scores. An important fact to note here is that scores are not absolute, but relative to an entire collection of results.

Tip

The concept of an absolute score does not make sense in this context because score values are not directly comparable through different searches. Scores could be comparable if they were normalized at some point, but this should be avoided as the document can be updated, removed, or added to the index, thereby changing the scores.

On reading the table starting from the end, you will clearly notice that in the last two examples, we are excluding all the documents that do not match these constraints using only filter queries. In this case, the score does not need to be higher than any other hypothetical document's score as we know there is only one single document, so we will obtain a default value of 1.

As long as we are introducing parameter combinations that are less restrictive, combining filter queries with standard query, we find the score which boosts our document towards the start of the result in particular (which will be always the same). This is needed to have the first result for a hypothetically wider collection of documents.

Indeed, in the first example, even if we are obtaining the same single document as a result, we are actually searching over all the documents (the entire index), and our document will have a greater score value.

Just to be more clear, if you are not satisfied with having a simple numeric result and want to go a step further in understanding what's behind that, you can simply look at the debug details by adding debug=true, which I have omitted here for simplicity and readability.

For example, the first combination (no filter queries) will be internally expanded as:

4.359356 = (MATCH) sum of: 2.6043947 = (MATCH) weight(title:vermeer in 456) [DefaultSimilarity], result of: 2.6043947 = score(doc=456,freq=1.0 = termFreq=1.0 ), product of: 0.77293366 = queryWeight, product of: 8.985314 = idf(docFreq=1, maxDocs=5875) 0.086021885 = queryNorm 3.369493 = fieldWeight in 456, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 8.985314 = idf(docFreq=1, maxDocs=5875) 0.375 = fieldNorm(doc=456) 1.754961 = (MATCH) weight(artist:dali in 456) [DefaultSimilarity], result of: 1.754961 = score(doc=456,freq=1.0 = termFreq=1.0 ), product of: 0.6344868 = queryWeight, product of: 7.3758764 = idf(docFreq=9, maxDocs=5875) 0.086021885 = queryNorm 2.7659535 = fieldWeight in 456, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 7.3758764 = idf(docFreq=9, maxDocs=5875) 0.375 = fieldNorm(doc=456)

While the last (only a single Boolean filter query, with the common query matching all the documents) will be expanded as:

1.0 = (MATCH) MatchAllDocsQuery, product of: 1.0 = queryNorm

I hope that seeing these results will convince you to pay attention when you want to use filter queries. Lastly, don't forget that it is really important to consider scores when thinking about document similarity. We will introduce that later in this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.252.204