Chapter 6. Search Relevancy

At this point, you've learned the basics of Solr. You've undoubtedly seen your results sorted by score in descending order, the default, but have no understanding as to where those scores came from. This chapter is all about search relevancy, which basically means it's about scoring; but it's also about other non-trivial methods of sorting to produce relevant results. A core Solr feature enabling these more advanced techniques, called function queries, will be introduced. The major topics covered in this chapter are as follows:

  • Factors influencing the score
  • Troubleshooting queries to include scoring
  • DisMax part 2—features that enhance relevancy
  • Function queries

    Tip

    In a hurry?

    Use the edismax query parser for user queries by setting the defType parameter. Configure the qf (query fields), as explained in the previous chapter, set pf (phrase fields) considering the call-out tip in this chapter, and set tie to 0.1. If at any point you need help troubleshooting a query (and you will), then return to read the Troubleshooting queries and scoring section of this chapter.

Scoring

Scoring in Lucene is an advanced subject, but it is important to at least have a basic understanding of it. We will discuss the factors influencing Lucene's default scoring model and where to look for diagnostic scoring information. If this overview is insufficient for your interest, then you can get the full details at http://lucene.apache.org/core/4_8_1/core/org/apache/lucene/search/package-summary.html#scoring.

The important thing to understand about scores is not to attribute much meaning to a score by itself; it's almost meaningless. The relative value of an individual score to the max score is much more meaningful. A document scored as 0.25 might be a great match or not, there's no telling, while in another query a document scoring 0.80 may actually not be a great match. But if you compare a score to another from the very same search and find it to be twice as large, then it is fair to say that the query matched this document twice as well. The factors influencing the score are as follows:

  • Term frequency (tf): The more times a term is found in a document's field, the higher the score it gets. This concept is most intuitive. Obviously, it doesn't matter how many times the term may appear in some other field, it's the searched field that is relevant (whether explicit in the query or the default).
  • Inverse document frequency (idf): The rarer a term is in the entire index, the higher its score is. The document frequency is the number of documents in which the term appears for a given field. It is the inverse of the document frequency that is positively correlated with the score.
  • Co-ordination factor (coord): The greater the number of query clauses that match a document, the greater the score will be. Any mandatory clauses must match and the prohibited ones must not match, leaving the relevance of this piece of the score to situations where there are optional clauses.
  • Field length (fieldNorm): The shorter the matching field is, measured in number of indexed terms, the greater the matching document's score will be. For example, if there was a band named Smashing, and another named Smashing Pumpkins, then this factor in the scoring would be higher for the first band upon a search for just Smashing, as it has one word in the field while the other has two. Norms for a field can be marked as omitted in the schema with the omitNorms attribute, effectively neutralizing this component of the score and index-time boosts too.

    Note

    A score explain will show queryNorm. It's derived from the query itself and not the indexed data; it serves to help make scores more comparable for different queries, but not for different documents matching the same query.

These factors are the intrinsic components contributing to the score of a document in the results. If your application introduces other components to the score, then that is referred to as boosting. Usually, boosting is a simple multiplier to a field's score, either at index or query time, but it's not limited to that.

Alternative scoring models

The scoring factors that have just been described relate to Lucene's default scoring model. It's known as the Vector Space Model, also referred to as simply TF-IDF due to its most prominent components. This venerable model is well known in the information retrieval community, and it has been the only model Lucene supported since the beginning. Lucene/Solr 4 supports four more models, including a well-known one called BM25. BM25 has been the subject of many research papers including those from recognized search experts at Google and Microsoft. It's often pitted against the Vector Space Model portrayed as an improvement, provided its parameters are tuned appropriately.

In Lucene, the relevance model is implemented by a Similarity subclass, and Solr provides SimilarityFactory for each one. Naturally, they have their own unique tuning parameters. In order to use BM25, simply add the following to your schema.xml:

   <similarity class="solr.BM25SimilarityFactory">
     <float name="k1">1.2</float>
     <float name="b">0.75</float>
   </similarity>

Don't forget to re-index. The preceding excerpt will have a global effect on relevancy for the schema. It's also possible to choose a different similarity per field. Beware that doing so is problematic if you are using TF-IDF at all; you'll see different scores for TF-IDF depending on whether the similarity is configured at the field level or as the global default, with regard to the query norm and coordination factor.

For more information on configuring scoring models (similarities) in Solr, see the wiki at http://wiki.apache.org/solr/SchemaXml#Similarity and Solr's Javadoc API for the factories. But for real guidance on the specific models, you'll have to start Googling.

Tip

The default Vector Space Model is a good default

Do not choose a relevancy model and its tuning parameters based solely on a small sampling of anecdotal searches; choose them after a real evaluation, such as from A/B testing.

Query-time and index-time boosting

At index-time boosting, you have the option to boost a particular document specified at the document level or at a specific field. Chapter 4, Indexing Data, shows the syntax; it's very simple to use for the XML and JSON formats. The document-level boost is the same as boosting each field by that value. This is internally stored as part of the norms number. Norms must not be omitted in the relevant fields in the schema. It's uncommon to perform index-time boosting because it is not as flexible as query time. That said, index-time boosting tends to have a more predictable and controllable influence on the final score, and it's faster.

At query-time boosting, we described in the previous chapter how to explicitly boost a particular clause of a query higher or lower, if needed, using the trailing ^ syntax. We also showed how the DisMax query parser's qf parameter not only lists the fields to search but allows a boost for them as well. There are a few more ways DisMax can boost queries that you'll read about shortly.

Troubleshooting queries and scoring

An invaluable tool in diagnosing scoring behavior (or why a document isn't in the result or is but shouldn't be) is enabling query debugging with the debugQuery query parameter. There is no better way to describe it than with an example. Consider the fuzzy query on the artists' index:

a_name:Smashing~

We would intuitively expect that documents with fields containing Smashing would get the top scores, but that didn't happen. Execute the preceding query mentioned with debugQuery=on.

Tip

Depending on the response format and how you're interacting with Solr, you might observe that this information isn't indented. If you see that, switch to another response format. Try Ruby with wt=ruby.

In the following code, the fourth document has Smashing as part of its name but the top three don't:

  <doc>
    <float name="score">3.999755</float>
    <str name="a_name">Smashin'</str>
  </doc>
  <doc>
    <float name="score">3.333129</float>
    <str name="a_name">Mashina</str>
  </doc>
  <doc>
    <float name="score">2.551927</float>
    <str name="a_name">Slashing Funkids</str>
  </doc>
  <doc>
    <float name="score">2.5257545</float>
    <str name="a_name">Smashing Atoms</str>
  </doc>

The first and third documents have words that differ from smashing by only one character, the second by two. What's going on here? Let's look at the following debug output, showing just the second and fourth docs for illustrative purposes:

<lst name="explain">
  <str name="Artist:227132">
    3.333129 = (MATCH) sum of:
      3.333129 = (MATCH) weight(a_name:mashina^0.71428573 in 166352) [DefaultSimilarity], result of:
        3.333129 = score(doc=166352,freq=1.0 = termFreq=1.0 ), product of:
          0.2524328 = queryWeight, product of:
            0.71428573 = boost
            13.204025 = idf(docFreq=1, numDocs=399182)
            0.026765013 = queryNorm
          13.204025 = (MATCH) fieldWeight(a_name:mashina in 286945),
          product of:
            1.0 = tf(termFreq(a_name:mashina)=1)
            13.204025 = idf(docFreq=1, numDocs=399182)
            1.0 = fieldNorm(field=a_name, doc=286945)
  </str>
<!--... skip ...-->
  <str name="Artist:93855">
    2.5257545 = (MATCH) sum of:
      2.5257545 = (MATCH) weight(a_name:smashing in 9796) [DefaultSimilarity], result of:
        2.5257545 = score(doc=9796,freq=1.0 = termFreq=1.0 ), product of:
          0.32888138 = queryWeight, product of:
            12.287735 = idf(docFreq=4, numDocs=399182)
            0.026765013 = queryNorm
          7.6798344 = fieldWeight in 9796, product of:
            1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
            12.287735 = idf(docFreq=4, maxDocs=399182)
            0.625 = fieldNorm(doc=9796)
  </str>

What we see here is the mathematical breakdown of the various components of the score. We see that mashina (the term actually in the index) was given a query-time boost of 0.75 (under-boosted), whereas smashing wasn't. We expected this because fuzzy matching gives higher weights to stronger matches, and it did. However, other factors pulled the final score in the other direction. Notice that the fieldNorm for Mashina is 1.0, whereas Smashing Atoms has a fieldNorm of 0.625. This is because the document we wanted to score higher has a field with two indexed terms versus just the one that Mashina has. Another factor is that the IDF for mashina, 13.2, is higher than for smashing, 12.3. Upper/lower case plays no role. So, arguably, Mashina is a closer match than Smashing Atoms to the fuzzy query Smashing~.

How might we fix this? Well, it's not broken, and the number four spot in the search results isn't bad. So this result is arguably in no need of fixing. This is also a fuzzy query that is fairly unusual and arguably isn't a circumstance to optimize for. For the fuzzy query case seen here, you could use DisMax's bq parameter (to be described very soon) and give it a non-fuzzy version of the user's query. That will have the effect of boosting an exact match stronger. Another idea is to enable omitNorms on a_name in the schema; however, that might reduce scoring effectiveness for other queries.

Tools – Splainer and Quepid

If you find the explain output hard to wrap your head around, you might want to use an open source tool called Splainer that dresses up Solr's raw output in a way that is easier to understand. Splainer is a browser-side web application, and as such, you can try it online against your local Solr instance without having to install or configure anything. Try it at http://splainer.io. Be sure to view the tour, which will show more of what it has to offer.

If you want to take search relevancy seriously, then you're going to invest significant time into it. Solr exposes a lot of power and reasonable defaults, but each application is different and it's all too easy to make a change that has a net negative effect across the searches users make. You'll need to do things such as keep track of a set of sample queries and their results, and monitor it over time. You could do this manually with a hodge-podge of spreadsheets and scripts, but a tool such as Quepid can help a ton. The most important thing Quepid does is assist you in curating a set of important queries with their search results that have human-entered quality judgments against them. As you tweak Solr's relevancy knobs, you can see the effect. Quepid is available at https://quepid.com.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.53.5