Chapter 3. Debugging your first relevance problem

This chapter covers

  • Basic extracting, indexing, and searching content in Elasticsearch
  • Troubleshooting searches that don’t return expected results
  • Debugging the construction of the inverted index
  • Troubleshooting relevance bugs
  • Solving your first relevance issue

The previous chapter laid out a rather ideal blueprint for Lucene-based search. In this chapter, the search engine has broken down! You’ll see what it takes to debug a real, live search engine. What tools are available to gain visibility into the behavior of search engine internals? Why do certain documents match the query, whereas other more relevant documents don’t? Why do seemingly irrelevant documents outrank relevant ones?

This chapter introduces you to a beginner’s problem. Although the solutions are straightforward, in order to solve them you’ll need to master relevance debugging. You’ll use these techniques to solve every relevance problem you face. Just as in math, showing your work can be the most important step.

You’ll begin to use our search engine, Elasticsearch, to search over a real data set. As you encounter the common beginner’s problem, your focus will be on debugging two primary internal layers key to relevance: matching and ranking. Armed with renewed insights from the debugging capabilities of the search engine, you can begin to use the search engine to rank and match based on features that you know best describe your content.

Through this chapter, you’ll experience a day in the life of the relevance engineer fighting fires (as shown in figure 3.1). You’ll troubleshoot why queries and ranking don’t match as your users expect, or why odd documents seem to be considered more relevant than others.

Figure 3.1. As a relevance engineer, you have several tools available for debugging relevance problems.

Before you start exploring what’s possible, let’s introduce the major building blocks: the search engine, the data set, and the programming environment you’ll be using to work through relevance examples.

3.1. Applications to Solr and Elasticsearch: examples in Elasticsearch

The preceding chapter introduced the components of a Lucene-based search engine. Which one should we use for examples? Solr? Elasticsearch? Both?

In an effort to “go deep,” we chose to develop our examples with just one search engine: Elasticsearch. Covering both search engines in equal depth would get you lost in the weeds as we endlessly compare identical but superficially different configuration details. This is rather uninteresting (shall we say, “irrelevant”) information you can easily find with Google.

Luckily, despite superficial differences, Solr and Elasticsearch are very close in functionality. The information you’ll learn in this book applies to either. Use this book as you’d use an algorithms book that happens to use C for its examples. We happen to use Elasticsearch. You can easily implement algorithms in any programming language. You can implement the relevance strategies here in either search engine.

If you’re a Solr developer, fear not: our basic use of Elasticsearch APIs should feel familiar. We’ll give you just enough of an explanation about what’s happening with Elasticsearch that even a smidgen of familiarity with either search engine should help you feel at home. Further, in this chapter, as we lay the foundation for several basics, we sprinkle hints for the Solr reader. We also provide appendix B to help map features between the two search engines.

It’s also important to note that this book isn’t about Elasticsearch. We focus on features related to relevance, completely ignoring other features and concerns when using a search engine: analytics, ingesting your data, scaling, and performance. If you have absolutely no familiarity with Solr or Elasticsearch, there are excellent books and tutorials on both that we encourage you to read before diving into this book.

3.2. Our most prominent data set: TMDB

For much of this book, we use The Movie Database (TMDB) as our data set. TMDB is a popular online movie and TV-show database. We’re grateful to TMDB for giving us permission to use its data set, and encourage you to support the project at http://themoviedb.org. We’re excited about TMDB’s data, as its content contains several attributes that many search applications must work with. When searching movies, these attributes include:

  • Prose text (including overviews, synopsis, and user reviews)
  • Shorter text (such as director and actor names, and titles)
  • Numerical attributes (user ratings, movie revenue, number of awards)
  • Movie release dates and other attributes important in search

In this book, you’ll primarily use a prepackaged version of TMDB data. Packaged with the book’s GitHub repository (http://github.com/o19s/relevant-search-book) is a file containing a snapshot of TMDB movies used at the time of writing this book. This file, tmdb.json, is a large JSON dictionary. Each entry is a movie with various properties such as title and overview. We recommend using this data, as the results will be consistent with the book’s content. We welcome you, however, to use TMDB’s data directly. We cover the steps you can take to index an up-to-date version of TMDB’s data in appendix A. In this appendix, you’ll see in particular how movies are extracted one by one from an external API and further enriched with cast and crew information.

3.3. Examples programmed in Python

When examples call for light coding, we use Python, a highly readable, imperative language that looks and feels like pseudocode. You don’t need to know Python to follow along (just pretend we’re writing pseudocode). We’re not doing anything fancy with the language, so these examples should still be easy to follow. We also limit the dependencies (avoiding, for example, even Elasticsearch’s excellent client libraries). Instead, it should be assumed that for every piece of Python code, the following imports are included. This code imports requests (an HTTP client library) and Python’s JSON standard library:

import requests # requests HTTP library
import json # json parsing

Note also that we use Elasticsearch at localhost at its default port, 9200, throughout the examples for code readability. Change this as needed to point to your Elasticsearch instance as you work through the examples.

For detailed instructions on how to run the examples or to access the TMDB data, please refer to the book’s GitHub repository (http://github.com/o19s/relevant-search-book). This repository contains the full set of examples and data for the book, along with detailed installation instructions in the README file should you need to install Python, Elasticsearch, or any of the required libraries.

3.4. Your first search application

To get started, you’re going to index a few pieces of text about popular movies into Elasticsearch. In this chapter, we’re pretty verbose about what we’re doing, commenting carefully as we move forward. To avoid being verbose in future chapters, we wrap each component in a Python function. After you index TMDB data and issue your first search, you’ll quickly hit a snag in your relevance that will force you to debug the seemingly mystical and odd behavior of the search engine.

To index movies, first you need to read them in! To access tmdb.json with the movie dictionary, you’ll use a function called extract. In the following listing, you’ll pull back each movie by parsing the JSON file into a Python dictionary.

Listing 3.1. Extract movies from tmdb.json

What does the returned dictionary look like? It’s a mapping of TMDB movie IDs to the movies pulled back from TMDB. A movie has plenty of fields you’d expect to be in a movie. Let’s look at an example. Here’s a snippet of the movie Aquamarine as a Python dictionary.

Listing 3.2. Sample TMDB movie from tmdb.json

Now with some interesting data loaded, you’ll index these documents into Elasticsearch. Elasticsearch has several ways to index documents. You’ll predominantly use the bulk index API that allows you to efficiently index multiple documents in one HTTP request. Don’t worry too deeply about the ins and outs of the bulk index APIs; knowing the indexing APIs in any more depth than what’s presented here isn’t key for this book. What’s crucial for relevance is having an ability to re-create the index with new analysis and index settings. Once it’s re-created, you’ll need to reprocess documents against the updated settings.

That being said, let’s create a function, reindex, that you can refer to. The reindex function takes settings and the movieDict dictionary returned from extract, re-creates the Elasticsearch index, and indexes the data into Elasticsearch.

Listing 3.3. Indexing with Elasticsearch’s bulk API—reindex

In reindex, you first interact with Elasticsearch by re-creating a tmdb index for your data with passed-in settings. Creating an index is synonymous with creating a database in a relational database system. The index will contain your documents and other pieces of search configuration for tmdb content. You’ll work with the /tmdb Elasticsearch HTTP endpoint when working with the tmdb index as a whole.

You may notice the shards setting passed in . As you may recall from chapter 2, a term’s document frequency is an important component of results ranking. Document frequency counts the number of times a term occurs across the entire index. In distributed search engines, where the index is physically subdivided into shards, document frequency is stored per shard. This can cause results ranking to appear to be broken for smaller test document sets. For larger document sets, the impacts of sharding usually average out. For the repeatability of our testing, we’ll disable sharding.

Starting at , you start to use the bulk index API. You begin to build up a string of bulk index commands to Elasticsearch. The addCmd here tells Elasticsearch that you’re indexing the document. You tell Elasticsearch some metadata about each document, including where it should be stored (_index: tmdb), its type (_type: movie), and its unique ID (taken from TMDB’s id). On the subsequent line, you append the document to be indexed. On the next line, you append the command and document to the bulkMovies string for indexing. You repeat this process for every movie in movieDict. Finally, after building the full bulk command, you POST the large bulkMovies string to Elasticsearch’s /_bulk endpoint.

With all the pieces, you can finally index the movies. Combining extract and reindex, you can pull data into Elasticsearch in the following listing.

Listing 3.4. Pulling data from TMDB into Elasticsearch
movieDict = extract()
reindex(movieDict=movieDict)

Congratulations! You’ve built your first ETL (extract, transform, load) pipeline. Here you’ve done the following:

  • Extracted information from an external system
  • Transformed the data into a form amenable to the search engine
  • Indexed the data into Elasticsearch

Further, by telling Elasticsearch via the commands in reindex about a new index (_index: tmdb) and about a new type (_type: movie), you’ve created both an index (not an SQL database) and a type of document (not an SQL table). In the future, when you want to search or interact with the tmdb index, you’ll reference tmdb/movie/ or tmdb/ in the path of the Elasticsearch URL.

3.4.1. Your first searches of the TMDB Elasticsearch index

Now you can search! For this movie application, you need to figure out how to respond to user searches from your application’s search bar. To do this, you’ll use Elasticsearch’s Query domain-specific language (DSL), or Query DSL.

The Query DSL tells Elasticsearch how to execute a search using a JSON format. Here you specify factors such as required clauses, clauses that shouldn’t be included, boosts, field weights, scoring functions, and other factors that control matching and ranking. The Query DSL can be thought of as the search engine’s SQL, a query language focused on ranked retrieval of flat, denormalized documents.

Being a fairly new relevance engineer, you’ll start with a basic application of Elasticsearch’s multi_match query. This is Elasticsearch’s Swiss Army knife for constructing queries across multiple fields. Because most search problems involve searching multiple fields, it’s where many start with a relevance solution. A common initial pass at a search relevance solution is to attempt to construct a multi_match query that lists the fields to be searched along with a few boosts (specified with the ^ symbol). Boosting is the act of adding or multiplying to a relevance score with a constant factor, query, or function. In this case, boosting is simple; you boost the title score by the constant 10 in an effort to tell the search engine about the relative importance of the field.

Let’s implement a search function that lets you search with passed-in Query DSL queries. search is a fairly straightforward function that passes a query and prints the search results in order of relevance, as shown in the following listing.

Listing 3.5. The search function

What do Query DSL queries look like that you pass to search? In listing 3.6, you construct a Query DSL search using multi_match. You attempt to tell Elasticsearch that a title field is 10 times more important than the overview field when ranking . Through this chapter, you’ll assess whether this attempt is working out.

Hint for Solr Readers

Instead of multi_match, Solr encourages you to start with the “dismax” family of query parsers. A starting query for the Solr user might be: http://localhost:8983/solr/tmdb/select?q=basketball with cartoon aliens&defType=edismax&qf=title^10 overview. Note that while this is the common starting point, this query works somewhat differently than Elasticsearch’s multi_match query parser. See chapter 6 and appendix B for more details.

Here’s your first “hello world” search using the Query DSL.

Listing 3.6. Your first search

Output:

Num  Relevance Score     Movie Title
1    0.8424165           Aliens
2    0.5603433           The Basketball Diaries
3    0.52651036          Cowboys & Aliens
4    0.42120826          Aliens vs Predator: Requiem
5    0.42120826          Aliens in the Attic
6    0.42120826          Monsters vs Aliens
7    0.262869            Dances with Wolves
8    0.262869            Interview with the Vampire
9    0.262869            From Russia with Love
10   0.262869            Gone with the Wind
11   0.262869            Fire with Fire

Oh, no—these search results aren’t good! You can infer from the query “basketball with cartoon aliens” that the user is likely searching for Space Jam—a movie about the Looney Tunes characters facing off against space aliens in a game of basketball with the help of Michael Jordan. It seems that the user doesn’t know the name of the movie and is attempting to grope around for it with a descriptive query—a common use case. Unfortunately, most of the top movies listed seem to be about basketball or aliens, but not both. Other movies seem to be completely unrelated to basketball or aliens, and we’re completely missing the mark. Where’s Space Jam? If you request additional results from Elasticsearch, you finally see your result:

43   0.016977157            Space Jam

Why were seemingly irrelevant movies considered valuable by the search engine? How can you diagnose the problem and begin to seek solutions? Your day as a relevance engineer will be spent trying to diagnose the odd results returned by the search engine. You need to answer two main questions:

  • Why did certain documents match query terms? Why did a movie such as Fire with Fire even match your query?
  • Why did less relevant documents rank as highly as they did? Why is The Basketball Diaries ranked higher than our target Space Jam?

You’ll want to be able to understand the problem fast. Time is ticking, and users aren’t having a good search experience.

3.5. Debugging query matching

What could be happening in this failed search for “basketball with cartoon aliens”? The first, and most foundational, way to begin looking for answers is by debugging the query’s term-matching behavior. In your work, you’ll often find cases where a relevant document that should match doesn’t. Conversely, you might be surprised when low-value or spurious terms match, adding an irrelevant document to the results. Even within the documents retrieved, matching or not matching a term might influence relevance ranking—unexpectedly causing poor results to be ranked highly because of spurious matches or ranked low because of unexpected misses. You need to be able to take apart this process with Elasticsearch’s analysis and query validation debugging tools.

First, we’ll remind you of what we mean by matching. Recall from chapter 2 that declaring a term a match in the inverted index is a strict, exact binary equivalence. Search engines don’t have the intelligence to know that “Aliens” and “alien” refer to the same idea. Or that “extraterrestrial” refers to almost the same idea. English-speaking humans understand that these mentions should be counted as signifiers of the idea of alien; or, as we’ve discussed, an indicator of the feature of “alien-ness” present in the text. But to the unintelligent search engine, these two tokens exist as distinct UTF-8 binary strings. The two strings, 0x41,0x6c,0x69,0x65,0x6e,0x73 (Aliens) and 0x61,0x6c,0x69,0x65,0x6e (alien), aren’t at all the same and don’t match.

This exacting matching behavior points to two areas to take apart:

  • Query parsing —How your Query DSL query translates into a matching strategy of specific terms to fields
  • Analysis —The process of creating tokens from the query and document text

By understanding query parsing, you can see exactly how your Query DSL query uses Lucene’s data structures to satisfy searches against different fields. Through analysis, you can massage, interrogate, pry, and prod text with hope that the text’s true “alien-ness” can be boiled down to a single term. You can further identify meaningless terms, such as the that might match but represent no important feature, creating spurious matches on low-value terms.

Only after you understand how the underlying data structures are created and accessed can you hope to take control of the process. Let’s walk through your search and see whether a matching problem is inadvertently including spurious matches such as Fire with Fire.

3.5.1. Examining the underlying query strategy

The first thing you’ll do to inspect matching behavior is ask Elasticsearch to explain how the query was parsed. This will decompose your search query into an alternate description that more closely describes the underlying interaction with Lucene’s data structures. To do this, you’ll use Elasticsearch’s query validation endpoint. This endpoint, shown in the next listing, takes as an argument a Query DSL query and returns a low-level explanation of the strategy used to satisfy the query.

Hint for Solr Readers

Set the parameter debugQuery=true on your Solr query to get equivalent query parsing debug information. See your Solr response’s parsedquery for what’s equivalent to Elasticsearch’s query validation endpoint output.

Listing 3.7. Explaining the behavior of your query

Response:

Here the returned explanation field (in bold) lists what you’re interested in. Your query is translated into a more precise syntax that gives deeper information about how Lucene will work with your Elasticsearch query:

((title:basketball title:with title:cartoon title:aliens)^10.0) |
 (overview:basketball overview:with overview:cartoon overview:aliens)

3.5.2. Taking apart query parsing

The query validation endpoint has returned an alternative representation of your Query DSL query to help debug your matching issues. Let’s examine this alternative syntax; we introduced the basics of this syntax in chapter 2. The query validation output is reminiscent of Lucene query syntax[1]—a low-level, precise way of specifying a search. Because of the additional precision, Lucene query syntax describes the requirements of a relevant document a bit more closely to how Lucene itself will perform the search using the inverted index.

1

In reality, the representation depends on each Lucene query’s Java toString method, which attempts (but doesn’t always accurately reflect) strict Lucene query syntax.

As we discussed in chapter 2, Lucene queries are composed of the Boolean clauses MUST(+), SHOULD, and MUST_NOT(-). Each one specifies a field to search in the underlying document, and each takes the form [+/-]<fieldName>:<query>. To debug matching, the most important part of the clause is the component that specifies the match itself: <fieldName>:<query>. If you examine one of the preceding clauses, such as title:basketball, you can see that you’re asking the title field to look for the specific term basketball. Each clause is a simple term query, a single term lookup in the inverted index. Besides the term query, the most prominent queries you’ll encounter are phrase queries. We discussed these also in chapter 2. In Lucene query syntax, these are specified by using quotes, as in title:"space jam" to indicate that the terms should be adjacent.

In our example, as you move one layer out from each match, you can see Lucene’s query strategy. Although you’re currently focused on matching, this encompasses more than that. Above the innermost matches, you see four SHOULD clauses scored together (grouped with parentheses):

(title:basketball title:with title:cartoon title:aliens)

Boosted by a factor of 10 (as we’ve requested when searching), you have the following:

(title:basketball title:with title:cartoon title:aliens)^10

Compared to another query, with a maximum score taken (| symbol), you have this:

((title:basketball title:with title:cartoon title:aliens)^10.0) |
(overview:basketball overview:with overview:cartoon overview:aliens)

We present other pieces of this pseudo-Lucene query syntax as you move through the book.

It seems odd that a lot of surprising scoring mumbo-jumbo is already happening. You’ll debug scoring in greater depth later in this chapter; for now what matters is using the term query information to answer why spurious matches such as Dances with Wolves or Fire with Fire are even considered matches.

3.5.3. Debugging analysis to solve matching issues

Now that you know which terms are being searched for, the next step to debugging matching is to see how documents are decomposed into terms and placed in the index. After all, your searches will fail if the terms you’re searching for don’t exist in the index. We gave an example of this previously. Searches for the term Aliens won’t match the term alien regardless of our intuition. Further, term searches might result in spurious matches that don’t signify anything valuable. For example, matching on the in isolation is spurious for English. It signifies no important feature latent in the text to our user’s English-language-trained minds.

Despite our intuitive notion of how a document should be decomposed into terms, the mechanics of analysis often surprise us. It’s a process you’ll need to debug often. You already know how these terms are extracted: through index-time analysis. Analyzers are the entities that define the analysis process. They contain the components discussed in chapter 2: character filters, a tokenizer, and token filters. In Elasticsearch, the analyzer used can be specified at many levels, including for the index (all of TMDB), a node (a running instance of Elasticsearch), a type (all movies), a field, or even at query time for a particular query. You have yet to specify an analyzer, so the default standard analyzer is used. You can use this knowledge along with Elasticsearch’s useful analyze endpoint to view how text from your documents was transformed into the tokens that will form the inverted index.

Perhaps if you see how the analysis for the title Fire with Fire translates to the inverted index, you might see the terms that match your query. Then you might see why this seemingly random, irrelevant movie is included in the results.

Hint for Solr Readers

While there’s a similar API in Solr, Solr comes with a tremendous admin UI that includes a debugging tool for analyzers. In the Admin UI select your core and “analyzers” to perform similar debugging.

Listing 3.8. Debugging analysis

The result (in prettier YAML) is as follows:

This output shows you important properties of each token extracted from the snippet Fire with Fire by the standard analyzer. This list of tokens resulting from analysis is known as the token stream. In this token stream, you extract three tokens: fire, with, and fire. Notice how the text has been tokenized by whitespace and lowercased? Further notice how more attributes than just the token text are included. Notice the offset values, indicating the exact character position of each term in the source text, and position, indicating the position of the token in the stream.

After analysis, the token stream is indexed and placed into the inverted index. For debugging and illustration purposes, you can represent the inverted index in a simple data structure known as SimpleText[2]—an index storage format created by Mike McCandless purely for educational purposes. You’ll use this layout to think through the structure of the inverted index.

2

You can read more about SimpleText in Mike McCandless’s blog post: http://blog.mikemccandless.com/2010/10/lucenes-simpletext-codec.html.

Let’s take a second to reflect on how the preceding token stream is translated to a SimpleText representation of an index, focused just on the term fire.

Listing 3.9. SimpleText index representation for the term fire

The search engine’s goal when indexing is to consume the token stream into the inverted index, placing documents under their appropriate terms. After counting the number of occurrences of a particular token (in this case, two instances of fire), indexing adds entries to the postings list for the term fire. Under fire, you add your document, doc 0 . You further store the number of occurrences of fire in doc 0 as freq and record where it occurred through each position entry. With all the tokens taken together, this document is added to the postings for two terms, fire and with, as shown in the following listing.

Listing 3.10. View of title index with Fire with Fire terms highlighted

As we’ve discussed, data structures other than the inverted index consume this token stream. Numerous features can be enabled in Lucene. For our purposes, you should consider data structures that consume this token stream to provide other forms of global-term statistics such as the document frequency. In this case, the document frequency of fire will increase by one, reflecting the new document.

It’s important to note that you can deeply control this process. Typically, analysis is controlled on a field-by-field basis. You’ll see how to define your own analyzers for your fields, using the components discussed in chapter 2: character filters, tokenizers, and token filters. But first, armed with what you know about the query and the terms in the index, you need to examine why a spurious result like Fire with Fire would even match in the first place.

3.5.4. Comparing your query to the inverted index

You’re now prepared to compare your parsed query with the context of the inverted index. If you compare the parsed query

((title:basketball title:with title:cartoon title:aliens)^10.0) |
(overview:basketball overview:with overview:cartoon overview:aliens)

against the inverted index snippet from the token stream for Fire with Fire, you see exactly where the match occurs:

The clause title:with pulls in doc 0, Fire with Fire, from the inverted index. Recalling how term matches work, you can start to understand the mechanics here. Our document is listed under with in the index. Therefore, it’s included in the search results along with other matches for with. As we discussed in the previous chapter, this is an entirely mechanical process. Of course, to English speakers, a match on with isn’t helpful and will leave them scratching their heads about why such a noisy word was considered important in matching.

Other spurious movies seem to fall into this category. Movies like Dances with Wolves or From Russia with Love get slurped up into the search results just as easily as documents that match more important terms like basketball or aliens. Without help, the search engine can’t discriminate between meaningful, valid, and important English terms and those that are noise and low value.

3.5.5. Fixing our matching by changing analyzers

This matching problem luckily has a straightforward fix. We’ve teased Elasticsearch for not knowing much about English. In actuality, Elasticsearch has an analyzer that handles English text fairly well. It strings together character filters, a tokenizer, and token filters to normalize English to standard word forms. It can stem English terms to root forms (running -> run), and remove noise terms such as the, known as stop words. Lucky for us, with is one such stop word. Removing it from the index could solve our problem.

How do you do use this analyzer? Simple: you need to assign a different analyzer to the fields. Because modifications to index-time analysis alter the structure of the inverted index, you’ll have to reindex your documents. To customize the analysis, you’ll re-create your index and rerun the previous indexing code. The main difference is at in the following listing; you’ll explicitly configure the field with the English analyzer before creating the index.

Hint for Solr Reader

Solr’s schema.xml specifies the configuration of Solr’s fields. The analyzer used by a field is controlled by the analyzer associated with a field’s fieldType. Out of the box, Solr’s schema.xml defines a number of field types, including text_en which is appropriate for English text. Changing analyzers and field settings requires reindexing in Solr just as in Elasticsearch.

Listing 3.11. Reindexing with the English analyzer

Great! Did it work? Let’s reanalyze Fire with Fire to see the results:

resp = requests.get('http://localhost:9200/tmdb/
     _analyze?field=title&format=yaml',
                    data="Fire with Fire")

Response:

Notice the removal of with in this token stream. Particularly, notice the gap between positions 1 and 3. Elasticsearch is reflecting the removal of the token by this position gap to avoid spurious phrase matches. Rerunning the query validation also shows a removal of with from the query:

And indeed, the matches become much closer to what you want. At least you’re in the range of aliens. Further, because of more sophisticated analysis, stemming, and token normalization, you’re picking up other matches of alien that were missing.

Num  Relevance Score    Movie Title
1    1.0643067          Alien
2    1.0643067          Aliens
3    1.0643067          Alien3
4    1.0254613          The Basketball Diaries
5    0.66519165         Cowboys & Aliens
6    0.66519165         Aliens in the Attic
7    0.66519165         Alien: Resurrection
8    0.53215337         Aliens vs Predator: Requiem
9    0.53215337         AVP: Alien vs. Predator
10   0.53215337         Monsters vs Aliens
11   0.08334568         Space Jam

Congratulations! By turning on the English analyzer, you’ve made a significant leap forward. Your target has moved up to #11. You’ve achieved a saner mapping of text that corresponds to the text’s “alien-ness” feature through simple English-focused analysis. You’ve also eliminated text that shouldn’t be thought of as corresponding to any feature of the text: stop words.

In future chapters, you’ll explore more use cases that shape the representation of tokens even deeper than what you’ve done here. Because terms are analogues to textual features, the translation of text into tokens is often deeply customized per domain. For now, you need to switch gears to debug the next layer in the search equation: relevance ranking.

3.6. Debugging ranking

After resolving your matching issue, you’re still left wondering why movies like Alien, Aliens, and Basketball Diaries rank above Space Jam. None of these movies have basketball-playing aliens. Our user is still left disappointed with the search. With results like these, the user is likely growing increasingly frustrated with the search application. You have to find a way to take apart the relevance ranking such that it more accurately aligns to your user’s information needs. You need to ask Elasticsearch to explain itself. What you’ll see is that debugging ranking means understanding the following:

  • The calculation of individual match scores
  • How these match scores factor into the document’s overall relevance score

You saw in chapter 2 that underlying each of these factors is the idea of a score. The score is the number assigned by the search engine to a document matching a search. It indicates how relevant the document is to the search (higher score meaning more relevant). Relevance ranking, then, is typically a sort on this number. You’ll see through debugging ranking that this score, although informed by a theoretical basis, is entirely in your hands to manipulate, to implement your notions of relevance. In fact, the majority of this book is about the best way to take mastery over this one number!

For matches, you must determine whether match scores accurately reflect your intuitive notion about the strength of the corresponding feature. We discussed that all mentions of alien or alien-related text (for example, Aliens or extraterrestrial) add weight to our notion of the “alien-ness” feature latent in the text. Do you feel that when you match on alien, the score for the term alien reflects your intuition of the true strength of the text’s “alien-ness”? We’ll decompose the math that goes into term scoring. Only then can you reflect on whether matches on alien or basketball really reflect your understanding of the true strength of a movie’s true “alien-ness” or “basketball-ness.”

You’ll also see the mechanics of how other queries compose matches into larger score calculations by boosting, summing, and choosing between component scores. If term matches represent the strength of individual features in text, then these other operations relate the features to one another. Our example specifies a multi_match query with default settings, searching title with a boost of 10 and overview with no boost. How does this translate into a scoring formula? More important, how do you know whether the formula resulting from this query was the right thing to do?

To fix our Space Jam query, you’ll need to get inside the search engine’s head. You’ll need to align the mechanical scoring process to reflect your business and user’s notion of relevance—both in terms of how terms relate to features and how these feature strengths combine into a larger relevance score.

3.6.1. Decomposing the relevance score with Lucene’s explain feature

Lucene’s explain feature lets you decompose the calculation behind the relevance score. Before diving into the explain, let’s revisit your initial pass at understanding the query for Space Jam. The query validation output helps reveal the scoring strategy that will be used:

((title:basketbal title:cartoon title:alien)^10.0) |
 (overview:basketbal overview:cartoon overview:alien)

In this query, you seek out basketbal (the stemmed form of basketball), cartoon, and alien terms in each field. The title score is boosted by 10. The search engine then chooses between the two fields, by taking the maximum of the field scores (the | symbol).

This is a starting point, but what you need to see isn’t the strategy, but the after-action report. You need to see the scoring arithmetic for specific documents.

There are a couple of ways to ask for explain information, but because we often want to see this information in line with each search result, it’s convenient to set explain: true when issuing the search query. This will return an _explanation entry in each search result returned. Let’s reissue our search with an explain set so you can reflect on the scoring.

Hint for Solr Reader

The Solr parameter &debugQuery=true outputs the same scoring debug information as setting 'explain': True in Elasticsearch. Examine the “explain” section in your Solr’s response.

Listing 3.12. Requesting a relevancy scoring explanation

The full explain is lengthy, so we omit a great deal of the JSON output. It’s here only to give a taste. We show the full explain in a more concise form farther down. Without further ado, here’s a snippet of the JSON explain for Alien:

{
 "description": "max of:",
 "value": 1.0643067,
 "details": [
  {
   "description": "product of:",
   "value": 1.0643067,
   "details": [
    {
     "description": "sum of:",
     "value": 3.19292,
     "details": [
      {
       "description": "weight(title:alien in 223)
                       [PerFieldSimilarity], result of:",
       "value": 3.19292,
       "details": [
        {
         "description": "score(doc=223,freq=1.0 = termFreq=1.0
),
                         product of:",
         "value": 3.19292,
         "details": [
          {
           "description": "queryWeight, product of:",
           "value": 0.4793294,
           "details": [
            {
             "description": "idf(docFreq=9, maxDocs=2875)",
             "value": 6.661223

            }
<omitted>
}

From now on, we’ll summarize this explain format more concisely. We can simplify the preceding snippet by collapsing it for a shorter summary shown in the following listing. While this begins to take shape, it’s still overwhelming. Don’t focus on understanding this now; just scan it. We’ll soon show a way to make sense of the madness.

Listing 3.13. Simplified explain for Alien
1.0646985, max of:
  1.0646985, product of:
    3.1940954, sum of:
      3.1940954, weight(title:alien in 223) [PerFieldSimilarity], result of:
        3.1940954, score(doc=223,freq=1.0 = termFreq=1.0
), product of:
          0.4793558, queryWeight, product of:
            6.6633077, idf(docFreq=9, maxDocs=2881)
            0.07193962, queryNorm
          6.6633077, fieldWeight in 223, product of:
            1.0, tf(freq=1.0), with freq of:
              1.0, termFreq=1.0
            6.6633077, idf(docFreq=9, maxDocs=2881)
            1.0, fieldNorm(doc=223)
    0.33333334, coord(1/3)
  0.053043984, product of:
    0.15913194, sum of:
      0.15913194, weight(overview:alien in 223)
                 [PerFieldSimilarity], result of:
        0.15913194, score(doc=223,freq=1.0 = termFreq=1.0
), product of:
          0.033834733, queryWeight, product of:
            4.7032127, idf(docFreq=70, maxDocs=2881)
            0.0071939616, queryNorm
          4.7032127, fieldWeight in 223, product of:
            1.0, tf(freq=1.0), with freq of:
              1.0, termFreq=1.0
            4.7032127, idf(docFreq=70, maxDocs=2881)
            1.0, fieldNorm(doc=223)
    0.33333334, coord(1/3)

We’ll compare this explanation for Alien to the explain for our target result Space Jam:

0.08334568, max of:
  0.08334568, product of:
    0.12501852, sum of:
      0.08526054, weight(overview:basketbal in 1289)
                 [PerFieldSimilarity], result of:
        0.08526054, score(doc=1289,freq=1.0 = termFreq=1.0
), product of:
          0.049538642, queryWeight, product of:
            6.8843665, idf(docFreq=7, maxDocs=2875)
            0.0071958173, queryNorm

          1.7210916, fieldWeight in 1289, product of:
            1.0, tf(freq=1.0), with freq of:
              1.0, termFreq=1.0
            6.8843665, idf(docFreq=7, maxDocs=2875)
            0.25, fieldNorm(doc=1289)
      0.03975798, weight(overview:alien in 1289)
                  [PerFieldSimilarity], result of:
        0.03975798, score(doc=1289,freq=1.0 = termFreq=1.0
), product of:
          0.03382846, queryWeight, product of:
            4.701128, idf(docFreq=70, maxDocs=2875)
            0.0071958173, queryNorm
          1.175282, fieldWeight in 1289, product of:
            1.0, tf(freq=1.0), with freq of:
              1.0, termFreq=1.0
            4.701128, idf(docFreq=70, maxDocs=2875)
            0.25, fieldNorm(doc=1289)
    0.6666667, coord(2/3)

At first blush, these explains appear terrifying. The first thing to realize is that the explain is simply a decomposition of the arithmetic behind the relevance score. Each number on the outside is explained by the details nested within. At the outermost explain, you have the document’s relevance score. As you move deeper into the details, you can see how that score is calculated with increased granularity.

Eventually, you get to the layer listing the scores for specific matches (title:alien). Under this layer, you describe the components involved in the scoring of a specific match in a field. This match level is a bit of a dividing line in the explain. Inside, a match is scored by directly consulting Lucene’s data structures for a term in a field. Outside, scores for matches are combined into a larger formula. You may wish to compare this output to what’s presented in figure 2.10, near the end of chapter 2.

If you elide what’s inside the explains for each match (looking only at what’s “outside” matches), you have an even more concise explain for Alien:

1.0643067, max of:
  1.0643067, product of:
    3.19292, sum of:
      3.19292, weight(title:alien in 223) [PerFieldSimilarity]
    0.33333334, coord(1/3)
  0.066263296, product of:
    0.19878988, sum of:
      0.19878988, weight(overview:alien in 223) [PerFieldSimilarity

What you’re left with is a set of operations on the matches themselves. Internally, these operations reflect queries that wrap other queries. These wrapping queries are known as compound queries. Compound queries allow us to express how different features represented by the underlying term-match scores relate to each other mathematically. They reflect the query strategy you’ve already seen:

((title:basketbal title:cartoon title:aliens)^10.0) |
 (overview:basketbal overview:cartoon overview:aliens)

After combining the matches, they in turn can be combined at arbitrary depth by other compound queries to create even more-complex query scoring and matching. A great deal of relevance engineering is learning how a Query DSL query maps to a set of compound queries.

If you pull back the veil and examine inside a match, you see a different sort of calculation happening. The scoring begins to look more cryptic, filled with deeper search-engine jargon. At this point, you’re seeing a more fundamental reflection of the information retrieval intelligence built into the search engine. At this level, you begin to see information about match statistics for a field. These matches are the basic building blocks of the scoring calculation—hopefully, accurately reflecting the strength of a particular latent feature in the text.

0.03975798, weight(overview:alien in 1289) [PerFieldSimilarity], result of:
  0.03975798, score(doc=1289,freq=1.0 = termFreq=1.0
), product of:
  0.03382846, queryWeight, product of:
    4.701128, idf(docFreq=70, maxDocs=2875)
    0.0071958173, queryNorm
  1.175282, fieldWeight in 1289, product of:
    1.0, tf(freq=1.0), with freq of:
      1.0, termFreq=1.0
    4.701128, idf(docFreq=70, maxDocs=2875)
    0.25, fieldNorm(doc=1289)

Again, you might be terrified! Don’t fret. We cover a theoretical backing for these numbers next, and after that, you’ll be able to compare matches with ease. You’ll be able to determine why some field matches seem to convey more strength than others.

3.6.2. The vector-space model, the relevance explain, and you

Much of the Lucene scoring formula derives from information retrieval. But the theoretical influence needs to be tempered mightily. Although the theoretical basis gives you context for solving a problem, in practice, relevance scoring uses theory-inspired heuristics based on applied experience of what works well. In many ways, aside from foundational concepts, relevance scoring is just as much an art as a science. Understanding the science will help you ensure that the search engine correctly measures the weight of features latent in the text, represented by terms.

To information retrieval, a search for multiple terms in a field (such as our overview:basketbal overview:alien overview:cartoon against Space Jam) attempts to approximate a vector comparison between the query and matched document. Vectors? That’s sounds like geometry for what seems like a language arts problem. Recall that a vector represents a magnitude and a direction in space. A vector is often represented as an arrow, pointing into space from the origin—say, to the Moon from Earth. Numerically, a vector is represented as a value for each dimension. Perhaps the vector <50,20> means “North 50 miles, East 20 miles.” Space for a vector, however, need not relate to the physical world we move around in. For example, if the x-axis represents a fruit’s juiciness, and the y-axis its size, you can define a vector space that captures some of the important features of fruitiness. Figure 3.2 shows this vector space.

Figure 3.2. Fruit in a two-dimensional vector space; the x-axis is size, and the y-axis is juiciness. Every fruit measured for size/juiciness can be represented as a vector, with similar fruit clumping together.

Here you see several pieces of fruit represented as vectors in the juiciness/size vector space. Some have a great deal of strength in the juiciness dimension, others in the size direction. You can easily see how similar fruit might clump together. For example, fruits in the upper right are most likely watermelon—very large and very juicy.

You can infer something about the similarity of two pieces of fruit by computing the dot product of their two vectors. In the fruit example, this means (1) multiplying the juiciness of each fruit together, (2) multiplying the size, and (3) summing the results. It turns out that the more properties fruit share in common, the higher the dot product.

dotprod(fruit1, fruit2) = juiciness(fruit1) × juiciness(fruit2) +
                               size(fruit1) × size(fruit2)

What does this have to do with text? To information retrieval, text (queries and documents) can also be represented as vectors. Instead of examining features such as juiciness or size, the dimensions in the text vector space represent words that might appear in the text. What if instead of fruit, you looked at a movie overview’s mention of basketball, aliens, or cartoons, as in figure 3.3. Some text is definitely about aliens (for example, the overview for Alien), but not basketball nor cartoons. Other text (such as the overview of the Japanese anime film Slam Dunk) is about basketball and cartoons, but not aliens. We suspect that our target, Space Jam, should score highly in all the required dimensions.

Figure 3.3. Movie overview text in a three-dimensional vector-space examining basketball, cartoon, and alien. Some movies are very much about cartoons and basketball (Slam Dunk). Space Jam is about all three!

In the same way that fruit has a “juiciness” feature, you can think of movie text as having an “alien-ness” feature based on the occurrence of alien words in the text. To generalize this idea of representing features, you’ll define a feature space to mean a vector space where dimensions represent features, regardless of whether you’re talking about features of fruit, text, or anything else worth comparing.

Of course, movies are about far more than just basketballs, cartoons, and aliens. For text, the feature space is much larger than three dimensions. In what’s known as the bag of words model of text, our vectors have a single dimension for each possible term. It’s possible for there to be a dimension for every word in the English language! Naturally, any given document or query is unlikely to mention every word in the English language. You’ll be hard pressed to find mention of Rome or history in the overview for Space Jam. Similarly, Gladiator is unlikely to mention Michael Jordan. Therefore, most dimensions in these document vectors are empty or zero. For this reason, they’re known as sparse vectors.

After understanding that each vector dimension is a feature, the next step is measuring the strength or magnitude of that feature. In search, this value is known as a weight—a measure of how important that term is for the snippet of text. If alien is prominent, it should receive a high weight; otherwise, it should receive a low or zero weight. If you reexamine the previous explain, you can see Lucene’s own weight measurement for the “alien” dimension in Space Jam:

0.03975798, weight(overview:alien in 1289)

Before digging into how Lucene computes this weight, let’s walk through an example with a simpler definition. Let’s define the weight for a particular term in text as 1 if the term is at all present, and 0 if not. With this definition, a snippet of text from Space Jam’s overview, basketball game against alien, would be represented as this bag-of-words vector VD:

a

alien

against

...

basketball

Cartoon

...

game

...

movie

narnia

...

zoo

0 1 1   1 0 0 1   0 0   0

This vector has a dimension for every word in the English language; we’re showing you only a handful of English words. You can compare this to a similarly constructed vector, VQ, for your query “basketball with cartoon aliens”:

a

alien

against

...

basketball

cartoon

...

game

...

movie

narnia

...

zoo

0 1 0   1 1 0 0   0 0   0

How many components match? How similar are the query and document? Just as with the fruit example, you can calculate a dot product to arrive at a score. Recall that a dot product of two vectors multiplies corresponding components one by one. You then sum the components. So your score for this query would be calculated as follows:

score = VD['a'] × VQ['a'] + VD['alien'] × VQ['alien'] +
    ... + VD['space'] × VQ['space'] ...

When compared to the preceding explain breakdown, each multiplication factor represents a match score. In other words, overview:alien in the explain corresponds to the factor VD['alien'] × VQ['alien']. The difference is that the explain reflects Lucene’s own function for calculating a field or query weight, which we dive into next. The summation in the preceding dot product can be found in the behavior of the Boolean query that sums up matching clauses. You can see this in sum of from the previous explain:

3.19292, sum of:
      3.19292, weight(title:alien in 223) [PerFieldSimilarity]

3.6.3. Practical caveats to the vector space model

Although the vector space model provides a general framework for discussing Lucene’s scoring, it’s far from a complete picture. Numerous fudge factors have been shown to improve scoring in practice. Perhaps most fundamentally, the ways matches are combined by compound queries into a larger score isn’t always a summation.

You’ve seen through the | symbol that the “max” of two fields is often taken. There’s also often a coord factor that directly punishes compound matches missing some of their components (coord multiplies the resulting dot product by <the number of matches> / <the total query terms>). Many of the compound queries you’ll encounter will perform various operations on the underlying queries, such as taking a max, summing, or taking a product. You also have tremendous freedom to arbitrarily calculate or boost scores with your own function queries that might combine match (or other) scores with other arbitrary factors. You’ll explore many of these strategies in future chapters.

Another important note about this dot product is that it’s often normalized by dividing the magnitude of each vector:

For dot products, normalization converts the score to a 0–1. This rebalances the equation to account for features that tend to have high weights, and those that tend to have smaller weights.[3] For search, given all the fudge factors in Lucene scoring and the peculiarities of field statistics, you should never attempt to compare scores between queries without a great deal of deep customization to make them comparable.

3

Astute readers will recognize this as the cosine similarity.

As stated previously, the sparse vector representation of text is known as the bag of words model. It’s considered a “bag” because it reflects a decomposition of text that ignores the context of these terms. An important part of this context is the position the term occurs in to enable phrase matching. Luckily, Lucene also stores positions of each term’s occurrence. Thus, you could view a document as a sparse vector that also includes every subphrase. This can be quite a large vector indeed! See the table below as an example. It’s even larger when you open up the library of complex span queries!

a

an

alien

...

Basketball

lump

...

“basketball game”

“game against”

...

0 0 1   0 0   1 1  

3.6.4. Scoring matches to measure relevance

You’re still trying to get to the bottom of why some of your movies are ranked suspiciously higher than the target Space Jam. You’ve explored the explain, and you see some match scores that concern you. You’re almost there!

You need to understand how Lucene measures the weight of a term in a piece of document or query text in its vector calculation. You can then evaluate whether these weights correspond to your intuition of how strongly those matches should be weighted. Users, of course, don’t think in terms of the math presented here. But these metrics have proven to experimentally approximate the user’s broad sense of relevance. Let’s see if it lines up with our expectations for our use case.

Let’s look at Lucene’s weight computation for alien in Space Jam to suss out why a match on alien is relatively weak:

0.03975798, weight(overview:alien in 1289) [PerFieldSimilarity], result of:
  0.03975798, score(doc=1289,freq=1.0 = termFreq=1.0), product of:
    0.03382846, queryWeight, product of:
      4.701128, idf(docFreq=70, maxDocs=2875)
      0.0071958173, queryNorm
    1.175282, fieldWeight in 1289, product of:
      1.0, tf(freq=1.0), with freq of:
        1.0, termFreq=1.0
      4.701128, idf(docFreq=70, maxDocs=2875)
      0.25, fieldNorm(doc=1289)

How does Lucene’s weight computation work? Looks like two weight components are multiplied together. The fieldWeight reflects how Lucene computes the importance of this term in the field text (in this case, overview). The queryWeight computes the weight of the term in the user’s query.

This weight information can be translated from the explains into sparse vectors for the query and the two documents being scored (VQ and VD from the previous section). For example, if you compare the weight of alien in Space Jam with the corresponding entry in Alien:

0.15913194, weight(overview:alien in 223) [PerFieldSimilarity], result of:
  0.15913194, score(doc=223,freq=1.0 = termFreq=1.0), product of:
    0.033834733, queryWeight, product of:
      4.7032127, idf(docFreq=70, maxDocs=2881)
      0.0071939616, queryNorm
    4.7032127, fieldWeight in 223, product of:
      1.0, tf(freq=1.0), with freq of:
        1.0, termFreq=1.0
      4.7032127, idf(docFreq=70, maxDocs=2881)
      1.0, fieldNorm(doc=223)

You can represent these weights in a sparse vector. Here, you see the weight for alien.

Query or field

...

alien

...

Query: basketball with cartoon aliens (VQ)   0.033834733  
overview field in Space Jam (VD)   1.175282  
overview field in Alien (VD)   4.7032127  

For some reason, the weight of the term alien is much higher in the overview field for Alien than it is for Space Jam. To us, this means that the feature of "alien-ness" is graded highly in this overview text.

3.6.5. Computing weights with TF × IDF

The rules for computing a term’s weight in a field is driven by what Lucene calls a similarity. A similarity uses statistics recorded in the index for matched terms to assist a query in computing a numerical weight for the term. Lucene supports several similarity implementations, including letting you define your own.

Most similarities are based on the formula TF × IDF. This refers to the multiplication of two important term statistics extracted from the field and recorded in the inverted index by Lucene—term frequency (TF) and inverse document frequency (IDF). By default, terms have their importance weighed by multiplying these two statistics. As you may recall, these statistics were discussed at the end of chapter 2. Let’s recap.

TF (tf in the preceding scoring) reflects how frequently a term occurs in a field. You can see it in the SimpleText version of the inverted index in earlier sections as freq. TF is extremely valuable in scoring a match. If a matched term occurs frequently in a particular field in a document (if the field mentions alien a lot), we consider that field’s text much more likely to be about that term. (We consider it very likely to be about aliens.)

Conversely, IDF (idf in the preceding scoring) tells us how rare (and therefore valuable) a matched term is. Because IDF is the inverse of the document frequency, it’s computed by taking 1 / document frequency, or 1 / DF. As you may recall, DF records the number of documents the term occurs in. If the term is common, it will have a high document frequency. Rare terms are considered valuable, and common ones less so. If the term supercalifragilistic occurs in a single document, it will receive a high IDF.

Raw TF × IDF weighs a term’s importance in text by multiplying TF with IDF—or put another way, TF × (1 / DF), or TF / DF. This measures what proportion the index’s overall use of that term is concentrated in this specific document.

Table 3.1 shows how TF × IDF works. In this example, when you weigh the importance of lego, there are relatively few movies about Legos. As you’d expect, the one movie that mentions Legos, The Lego Movie, receives a higher TF × IDF weight. Contrast this lego search with love. Movies that mention the term love are particularly common (everyone loves romantic comedies!). This causes occurrences of love in one particular romantic comedy, Sleepless in Seattle, to receive a lower weight than lego in The Lego Movie, despite having far more occurrences of love in Sleepless.

Table 3.1. Scoring love matches in Sleepless in Seattle versus lego in The Lego Movie. lego is rare and is mentioned only in The Lego Movie, thus yielding a higher score than the love match.

Movie

Matched term

DF

TF

TF × IDF (TF / DF)

Sleepless In Seattle love 100 10 10 / 100 = 0.1
The Lego Movie lego 1 3 3 / 1 = 3.0

The idea behind TF × IDF corresponds to most users’ instincts about what matches should be considered important terms in text. Users perceive rare terms (such as lego) to be far more specific and targeted than common terms (love). Further, as a snippet of text mentions a term proportionally more than other text (as its TF increases), it’s more likely that this text is going to be about the term being searched on.

Though broadly valuable, you’ll see cases where these intuitions don’t hold. Sometimes increased TF doesn’t correspond to the user’s notion of term importance. High TF matches on short text snippets like, for example, title fields (Fire with Fire) often don’t correlate with our notion of increased term weight. Luckily, Elasticsearch gives us the ability to disable TF as needed.

3.6.6. Lies, damned lies, and similarity

Although you can see that TF × IDF seems to be an intuitive weighting formula, these raw statistics need additional tweaking to be optimal. Information retrieval research demonstrates that although a search term might occur 10 times more in a piece of text, that doesn’t make it 10 times as relevant. More mentions of the term do correlate with relevance, but the relationship isn’t linear. For this reason, Lucene dampens the impact of TF and IDF by using a similarity class.

This book uses Lucene’s classic TF × IDF similarity (the defaults in Solr 5.x and Elasticsearch 2.0—see callout). Lucene’s classic similarity dampens the impact of tf and idf when computing a weight:

TF Weight = sqrt(tf)
IDF Weight = log(numDocs / (df + 1)) + 1

You can see how these statistics are dampened in table 3.2.

Table 3.2. Term frequency and document frequencies dampened with default formulas (IDF calculated for 1,000 documents)

Raw TF

TF weight

 

Raw DF

IDF weight (for 1,000 docs)

1 1   1 7.215
2 1.414   2 6.809
5 2.236   10 5.510
15 3.872   50 2.976
50 7.071   1000 0.999
1,000 31.623      

Further, dampening TF × IDF by itself often isn’t sufficient. Term frequency often must be considered relative to the number of total terms in a matched field. For example, does a single mention of alien in a 1,000-page book have the same weight as a single occurrence of alien in a three-sentence snippet? The shorter snippet with just one match is likely much more relevant for the term than the book that uses the term once. For this reason, in the preceding fieldWeight calculation, TF × IDF is multiplied by fieldNorm—a weight-normalization factor based on the length of the document. This normalization factor is calculated as follows:

fieldNorm =  1 / sqrt(fieldlength)

This normalization regulates the impact of TF and IDF on the term’s weight by biasing occurrences in shorter fields. Norms are calculated at index time and take up space. Further, depending on the application and user base, they don’t always correlate to our user’s notion of term importance in a piece of text. Luckily, Lucene lets us disable the norms completely.

Taken together, Lucene’s classic similarity measures a term’s weight in a piece of text as follows:

TF weighted × IDF weighted × fieldNorm

Revisiting the fieldWeight calculation, you see this formula in play:

0.4414702, fieldWeight in 31, product of:
  1.4142135, tf(freq=2.0), with freq of:
    2.0, termFreq=2.0
  3.9957323, idf(docFreq=1, maxDocs=40)
  0.078125, fieldNorm(doc=31)
Lucene’s next default similarity: BM25

Over the years, an alternate approach to computing a TF × IDF score has become prevalent in the information retrieval community: Okapi BM25. Because of its proven high performance on article-length text, Lucene’s BM25 similarity will be rolling out as the default similarity for Solr/Elasticsearch, even as you read this book.

What is BM25? Instead of “fudge factors” as discussed previously, BM25 bases its TF × IDF “fudges” on more-robust information retrieval findings. This includes forcing the impact of TF to reach a saturation point. Instead of the impact of length (fieldNorms) always increasing, its impact is computed relative to the average document length (above-average docs weighted down, below-average boosted). IDF is computed similarly to classic TF × IDF similarity.

Will BM25 help your relevance? It’s not that simple. As we discussed in chapter 1, information retrieval focuses heavily on broad, incremental improvements to article-length pieces of text. BM25 may not matter for your specific definition of relevance. For this reason, we intentionally eschew the additional complexity of BM25 in this book. Lucene won’t be deprecating classic TF × IDF at all; instead, it will become known as the classic similarity. Don’t be shy about experimenting with both. As for this book’s examples, you can re-create the scoring in future Elasticsearch versions by changing the similarity back to the classic similarity. Finally, every lesson you learn from this book applies, regardless of whether you choose BM25 or classic TF × IDF.

3.6.7. Factoring in the search term’s importance

The computation of a query’s weight (queryWeight) doesn’t correspond to the same formula. More occurrences of a term in the query for nearly all search cases doesn’t correspond to more importance for that term (users almost always list their query term once). Further, as queries are short, there’s little need for length normalization. So this is also omitted. What’s left from our preceding weight calculation is the IDF.

Additionally, queryWeight adds two factors:

  • Query-time boosting
  • Query normalization (queryNorm)

First we’ll get queryNorm out of the way. The first thing to note is that without boosting, queryNorm doesn’t matter. It’s constant across all matches for our search. query-Norm attempts to make scores between the different matches outside this single search comparable, but it does a poor job. You should never attempt to compare scores across different fields outside a single search. So much variation in statistics like IDF and TF across fields and text makes relevance scores extremely relative. In fact, dropping this factor commonly comes up in Lucene’s discussion list.[4]

4

See “Whither Query Norm” at http://lucene.472066.n3.nabble.com/Whither-Query-Norm-td600443.html for more details.

What does matter in queryWeight is the boost factor. There’s no boost in our query in overview, but we do boost on title matches. Unfortunately, at times the boost can be lost in the queryNorm calculation. Examining the queryNorm for our title match in Alien shows a different calculation than queryNorm in the corresponding match in the overview field by a factor of 10. You’ll see how Elasticsearch allows you to boost different factors in future chapters.

3.6.8. Fixing Space Jam vs. alien ranking

Finally armed with mastery of Lucene’s scoring, you can compare the Space Jam and Alien explains. Alien has two matches: a strong title match, and a much weaker match in the overview field. Space Jam has two matches in the overview field. If you zero in on what’s driving the differences in how the matches are computed, you can see that scores for overview fields are in general always considerably weaker than scores in the title field.

You see this with a very high score for Alien’s title match:

3.1940954, weight(title:alien in 223)

compared with the lower relevancy scores for an alien match in the overview field:

0.03975798, weight(overview:alien in 1289)

This difference is roughly two orders of magnitude! Wait, didn’t we explicitly tell the search engine that title is only 10 times as important as overview via boosting? Sure, although we did apply a boost, we also learned that scores between fields aren’t at all comparable. They exist entirely in their own scoring universes. Comparing the title match for Alien with the overview match for alien, you can see this:

Here you see the driver of the difference in the two match’s fieldWeight scores . These fields tend to take a different character. An overview is roughly paragraph length, whereas titles are just a few terms. This tends to drive the fieldNorms to be quite different. Further, the relative distribution of terms in overview fields doesn’t mirror the term distribution in title fields.

These fields are often driven by how the authors of these fields chose to express themselves. What do movie marketers think when writing an overview? What words do they choose? How pithy or expansive is the writing? How does a movie studio choose a movie title? Do they tie it to existing brands (and thus terms) or are they always trying to be original? Often relevance with text requires both getting in the head of the author (why did the author choose certain language) and the searcher (why does the searcher use particular search terms).

For the math we have to fix, the main implication is that a good overview score may be significantly smaller than a good title score. A boost of 10 doesn’t imply 10 times the importance to the search engine. It simply implies a multiple. When you take apart term matching, you can see this in stark relief, with overview scores always significantly lower than title scores. The appropriate way to use these field weights is to first dive into the rough timbre of these scores before deciding how to apply weights. It might be more appropriate to boost a title by 0.1, and this still may give significantly more weight to a title match than an overview match simply because of that field’s particular character.

Let’s rerun our query with a more reasonable title boost to see the impact.

Listing 3.14. Searching with an adjusted boost

Results:

Num Relevance Score       Movie Title
1   1.0016364             Space Jam
2   0.29594672            Grown Ups
3   0.28491083            Speed Racer
4   0.28491083            The Flintstones
5   0.2536686             White Men Can't Jump
6   0.2536686             Coach Carter
7   0.21968345            Semi-Pro
8   0.20324169            The Thing
9   0.1724563             Meet Dave
10  0.16911241            Teen Wolf

Great! Now that looks much better.

3.7. Solved? Our work is never over!

You’ve made progress and pushed the ball forward, but have you really solved anything? You’re left with several points for improvement.

First, recall how your query works from the preceding validation endpoint:

((title:basketbal title:cartoon title:alien)^10.0) | (overview:basketbal
overview:cartoon overview:alien)

Remember how the | is taking the maximum between the two field scores? You see this in the following explain too , when looking only at the compound queries:

By boosting overview and title more conscientiously, you’ve made it possible that overview might sometimes beat title when the query takes the maximum of the two fields. Why are you taking a maximum? Why is this strategy used by default? Are there other strategies you could use to combine these scores so that it’s not all or nothing between strong title and strong overview matches? You may have solved the Space Jam problem, but what will this max do to other searches? When other searches create different scores, will we be back at square one?

Finally, you must ask whether your fieldWeight calculation could be improved. Do we truly care about fieldNorms? Is it important in this use case to bias toward shorter text or longer text? There’s also the ever present struggle of the relevance engineer: do the terms themselves represent the right features latent in the text? If you looked at a snippet from Space Jam, you can ponder a few questions:

Michael Jordan agrees to help the Looney Toons play a basketball game vs. alien slavers to determine their freedom.

Are all the features truly captured in this explain? We didn’t see any weight for a match on cartoon in the explain; should toons or looney toons match cartoon? What about Michael Jordan? We humans associate him with basketball; should we amplify the weight of the basketball term by his name’s presence?

Turning latent features into terms, and combining those features is the ever present struggle of the relevance engineer. In the next chapter, we cover this topic in great detail! You’ve begun with great promise here. In later chapters, you’ll continue to explore these issues in greater depth, to maximize your use of the search engine’s tools to match and rank in order to satisfy user and business needs.

3.8. Summary

  • Not getting the expected search results requires debugging matching and ranking.
  • To debug matching, examine how your search engine interprets and executes your query; with Elasticsearch, this corresponds to the query validation endpoint.
  • Search engines require exacting, byte-for-byte term matching to include a document in the search results.
  • Search engines need your help to determine which matches to include/omit (such as stop words) by choosing an appropriate analyzer.
  • Search engines use TF × IDF scoring to rank results. You can see the scoring by using your search engine’s relevance explain output.
  • TF measures the term’s importance in the document’s text. IDF determines the rareness/specialness of a term across the whole corpus. Field norms (norms) bias scoring toward shorter text.
  • Debugging search engine ranking requires understanding fieldWeight (how important search terms are in the text) and debugging queryWeight (how important search terms are in the search query).
  • Relevance scores aren’t easily compared, nor are they normalized. A poor score for a title field may be 10.0; a good search score for the overview field may be 0.01.
  • Boost factors aren’t field priorities. Instead, they let you rebalance relevance scoring. When scoring is balanced, you can prioritize one field or another.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.82.83