Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 14. Multifield Search

Queries are seldom simple one-clause match queries. We frequently need to search for the same or different query strings in one or more fields, which means that we need to be able to combine multiple query clauses and their relevance scores in a way that makes sense.

Perhaps we’re looking for a book called War and Peace by an author called Leo Tolstoy. Perhaps we’re searching the Elasticsearch documentation for “minimum should match,” which might be in the title or the body of a page. Or perhaps we’re searching for users with first name John and last name Smith.

In this chapter, we present the available tools for constructing multiclause searches and how to figure out which solution you should apply to your particular use case.

Multiple Query Strings

The simplest multifield query to deal with is the one where we can map search terms to specific fields. If we know that War and Peace is the title, and Leo Tolstoy is the author, it is easy to write each of these conditions as a match clause and to combine them with a bool query:

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title":  "War and Peace" }},
        { "match": { "author": "Leo Tolstoy"   }}
      ]
    }
  }
}

The bool query takes a more-matches-is-better approach, so the score from each match clause will be added together to provide the final _score for each document. Documents that match both clauses will score higher than documents that match just one clause.

Of course, you’re not restricted to using just match clauses: the bool query can wrap any other query type, including other bool queries. We could add a clause to specify that we prefer to see versions of the book that have been translated by specific translators:

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title":  "War and Peace" }},
        { "match": { "author": "Leo Tolstoy"   }},
        { "bool":  {
          "should": [
            { "match": { "translator": "Constance Garnett" }},
            { "match": { "translator": "Louise Maude"      }}
          ]
        }}
      ]
    }
  }
}

Why did we put the translator clauses inside a separate bool query? All four match queries are should clauses, so why didn’t we just put the translator clauses at the same level as the title and author clauses?

The answer lies in how the score is calculated. The bool query runs each match query, adds their scores together, then multiplies by the number of matching clauses, and divides by the total number of clauses. Each clause at the same level has the same weight. In the preceding query, the bool query containing the translator clauses counts for one-third of the total score. If we had put the translator clauses at the same level as title and author, they would have reduced the contribution of the title and author clauses to one-quarter each.

Prioritizing Clauses

It is likely that an even one-third split between clauses is not what we need for the preceding query. Probably we’re more interested in the title and author clauses then we are in the translator clauses. We need to tune the query to make the title and author clauses relatively more important.

The simplest weapon in our tuning arsenal is the boost parameter. To increase the weight of the title and author fields, give them a boost value higher than 1:

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { 
            "title":  {
              "query": "War and Peace",
              "boost": 2
        }}},
        { "match": { 
            "author":  {
              "query": "Leo Tolstoy",
              "boost": 2
        }}},
        { "bool":  { 
            "should": [
              { "match": { "translator": "Constance Garnett" }},
              { "match": { "translator": "Louise Maude"      }}
            ]
        }}
      ]
    }
  }
}

: The title and author clauses have a boost value of 2.
: The nested bool clause has the default boost of 1.

The “best” value for the boost parameter is most easily determined by trial and error: set a boost value, run test queries, repeat. A reasonable range for boost lies between 1 and 10, maybe 15. Boosts higher than that have little more impact because scores are normalized.

Single Query String

The bool query is the mainstay of multiclause queries. It works well for many cases, especially when you are able to map different query strings to individual fields.

The problem is that, these days, users expect to be able to type all of their search terms into a single field, and expect that the application will figure out how to give them the right results. It is ironic that the multifield search form is known as Advanced Search—it may appear advanced to the user, but it is much simpler to implement.

There is no simple one-size-fits-all approach to multiword, multifield queries. To get the best results, you have to know your data and know how to use the appropriate tools.

Know Your Data

When your only user input is a single query string, you will encounter three scenarios frequently:

Best fields

When searching for words that represent a concept, such as “brown fox,” the words mean more together than they do individually. Fields like the title and body, while related, can be considered to be in competition with each other. Documents should have as many words as possible in the same field, and the score should come from the best-matching field.

Most fields

A common technique for fine-tuning relevance is to index the same data into multiple fields, each with its own analysis chain.

The main field may contain words in their stemmed form, synonyms, and words stripped of their diacritics, or accents. It is used to match as many documents as possible.

The same text could then be indexed in other fields to provide more-precise matching. One field may contain the unstemmed version, another the original word with accents, and a third might use shingles to provide information about word proximity.

These other fields act as signals to increase the relevance score of each matching document. The more fields that match, the better.

Cross fields

For some entities, the identifying information is spread across multiple fields, each of which contains just a part of the whole:

Person: first_name and last_name
Book: title, author, and description
Address: street, city, country, and postcode

In this case, we want to find as many words as possible in any of the listed fields. We need to search across multiple fields as if they were one big field.

All of these are multiword, multifield queries, but each requires a different strategy. We will examine each strategy in turn in the rest of this chapter.

Best Fields

Imagine that we have a website that allows users to search blog posts, such as these two documents:

PUT /my_index/my_type/1
{
    "title": "Quick brown rabbits",
    "body":  "Brown rabbits are commonly seen."
}

PUT /my_index/my_type/2
{
    "title": "Keeping pets healthy",
    "body":  "My quick brown fox eats rabbits on a regular basis."
}

The user types in the words “Brown fox” and clicks Search. We don’t know ahead of time if the user’s search terms will be found in the title or the body field of the post, but it is likely that the user is searching for related words. To our eyes, document 2 appears to be the better match, as it contains both words that we are looking for.

Now we run the following bool query:

{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

And we find that this query gives document 1 the higher score:

{
  "hits": [
     {
        "_id":      "1",
        "_score":   0.14809652,
        "_source": {
           "title": "Quick brown rabbits",
           "body":  "Brown rabbits are commonly seen."
        }
     },
     {
        "_id":      "2",
        "_score":   0.09256032,
        "_source": {
           "title": "Keeping pets healthy",
           "body":  "My quick brown fox eats rabbits on a regular basis."
        }
     }
  ]
}

To understand why, think about how the bool query calculates its score:

It runs both of the queries in the should clause.
It adds their scores together.
It multiplies the total by the number of matching clauses.
It divides the result by the total number of clauses (two).

Document 1 contains the word brown in both fields, so both match clauses are successful and have a score. Document 2 contains both brown and fox in the body field but neither word in the title field. The high score from the body query is added to the zero score from the title query, and multiplied by one-half, resulting in a lower overall score than for document 1.

In this example, the title and body fields are competing with each other. We want to find the single best-matching field.

What if, instead of combining the scores from each field, we used the score from the best-matching field as the overall score for the query? This would give preference to a single field that contains both of the words we are looking for, rather than the same word repeated in different fields.

dis_max Query

Instead of the bool query, we can use the dis_max or Disjunction Max Query. Disjunction means or (while conjunction means and) so the Disjunction Max Query simply means return documents that match any of these queries, and return the score of the best matching query:

{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

This produces the results that we want:

{
  "hits": [
     {
        "_id":      "2",
        "_score":   0.21509302,
        "_source": {
           "title": "Keeping pets healthy",
           "body":  "My quick brown fox eats rabbits on a regular basis."
        }
     },
     {
        "_id":      "1",
        "_score":   0.12713557,
        "_source": {
           "title": "Quick brown rabbits",
           "body":  "Brown rabbits are commonly seen."
        }
     }
  ]
}

Tuning Best Fields Queries

What would happen if the user had searched instead for “quick pets”? Both documents contain the word quick, but only document 2 contains the word pets. Neither document contains both words in the same field.

A simple dis_max query like the following would choose the single best matching field, and ignore the other:

{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ]
        }
    }
}

{
  "hits": [
     {
        "_id": "1",
        "_score": 0.12713557, 
        "_source": {
           "title": "Quick brown rabbits",
           "body": "Brown rabbits are commonly seen."
        }
     },
     {
        "_id": "2",
        "_score": 0.12713557, 
        "_source": {
           "title": "Keeping pets healthy",
           "body": "My quick brown fox eats rabbits on a regular basis."
        }
     }
   ]
}

: Note that the scores are exactly the same.

We would probably expect documents that match on both the title field and the body field to rank higher than documents that match on just one field, but this isn’t the case. Remember: the dis_max query simply uses the _score from the single best-matching clause.

tie_breaker

It is possible, however, to also take the _score from the other matching clauses into account, by specifying the tie_breaker parameter:

{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ],
            "tie_breaker": 0.3
        }
    }
}

This gives us the following results:

{
  "hits": [
     {
        "_id": "2",
        "_score": 0.14757764, 
        "_source": {
           "title": "Keeping pets healthy",
           "body": "My quick brown fox eats rabbits on a regular basis."
        }
     },
     {
        "_id": "1",
        "_score": 0.124275915, 
        "_source": {
           "title": "Quick brown rabbits",
           "body": "Brown rabbits are commonly seen."
        }
     }
   ]
}

: Document 2 now has a small lead over document 1.

The tie_breaker parameter makes the dis_max query behave more like a halfway house between dis_max and bool. It changes the score calculation as follows:

Take the _score of the best-matching clause.
Multiply the score of each of the other matching clauses by the tie_breaker.
Add them all together and normalize.

With the tie_breaker, all matching clauses count, but the best-matching clause counts most.

Note

The tie_breaker can be a floating-point value between 0 and 1, where 0 uses just the best-matching clause and 1 counts all matching clauses equally. The exact value can be tuned based on your data and queries, but a reasonable value should be close to zero, (for example, 0.1 - 0.4), in order not to overwhelm the best-matching nature of dis_max.

multi_match Query

The multi_match query provides a convenient shorthand way of running the same query against multiple fields.

Note

There are several types of multi_match query, three of which just happen to coincide with the three scenarios that we listed in “Know Your Data”: best_fields, most_fields, and cross_fields.

By default, this query runs as type best_fields, which means that it generates a match query for each field and wraps them in a dis_max query. This dis_max query

{
  "dis_max": {
    "queries":  [
      {
        "match": {
          "title": {
            "query": "Quick brown fox",
            "minimum_should_match": "30%"
          }
        }
      },
      {
        "match": {
          "body": {
            "query": "Quick brown fox",
            "minimum_should_match": "30%"
          }
        }
      },
    ],
    "tie_breaker": 0.3
  }
}

could be rewritten more concisely with multi_match as follows:

{
    "multi_match": {
        "query":                "Quick brown fox",
        "type":                 "best_fields", 
        "fields":               [ "title", "body" ],
        "tie_breaker":          0.3,
        "minimum_should_match": "30%" 
    }
}

: The best_fields type is the default and can be left out.
: Parameters like minimum_should_match or operator are passed through to the generated match queries.

Using Wildcards in Field Names

Field names can be specified with wildcards: any field that matches the wildcard pattern will be included in the search. You could match on the book_title, chapter_title, and section_title fields, with the following:

{
    "multi_match": {
        "query":  "Quick brown fox",
        "fields": "*_title"
    }
}

Boosting Individual Fields

Individual fields can be boosted by using the caret (^) syntax: just add ^boost after the field name, where boost is a floating-point number:

{
    "multi_match": {
        "query":  "Quick brown fox",
        "fields": [ "*_title", "chapter_title^2" ] 
    }
}

: The chapter_title field has a boost of 2, while the book_title and section_title fields have a default boost of 1.

Most Fields

Full-text search is a battle between recall—returning all the documents that are relevant—and precision—not returning irrelevant documents. The goal is to present the user with the most relevant documents on the first page of results.

To improve recall, we cast the net wide—we include not only documents that match the user’s search terms exactly, but also documents that we believe to be pertinent to the query. If a user searches for “quick brown fox,” a document that contains fast foxes may well be a reasonable result to return.

If the only pertinent document that we have is the one containing fast foxes, it will appear at the top of the results list. But of course, if we have 100 documents that contain the words quick brown fox, then the fast foxes document may be considered less relevant, and we would want to push it further down the list. After including many potential matches, we need to ensure that the best ones rise to the top.

A common technique for fine-tuning full-text relevance is to index the same text in multiple ways, each of which provides a different relevance signal. The main field would contain terms in their broadest-matching form to match as many documents as possible. For instance, we could do the following:

Use a stemmer to index jumps, jumping, and jumped as their root form: jump. Then it doesn’t matter if the user searches for jumped; we could still match documents containing jumping.
Include synonyms like jump, leap, and hop.
Remove diacritics, or accents: for example, ésta, está, and esta would all be indexed without accents as esta.

However, if we have two documents, one of which contains jumped and the other jumping, the user would probably expect the first document to rank higher, as it contains exactly what was typed in.

We can achieve this by indexing the same text in other fields to provide more-precise matching. One field may contain the unstemmed version, another the original word with diacritics, and a third might use shingles to provide information about word proximity. These other fields act as signals that increase the relevance score of each matching document. The more fields that match, the better.

A document is included in the results list if it matches the broad-matching main field. If it also matches the signal fields, it gets extra points and is pushed up the results list.

We discuss synonyms, word proximity, partial-matching and other potential signals later in the book, but we will use the simple example of stemmed and unstemmed fields to illustrate this technique.

Multifield Mapping

The first thing to do is to set up our field to be indexed twice: once in a stemmed form and once in an unstemmed form. To do this, we will use multifields, which we introduced in “String Sorting and Multifields”:

DELETE /my_index

PUT /my_index
{
    "settings": { "number_of_shards": 1 }, 
    "mappings": {
        "my_type": {
            "properties": {
                "title": { 
                    "type":     "string",
                    "analyzer": "english",
                    "fields": {
                        "std":   { 
                            "type":     "string",
                            "analyzer": "standard"
                        }
                    }
                }
            }
        }
    }
}

: See “Relevance Is Broken!”.
: The title field is stemmed by the english analyzer.
: The title.std field uses the standard analyzer and so is not stemmed.

Next we index some documents:

PUT /my_index/my_type/1
{ "title": "My rabbit jumps" }

PUT /my_index/my_type/2
{ "title": "Jumping jack rabbits" }

Here is a simple match query on the title field for jumping rabbits:

GET /my_index/_search
{
   "query": {
        "match": {
            "title": "jumping rabbits"
        }
    }
}

This becomes a query for the two stemmed terms jump and rabbit, thanks to the english analyzer. The title field of both documents contains both of those terms, so both documents receive the same score:

{
  "hits": [
     {
        "_id": "1",
        "_score": 0.42039964,
        "_source": {
           "title": "My rabbit jumps"
        }
     },
     {
        "_id": "2",
        "_score": 0.42039964,
        "_source": {
           "title": "Jumping jack rabbits"
        }
     }
  ]
}

If we were to query just the title.std field, then only document 2 would match. However, if we were to query both fields and to combine their scores by using the bool query, then both documents would match (thanks to the title field) and document 2 would score higher (thanks to the title.std field):

GET /my_index/_search
{
   "query": {
        "multi_match": {
            "query":  "jumping rabbits",
            "type":   "most_fields", 
            "fields": [ "title", "title.std" ]
        }
    }
}

: We want to combine the scores from all matching fields, so we use the most_fields type. This causes the multi_match query to wrap the two field-clauses in a bool query instead of a dis_max query.

{
  "hits": [
     {
        "_id": "2",
        "_score": 0.8226396, 
        "_source": {
           "title": "Jumping jack rabbits"
        }
     },
     {
        "_id": "1",
        "_score": 0.10741998, 
        "_source": {
           "title": "My rabbit jumps"
        }
     }
  ]
}

: Document 2 now scores much higher than document 1.

We are using the broad-matching title field to include as many documents as possible—to increase recall—but we use the title.std field as a signal to push the most relevant results to the top.

The contribution of each field to the final score can be controlled by specifying custom boost values. For instance, we could boost the title field to make it the most important field, thus reducing the effect of any other signal fields:

GET /my_index/_search
{
   "query": {
        "multi_match": {
            "query":       "jumping rabbits",
            "type":        "most_fields",
            "fields":      [ "title^10", "title.std" ] 
        }
    }
}

: The boost value of 10 on the title field makes that field relatively much more important than the title.std field.

Cross-fields Entity Search

Now we come to a common pattern: cross-fields entity search. With entities like person, product, or address, the identifying information is spread across several fields. We may have a person indexed as follows:

{
    "firstname":  "Peter",
    "lastname":   "Smith"
}

Or an address like this:

{
    "street":   "5 Poland Street",
    "city":     "London",
    "country":  "United Kingdom",
    "postcode": "W1V 3DG"
}

This sounds a lot like the example we described in “Multiple Query Strings”, but there is a big difference between these two scenarios. In “Multiple Query Strings”, we used a separate query string for each field. In this scenario, we want to search across multiple fields with a single query string.

Our user might search for the person “Peter Smith” or for the address “Poland Street W1V.” Each of those words appears in a different field, so using a dis_max / best_fields query to find the single best-matching field is clearly the wrong approach.

A Naive Approach

Really, we want to query each field in turn and add up the scores of every field that matches, which sounds like a job for the bool query:

{
  "query": {
    "bool": {
      "should": [
        { "match": { "street":    "Poland Street W1V" }},
        { "match": { "city":      "Poland Street W1V" }},
        { "match": { "country":   "Poland Street W1V" }},
        { "match": { "postcode":  "Poland Street W1V" }}
      ]
    }
  }
}

Repeating the query string for every field soon becomes tedious. We can use the multi_match query instead, and set the type to most_fields to tell it to combine the scores of all matching fields:

{
  "query": {
    "multi_match": {
      "query":       "Poland Street W1V",
      "type":        "most_fields",
      "fields":      [ "street", "city", "country", "postcode" ]
    }
  }
}

Problems with the most_fields Approach

The most_fields approach to entity search has some problems that are not immediately obvious:

It is designed to find the most fields matching any words, rather than to find the most matching words across all fields.
It can’t use the operator or minimum_should_match parameters to reduce the long tail of less-relevant results.
Term frequencies are different in each field and could interfere with each other to produce badly ordered results.

Field-Centric Queries

All three of the preceding problems stem from most_fields being field-centric rather than term-centric: it looks for the most matching fields, when really what we’re interested is the most matching terms.

Note

The best_fields type is also field-centric and suffers from similar problems.

First we’ll look at why these problems exist, and then how we can combat them.

Problem 1: Matching the Same Word in Multiple Fields

Think about how the most_fields query is executed: Elasticsearch generates a separate match query for each field and then wraps these match queries in an outer bool query.

We can see this by passing our query through the validate-query API:

GET /_validate/query?explain
{
  "query": {
    "multi_match": {
      "query":   "Poland Street W1V",
      "type":    "most_fields",
      "fields":  [ "street", "city", "country", "postcode" ]
    }
  }
}

which yields this explanation:

(street:poland   street:street   street:w1v)
(city:poland     city:street     city:w1v)
(country:poland  country:street  country:w1v)
(postcode:poland postcode:street postcode:w1v)

You can see that a document matching just the word poland in two fields could score higher than a document matching poland and street in one field.

Problem 2: Trimming the Long Tail

In “Controlling Precision”, we talked about using the and operator or the minimum_should_match parameter to trim the long tail of almost irrelevant results. Perhaps we could try this:

{
    "query": {
        "multi_match": {
            "query":       "Poland Street W1V",
            "type":        "most_fields",
            "operator":    "and", 
            "fields":      [ "street", "city", "country", "postcode" ]
        }
    }
}

: All terms must be present.

However, with best_fields or most_fields, these parameters are passed down to the generated match queries. The explanation for this query shows the following:

(+street:poland   +street:street   +street:w1v)
(+city:poland     +city:street     +city:w1v)
(+country:poland  +country:street  +country:w1v)
(+postcode:poland +postcode:street +postcode:w1v)

In other words, using the and operator means that all words must exist in the same field, which is clearly wrong! It is unlikely that any documents would match this query.

Problem 3: Term Frequencies

In “What Is Relevance?”, we explained that the default similarity algorithm used to calculate the relevance score for each term is TF/IDF:

Term frequency: The more often a term appears in a field in a single document, the more relevant the document.
Inverse document frequency: The more often a term appears in a field in all documents in the index, the less relevant is that term.

When searching against multiple fields, TF/IDF can introduce some surprising results.

Consider our example of searching for “Peter Smith” using the first_name and last_name fields. Peter is a common first name and Smith is a common last name—both will have low IDFs. But what if we have another person in the index whose name is Smith Williams? Smith as a first name is very uncommon and so will have a high IDF!

A simple query like the following may well return Smith Williams above Peter Smith in spite of the fact that the second person is a better match than the first.

{
    "query": {
        "multi_match": {
            "query":       "Peter Smith",
            "type":        "most_fields",
            "fields":      [ "*_name" ]
        }
    }
}

The high IDF of smith in the first name field can overwhelm the two low IDFs of peter as a first name and smith as a last name.

Solution

These problems only exist because we are dealing with multiple fields. If we were to combine all of these fields into a single field, the problems would vanish. We could achieve this by adding a full_name field to our person document:

{
    "first_name":  "Peter",
    "last_name":   "Smith",
    "full_name":   "Peter Smith"
}

When querying just the full_name field:

Documents with more matching words would trump documents with the same word repeated.
The minimum_should_match and operator parameters would function as expected.
The inverse document frequencies for first and last names would be combined so it wouldn’t matter whether Smith were a first or last name anymore.

While this would work, we don’t like having to store redundant data. Instead, Elasticsearch offers us two solutions—one at index time and one at search time—which we discuss next.

Custom _all Fields

In “Metadata: _all Field”, we explained that the special _all field indexes the values from all other fields as one big string. Having all fields indexed into one field is not terribly flexible, though. It would be nice to have one custom _all field for the person’s name, and another custom _all field for the address.

Elasticsearch provides us with this functionality via the copy_to parameter in a field mapping:

PUT /my_index
{
    "mappings": {
        "person": {
            "properties": {
                "first_name": {
                    "type":     "string",
                    "copy_to":  "full_name" 
                },
                "last_name": {
                    "type":     "string",
                    "copy_to":  "full_name" 
                },
                "full_name": {
                    "type":     "string"
                }
            }
        }
    }
}

: The values in the first_name and last_name fields are also copied to the full_name field.

With this mapping in place, we can query the first_name field for first names, the last_name field for last name, or the full_name field for first and last names.

Note

Mappings of the first_name and last_name fields have no bearing on how the full_name field is indexed. The full_name field copies the string values from the other two fields, then indexes them according to the mapping of the full_name field only.

cross-fields Queries

The custom _all approach is a good solution, as long as you thought about setting it up before you indexed your documents. However, Elasticsearch also provides a search-time solution to the problem: the multi_match query with type cross_fields. The cross_fields type takes a term-centric approach, quite different from the field-centric approach taken by best_fields and most_fields. It treats all of the fields as one big field, and looks for each term in any field.

To illustrate the difference between field-centric and term-centric queries, look at the explanation for this field-centric most_fields query:

GET /_validate/query?explain
{
    "query": {
        "multi_match": {
            "query":       "peter smith",
            "type":        "most_fields",
            "operator":    "and", 
            "fields":      [ "first_name", "last_name" ]
        }
    }
}

: All terms are required.

For a document to match, both peter and smith must appear in the same field, either the first_name field or the last_name field:

(+first_name:peter +first_name:smith)
(+last_name:peter  +last_name:smith)

A term-centric approach would use this logic instead:

+(first_name:peter last_name:peter)
+(first_name:smith last_name:smith)

In other words, the term peter must appear in either field, and the term smith must appear in either field.

The cross_fields type first analyzes the query string to produce a list of terms, and then it searches for each term in any field. That difference alone solves two of the three problems that we listed in “Field-Centric Queries”, leaving us just with the issue of differing inverse document frequencies.

Fortunately, the cross_fields type solves this too, as can be seen from this validate-query request:

GET /_validate/query?explain
{
    "query": {
        "multi_match": {
            "query":       "peter smith",
            "type":        "cross_fields", 
            "operator":    "and",
            "fields":      [ "first_name", "last_name" ]
        }
    }
}

: Use cross_fields term-centric matching.

It solves the term-frequency problem by blending inverse document frequencies across fields:

+blended("peter", fields: [first_name, last_name])
+blended("smith", fields: [first_name, last_name])

In other words, it looks up the IDF of smith in both the first_name and the last_name fields and uses the minimum of the two as the IDF for both fields. The fact that smith is a common last name means that it will be treated as a common first name too.

Note

For the cross_fields query type to work optimally, all fields should have the same analyzer. Fields that share an analyzer are grouped together as blended fields.

If you include fields with a different analysis chain, they will be added to the query in the same way as for best_fields. For instance, if we added the title field to the preceding query (assuming it uses a different analyzer), the explanation would be as follows:

(+title:peter +title:smith)
(
  +blended("peter", fields: [first_name, last_name])
  +blended("smith", fields: [first_name, last_name])
)

This is particularly important when using the minimum_should_match and operator parameters.

Per-Field Boosting

One of the advantages of using the cross_fields query over custom _all fields is that you can boost individual fields at query time.

For fields of equal value like first_name and last_name, this generally isn’t required, but if you were searching for books using the title and description fields, you might want to give more weight to the title field. This can be done as described before with the caret (^) syntax:

GET /books/_search
{
    "query": {
        "multi_match": {
            "query":       "peter smith",
            "type":        "cross_fields",
            "fields":      [ "title^2", "description" ] 
        }
    }
}

: The title field has a boost of 2, while the description field has the default boost of 1.

The advantage of being able to boost individual fields should be weighed against the cost of querying multiple fields instead of querying a single custom _all field. Use whichever of the two solutions that delivers the most bang for your buck.

Exact-Value Fields

The final topic that we should touch on before leaving multifield queries is that of exact-value not_analyzed fields. It is not useful to mix not_analyzed fields with analyzed fields in multi_match queries.

The reason for this can be demonstrated easily by looking at a query explanation. Imagine that we have set the title field to be not_analyzed:

GET /_validate/query?explain
{
    "query": {
        "multi_match": {
            "query":       "peter smith",
            "type":        "cross_fields",
            "fields":      [ "title", "first_name", "last_name" ]
        }
    }
}

Because the title field is not analyzed, it searches that field for a single term consisting of the whole query string!

title:peter smith
(
    blended("peter", fields: [first_name, last_name])
    blended("smith", fields: [first_name, last_name])
)

That term clearly does not exist in the inverted index of the title field, and can never be found. Avoid using not_analyzed fields in multi_match queries.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 14. Multifield Search

Create new playlist

Sign In

Sign Up

Chapter 14. Multifield Search

Multiple Query Strings

Prioritizing Clauses

Single Query String

Know Your Data

Best Fields

dis_max Query

Tuning Best Fields Queries

tie_breaker

Note

multi_match Query

Note

Using Wildcards in Field Names

Boosting Individual Fields

Most Fields

Multifield Mapping

Cross-fields Entity Search

A Naive Approach

Problems with the most_fields Approach

Field-Centric Queries

Note

Problem 1: Matching the Same Word in Multiple Fields

Problem 2: Trimming the Long Tail

Problem 3: Term Frequencies

Solution

Custom _all Fields

Note

cross-fields Queries

Note

Per-Field Boosting

Exact-Value Fields

Table of Contents for
14. Multifield Search