Structured search is about interrogating data that has inherent structure. Dates, times, and numbers are all structured: they have a precise format that you can perform logical operations on. Common operations include comparing ranges of numbers or dates, or determining which of two values is larger.
Text can be structured too. A box of crayons has a discrete set of colors:
red
, green
, blue
. A blog post may be tagged with keywords
distributed
and search
. Products in an ecommerce store have Universal
Product Codes (UPCs) or some other identifier that requires strict and
structured formatting.
With structured search, the answer to your question is always a yes or no; something either belongs in the set or it does not. Structured search does not worry about document relevance or scoring; it simply includes or excludes documents.
This should make sense logically. A number can’t be more in a range than any other number that falls in the same range. It is either in the range—or it isn’t. Similarly, for structured text, a value is either equal or it isn’t. There is no concept of more similar.
When working with exact values, you will be working with filters. Filters are important because they are very, very fast. Filters do not calculate relevance (avoiding the entire scoring phase) and are easily cached. We’ll talk about the performance benefits of filters later in “All About Caching”, but for now, just keep in mind that you should use filters as often as you can.
We are going to explore the term
filter first because you will use it often.
This filter is capable of handling numbers, Booleans, dates, and text.
Let’s look at an example using numbers first by indexing some products. These
documents have a price
and a productID
:
POST
/
my_store
/
products
/
_bulk
{
"index"
:
{
"_id"
:
1
}}
{
"price"
:
10
,
"productID"
:
"XHDK-A-1293-#fJ3"
}
{
"index"
:
{
"_id"
:
2
}}
{
"price"
:
20
,
"productID"
:
"KDKE-B-9947-#kL5"
}
{
"index"
:
{
"_id"
:
3
}}
{
"price"
:
30
,
"productID"
:
"JODL-X-1937-#pV7"
}
{
"index"
:
{
"_id"
:
4
}}
{
"price"
:
30
,
"productID"
:
"QQPX-R-3956-#aD8"
}
Our goal is to find all products with a certain price. You may be familiar with SQL if you are coming from a relational database background. If we expressed this query as an SQL query, it would look like this:
SELECT
document
FROM
products
WHERE
price
=
20
In the Elasticsearch query DSL, we use a term
filter to accomplish the same
thing. The term
filter will look for the exact value that we specify. By
itself, a term
filter is simple. It accepts a field name and the value
that we wish to find:
{
"term"
:
{
"price"
:
20
}
}
The term
filter isn’t very useful on its own, though. As discussed in
“Query DSL”, the search
API expects a query
, not a filter
. To
use our term
filter, we need to wrap it with a
filtered
query:
GET
/
my_store
/
products
/
_search
{
"query"
:
{
"filtered"
:
{
"query"
:
{
"match_all"
:
{}
},
"filter"
:
{
"term"
:
{
"price"
:
20
}
}
}
}
}
The filtered
query accepts both a query
and a filter
.
A match_all
is used to return all matching documents. This is the default
behavior, so in future examples we will simply omit the query
section.
The term
filter that we saw previously. Notice how it is placed inside
the filter
clause.
Once executed, the search results from this query are exactly what you would
expect: only document 2 is returned as a hit (because only 2
had a price
of 20
):
"hits"
:
[
{
"_index"
:
"my_store"
,
"_type"
:
"products"
,
"_id"
:
"2"
,
"_score"
:
1.0
,
"_source"
:
{
"price"
:
20
,
"productID"
:
"KDKE-B-9947-#kL5"
}
}
]
As mentioned at the top of this section, the term
filter can match strings
just as easily as numbers. Instead of price, let’s try to find products that
have a certain UPC identification code. To do this with SQL, we might use a
query like this:
SELECT
product
FROM
products
WHERE
productID
=
"XHDK-A-1293-#fJ3"
Translated into the query DSL, we can try a similar query with the term
filter, like so:
GET
/
my_store
/
products
/
_search
{
"query"
:
{
"filtered"
:
{
"filter"
:
{
"term"
:
{
"productID"
:
"XHDK-A-1293-#fJ3"
}
}
}
}
}
Except there is a little hiccup: we don’t get any results back! Why is
that? The problem isn’t with the the term
query; it is with the way
the data has been indexed. If we use the analyze
API (“Testing Analyzers”), we
can see that our UPC has been tokenized into smaller tokens:
GET
/
my_store
/
_analyze
?
field
=
productID
XHDK
-
A
-
1293
-
#
fJ3
{
"tokens"
:
[
{
"token"
:
"xhdk"
,
"start_offset"
:
0
,
"end_offset"
:
4
,
"type"
:
"<ALPHANUM>"
,
"position"
:
1
},
{
"token"
:
"a"
,
"start_offset"
:
5
,
"end_offset"
:
6
,
"type"
:
"<ALPHANUM>"
,
"position"
:
2
},
{
"token"
:
"1293"
,
"start_offset"
:
7
,
"end_offset"
:
11
,
"type"
:
"<NUM>"
,
"position"
:
3
},
{
"token"
:
"fj3"
,
"start_offset"
:
13
,
"end_offset"
:
16
,
"type"
:
"<ALPHANUM>"
,
"position"
:
4
}
]
}
There are a few important points here:
We have four distinct tokens instead of a single token representing the UPC.
All letters have been lowercased.
We lost the hyphen and the hash (#
) sign.
So when our term
filter looks for the exact value XHDK-A-1293-#fJ3
, it
doesn’t find anything, because that token does not exist in our inverted index.
Instead, there are the four tokens listed previously.
Obviously, this is not what we want to happen when dealing with identification codes, or any kind of precise enumeration.
To prevent this from happening, we need to tell Elasticsearch that this field
contains an exact value by setting it to be not_analyzed
. We saw this
originally in “Customizing Field Mappings”. To do this, we need to first delete
our old index (because it has the incorrect mapping) and create a new one with
the correct mappings:
DELETE
/
my_store
PUT
/
my_store
{
"mappings"
:
{
"products"
:
{
"properties"
:
{
"productID"
:
{
"type"
:
"string"
,
"index"
:
"not_analyzed"
}
}
}
}
}
Deleting the index first is required, since we cannot change mappings that already exist.
With the index deleted, we can re-create it with our custom mapping.
Here we explicitly say that we don’t want productID
to be analyzed.
Now we can go ahead and reindex our documents:
POST
/
my_store
/
products
/
_bulk
{
"index"
:
{
"_id"
:
1
}}
{
"price"
:
10
,
"productID"
:
"XHDK-A-1293-#fJ3"
}
{
"index"
:
{
"_id"
:
2
}}
{
"price"
:
20
,
"productID"
:
"KDKE-B-9947-#kL5"
}
{
"index"
:
{
"_id"
:
3
}}
{
"price"
:
30
,
"productID"
:
"JODL-X-1937-#pV7"
}
{
"index"
:
{
"_id"
:
4
}}
{
"price"
:
30
,
"productID"
:
"QQPX-R-3956-#aD8"
}
Only now will our term
filter work as expected. Let’s try it again on the
newly indexed data (notice, the query and filter have not changed at all, just
how the data is mapped):
GET
/
my_store
/
products
/
_search
{
"query"
:
{
"filtered"
:
{
"filter"
:
{
"term"
:
{
"productID"
:
"XHDK-A-1293-#fJ3"
}
}
}
}
}
Since the productID
field is not analyzed, and the term
filter performs no
analysis, the query finds the exact match and returns document 1 as a hit.
Success!
Internally, Elasticsearch is performing several operations when executing a filter:
Find matching docs.
The term
filter looks up the term XHDK-A-1293-#fJ3
in the inverted index
and retrieves the list of documents that contain that term. In this case,
only document 1 has the term we are looking for.
Build a bitset.
The filter then builds a bitset--an array of 1s and 0s—that
describes which documents contain the term. Matching documents receive a 1
bit. In our example, the bitset would be [1,0,0,0]
.
Cache the bitset.
Last, the bitset is stored in memory, since we can use this in the future and skip steps 1 and 2. This adds a lot of performance and makes filters very fast.
When executing a filtered
query, the filter
is executed before the
query
. The resulting bitset is given to the query
, which uses it to simply
skip over any documents that have already been excluded by the filter. This is
one of the ways that filters can improve performance. Fewer documents
evaluated by the query means faster response times.
The previous two examples showed a single filter in use. In practice, you will probably need to filter on multiple values or fields. For example, how would you express this SQL in Elasticsearch?
SELECT
product
FROM
products
WHERE
(
price
=
20
OR
productID
=
"XHDK-A-1293-#fJ3"
)
AND
(
price
!=
30
)
In these situations, you will need the bool
filter. This is a compound
filter that accepts other filters as arguments, combining them in various
Boolean combinations.
The bool
filter is composed of three sections:
{
"bool"
:
{
"must"
:
[],
"should"
:
[],
"must_not"
:
[],
}
}
must
All of these clauses must match. The equivalent of AND
.
must_not
All of these clauses must not match. The equivalent of NOT
.
should
At least one of these clauses must match. The equivalent of OR
.
And that’s it! When you need multiple filters, simply place them into the
different sections of the bool
filter.
Each section of the bool
filter is optional (for example, you can have a must
clause and nothing else), and each section can contain a single filter or an
array of filters.
To replicate the preceding SQL example, we will take the two term
filters that
we used previously and place them inside the should
clause of a bool
filter, and add another clause to deal with the NOT
condition:
GET
/
my_store
/
products
/
_search
{
"query"
:
{
"filtered"
:
{
"filter"
:
{
"bool"
:
{
"should"
:
[
{
"term"
:
{
"price"
:
20
}},
{
"term"
:
{
"productID"
:
"XHDK-A-1293-#fJ3"
}}
],
"must_not"
:
{
"term"
:
{
"price"
:
30
}
}
}
}
}
}
}
Note that we still need to use a filtered
query to wrap everything.
These two term
filters are children of the bool
filter, and since they
are placed inside the should
clause, at least one of them needs to match.
If a product has a price of 30
, it is automatically excluded because it
matches a must_not
clause.
Our search results return two hits, each document satisfying a different clause
in the bool
filter:
"hits"
:
[
{
"_id"
:
"1"
,
"_score"
:
1.0
,
"_source"
:
{
"price"
:
10
,
"productID"
:
"XHDK-A-1293-#fJ3"
}
},
{
"_id"
:
"2"
,
"_score"
:
1.0
,
"_source"
:
{
"price"
:
20
,
"productID"
:
"KDKE-B-9947-#kL5"
}
}
]
Even though bool
is a compound filter and accepts children filters, it is
important to understand that bool
is just a filter itself. This means you
can nest bool
filters inside other bool
filters, giving you the
ability to make arbitrarily complex Boolean logic.
Given this SQL statement:
SELECT
document
FROM
products
WHERE
productID
=
"KDKE-B-9947-#kL5"
OR
(
productID
=
"JODL-X-1937-#pV7"
AND
price
=
30
)
We can translate it into a pair of nested bool
filters:
GET
/
my_store
/
products
/
_search
{
"query"
:
{
"filtered"
:
{
"filter"
:
{
"bool"
:
{
"should"
:
[
{
"term"
:
{
"productID"
:
"KDKE-B-9947-#kL5"
}},
{
"bool"
:
{
"must"
:
[
{
"term"
:
{
"productID"
:
"JODL-X-1937-#pV7"
}},
{
"term"
:
{
"price"
:
30
}}
]
}}
]
}
}
}
}
}
Because the term
and the bool
are sibling clauses inside the first
Boolean should
, at least one of these filters must match for a document
to be a hit.
These two term
clauses are siblings in a must
clause, so they both
have to match for a document to be returned as a hit.
The results show us two documents, one matching each of the should
clauses:
"hits"
:
[
{
"_id"
:
"2"
,
"_score"
:
1.0
,
"_source"
:
{
"price"
:
20
,
"productID"
:
"KDKE-B-9947-#kL5"
}
},
{
"_id"
:
"3"
,
"_score"
:
1.0
,
"_source"
:
{
"price"
:
30
,
"productID"
:
"JODL-X-1937-#pV7"
}
}
]
This productID
matches the term
in the first bool
.
These two fields match the term
filters in the nested bool
.
This was a simple example, but it demonstrates how Boolean filters can be used as building blocks to construct complex logical conditions.
The term
filter is useful for finding a single value, but often you’ll want
to search for multiple values. What if you want to find documents that have a
price of $20 or $30?
Rather than using multiple term
filters, you can instead use a single terms
filter (note the s at the end). The terms
filter is simply the plural
version of the singular term
filter.
It looks nearly identical to a vanilla term
too. Instead of
specifying a single price, we are now specifying an array of values:
{
"terms"
:
{
"price"
:
[
20
,
30
]
}
}
And like the term
filter, we will place it inside a filtered
query to
use it:
GET
/
my_store
/
products
/
_search
{
"query"
:
{
"filtered"
:
{
"filter"
:
{
"terms"
:
{
"price"
:
[
20
,
30
]
}
}
}
}
}
The query will return the second, third, and fourth documents:
"hits"
:
[
{
"_id"
:
"2"
,
"_score"
:
1.0
,
"_source"
:
{
"price"
:
20
,
"productID"
:
"KDKE-B-9947-#kL5"
}
},
{
"_id"
:
"3"
,
"_score"
:
1.0
,
"_source"
:
{
"price"
:
30
,
"productID"
:
"JODL-X-1937-#pV7"
}
},
{
"_id"
:
"4"
,
"_score"
:
1.0
,
"_source"
:
{
"price"
:
30
,
"productID"
:
"QQPX-R-3956-#aD8"
}
}
]
It is important to understand that term
and terms
are contains operations,
not equals. What does that mean?
If you have a term filter for { "term" : { "tags" : "search" } }
, it will match
both of the following documents:
{
"tags"
:
[
"search"
]
}
{
"tags"
:
[
"search"
,
"open_source"
]
}
Recall how the term
filter works: it checks the inverted index for all
documents that contain a term, and then constructs a bitset. In our simple
example, we have the following inverted index:
Token |
DocIDs |
|
|
|
|
When a term
filter is executed for the token search
, it goes straight to the
corresponding entry in the inverted index and extracts the associated doc IDs.
As you can see, both document 1 and document 2 contain the token in the inverted index.
Therefore, they are both returned as a result.
The nature of an inverted index also means that entire field equality is rather difficult to calculate. How would you determine whether a particular document contains only your request term? You would have to find the term in the inverted index, extract the document IDs, and then scan every row in the inverted index, looking for those IDs to see whether a doc has any other terms.
As you might imagine, that would be tremendously inefficient and expensive.
For that reason, term
and terms
are must contain operations, not
must equal exactly.
If you do want that behavior—entire field equality—the best way to accomplish it involves indexing a secondary field. In this field, you index the number of values that your field contains. Using our two previous documents, we now include a field that maintains the number of tags:
{
"tags"
:
[
"search"
],
"tag_count"
:
1
}
{
"tags"
:
[
"search"
,
"open_source"
],
"tag_count"
:
2
}
Once you have the count information indexed, you can construct a bool
filter
that enforces the appropriate number of terms:
GET
/
my_index
/
my_type
/
_search
{
"query"
:
{
"filtered"
:
{
"filter"
:
{
"bool"
:
{
"must"
:
[
{
"term"
:
{
"tags"
:
"search"
}
},
{
"term"
:
{
"tag_count"
:
1
}
}
]
}
}
}
}
}
This query will now match only the document that has a single tag that is
search
, rather than any document that contains search
.
When dealing with numbers in this chapter, we have so far searched for only exact numbers. In practice, filtering on ranges is often more useful. For example, you might want to find all products with a price greater than $20 and less than $40.
In SQL terms, a range can be expressed as follows:
SELECT
document
FROM
products
WHERE
price
BETWEEN
20
AND
40
Elasticsearch has a range
filter, which, unsurprisingly, allows you to
filter ranges:
"range"
:
{
"price"
:
{
"gt"
:
20
,
"lt"
:
40
}
}
The range
filter supports both inclusive and exclusive ranges, through
combinations of the following options:
gt
: >
greater than
lt
: <
less than
gte
: >=
greater than or equal to
lte
: <=
less than or equal to
GET
/
my_store
/
products
/
_search
{
"query"
:
{
"filtered"
:
{
"filter"
:
{
"range"
:
{
"price"
:
{
"gte"
:
20
,
"lt"
:
40
}
}
}
}
}
}
If you need an unbounded range (for example, just >20), omit one of the boundaries:
"range"
:
{
"price"
:
{
"gt"
:
20
}
}
The range
filter can be used on date fields too:
"range"
:
{
"timestamp"
:
{
"gt"
:
"2014-01-01 00:00:00"
,
"lt"
:
"2014-01-07 00:00:00"
}
}
When used on date fields, the range
filter supports date math operations.
For example, if we want to find all documents that have a timestamp sometime
in the last hour:
"range"
:
{
"timestamp"
:
{
"gt"
:
"now-1h"
}
}
This filter will now constantly find all documents with a timestamp greater than the current time minus 1 hour, making the filter a sliding window across your documents.
Date math can also be applied to actual dates, rather than a placeholder like
now. Just add a double pipe (||
) after the date and follow it with a date
math expression:
"range"
:
{
"timestamp"
:
{
"gt"
:
"2014-01-01 00:00:00"
,
"lt"
:
"2014-01-01 00:00:00||+1M"
}
}
Date math is calendar aware, so it knows the number of days in each month, days in a year, and so forth. More details about working with dates can be found in the date format reference documentation.
The range
filter can also operate on string fields. String ranges are
calculated lexicographically or alphabetically. For example, these values
are sorted in lexicographic order:
5, 50, 6, B, C, a, ab, abb, abc, b
Terms in the inverted index are sorted in lexicographical order, which is why string ranges use this order.
If we want a range from a
up to but not including b
, we can use the same
range
filter syntax:
"range"
:
{
"title"
:
{
"gte"
:
"a"
,
"lt"
:
"b"
}
}
Think back to our earlier example, where documents have a field named tags
.
This is a multivalue field. A document may have one tag, many tags, or
potentially no tags at all. If a field has no values, how is it stored in an
inverted index?
That’s a trick question, because the answer is, it isn’t stored at all. Let’s look at that inverted index from the previous section:
Token |
DocIDs |
|
|
|
|
How would you store a field that doesn’t exist in that data structure? You can’t! An inverted index is simply a list of tokens and the documents that contain them. If a field doesn’t exist, it doesn’t hold any tokens, which means it won’t be represented in an inverted index data structure.
Ultimately, this means that a null
, []
(an empty
array), and [null]
are all equivalent. They simply don’t exist in the
inverted index!
Obviously, the world is not simple, and data is often missing fields, or contains explicit nulls or empty arrays. To deal with these situations, Elasticsearch has a few tools to work with null or missing values.
The first tool in your arsenal is the exists
filter. This filter will return
documents that have any value in the specified field. Let’s use the tagging example
and index some example documents:
POST
/
my_index
/
posts
/
_bulk
{
"index"
:
{
"_id"
:
"1"
}}
{
"tags"
:
[
"search"
]
}
{
"index"
:
{
"_id"
:
"2"
}}
{
"tags"
:
[
"search"
,
"open_source"
]
}
{
"index"
:
{
"_id"
:
"3"
}}
{
"other_field"
:
"some data"
}
{
"index"
:
{
"_id"
:
"4"
}}
{
"tags"
:
null
}
{
"index"
:
{
"_id"
:
"5"
}}
{
"tags"
:
[
"search"
,
null
]
}
The tags
field has one value.
The tags
field has two values.
The tags
field is missing altogether.
The tags
field is set to null
.
The tags
field has one value and a null
.
The resulting inverted index for our tags
field will look like this:
Token |
DocIDs |
|
|
|
|
Our objective is to find all documents where a tag is set. We don’t care what
the tag is, so long as it exists within the document. In SQL parlance,
we would use an IS NOT NULL
query:
SELECT
tags
FROM
posts
WHERE
tags
IS
NOT
NULL
In Elasticsearch, we use the exists
filter:
GET
/
my_index
/
posts
/
_search
{
"query"
:
{
"filtered"
:
{
"filter"
:
{
"exists"
:
{
"field"
:
"tags"
}
}
}
}
}
Our query returns three documents:
"hits"
:
[
{
"_id"
:
"1"
,
"_score"
:
1.0
,
"_source"
:
{
"tags"
:
[
"search"
]
}
},
{
"_id"
:
"5"
,
"_score"
:
1.0
,
"_source"
:
{
"tags"
:
[
"search"
,
null
]
}
},
{
"_id"
:
"2"
,
"_score"
:
1.0
,
"_source"
:
{
"tags"
:
[
"search"
,
"open source"
]
}
}
]
Document 5 is returned even though it contains a null
value. The field
exists because a real-value tag was indexed, so the null
had no impact
on the filter.
The results are easy to understand. Any document that has terms in the
tags
field was returned as a hit. The only two documents that were excluded
were documents 3 and 4.
The missing
filter is essentially the inverse of exists
: it returns
documents where there is no value for a particular field, much like this
SQL:
SELECT
tags
FROM
posts
WHERE
tags
IS
NULL
Let’s swap the exists
filter for a missing
filter from our previous example:
GET
/
my_index
/
posts
/
_search
{
"query"
:
{
"filtered"
:
{
"filter"
:
{
"missing"
:
{
"field"
:
"tags"
}
}
}
}
}
And, as you would expect, we get back the two docs that have no real values
in the tags
field—documents 3 and 4:
"hits"
:
[
{
"_id"
:
"3"
,
"_score"
:
1.0
,
"_source"
:
{
"other_field"
:
"some data"
}
},
{
"_id"
:
"4"
,
"_score"
:
1.0
,
"_source"
:
{
"tags"
:
null
}
}
]
The exists
and missing
filters also work on inner objects, not just core
types. With the following document
{
"name"
:
{
"first"
:
"John"
,
"last"
:
"Smith"
}
}
you can check for the existence of name.first
and name.last
but also just
name
. However, in “Types and Mappings”, we said that an object like the preceding one is
flattened internally into a simple field-value structure, much like this:
{
"name.first"
:
"John"
,
"name.last"
:
"Smith"
}
So how can we use an exists
or missing
filter on the name
field, which
doesn’t really exist in the inverted index?
The reason that it works is that a filter like
{
"exists"
:
{
"field"
:
"name"
}
}
is really executed as
{
"bool"
:
{
"should"
:
[
{
"exists"
:
{
"field"
:
{
"name.first"
}}},
{
"exists"
:
{
"field"
:
{
"name.last"
}}}
]
}
}
That also means that if first
and last
were both empty, the name
namespace would not exist.
Earlier in this chapter (“Internal Filter Operation”), we briefly discussed how filters are calculated. At their heart is a bitset representing which documents match the filter. Elasticsearch aggressively caches these bitsets for later use. Once cached, these bitsets can be reused wherever the same filter is used, without having to reevaluate the entire filter again.
These cached bitsets are “smart”: they are updated incrementally. As you index new documents, only those new documents need to be added to the existing bitsets, rather than having to recompute the entire cached filter over and over. Filters are real-time like the rest of the system; you don’t need to worry about cache expiry.
Each filter is calculated and cached independently, regardless of where it is used. If two different queries use the same filter, the same filter bitset will be reused. Likewise, if a single query uses the same filter in multiple places, only one bitset is calculated and then reused.
Let’s look at this example query, which looks for emails that are either of the following:
In the inbox and have not been read
Not in the inbox but have been marked as important
"bool"
:
{
"should"
:
[
{
"bool"
:
{
"must"
:
[
{
"term"
:
{
"folder"
:
"inbox"
}},
{
"term"
:
{
"read"
:
false
}}
]
}},
{
"bool"
:
{
"must_not"
:
{
"term"
:
{
"folder"
:
"inbox"
}
},
"must"
:
{
"term"
:
{
"important"
:
true
}
}
}}
]
}
Even though one of the inbox clauses is a must
clause and the other is a
must_not
clause, the two clauses themselves are identical. This means that
the bitset is calculated once for the first clause that is executed, and then
the cached bitset is used for the other clause. By the time this query is run
a second time, the inbox filter is already cached and so both clauses will use
the cached bitset.
This ties in nicely with the composability of the query DSL. It is easy to move filters around, or reuse the same filter in multiple places within the same query. This isn’t just convenient to the developer—it has direct performance benefits.
Most leaf filters—those dealing directly with fields like the term
filter—are cached, while compound filters, like the bool
filter, are not.
Leaf filters have to consult the inverted index on disk, so it makes sense to cache them. Compound filters, on the other hand, use fast bit logic to combine the bitsets resulting from their inner clauses, so it is efficient to recalculate them every time.
Certain leaf filters, however, are not cached by default, because it doesn’t make sense to do so:
The results from script
filters cannot
be cached because the meaning of the script is opaque to Elasticsearch.
The geolocation filters, which we cover in more detail in Part V, are usually used to filter results based on the geolocation of a specific user. Since each user has a unique geolocation, it is unlikely that geo-filters will be reused, so it makes no sense to cache them.
Date ranges that use the now
function (for example "now-1h"
), result in values
accurate to the millisecond. Every time the filter is run, now
returns a new
time. Older filters will never be reused, so caching is disabled by default.
However, when using now
with rounding (for example, now/d
rounds to the nearest day),
caching is enabled by default.
Sometimes the default caching strategy is not correct. Perhaps you have a
complicated bool
expression that is reused several times in the same query.
Or you have a filter on a date
field that will never be reused. The default
caching strategy can be overridden on almost any filter by setting the
_cache
flag:
{
"range"
:
{
"timestamp"
:
{
"gt"
:
"2014-01-02 16:15:14"
},
"_cache"
:
false
}
}
Later chapters provide examples of when it can make sense to override the default caching strategy.
The order of filters in a bool
clause is important for performance. More-specific filters should be placed before less-specific filters in order to
exclude as many documents as possible, as early as possible.
If Clause A could match 10 million documents, and Clause B could match only 100 documents, then Clause B should be placed before Clause A.
Cached filters are very fast, so they should be placed before filters that are not cacheable. Imagine that we have an index that contains one month’s worth of log events. However, we’re mostly interested only in log events from the previous hour:
GET
/
logs
/
2014
-
01
/
_search
{
"query"
:
{
"filtered"
:
{
"filter"
:
{
"range"
:
{
"timestamp"
:
{
"gt"
:
"now-1h"
}
}
}
}
}
}
This filter is not cached because it uses the now
function, the value of
which changes every millisecond. That means that we have to examine one
month’s worth of log events every time we run this query!
We could make this much more efficient by combining it with a cached filter: we can exclude most of the month’s data by adding a filter that uses a fixed point in time, such as midnight last night:
"bool"
:
{
"must"
:
[
{
"range"
:
{
"timestamp"
:
{
"gt"
:
"now-1h/d"
}
}},
{
"range"
:
{
"timestamp"
:
{
"gt"
:
"now-1h"
}
}}
]
}
This filter is cached because it uses now
rounded to midnight.
This filter is not cached because it uses now
without rounding.
The now-1h/d
clause rounds to the previous midnight and so excludes all documents
created before today. The resulting bitset is cached because now
is used
with rounding, which means that it is executed only once a day, when the value
for midnight-last-night changes. The now-1h
clause isn’t cached because
now
produces a time accurate to the nearest millisecond. However, thanks to
the first filter, this second filter need only check documents that have been
created since midnight.
The order of these clauses is important. This approach works only because the since-midnight clause comes before the last-hour clause. If they were the other way around, then the last-hour clause would need to examine all documents in the index, instead of just documents created since midnight.
3.149.241.250