ElasticSearch performs a good job on finding results also in large text documents. Thus, for searching text in very large blocks it's very useful, but to improve the user experience it is sometimes required to show the abstract part: a small portion of the text that has matched the query. The highlight functionality in ElasticSearch is designed to do this job.
You need a working ElasticSearch cluster and an index populated with the script available in online code.
For searching and highlighting the results, we need to perform the following steps:
highlight
section as follows:curl -XGET 'http://127.0.0.1:9200/test-index/_search?from=0&size=10' -d ' { "query": {"query_string": {"query": "joe"}}, "highlight": { "pre_tags": ["<b>"], "fields": { "parsedtext": {"order": "score"}, "name": {"order": "score"}}, "post_tags": ["</b>"]}}'
{ … omissis … "hits" : { "total" : 1, "max_score" : 0.44194174, "hits" : [ { "_index" : "test-index", "_type" : "test-type", "_id" : "1", "_score" : 0.44194174, "_source" : {"position": 1, "parsedtext": "Joe Testere nice guy", "name": "Joe Tester", "uuid": "11111"}, "highlight" : { "name" : [ "<b>Joe</b> Tester" ], "parsedtext" : [ "<b>Joe</b> Testere nice guy" ] } } ] } }
As you can see, in the standard result there is a new field, highlight
, which contains the highlighted fields with an array of fragments.
When the highlight
parameter is passed to the search object, ElasticSearch tries to execute the highlight on document results.
The highlighting phase, which is after the document fetch, tries to extract the highlight following these steps:
Using the highlighting functionality is very easy, but there are some important areas where we need to pay attention. They are as follows:
The ElasticSearch highlighter checks the presence of the data field first as term vector (it is the fastest way to execute the highlighting functionality). If the field doesn't have the term vector, it tries to load the field value from the stored fields. If the field is not stored, it finally loads the JSON source, interprets it, and extracts the data value if available. Obviously, the last approach is the slowest one and most resource intensive.
There are several parameters that can be passed in the highlight object to control the highlighting process, and these are as follows:
number_of_fragments
(default 5
): This parameter controls how many fragments are to be returned. It can be configured globally or for a field.fragment_size
(default 100
): This parameter controls the number of characters that the fragments must contain. It can be configured globally or for a field.pre_tags
/post_tags
: This parameter controls a list of tags to be used for marking the highlighted text.tags_schema="styled"
: This parameter allows defining a tags schema that marks highlighting with different tags with ordered importance. This is a helper to be used to avoid defining a lot of pre_tags
/post_tags
tags.18.219.92.34