Elasticsearch comes with several analyzers in its standard installation. In the following table, some analyzers are described:
Analyzers fulfill the following three main functions using character filters, tokenizer, and token filters:
Let's look at the main function of how closely it is realized now.
In the analysis process, a tokenizer is used to break a text into tokens. Before this operation, the text is passed through any character filter. Then, token filters start working.
Character filters are used before being passed to tokenizer at the analysis process. Elasticsearch has built-in characters filters. Also, you can create your own character filters to meet your needs.
This filter is stripping out HTML markup from an analyzed text. For example, consider the following verse belonging to the Turkish poet and sufi mystic Yunus Emre:
Âşıklar ölmez!
As you can see, Turkish and Latin accent characters are used instead of HTML decimal code. The original text is Âşıklar ölmez! (Translation: lovers are immortal!) Let's see how you get a result when this text is analyzed with standard tokenizer:
curl -XGET 'localhost:9200/_analyze?tokenizer=standard&pretty' -d 'Âşıklar ölmez!' { "tokens" : [ { "token" : "194", "start_offset" :2, "end_offset" :5, "type" : "<NUM>", "position" : 1 }, { "token" : "351", "start_offset" :8, "end_offset" :11, "type" : "<NUM>", "position" : 2 }, { "token" : "305", "start_offset" :14, "end_offset" :17, "type" : "<NUM>", "position" : 3 }, { "token" : "klar", "start_offset" :18, "end_offset" :22, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "246", "start_offset" :25, "end_offset" :28, "type" : "<NUM>", "position" : 5 }, { "token" : "lmez", "start_offset" :29, "end_offset" :33, "type" : "<ALPHANUM>", "position" : 6 } ] }
As you can see, these results are not useful or user-friendly. Remember, if text is being analyzed in this way, documents containing the word Âşıklar are not returned to us when we search the word Âşıklar. In this case, we need a filter to convert the HTML code of the characters. HTML Strip Char Filter performs this job, as shown:
curl -XGET 'localhost:9200/_analyze?tokenizer=standard&char_filters=html_strip&pretty' -d 'Âşıklar ölmez!' { "tokens" : [ { "token" : "Âşıklar", "start_offset" :0, "end_offset" :22, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "ölmez", "start_offset" :23, "end_offset" :33, "type" : "<ALPHANUM>", "position" : 2 } ] }
Token is one of the basic concepts in the lexical analysis of computer science, which means that a sequence of characters (that is, string) can turn into a sequence of tokens. For example, the string hello world
becomes [hello
, world
]. Elasticsearch has several tokenizers that are used to divide a string down into a stream of terms or tokens. A simple tokenizer may split the string up into terms wherever it encounters word boundaries, whitespace, or punctuation.
Elasticsearch has built-in tokenizers. You can combine them with character filters to create custom analyzers. In the following table, some tokenizers are described:
If you want more information about the Unicode Standard Annex #29, refer to http://unicode.org/reports/tr29/.
Token filters accept a stream of modified tokens from tokenizers. Elasticsearch has built-in token filters. In the following table, some token filters are described:
18.216.190.18