Tokenizers

The tokenizer in the analyzer receives the output character stream from the character filters and splits this into a token stream, which is the input to the token filter. Three types of tokenizer are supported in Elasticsearch, and they are described as follows:

  • Word-oriented tokenizer: This splits the character stream into individual tokens.
  • Partial word tokenizer: This splits the character stream into a sequence of characters within a given length.
  • Structured text tokenizer: This splits the character stream into known structured tokens such as keywords, email addresses, and zip codes.

We'll give an example for each built-in tokenizer and compile the results into the following tables. Let's first take a look at the Word-oriented tokenizer:

Word-oriented tokenizer
Tokenizer
standard Input text "POST https://api.iextrading.com/1.0/stock/acwf/company /usr/local"
Description This is grammar-based tokenization; it supports the max_token_length parameter to divide the input text into segments.
Output tokens [POST, https, "api.iextrading.com", "1.0", "stock", "acwf", "company", "usr", "local"]
letter Input text "POST https://api.iextrading.com/1.0/stock/acwf/company /usr/local"
Description This uses non-letters as separators to split the character stream into terms. 
Output tokens ["POST", "https", "api", "iextrading", "com", "stock", "acwf", "company", "usr", "local"]
lowercase Input text "POST https://api.iextrading.com/1.0/stock/acwf/company /usr/local"
Description Similar to the letter tokenizer, it uses non-letters as separators to tokenize the input text. In addition,  it also converts the lettersfrom uppercase to lowercase.
Output tokens ["post", "https", "api", "iextrading", "com", "stock", "acwf", "company", "usr", "local"]
whitespace Input text "POST https://api.iextrading.com/1.0/stock/acwf/company /usr/local"
Description This uses whitespace characters as separators to split the character stream into terms.
Output tokens ["POST", "https://api.iextrading.com/1.0/stock/acwf/company", "/usr/local"]
uax_url_email Input text "POST https://api.iextrading.com/1.0/stock/acwf/company /usr/local"
Description This splits character streams into URL format terms and email address format terms.
Output tokens ["POST", "https://api.iextrading.com/1.0/stock/acwf/company", "usr", "local"]
classic Input text "POST https://api.iextrading.com/1.0/stock/acwf/company 192.168.0.1 100-123"
Description

This is grammar-based tokenization. Additionally, it uses punctuation as a separator but retains some special formatting, such as dots between the non-whitespace characters, hyphens between the numbers, email addresses, and internet hostnames.

Output tokens ["POST", "https", "api.iextrading.com", "1.0/stock", "acwf", "company", "192.168.0.1", "100-123"]
thai Input text "คุณจะรัก Elasticsearch 7.0"
Description This is similar to the standard tokenizer, but uses Thai text.
Output tokens ["คุณ, "จะ", "รัก", "Elasticsearch", "7.0"]

 

Let's take a look at the Partial word tokenizer, as described in the following table:

Partial Word Tokenizer
Tokenizer
ngram Input text "Elasticsearch 7.0"
Description This slides along the input character stream to provide items in the specified length of the specified characters. It uses min_gram (this defaults to 1) and max_gram (this defaults to 2) to specify the length and token_chars to specify the lettersdigitswhitespacepunctuation, and symbol.
Custom tokenizer {"type":"ngram", "min_gram":2, "max_gram":2, "token_chars": ["punctuation", "digit"]}
Output tokens ["7.", ".0"]
edge_ngram Input "Elasticsearch 7.0"
Description This is similar to the ngram tokenizer. The difference is that each item is anchored to the starting point of the candidate words.
Custom tokenizer {"type":"edge_ngram", "min_gram":2, "max_gram":2, "token_chars": ["punctuation", "digit"]}
Output tokens ["7."]

 

Let's take a look at the Structured text tokenizer, as described in the following table:

Structured text tokenizer
Tokenizer
keyword Input text "Elasticsearch 7.0"
Description This outputs the same text as the input character steam as a term.
Output tokens ["Elasticsearch 7.0"]
pattern Input text "Elasticsearch 7.0"
Description This uses a regular expression to perform pattern matching to process the input character stream to obtain terms. The default pattern is W+. Use pattern to specify the regular expression; use flags to specify the flag of the Java regular expression; and use group to specify the group matched.
Output tokens ["Elasticsearch", "7", "0"]
char_group Input text "Elasticsearch 7.0"
Description You can define a set of separators to split the input character stream into terms. Use tokenize_on_chars to specify a list of separators.
Custom tokenizer {"type":"char_group", "tokenize_on_chars": ["whitespace", "punctuation"]}
Output tokens ["Elasticsearch", "7", "0"]
simple_pattern Input text "Elasticsearch 7.0"
Description This is similar to the pattern tokenizer, but with Lucene regular expressions. The tokenization is usually faster (for more information, you can refer to https://lucene.apache.org/core/7_0_1/core/org/apache/lucene/util/automaton/RegExp.html). The following example matches words only made from letters.
Custom tokenizer {"type":"simple_pattern", "pattern": "[a-zA-Z]*"}
Output tokens ["Elasticsearch"]
simple_pattern_split Input text "Elasticsearch 7.0"
Description You can define the pattern as a separator to split the input character stream into terms. Use pattern to specify the pattern of the separator.
Custom tokenizer {"type":"simple_pattern_split", "pattern": "[a-zA-Z.]*"}
Output tokens ["7", "0"]
path_hierarchy Input text "/usr/local"
Description This uses the path separator to split the input character stream into terms. The following parameters can be set: delimiter (the separator), replacement (the character to replace the delimiter), buffer_size (the maximum length in one batch), reverse (this reverses the generated terms), and skip (the number of generated terms to skip).
Custom tokenizer {"type":"path_hierarchy", "replacement":"_"}
Output tokens ["_user", "_usr_local"]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.193.255