The tokenizer in the analyzer receives the output character stream from the character filters and splits this into a token stream, which is the input to the token filter. Three types of tokenizer are supported in Elasticsearch, and they are described as follows:
- Word-oriented tokenizer: This splits the character stream into individual tokens.
- Partial word tokenizer: This splits the character stream into a sequence of characters within a given length.
- Structured text tokenizer: This splits the character stream into known structured tokens such as keywords, email addresses, and zip codes.
We'll give an example for each built-in tokenizer and compile the results into the following tables. Let's first take a look at the Word-oriented tokenizer:
Word-oriented tokenizer | ||
Tokenizer | ||
standard | Input text | "POST https://api.iextrading.com/1.0/stock/acwf/company /usr/local" |
Description | This is grammar-based tokenization; it supports the max_token_length parameter to divide the input text into segments. | |
Output tokens | [POST, https, "api.iextrading.com", "1.0", "stock", "acwf", "company", "usr", "local"] | |
letter | Input text | "POST https://api.iextrading.com/1.0/stock/acwf/company /usr/local" |
Description | This uses non-letters as separators to split the character stream into terms. | |
Output tokens | ["POST", "https", "api", "iextrading", "com", "stock", "acwf", "company", "usr", "local"] | |
lowercase | Input text | "POST https://api.iextrading.com/1.0/stock/acwf/company /usr/local" |
Description | Similar to the letter tokenizer, it uses non-letters as separators to tokenize the input text. In addition, it also converts the lettersfrom uppercase to lowercase. | |
Output tokens | ["post", "https", "api", "iextrading", "com", "stock", "acwf", "company", "usr", "local"] | |
whitespace | Input text | "POST https://api.iextrading.com/1.0/stock/acwf/company /usr/local" |
Description | This uses whitespace characters as separators to split the character stream into terms. | |
Output tokens | ["POST", "https://api.iextrading.com/1.0/stock/acwf/company", "/usr/local"] | |
uax_url_email | Input text | "POST https://api.iextrading.com/1.0/stock/acwf/company /usr/local" |
Description | This splits character streams into URL format terms and email address format terms. | |
Output tokens | ["POST", "https://api.iextrading.com/1.0/stock/acwf/company", "usr", "local"] | |
classic | Input text | "POST https://api.iextrading.com/1.0/stock/acwf/company 192.168.0.1 100-123" |
Description |
This is grammar-based tokenization. Additionally, it uses punctuation as a separator but retains some special formatting, such as dots between the non-whitespace characters, hyphens between the numbers, email addresses, and internet hostnames. |
|
Output tokens | ["POST", "https", "api.iextrading.com", "1.0/stock", "acwf", "company", "192.168.0.1", "100-123"] | |
thai | Input text | "คุณจะรัก Elasticsearch 7.0" |
Description | This is similar to the standard tokenizer, but uses Thai text. | |
Output tokens | ["คุณ, "จะ", "รัก", "Elasticsearch", "7.0"] |
Let's take a look at the Partial word tokenizer, as described in the following table:
Partial Word Tokenizer | ||
Tokenizer | ||
ngram |
Input text | "Elasticsearch 7.0" |
Description | This slides along the input character stream to provide items in the specified length of the specified characters. It uses min_gram (this defaults to 1) and max_gram (this defaults to 2) to specify the length and token_chars to specify the letters, digits, whitespace, punctuation, and symbol. | |
Custom tokenizer | {"type":"ngram", "min_gram":2, "max_gram":2, "token_chars": ["punctuation", "digit"]} | |
Output tokens | ["7.", ".0"] | |
edge_ngram |
Input | "Elasticsearch 7.0" |
Description | This is similar to the ngram tokenizer. The difference is that each item is anchored to the starting point of the candidate words. | |
Custom tokenizer | {"type":"edge_ngram", "min_gram":2, "max_gram":2, "token_chars": ["punctuation", "digit"]} | |
Output tokens | ["7."] |
Let's take a look at the Structured text tokenizer, as described in the following table:
Structured text tokenizer | ||
Tokenizer | ||
keyword | Input text | "Elasticsearch 7.0" |
Description | This outputs the same text as the input character steam as a term. | |
Output tokens | ["Elasticsearch 7.0"] | |
pattern | Input text | "Elasticsearch 7.0" |
Description | This uses a regular expression to perform pattern matching to process the input character stream to obtain terms. The default pattern is W+. Use pattern to specify the regular expression; use flags to specify the flag of the Java regular expression; and use group to specify the group matched. | |
Output tokens | ["Elasticsearch", "7", "0"] | |
char_group | Input text | "Elasticsearch 7.0" |
Description | You can define a set of separators to split the input character stream into terms. Use tokenize_on_chars to specify a list of separators. | |
Custom tokenizer | {"type":"char_group", "tokenize_on_chars": ["whitespace", "punctuation"]} | |
Output tokens | ["Elasticsearch", "7", "0"] | |
simple_pattern | Input text | "Elasticsearch 7.0" |
Description | This is similar to the pattern tokenizer, but with Lucene regular expressions. The tokenization is usually faster (for more information, you can refer to https://lucene.apache.org/core/7_0_1/core/org/apache/lucene/util/automaton/RegExp.html). The following example matches words only made from letters. | |
Custom tokenizer | {"type":"simple_pattern", "pattern": "[a-zA-Z]*"} | |
Output tokens | ["Elasticsearch"] | |
simple_pattern_split | Input text | "Elasticsearch 7.0" |
Description | You can define the pattern as a separator to split the input character stream into terms. Use pattern to specify the pattern of the separator. | |
Custom tokenizer | {"type":"simple_pattern_split", "pattern": "[a-zA-Z.]*"} | |
Output tokens | ["7", "0"] | |
path_hierarchy | Input text | "/usr/local" |
Description | This uses the path separator to split the input character stream into terms. The following parameters can be set: delimiter (the separator), replacement (the character to replace the delimiter), buffer_size (the maximum length in one batch), reverse (this reverses the generated terms), and skip (the number of generated terms to skip). | |
Custom tokenizer | {"type":"path_hierarchy", "replacement":"_"} | |
Output tokens | ["_user", "_usr_local"] |