Tokenizers

The tokenizer in the analyzer receives the output character stream from the character filters and splits this into a token stream, which is the input to the token filter. Three types of tokenizer are supported in Elasticsearch, and they are described as follows:

Word-oriented tokenizer: This splits the character stream into individual tokens.
Partial word tokenizer: This splits the character stream into a sequence of characters within a given length.
Structured text tokenizer: This splits the character stream into known structured tokens such as keywords, email addresses, and zip codes.

We'll give an example for each built-in tokenizer and compile the results into the following tables. Let's first take a look at the Word-oriented tokenizer:

Word-oriented tokenizer
Tokenizer
`standard`	Input text	`"POST https://api.iextrading.com/1.0/stock/acwf/company /usr/local"`
	Description	This is grammar-based tokenization; it supports the `max_token_length` parameter to divide the input text into segments.
	Output tokens	`[POST, https, "api.iextrading.com", "1.0", "stock", "acwf", "company", "usr", "local"]`
`letter`	Input text	`"POST https://api.iextrading.com/1.0/stock/acwf/company /usr/local"`
	Description	This uses non-letters as separators to split the character stream into terms.
	Output tokens	`["POST", "https", "api", "iextrading", "com", "stock", "acwf", "company", "usr", "local"]`
`lowercase`	Input text	`"POST https://api.iextrading.com/1.0/stock/acwf/company /usr/local"`
	Description	Similar to the `letter` tokenizer, it uses non-letters as separators to tokenize the input text. In addition, it also converts the lettersfrom uppercase to lowercase.
	Output tokens	`["post", "https", "api", "iextrading", "com", "stock", "acwf", "company", "usr", "local"]`
`whitespace`	Input text	`"POST https://api.iextrading.com/1.0/stock/acwf/company /usr/local"`
	Description	This uses whitespace characters as separators to split the character stream into terms.
	Output tokens	`["POST", "https://api.iextrading.com/1.0/stock/acwf/company", "/usr/local"]`
`uax_url_email`	Input text	`"POST https://api.iextrading.com/1.0/stock/acwf/company /usr/local"`
	Description	This splits character streams into URL format terms and email address format terms.
	Output tokens	`["POST", "https://api.iextrading.com/1.0/stock/acwf/company", "usr", "local"]`
`classic`	Input text	`"POST https://api.iextrading.com/1.0/stock/acwf/company 192.168.0.1 100-123"`
	Description	This is grammar-based tokenization. Additionally, it uses punctuation as a separator but retains some special formatting, such as dots between the non-whitespace characters, hyphens between the numbers, email addresses, and internet hostnames.
	Output tokens	`["POST", "https", "api.iextrading.com", "1.0/stock", "acwf", "company", "192.168.0.1", "100-123"]`
`thai`	Input text	`"คุณจะรัก Elasticsearch 7.0"`
	Description	This is similar to the standard tokenizer, but uses Thai text.
	Output tokens	`["คุณ, "จะ", "รัก", "Elasticsearch", "7.0"]`

Let's take a look at the Partial word tokenizer, as described in the following table:

Partial Word Tokenizer
Tokenizer
ngram	Input text	`"Elasticsearch 7.0"`
	Description	This slides along the input character stream to provide items in the specified length of the specified characters. It uses `min_gram` (this defaults to `1`) and `max_gram` (this defaults to `2`) to specify the length and `token_chars` to specify the letters, digits, whitespace, punctuation, and symbol.
	Custom tokenizer	`{"type":"ngram", "min_gram":2, "max_gram":2, "token_chars": ["punctuation", "digit"]}`
	Output tokens	`["7.", ".0"]`
edge_ngram	Input	`"Elasticsearch 7.0"`
	Description	This is similar to the `ngram` tokenizer. The difference is that each item is anchored to the starting point of the candidate words.
	Custom tokenizer	`{"type":"edge_ngram", "min_gram":2, "max_gram":2, "token_chars": ["punctuation", "digit"]}`
	Output tokens	`["7."]`

Let's take a look at the Structured text tokenizer, as described in the following table:

Structured text tokenizer
Tokenizer
`keyword`	Input text	`"Elasticsearch 7.0"`
	Description	This outputs the same text as the input character steam as a term.
	Output tokens	`["Elasticsearch 7.0"]`
`pattern`	Input text	`"Elasticsearch 7.0"`
	Description	This uses a regular expression to perform pattern matching to process the input character stream to obtain terms. The default pattern is `W+`. Use `pattern` to specify the regular expression; use `flags` to specify the flag of the Java regular expression; and use `group` to specify the group matched.
	Output tokens	`["Elasticsearch", "7", "0"]`
`char_group`	Input text	`"Elasticsearch 7.0"`
	Description	You can define a set of separators to split the input character stream into terms. Use `tokenize_on_chars` to specify a list of separators.
	Custom tokenizer	`{"type":"char_group", "tokenize_on_chars": ["whitespace", "punctuation"]}`
	Output tokens	`["Elasticsearch", "7", "0"]`
`simple_pattern`	Input text	`"Elasticsearch 7.0"`
	Description	This is similar to the `pattern` tokenizer, but with Lucene regular expressions. The tokenization is usually faster (for more information, you can refer to https://lucene.apache.org/core/7_0_1/core/org/apache/lucene/util/automaton/RegExp.html). The following example matches words only made from letters.
	Custom tokenizer	`{"type":"simple_pattern", "pattern": "[a-zA-Z]*"}`
	Output tokens	`["Elasticsearch"]`
`simple_pattern_split`	Input text	`"Elasticsearch 7.0"`
	Description	You can define the pattern as a separator to split the input character stream into terms. Use `pattern` to specify the pattern of the separator.
	Custom tokenizer	`{"type":"simple_pattern_split", "pattern": "[a-zA-Z.]*"}`
	Output tokens	`["7", "0"]`
`path_hierarchy`	Input text	`"/usr/local"`
	Description	This uses the path separator to split the input character stream into terms. The following parameters can be set: `delimiter` (the separator), `replacement` (the character to replace the delimiter), `buffer_size` (the maximum length in one batch), `reverse` (this reverses the generated terms), and `skip` (the number of generated terms to skip).
	Custom tokenizer	`{"type":"path_hierarchy", "replacement":"_"}`
	Output tokens	`["_user", "_usr_local"]`

Table of Contents for Tokenizers

Create new playlist

Sign In

Sign Up

Table of Contents for
Tokenizers