Token filters

The main function of a token filter is to add, modify, or delete the characters of the output tokens from the tokenizer. There are approximately 50 built-in token filters. We'll cover some popular token filters in the following table. You can find out more and learn about the remaining token filters at https://www.elastic.co/guide/en/elasticsearch/reference/7.x/analysis-tokenfilters.html. Each example token filter in the following table uses a standard tokenizer and a specified token filter. Note that no character filter is applied:

Token filter
`asciifolding`	Input text	`"Ÿőű'ľľ ľőνė Ȅľȁśťĩćŝėȁŕćĥ 7.0"`
	Description	This transforms the terms when letters, numbers, and unicode symbols are not in the first 127 ASCII characters to ASCII. The `preserve_original` parameter (this defaults to `false`) will retain the original terms if it is `true`.
	Custom token filter	`[{"type":"asciifolding", "preserve_original":true}]`
	Output tokens	`["You'll", "Ÿőű'ľľ", "love", "ľőνė", "Elasticsearch", "Ȅľȁśťĩćŝėȁŕćĥ", "7.0"]`
`ngram`	Input text	`"You'll love Elasticsearch 7.0"`
	Description	This slides along the output term from the tokenizer to provide items in the specified length of the specified characters. Use `min_gram` (this defaults to `1`) and `max_gram` (this defaults to `2`) to specify the length.
	Custom token filter	`[{"type":"ngram", "min_gram":10, "max_gram":10}]`
	Output tokens	`["Elasticsea", "lasticsear", "asticsearc", "sticsearch"]`
`edge_ngram`	Input text	`"You'll love Elasticsearch 7.0"`
	Description	This is similar to the `ngram` token filter. The difference is that each item is anchored to the starting point of the candidate terms.
	Custom token Filter	`[{"type":"edge_ngram", "min_gram":10, "max_gram":10}]`
	Output tokens	`["Elasticsea"]`
`lowercase (uppercase)`	Input text	`"You'll love Elasticsearch 7.0"`
	Description	This converts all the letters of the terms to lowercase (from uppercase).
	Custom token filter	`["lowercase"] (["uppercase"])`
	Output tokens	`["you'll", "love", "elasticsearch", "7.0"]`
`Fingerprint`	Input text	`"You'll love Elasticsearch 7.0"`
	Description	Sort, deduplicate, and concatenate the terms from the tokenizer into one term. The `separator` parameter (this defaults to the space character) can be set to another character. The `max_output_size` parameter (this defaults to `255`) will restrict the emitting of the final concatenated term.
	Custom token filter	`["fingerprint"]`
	Output tokens	`["7.0 Elasticsearch You'll love"]`
`keep`	Input text	`"You'll love Elasticsearch 7.0"`
	Description	Only those terms defined in the list of specified words are kept. Three options are provided: the `keep_words` parameter allows you to specify a list of words in the filter; the `keep_words_path` parameter allows you to specify a list of words in the file path; and the `keep_words_case` parameter (this defaults to `false`) converts the terms to lowercase.
	Custom token filter	`[{"type": "keep", "keep_words":["Elasticsearch", "7.0"]}]`
	Output tokens	`["Elasticsearch", "7.0"]`
`keep_types`	Input text	`"You'll love Elasticsearch 7.0"`
	Description	Only those terms defined in the list of specified token types are kept. One option is provided: the `mode` parameter (this defaults to `include`) allows you to include or exclude specified types of terms.
	Custom token filter	`[{"type": "keep_types", "types":["<NUM>"]}]`
	Output tokens	`["7.0"]`
`stemmer`	Input text	`"love loves loved loving"`
	Description	This allows you to specify `stemmer` in different languages and apply it to the terms.
	Custom token filter	`[{"type": "stemmer", "name":"english"}]`
	Output tokens	`["love", "love", "love", "love"]`
`stop`	Input text	`"A an The and Elasticsearch"`
	Description	This allows you to specify stop words to delete from the terms. Stop words in different languages are provided, such as `_english_`, and `_spanish_`. Use `stopwords` to specify a list of words to remove. The default value is `_english_`. Use `stopwords_path` to specify a file path relative to the config location that contains a list of words to remove. Use `ignore_case` to lowercase all the terms first. Use `remove_trailing` to ignore the last term.
	Custom token filter	`[{"type": "stop", "stopwords":["_english_"], "ignore_case":true}]`
	Output tokens	`["Elasticsearch"]`
`unique`	Input text	`"love loves loved loving"`
	Description	This allows you to produce unique terms. The custom token filters include `stemmer` and the unique tokenizer.
	Custom token filter	`[{"type": "stemmer", "name":"english"}, "unique"]`
	Output tokens	`["love"]`
`conditional`	Input text	`"You'll love Elasticsearch 7.0"`
	Description	This allows you to specify a predicate script and a list of token filters. Apply the token filters to the term if the terms match. Use the `script` parameter to specify the predicate. Use the `filter` parameter to specify the list of token filters. In the following example, the predicate matches the alphanumeric token type and applies the `reverse` token filter to reverse the term.
	Custom token filter	`[{"type": "condition", "script":{"source":"token.getType()=='<ALPHANUM>'"}, "filter":["reverse"]}]`
	Output tokens	`[""ll'uoY", "evol", "hcraescitsalE", "7.0"]`
`predicate_token` `_filter`	Input text	`"You'll love Elasticsearch 7.0"`
	Description	This allows you to specify a predicate script. Remove the term if the term does not match. Use the `script` parameter to specify the predicate. You can refer to the Elasticsearch Java documentation for more information (https://static.javadoc.io/org.elasticsearch/elasticsearch/7.0.0-beta1/org/elasticsearch/action/admin/indices/analyze/AnalyzeResponse.AnalyzeToken.html).
	Custom token filter	`[{"type": "predicate_token_filter", "script":{"source":"token.getType()=='<NUM>'"}}]`
	Output tokens	`["7.0"]`

word_delimiter is a more complex token filter, so we will introduce it separately. Essentially, it uses all non-alphanumeric characters as separators to split the term from the output of the tokenizer. In addition to this, it has many parameters to shape the filter. Let's explore this in more detail in the following table; each example uses a standard tokenizer and the word_delimiter token filter. Note that no character filter is applied:

Parameter
`generate_word_parts`	Input text	`"ElasticSearch 7.0"`
	Description	This generates subwords from a term when the case changes.
	Custom token filter	`[{"type":"word_delimiter", "generate_word_parts": true\|false}]`
	Value	`true` (by default)	`false`
	Output tokens	`["Elastic", "Search", "7", "0"]`	`["7", "0"]`
`generate_number` `_parts`	Input text	`"ElasticSearch 7.0"`
	Description	This generates a subnumber.
	Custom token filter	`[{"type":"word_delimiter", "generate_number_parts": true\|false}]`
	Value	`true` (by default)	`false`
	Output tokens	`["Elastic", "Search", "7", "0"]`	`["Elastic", "Search"]`
`catenate_words`	Input text	`"Elastic_Search 7.0"`
	Description	This concatenates the split word terms, which come from the same origin term, together.
	Custom token filter	`[{"type":"word_delimiter", "catenate_words": true\|false}]`
	Value	`true`	`false` (by default)
	Output tokens	`["Elastic", "ElasticSearch", "Search", "7", "0"]`	`["Elastic", "Search", "7", "0"]`
`catenate_numbers`	Input text	`"Elastic_Search 7.0"`
	Description	This concatenates the split numeric terms that come from the same origin term.
	Custom token filter	`[{"type":"word_delimiter", "catenate_numbers": true\|false}]`
	Value	`true`	`false` (by default)
	Output tokens	`["Elastic", "Search", "7", "70", "0"]`	`["Elastic", "Search", "7", "0"]`
`catenate_all`	Input text	`"Elastic_Search 7.0"`
	Description	This concatenates the split word terms or numeric terms that come from the same origin term.
	Custom token filter	`[{"type":"word_delimiter", "catenate_all": true\|false}]`
	Value	`true`	`false` (by default)
	Output tokens	`["Elastic", "ElasticSearch", "Search", "7", "70", "0"]`	`["Elastic", "Search", "7", "0"]`
`split_on_case_change`	Input text	`"ElasticSearch 7.0"`
	Description	The case changes are ignored when it is false.
	Custom token filter	`[{"type":"word_delimiter", "split_on_case_change": true\|false}]`
	Value	`true` (by default)	`false`
	Output tokens	`["Elastic", "Search", "7", "0"]`	`["ElasticSearch", "7", "0"]`
`preserve_original`	Input text	`"Elastic_Search 7.0"`
	Description	The original terms from the tokenizer are preserved.
	Custom token filter	`[{"type":"word_delimiter", "preserve_original": true\|false}]`
	Value	`true`	`false` (by default)
	Output tokens	`["Elastic_Search","Elastic", "Search", "7.0", "7", "0"]`	`["Elastic", "Search", "7", "0"]`
`split_on_numerics`	Input text	`"SN12X"`
	Description	The numeric changes are ignored when it is false.
	Custom token filter	`[{"type":"word_delimiter", "split_on_numerics":true\|false}]`
	Value	`true` (by default)	`false`
	Output tokens	`["SN", "12", "x"]`	`["SN12x"]`
`stem_english` `_possessive`	Input text	`"Elasticsearch's analyzer"`
	Description	This removes the apostrophe from the possessive adjective.
	Custom token filter	`[{"type":"word_delimiter", "stem_english_possessive":true\|false}]`
	Value	`true` (by default)	`false`
	Output tokens	`["ElasticSearch", "analyzer"]`	`["ElasticSearch", "s", "analyzer"]`

Another two parameters, protected_words and type_table, also have special usage; you can find out more about them at https://www.elastic.co/guide/en/elasticsearch/reference/7.x/analysis-word-delimiter-tokenfilter.html.

Table of Contents for Token filters

Create new playlist

Sign In

Sign Up

Table of Contents for
Token filters