Token filters

The main function of a token filter is to add, modify, or delete the characters of the output tokens from the tokenizer. There are approximately 50 built-in token filters. We'll cover some popular token filters in the following table. You can find out more and learn about the remaining token filters at https://www.elastic.co/guide/en/elasticsearch/reference/7.x/analysis-tokenfilters.html. Each example token filter in the following table uses a standard tokenizer and a specified token filter. Note that no character filter is applied:

Token filter
asciifolding Input text "Ÿőű'ľľ ľőνė Ȅľȁśťĩćŝėȁŕćĥ 7.0"
Description

This transforms the terms when letters, numbers, and unicode symbols are not in the first 127 ASCII characters to ASCII. The preserve_original parameter (this defaults to falsewill retain the original terms if it is true.

Custom token filter [{"type":"asciifolding", "preserve_original":true}]
Output tokens ["You'll", "Ÿőű'ľľ", "love", "ľőνė", "Elasticsearch", "Ȅľȁśťĩćŝėȁŕćĥ", "7.0"]
ngram Input text "You'll love Elasticsearch 7.0"
Description This slides along the output term from the tokenizer to provide items in the specified length of the specified characters. Use min_gram (this defaults to 1) and max_gram (this defaults to 2) to specify the length.
Custom token filter [{"type":"ngram", "min_gram":10, "max_gram":10}]
Output tokens ["Elasticsea", "lasticsear", "asticsearc", "sticsearch"]
edge_ngram Input text "You'll love Elasticsearch 7.0"
Description This is similar to the ngram token filter. The difference is that each item is anchored to the starting point of the candidate terms.
Custom token Filter [{"type":"edge_ngram", "min_gram":10, "max_gram":10}]
Output tokens ["Elasticsea"]

lowercase
(uppercase)

Input text "You'll love Elasticsearch 7.0"
Description This converts all the letters of the terms to lowercase (from uppercase).
Custom token filter ["lowercase"] (["uppercase"])
Output tokens ["you'll", "love", "elasticsearch", "7.0"]
Fingerprint Input text "You'll love Elasticsearch 7.0"
Description

Sort, deduplicate, and concatenate the terms from the tokenizer into one term. The separator parameter (this defaults to the space character) can be set to another character. The max_output_size parameter (this defaults to 255) will restrict the emitting of the final concatenated term.

Custom token filter ["fingerprint"]
Output tokens ["7.0 Elasticsearch You'll love"]
keep Input text "You'll love Elasticsearch 7.0"
Description

Only those terms defined in the list of specified words are kept. Three options are provided: the keep_words parameter allows you to specify a list of words in the filter; the keep_words_path parameter allows you to specify a list of words in the file path; and the keep_words_case parameter (this defaults to false) converts the terms to lowercase.

Custom token filter [{"type": "keep", "keep_words":["Elasticsearch", "7.0"]}]
Output tokens ["Elasticsearch", "7.0"]
 keep_types Input text "You'll love Elasticsearch 7.0"
Description Only those terms defined in the list of specified token types are kept. One option is provided: the mode parameter (this defaults to include) allows you to include or exclude specified types of terms.
Custom token filter [{"type": "keep_types", "types":["<NUM>"]}]
Output tokens ["7.0"]
stemmer Input text "love loves loved loving"
Description This allows you to specify stemmer in different languages and apply it to the terms.
Custom token filter [{"type": "stemmer", "name":"english"}]
Output tokens ["love", "love", "love", "love"]
stop Input text "A an The and Elasticsearch"
Description This allows you to specify stop words to delete from the terms. Stop words in different languages are provided, such as _english_, and _spanish_. Use stopwords to specify a list of words to remove. The default value is _english_. Use stopwords_path to specify a file path relative to the config location that contains a list of words to remove. Use ignore_case to lowercase all the terms first. Use remove_trailing to ignore the last term.
Custom token filter [{"type": "stop", "stopwords":["_english_"], "ignore_case":true}]
Output tokens ["Elasticsearch"]
unique Input text "love loves loved loving"
Description This allows you to produce unique terms. The custom token filters include stemmer and the unique tokenizer.
Custom token filter [{"type": "stemmer", "name":"english"}, "unique"]
Output tokens ["love"]
conditional Input text "You'll love Elasticsearch 7.0"
Description This allows you to specify a predicate script and a list of token filters. Apply the token filters to the term if the terms match. Use the script parameter to specify the predicate. Use the filter parameter to specify the list of token filters. In the following example, the predicate matches the alphanumeric token type and applies the reverse token filter to reverse the term.
Custom token filter [{"type": "condition", "script":{"source":"token.getType()=='<ALPHANUM>'"}, "filter":["reverse"]}]
Output tokens [""ll'uoY", "evol", "hcraescitsalE", "7.0"]

predicate_token

_filter

Input text "You'll love Elasticsearch 7.0"
Description This allows you to specify a predicate script. Remove the term if the term does not match. Use the script parameter to specify the predicate. You can refer to the Elasticsearch Java documentation for more information (https://static.javadoc.io/org.elasticsearch/elasticsearch/7.0.0-beta1/org/elasticsearch/action/admin/indices/analyze/AnalyzeResponse.AnalyzeToken.html).
Custom token filter [{"type": "predicate_token_filter", "script":{"source":"token.getType()=='<NUM>'"}}]
Output tokens ["7.0"]

word_delimiter is a more complex token filter, so we will introduce it separately. Essentially, it uses all non-alphanumeric characters as separators to split the term from the output of the tokenizer. In addition to this, it has many parameters to shape the filter. Let's explore this in more detail in the following table; each example uses a standard tokenizer and the word_delimiter token filter. Note that no character filter is applied:

Parameter

generate_word_parts

Input text

"ElasticSearch 7.0"

Description

This generates subwords from a term when the case changes.

Custom token filter

[{"type":"word_delimiter", "generate_word_parts": true|false}]

Value

true (by default)

false

Output tokens

["Elastic", "Search", "7", "0"]

["7", "0"]

generate_number

_parts

Input text

"ElasticSearch 7.0"

Description

This generates a subnumber.

Custom token filter

[{"type":"word_delimiter", "generate_number_parts": true|false}]

Value

true (by default)

false

Output tokens

["Elastic", "Search", "7", "0"]

["Elastic", "Search"]

catenate_words

Input text

"Elastic_Search 7.0"

Description

This concatenates the split word terms, which come from the same origin term, together.

Custom token filter

[{"type":"word_delimiter", "catenate_words": true|false}]

Value

true 

false (by default)

Output tokens

["Elastic", "ElasticSearch", "Search", "7", "0"]

["Elastic", "Search", "7", "0"]

catenate_numbers

Input text

"Elastic_Search 7.0"

Description

This concatenates the split numeric terms that come from the same origin term.

Custom token filter

[{"type":"word_delimiter", "catenate_numbers": true|false}]

Value

true

false (by default)

Output tokens

["Elastic", "Search", "7", "70", "0"]

["Elastic", "Search", "7", "0"]

catenate_all

Input text

"Elastic_Search 7.0"

Description

This concatenates the split word terms or numeric terms that come from the same origin term.

Custom token filter

[{"type":"word_delimiter", "catenate_all": true|false}]

Value

true

false (by default)

Output tokens

["Elastic", "ElasticSearch", "Search", "7", "70", "0"]

["Elastic", "Search", "7", "0"]

split_on_case_change

Input text

"ElasticSearch 7.0"

Description

The case changes are ignored when it is false.

Custom token filter

[{"type":"word_delimiter", "split_on_case_change": true|false}]

Value

true (by default)

false

Output tokens

["Elastic", "Search", "7", "0"]

["ElasticSearch",  "7", "0"]

preserve_original

Input text

"Elastic_Search 7.0"

Description

The original terms from the tokenizer are preserved.

Custom token filter

[{"type":"word_delimiter", "preserve_original": true|false}]

Value

true

false (by default)

Output tokens

["Elastic_Search","Elastic", "Search", "7.0", "7", "0"]

["Elastic", "Search",  "7", "0"]

split_on_numerics

Input text

"SN12X"

Description

The numeric changes are ignored when it is false.

Custom token filter

[{"type":"word_delimiter", "split_on_numerics":true|false}]

Value

true (by default)

false 

Output tokens

["SN", "12", "x"]

["SN12x"]

stem_english

_possessive

Input text

"Elasticsearch's analyzer"

Description

This removes the apostrophe from the possessive adjective.

Custom token filter

[{"type":"word_delimiter", "stem_english_possessive":true|false}]

Value

true (by default)

false 

Output tokens

["ElasticSearch", "analyzer"]

["ElasticSearch", "s", "analyzer"]

 

Another two parameters, protected_words and type_table, also have special usage; you can find out more about them at https://www.elastic.co/guide/en/elasticsearch/reference/7.x/analysis-word-delimiter-tokenfilter.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.115.237