Built-in analyzers

Elasticsearch comes with several analyzers in its standard installation. In the following table, some analyzers are described:

Analyzer

Description

Standard Analyzer

This uses Standard Tokenizer to divide text. Other components are Standard Token Filter, Lower Case Token Filter, and Stop Token Filter. It normalizes tokens, lowercases tokens, and also removes unwanted tokens. By default, Elasticsearch applies the standard analyzer.

Simple Analyzer

This uses Letter Tokenizer to divide text. Another component is Lower Case Tokenizer. It lowercases tokens.

Whitespace Analyzer

This uses Whitespace Tokenizer to divide text at spaces.

Stop Analyzer

This uses Letter Tokenizer to divide text. Other components are Lower Case Tokenizer and Stop Token Filter. It removes stop words from token streams.

Pattern Analyzer

This uses a regular expression to divide text. It accepts lowercase and stop words setting.

Language Analyzer

A set of analyzers analyze the text for a specific language. Languages supported are: Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, finish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Thai.

Analyzers fulfill the following three main functions using character filters, tokenizer, and token filters:

  • Filtering of characters
  • Tokenization
  • Filtering of the term

Let's look at the main function of how closely it is realized now.

Building blocks of Analyzer

In the analysis process, a tokenizer is used to break a text into tokens. Before this operation, the text is passed through any character filter. Then, token filters start working.

Characte filters

Character filters are used before being passed to tokenizer at the analysis process. Elasticsearch has built-in characters filters. Also, you can create your own character filters to meet your needs.

HTML Strip Char filter

This filter is stripping out HTML markup from an analyzed text. For example, consider the following verse belonging to the Turkish poet and sufi mystic Yunus Emre:

Âşıklar ölmez!

As you can see, Turkish and Latin accent characters are used instead of HTML decimal code. The original text is Âşıklar ölmez! (Translation: lovers are immortal!) Let's see how you get a result when this text is analyzed with standard tokenizer:

curl -XGET 'localhost:9200/_analyze?tokenizer=standard&pretty' -d 'Âşıklar ölmez!'
{
  "tokens" : [ {
    "token" : "194",
    "start_offset" :2,
    "end_offset" :5,
    "type" : "<NUM>",
    "position" : 1
  }, {
    "token" : "351",
    "start_offset" :8,
    "end_offset" :11,
    "type" : "<NUM>",
    "position" : 2
  }, {
    "token" : "305",
    "start_offset" :14,
    "end_offset" :17,
    "type" : "<NUM>",
    "position" : 3
  }, {
    "token" : "klar",
    "start_offset" :18,
    "end_offset" :22,
    "type" : "<ALPHANUM>",
    "position" : 4
  }, {
    "token" : "246",
    "start_offset" :25,
    "end_offset" :28,
    "type" : "<NUM>",
    "position" : 5
  }, {
    "token" : "lmez",
    "start_offset" :29,
    "end_offset" :33,
    "type" : "<ALPHANUM>",
    "position" : 6
  } ]
}

As you can see, these results are not useful or user-friendly. Remember, if text is being analyzed in this way, documents containing the word Âşıklar are not returned to us when we search the word Âşıklar. In this case, we need a filter to convert the HTML code of the characters. HTML Strip Char Filter performs this job, as shown:

curl -XGET 'localhost:9200/_analyze?tokenizer=standard&char_filters=html_strip&pretty' -d 'Âşıklar ölmez!'
{
  "tokens" : [ {
    "token" : "Âşıklar",
    "start_offset" :0,
    "end_offset" :22,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "ölmez",
    "start_offset" :23,
    "end_offset" :33,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

Pattern Replace Char filter

This char filter allows using a regex to manipulate the characters. The usage of the filter will be exemplified in the Creating a Custom Analyzer section.

Tokenizer

Token is one of the basic concepts in the lexical analysis of computer science, which means that a sequence of characters (that is, string) can turn into a sequence of tokens. For example, the string hello world becomes [hello, world]. Elasticsearch has several tokenizers that are used to divide a string down into a stream of terms or tokens. A simple tokenizer may split the string up into terms wherever it encounters word boundaries, whitespace, or punctuation.

Elasticsearch has built-in tokenizers. You can combine them with character filters to create custom analyzers. In the following table, some tokenizers are described:

Tokenizer

Description

Standard Tokenizer

This finds the boundaries between words and then divides text. To do this, it uses the Unicode Text Segmentation algorithm.

Letter Tokenizer

This divides text at non-letters and converts them to lower case that performs the function of Letter Tokenizer and the Lower Case Token Filter together.

Whitespace Tokenizer

This divides text at spaces.

Pattern Tokenizer

This divides text at via a regular expression.

UAX Email URL Tokenizer

This tokenizes e-mails and URLs as single tokens. It works like the standard tokenizer.

Path Hierarchy Tokenizer

This divides text at delimiters (defaults character delimiter to '/').

Note

If you want more information about the Unicode Standard Annex #29, refer to http://unicode.org/reports/tr29/.

Token filters

Token filters accept a stream of modified tokens from tokenizers. Elasticsearch has built-in token filters. In the following table, some token filters are described:

Token Filter

Description

ASCII Folding Token Filter

This converts alphabetic, numeric, and symbolic unicode characters that are not in the first 127 ASCII characters.

Length Token Filter

This removes words that are longer or shorter than specified.

Lowercase Token Filter

This normalizes token text to lower case.

Uppercase Token Filter

This normalizes token text to upper case.

Stop Token Filter

This removes stop words (They are specified words - for example the, is, are, and so on.) from token streams.

Reverse Token Filter

This simply reverses each token.

Trim Token Filter

This trims the whitespace surrounding a token.

Normalization Token Filters

These normalize special characters of a certain language (for example, Arabic, German, Persian).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.190.18