Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Built-in analyzers

Elasticsearch comes with several analyzers in its standard installation. In the following table, some analyzers are described:

Analyzer	Description
Standard Analyzer	This uses Standard Tokenizer to divide text. Other components are Standard Token Filter, Lower Case Token Filter, and Stop Token Filter. It normalizes tokens, lowercases tokens, and also removes unwanted tokens. By default, Elasticsearch applies the standard analyzer.
Simple Analyzer	This uses Letter Tokenizer to divide text. Another component is Lower Case Tokenizer. It lowercases tokens.
Whitespace Analyzer	This uses Whitespace Tokenizer to divide text at spaces.
Stop Analyzer	This uses Letter Tokenizer to divide text. Other components are Lower Case Tokenizer and Stop Token Filter. It removes stop words from token streams.
Pattern Analyzer	This uses a regular expression to divide text. It accepts lowercase and stop words setting.
Language Analyzer	A set of analyzers analyze the text for a specific language. Languages supported are: Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, finish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Thai.

Analyzers fulfill the following three main functions using character filters, tokenizer, and token filters:

Filtering of characters
Tokenization
Filtering of the term

Let's look at the main function of how closely it is realized now.

Building blocks of Analyzer

In the analysis process, a tokenizer is used to break a text into tokens. Before this operation, the text is passed through any character filter. Then, token filters start working.

Characte filters

Character filters are used before being passed to tokenizer at the analysis process. Elasticsearch has built-in characters filters. Also, you can create your own character filters to meet your needs.

HTML Strip Char filter

This filter is stripping out HTML markup from an analyzed text. For example, consider the following verse belonging to the Turkish poet and sufi mystic Yunus Emre:

Âşıklar ölmez!

As you can see, Turkish and Latin accent characters are used instead of HTML decimal code. The original text is Âşıklar ölmez! (Translation: lovers are immortal!) Let's see how you get a result when this text is analyzed with standard tokenizer:

curl -XGET 'localhost:9200/_analyze?tokenizer=standard&pretty' -d 'Âşıklar ölmez!'
{
  "tokens" : [ {
    "token" : "194",
    "start_offset" :2,
    "end_offset" :5,
    "type" : "<NUM>",
    "position" : 1
  }, {
    "token" : "351",
    "start_offset" :8,
    "end_offset" :11,
    "type" : "<NUM>",
    "position" : 2
  }, {
    "token" : "305",
    "start_offset" :14,
    "end_offset" :17,
    "type" : "<NUM>",
    "position" : 3
  }, {
    "token" : "klar",
    "start_offset" :18,
    "end_offset" :22,
    "type" : "<ALPHANUM>",
    "position" : 4
  }, {
    "token" : "246",
    "start_offset" :25,
    "end_offset" :28,
    "type" : "<NUM>",
    "position" : 5
  }, {
    "token" : "lmez",
    "start_offset" :29,
    "end_offset" :33,
    "type" : "<ALPHANUM>",
    "position" : 6
  } ]
}

As you can see, these results are not useful or user-friendly. Remember, if text is being analyzed in this way, documents containing the word Âşıklar are not returned to us when we search the word Âşıklar. In this case, we need a filter to convert the HTML code of the characters. HTML Strip Char Filter performs this job, as shown:

curl -XGET 'localhost:9200/_analyze?tokenizer=standard&char_filters=html_strip&pretty' -d 'Âşıklar ölmez!'
{
  "tokens" : [ {
    "token" : "Âşıklar",
    "start_offset" :0,
    "end_offset" :22,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "ölmez",
    "start_offset" :23,
    "end_offset" :33,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

Pattern Replace Char filter

This char filter allows using a regex to manipulate the characters. The usage of the filter will be exemplified in the Creating a Custom Analyzer section.

Tokenizer

Token is one of the basic concepts in the lexical analysis of computer science, which means that a sequence of characters (that is, string) can turn into a sequence of tokens. For example, the string hello world becomes [hello, world]. Elasticsearch has several tokenizers that are used to divide a string down into a stream of terms or tokens. A simple tokenizer may split the string up into terms wherever it encounters word boundaries, whitespace, or punctuation.

Elasticsearch has built-in tokenizers. You can combine them with character filters to create custom analyzers. In the following table, some tokenizers are described:

Tokenizer	Description
Standard Tokenizer	This finds the boundaries between words and then divides text. To do this, it uses the Unicode Text Segmentation algorithm.
Letter Tokenizer	This divides text at non-letters and converts them to lower case that performs the function of Letter Tokenizer and the Lower Case Token Filter together.
Whitespace Tokenizer	This divides text at spaces.
Pattern Tokenizer	This divides text at via a regular expression.
UAX Email URL Tokenizer	This tokenizes e-mails and URLs as single tokens. It works like the standard tokenizer.
Path Hierarchy Tokenizer	This divides text at delimiters (defaults character delimiter to '/').

Note

If you want more information about the Unicode Standard Annex #29, refer to http://unicode.org/reports/tr29/.

Token filters

Token filters accept a stream of modified tokens from tokenizers. Elasticsearch has built-in token filters. In the following table, some token filters are described:

Token Filter	Description
ASCII Folding Token Filter	This converts alphabetic, numeric, and symbolic unicode characters that are not in the first 127 ASCII characters.
Length Token Filter	This removes words that are longer or shorter than specified.
Lowercase Token Filter	This normalizes token text to lower case.
Uppercase Token Filter	This normalizes token text to upper case.
Stop Token Filter	This removes stop words (They are specified words - for example the, is, are, and so on.) from token streams.
Reverse Token Filter	This simply reverses each token.
Trim Token Filter	This trims the whitespace surrounding a token.
Normalization Token Filters	These normalize special characters of a certain language (for example, Arabic, German, Persian).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Built-in analyzers

Create new playlist

Sign In

Sign Up