Elasticsearch analyzers

Elasticsearch stores data in a very systematic and easily accessible and searchable fashion. To make data analysis easy and data more searchable, when the data is inducted into Elasticsearch, the following steps are done:

  1. Initial tidying of the string received (sanitizing). This is done by a character filter in Elasticsearch. This filter can sanitize the string before actual tokenization. It can also take out unnecessary characters or can even transform certain characters as needed.
  2. Tokenize the string into terms for creating an Inverted Index. This is done by Tokenizers in Elasticsearch. Various types of tokenizers exist that can do the job of actually splitting the string to terms/tokens.
  3. Normalize the data and search terms to make the search easier and relevant (further filtering and sanitizing). This is done by Token Filter in Elasticsearch. These filters can either take out certain tokens that are not so relevant for the search (a, the, and so on) or can change the token as needed.

Figure 17: Working of Analyzer

Elasticsearch provides many built-in character filters, tokenizers, and token filters. These components can be combined in any way as needed and this forms the so-called Analyzer.

There are two kinds of Analyzers, as follows:

  • Built-in Analyzers
  • Custom Analyzers
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.187.233