Standard tokenizer

Loosely speaking, the standard tokenizer breaks down a stream of characters by separating them with whitespace characters and punctuation.

The following example shows how the standard tokenizer breaks a character stream into tokens:

POST _analyze
{
  "tokenizer": "standard",
  "text": "Tokenizer breaks characters into tokens!"
}

The preceding command produces the following output; notice the start_offset, end_offset, and positions in the output:

{
  "tokens": [
    {
      "token": "Tokenizer",
      "start_offset": 0,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "breaks",
      "start_offset": 10,
      "end_offset": 16,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "characters",
      "start_offset": 17,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "into",
      "start_offset": 28,
      "end_offset": 32,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "tokens",
      "start_offset": 33,
      "end_offset": 39,
      "type": "<ALPHANUM>",
      "position": 4
    }
  ]
}

This token stream can be further processed by the token filters of the analyzer.

Table of Contents for Standard tokenizer

Create new playlist

Sign In

Sign Up

Table of Contents for
Standard tokenizer