Standard tokenizer

Loosely speaking, the standard tokenizer breaks down a stream of characters by separating them with whitespace characters and punctuation.

The following example shows how the standard tokenizer breaks a character stream into tokens:

POST _analyze
{
"tokenizer": "standard",
"text": "Tokenizer breaks characters into tokens!"
}

The preceding command produces the following output; notice the start_offset, end_offset, and positions in the output:

{
"tokens": [
{
"token": "Tokenizer",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "breaks",
"start_offset": 10,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "characters",
"start_offset": 17,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "into",
"start_offset": 28,
"end_offset": 32,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "tokens",
"start_offset": 33,
"end_offset": 39,
"type": "<ALPHANUM>",
"position": 4
}
]
}

This token stream can be further processed by the token filters of the analyzer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.41.42