Standard tokenizer

This splits the text field into tokens, treating white space and punctuation as delimiters. It considers white spaces and punctuation (comma, dots, hyphens, semicolons, colons, hashtags, and @ ) as delimiters and discards all of them, with these exceptions:

Dots that are not followed by white spaces are kept as part of the token. An example is internet domains such as www.google.com.
Factory class—solr.StandardTokenizerFactory.
Arguments—maxTokenLength (integer, default 255) The max length of token characters. Tokens that exceed the number of characters specified by maxTokenLength will be ignored.

Example:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
 <analyzer>
 <tokenizer class="solr.StandardTokenizerFactory"/>
 </analyzer>
</fieldType>

Input: Please send a mail at [email protected] by 12-11.

Output: Please, send, a, mail, at, dharmesh.vasoya, example.com, by, 12, 11

A total of 10 tokens have been generated by the standard tokenizer. These will be passed to its immediate successor in the chain. Standard tokenizer supports Unicode standard annex UAX#29, http://unicode.org/reports/tr29/#Word_Boundaries word boundaries with the following token types: <ALPHANUM>, <NUM>, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGANA>.

Table of Contents for Standard tokenizer

Create new playlist

Sign In

Sign Up

Table of Contents for
Standard tokenizer