This splits the text field into tokens, treating white space and punctuation as delimiters. It considers white spaces and punctuation (comma, dots, hyphens, semicolons, colons, hashtags, and @ ) as delimiters and discards all of them, with these exceptions:
- Dots that are not followed by white spaces are kept as part of the token. An example is internet domains such as www.google.com.
- Factory class—solr.StandardTokenizerFactory.
- Arguments—maxTokenLength (integer, default 255) The max length of token characters. Tokens that exceed the number of characters specified by maxTokenLength will be ignored.
Example:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
</fieldType>
Input: Please send a mail at [email protected] by 12-11.
Output: Please, send, a, mail, at, dharmesh.vasoya, example.com, by, 12, 11
A total of 10 tokens have been generated by the standard tokenizer. These will be passed to its immediate successor in the chain. Standard tokenizer supports Unicode standard annex UAX#29, http://unicode.org/reports/tr29/#Word_Boundaries word boundaries with the following token types: <ALPHANUM>, <NUM>, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGANA>.