This splits the text field into tokens at white spaces and punctuation. The classic tokenizer behaves in the same way as the standard tokenizer of Solr versions 3.1 and older. Like The standard tokenizer, it does not use the Unicode standard annex UAX#29 word boundary rules. Delimiter characters are discarded, with the following exceptions:
- Dots that are not followed by white spaces are kept as part of the token
- Words are split at hyphens unless there is a number in the word, in which case the token is not split and the numbers and hyphens are preserved
- It preserves internet domain names and email addresses as a single token
Factory class: solr.ClassicTokenizerFactory
Arguments: maxTokenLength (integer, default 255): Max length of the token characters. Tokens that exceed the number of characters specified by maxTokenLength will be ignored.
Example:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ClassicTokenizerFactory"/>
</analyzer>
</fieldType>
Input: Please send a mail at [email protected] by 12-Nov.
Output: Please, send, a, mail, at, [email protected], by, 12-Nov
The input string is split at white spaces and punctuation, but the email address [email protected] and 12-Nov are preserved as part of the token.