This splits the text stream at white spaces only. However, it will not split the text at any punctuation (like the standard tokenizer). Therefore, all of the punctuation will remain as is inside the generated tokens.
Factory class: solr.WhitespaceTokenizerFactory
Arguments:
- rule (default: java): A rule that considers white space as a delimiter.
- java: Uses Character.isWhitespace(int) (https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-)
- unicode: Uses unicode's WHITESPACE property
Example:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" rule="java" />
</analyzer>
</fieldType>
Input: Please send a mail at [email protected] by 12-11.
Output: Please, send, a, mail, at, [email protected], by, 12-11.
The input string was split at white spaces but the punctuation (@, .,and -) was preserved in the tokens.