Classic tokenizer

This splits the text field into tokens at white spaces and punctuation. The classic tokenizer behaves in the same way as the standard tokenizer of Solr versions 3.1 and older. Like The standard tokenizer, it does not use the Unicode standard annex UAX#29 word boundary rules. Delimiter characters are discarded, with the following exceptions:

  • Dots that are not followed by white spaces are kept as part of the token
  • Words are split at hyphens unless there is a number in the word, in which case the token is not split and the numbers and hyphens are preserved
  • It preserves internet domain names and email addresses as a single token

Factory class: solr.ClassicTokenizerFactory

ArgumentsmaxTokenLength (integer, default 255): Max length of the token characters. Tokens that exceed the number of characters specified by maxTokenLength will be ignored.

Example:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ClassicTokenizerFactory"/>
</analyzer>
</fieldType>

Input: Please send a mail at [email protected] by 12-Nov.

Output: Pleasesendamailat[email protected]by12-Nov

The input string is split at white spaces and punctuation, but the email address [email protected] and 12-Nov are preserved as part of the token.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.175.164