Classic tokenizer

This splits the text field into tokens at white spaces and punctuation. The classic tokenizer behaves in the same way as the standard tokenizer of Solr versions 3.1 and older. Like The standard tokenizer, it does not use the Unicode standard annex UAX#29 word boundary rules. Delimiter characters are discarded, with the following exceptions:

Dots that are not followed by white spaces are kept as part of the token
Words are split at hyphens unless there is a number in the word, in which case the token is not split and the numbers and hyphens are preserved
It preserves internet domain names and email addresses as a single token

Factory class: solr.ClassicTokenizerFactory

Arguments: maxTokenLength (integer, default 255): Max length of the token characters. Tokens that exceed the number of characters specified by maxTokenLength will be ignored.

Example:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.ClassicTokenizerFactory"/>
    </analyzer>
  </fieldType>

Input: Please send a mail at [email protected] by 12-Nov.

Output: Please, send, a, mail, at, [email protected], by, 12-Nov

The input string is split at white spaces and punctuation, but the email address [email protected] and 12-Nov are preserved as part of the token.

Table of Contents for Classic tokenizer

Create new playlist

Sign In

Sign Up

Table of Contents for
Classic tokenizer