Standard tokenizer

This splits the text field into tokens, treating white space and punctuation as delimiters. It considers white spaces and punctuation (comma, dots, hyphens, semicolons, colons, hashtags, and @ ) as delimiters and discards all of them, with these exceptions:

  • Dots that are not followed by white spaces are kept as part of the token. An example is internet domains such as www.google.com.
  • Factory class—solr.StandardTokenizerFactory.
  • Arguments—maxTokenLength (integer, default 255) The max length of token characters. Tokens that exceed the number of characters specified by maxTokenLength will be ignored.

Example:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
</fieldType>

InputPlease send a mail at [email protected] by 12-11.

OutputPleasesendamailatdharmesh.vasoyaexample.comby1211

A total of 10 tokens have been generated by the standard tokenizer. These will be passed to its immediate successor in the chain. Standard tokenizer supports Unicode standard annex UAX#29, http://unicode.org/reports/tr29/#Word_Boundaries word boundaries with the following token types: <ALPHANUM><NUM>, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGANA>.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.164.120