N-gram tokenizer

This generates n-gram tokens of sizes in the provided range from the input string.

Factory class: solr.NGramTokenizerFactory

Arguments: minGramSize (integer, default 1): The minimum n-gram size.
maxGramSize (integer, default 2): The maximum n-gram size

The following condition must be fulfilled when providing gram-size arguments:
0 < minGramSize <= maxGramSize

Example:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
 <analyzer>
 <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="3"/>
 </analyzer>
 </fieldType>

Input: send me

Output: se, sen, en, end, nd, nd, d, dm, m, me, me

N-gram tokenizer executes tokenization over the entire input string. Also, it does not consider white spaces as delimiters, so white space characters are also included in the tokenization. In the preceding example, white spaces are preserved as parts of the token after tokenization. The n-gram tokenizer is required in cases where we want to match search words from the start, end, or somewhere in between the string along with white spaces.

For example, the input string is Please send me a mail at [email protected] and we want to match mail at [email protected].

Table of Contents for N-gram tokenizer

Create new playlist

Sign In

Sign Up

Table of Contents for
N-gram tokenizer