The function of a tokenizer is to break input text into tokens, where each token is a stream of characters in the text. You configure a tokenizer for a text field type in schema.xml
with a <tokenizer>
element, which is a child of <analyzer>
, like this for example:
<fieldType name="text" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> </analyzer> </fieldType>
In the preceding example, you can see that a class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement org.apache.solr.analysis.TokenizerFactory
. You can pass arguments to tokenizer factories by setting attributes in the <tokenizer>
element. Here is an example of this:
<fieldType name="semicolonDelimited" class="solr.TextField"> <analyzer type="query"> <tokenizer class="solr.PatternTokenizerFactory" pattern="; "/> <analyzer> </fieldType>
In the preceding example, the PatternTokenizerFactory
class implements org.apache.solr.analysis.TokenizerFactory
and an argument is passed to this class with the attribute name pattern.
There are a lot of factory classes that are included in the Solr release. Let's go through some of them.
The standard tokenizer splits input text into tokens considering whitespaces and punctuations as delimiters. This is the most used tokenizer in Solr configuration. Here is an example of it:
<analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> </analyzer>
Input text: "Hello, [email protected] 01-17, re: m56-nm."
Output: "Hello", "packt.pub", "uk.com", "01", "17", "re", "m56", "xq"
In this tokenizer, delimiter characters are discarded, with the following exceptions:
The keyword tokenizer treats the entire input text as a single token. An example of it is as follows:
<analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> </analyzer>
Input text: "Hello, [email protected] 01-17, re: m56-nm."
Output: "Hello, [email protected] 01-17, re: m56-nm."
As the name suggests, the lowercase tokenizer tokenizes the input text by delimiting at non-letters and converting all letters to lowercase. In this tokenizer, whitespaces and non-letters are discarded. Here is an example of it:
<analyzer> <tokenizer class="solr.LowerCaseTokenizerFactory"/> </analyzer>
Input text: "I LOVE Packtpub Books!"
Output: "i", "love", "packtpub", "books"
The N-gram tokenizer reads the input text and generates N-gram tokens based on the input parameters of sizes in the given range. An example of this tokenizer is as follows:
<analyzer> <tokenizer class="solr.NGramTokenizerFactory"/> </analyzer>
The default behavior of this tokenizer is not to break the field at whitespaces. So, the resulting output in this example contains a whitespace as a token. The default minimum gram size is 1
and the maximum gram size is 2
.
Input text: "packt pub"
Output: "p", "a", "c", "k", "t", " ", "p", "u", "b", "pa", "ck", "t ", " p", "ub"
The following is an example with an N-gram size of 4
to 5
:
<analyzer> <tokenizer class="solr.NGramTokenizerFactory" minGramSize="4" maxGramSize="5"/> </analyzer>
Input text: "packtpub"
Output: "pack", "packt", "ackt", "acktp", "cktp", "cktpu", "ktpu", "ktpub", "tpub"
Other widely used tokenizers are:
You can go through the schema.xml
file, and you will see good examples of tokenizers and their use cases. This file has well-commented sections.
18.117.186.171