Tokenizers

The function of a tokenizer is to break input text into tokens, where each token is a stream of characters in the text. You configure a tokenizer for a text field type in schema.xml with a <tokenizer> element, which is a child of <analyzer>, like this for example:

<fieldType name="text" class="solr.TextField">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
    </analyzer>
</fieldType>

In the preceding example, you can see that a class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement org.apache.solr.analysis.TokenizerFactory. You can pass arguments to tokenizer factories by setting attributes in the <tokenizer> element. Here is an example of this:

<fieldType name="semicolonDelimited" class="solr.TextField">
  <analyzer type="query">
  <tokenizer class="solr.PatternTokenizerFactory" pattern="; "/>
  <analyzer>
</fieldType>

In the preceding example, the PatternTokenizerFactory class implements org.apache.solr.analysis.TokenizerFactory and an argument is passed to this class with the attribute name pattern.

There are a lot of factory classes that are included in the Solr release. Let's go through some of them.

Standard tokenizer

The standard tokenizer splits input text into tokens considering whitespaces and punctuations as delimiters. This is the most used tokenizer in Solr configuration. Here is an example of it:

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>

Input text: "Hello, [email protected] 01-17, re: m56-nm."

Output: "Hello", "packt.pub", "uk.com", "01", "17", "re", "m56", "xq"

In this tokenizer, delimiter characters are discarded, with the following exceptions:

  • Periods (dots) that are not followed by whitespaces are kept as part of the token, including Internet domain names
  • The @ character belongs to the set of token-splitting punctuation, so e-mail addresses are not preserved as single tokens

Keyword tokenizer

The keyword tokenizer treats the entire input text as a single token. An example of it is as follows:

<analyzer>
  <tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>

Input text: "Hello, [email protected] 01-17, re: m56-nm."

Output: "Hello, [email protected] 01-17, re: m56-nm."

Lowercase tokenizer

As the name suggests, the lowercase tokenizer tokenizes the input text by delimiting at non-letters and converting all letters to lowercase. In this tokenizer, whitespaces and non-letters are discarded. Here is an example of it:

<analyzer>
  <tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>

Input text: "I LOVE Packtpub Books!"

Output: "i", "love", "packtpub", "books"

N-gram tokenizer

The N-gram tokenizer reads the input text and generates N-gram tokens based on the input parameters of sizes in the given range. An example of this tokenizer is as follows:

<analyzer>
  <tokenizer class="solr.NGramTokenizerFactory"/>
</analyzer>

The default behavior of this tokenizer is not to break the field at whitespaces. So, the resulting output in this example contains a whitespace as a token. The default minimum gram size is 1 and the maximum gram size is 2.

Input text: "packt pub"

Output: "p", "a", "c", "k", "t", " ", "p", "u", "b", "pa", "ck", "t ", " p", "ub"

The following is an example with an N-gram size of 4 to 5:

<analyzer>
  <tokenizer class="solr.NGramTokenizerFactory" minGramSize="4" maxGramSize="5"/>
</analyzer>

Input text: "packtpub"

Output: "pack", "packt", "ackt", "acktp", "cktp", "cktpu", "ktpu", "ktpub", "tpub"

Other widely used tokenizers are:

  • Letter tokenizer
  • Classic tokenizer
  • Whitespace tokenizer
  • Edge N-gram tokenizer
  • ICU tokenizer
  • Path hierarchy tokenizer
  • Regular-expression tokenizer
  • UAX29 URL e-mail tokenizer

You can go through the schema.xml file, and you will see good examples of tokenizers and their use cases. This file has well-commented sections.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.186.171