What is a tokenizer?

A tokenizer is a tool provided by Solr that runs a tokenization process, breaks a stream of text into tokens at some delimiter, and generates a token stream. Tokenizers are configured by their Java implementation factory class in the managed-schema.xml file. For example, we can define a standard tokenizer as:

<tokenizer class="solr.StandardTokenizerFactory"/>

Most tokenizer implementation classes do not provide a default no-arg constructor. So, always use a tokenizer factory class instead of a tokenizer class. Solr provides a standard way to define tokenizers in XML format using factory implementation classes. These factory classes translate XML configurations to create an instance of the respective tokenizer implementation class. An analyzer may contain only tokenizers or both tokenizers and filters. If only tokenizers are configured, the output produced from tokenization is ready to use; otherwise, the output will be passed to the first filter in the list.

After tokenization, a new token stream is generated. The newly generated token, along with its normalized text values, also holds some metadata, such as the location at which they are generated. Token metadata is important for things like highlighting search results in the field text. A newly generated token contains a value that may be different from its input text value. We can't assume that the text of a token is the same as the input text, or that the length is the same. It’s also possible for more than one token to have the same position or refer to the same set in the original text.

Table of Contents for What is a tokenizer?

Create new playlist

Sign In

Sign Up

Table of Contents for
What is a tokenizer?