Tokenization

A tokenizer is an analysis component declared with the <tokenizer> element that takes text in the form of a character stream and splits it into so-called tokens, most of the time skipping insignificant bits like whitespace and joining punctuation. An analyzer has exactly one tokenizer. Your tokenizer choices are as follows:

  • KeywordTokenizerFactory: This tokenizer doesn't actually do any tokenization! The entire character stream becomes a single token. The string field type has a similar effect but doesn't allow configuration of text analysis like lower-casing, for example. Any field used for sorting or most uses of faceting will require an indexed field with no more than one term per original value.
  • WhitespaceTokenizerFactory: Text is tokenized by whitespace: spaces, tabs, carriage returns, line feeds.
  • StandardTokenizerFactory: This is a general-purpose tokenizer for most Western languages. It tokenizes on whitespace and other points specified by the Unicode standard's annex on word boundaries. Whitespace and punctuation characters at these boundaries get removed. Hyphens are considered a word boundary, making this tokenizer less desirable for use with WordDelimiterFilter.
  • UAX29URLEmailTokenizer: This behaves like StandardTokenizer with the additional property of recognizing e-mail addresses and URLs as single tokens.
  • ClassicTokenizerFactory: (This was formerly the StandardTokenizer before Solr 3.) This is a general-purpose tokenizer for English. In English text, it does do a few things better than StandardTokenizer. Acronyms using periods are recognized, leaving the final period in place, which would otherwise be removed for example, I.B.M.; hyphens don't split words when the token contains a number; and e-mail addresses and Internet hostnames survive as one token.

    Additionally, there is a ClassicFilter token filter that is usually configured to follow this tokenizer. It will strip the periods out of acronyms and remove any trailing apostrophes (English possessive). It will only work with ClassicTokenizer.

  • LetterTokenizerFactory: This tokenizer considers each contiguous sequence of letters (as defined by Unicode) as a token and disregards other characters.
  • LowerCaseTokenizerFactory: This tokenizer is functionally equivalent to LetterTokenizer followed by LowerCaseFilter, but faster.
  • PatternTokenizerFactory: This regular expression-based tokenizer can behave in one of the following two ways:
    • To split the text on a separator specified by a pattern, you can use it like this: <tokenizer class="solr.PatternTokenizerFactory" pattern=";*" />. This example tokenizes a semi-colon separated list.
    • To match only particular patterns and possibly use only a subset of the pattern as the token, for example, <tokenizer class="solr.PatternTokenizerFactory" pattern="'([^']+)'" group="1" />. The group attribute specifies which matching group will be the token. If you had input text like aaa 'bbb' 'ccc', this would result in tokens bbb and ccc.
  • PathHierarchyTokenizerFactory: This is a configurable tokenizer that tokenizes strings that follow a simple character delimiter pattern, such as file paths or domain names. It's useful in implementing hierarchical faceting, as discussed in Chapter 7, Faceting, or simply filtering documents by some root prefix of the hierarchy. As an example, the input string /usr/local/apache would be tokenized to these three tokens: /usr, /usr/local, and /usr/local/apache. This tokenizer has four configuration options:
    • delimiter: This is a character delimiter; the default is /
    • replace: This is a replacement character for delimiter (optional)
    • reverse: This is a Boolean to indicate whether the root of the hierarchy is on the right, such as with a hostname; the default is false
    • skip: This is a number of leading root tokens to skip; the default is 0
  • WikipediaTokenizerFactory: This is an experimental tokenizer for Mediawiki syntax, such as that used in Wikipedia.

There are some other tokenizers that exist for languages such as Chinese and Russian, as well as ICUTokenizer, which detects the language (or script) used and tokenizes accordingly. And furthermore, NGram-based tokenizers will be discussed later.

Tip

See http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters for more information on some of these tokenizers, or the API documentation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.147.20