Filtering

The token filters are declared in the <filter> element and consume one stream of tokens, known as TokenStream, and generate another. Hence, they can be chained one after another indefinitely. A token filter may be used to perform complex analysis by processing multiple tokens in the stream at once but in most cases it processes each token sequentially and decides to consider, replace, or ignore the token.

There may only be one official tokenizer in an analyzer; however, the token filter named WordDelimiterFilter is in-effect a tokenizer too:

<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>

(Not all options were just shown) The purpose of this analyzer is to both split and join compound words with various means of defining compound words. This one is typically used with WhitespaceTokenizer, not StandardTokenizer, which removes punctuation-based intra-word delimiters, thereby defeating some of this processing. The options for this analyzer have the values 1 to enable and 0 to disable.

Tip

This analysis component is the most configurable of all and it can be a little confusing. Use Solr's Analysis screen, which described in the Experimenting with text analysis section to validate your configuration.

The WordDelimiterFilter will first tokenize the input word, according to the configured options. Note that the commas on the right-hand side of the following examples denote separate terms, and options are all true by default:

  • Split on intra-word delimiters: Wi-Fi to Wi, Fi
  • Split on letter-number transitions: SD500 to SD, 500 (if splitOnNumerics)
  • Omit any delimiters: /hello--there, dude to hello, there, dude
  • Remove trailing 's: David's to David (if stemEnglishPossessive)
  • Split on lower to upper case transitions: WiFi to Wi, Fi (if splitOnCaseChange)

At this point, the resulting terms are all filtered out unless some of the following options are enabled. You should always enable at least one of them:

  • If generateWordParts or generateNumberParts is enabled, all-alphabetic terms or all-number terms pass through (meaning, they are not filtered). Either way, they are still considered for the concatenation options.
  • To concatenate a consecutive series of alphabetic terms, enable catenateWords (for example, wi-fi to wifi). If generateWordParts is also enabled, this example would generate wi and fi but not otherwise. This will work even if there is just one term in the series, thereby generating a term that disabling generateWordParts would have omitted. catenateNumbers works similarly but for numeric terms. The catenateAll option will concatenate all of the terms together. The concatenation process will take care to not emit duplicate terms.
  • To preserve the original word, enable preserveOriginal.

Here is an example exercising all the aforementioned options: WiFi-802.11b to Wi, Fi, WiFi, 802, 11, 80211, b, WiFi80211b, WiFi-802.11b.

Internally, this filter assigns a type to each character (such as letter or number) before looking for word boundaries. The types are determined by Unicode character categories. If you want to customize how the filter determines what the type of each character is, you can provide one or more mapping files with the types option. An example use case would be indexing Twitter tweets in which you want # and @ treated as type ALPHA.

Note

For more details on this esoteric feature, see SOLR-205. You can find sample configuration, about how to customize WordDelimiterFilter's tokenization rules, at https://issues.apache.org/jira/browse/SOLR-2059.

Lastly, if there are a certain limited number of known input words that you want this filter to pass through untouched, then they can be listed in a file referred to with the protected option. Some other filters share this same feature.

Solr's out-of-the-box configuration for the text_en_splitting field type is a reasonable way to use the WordDelimiterFilter—generation of word and number parts at both index- and query-time, but concatenating only at index time, since doing so at query time too would be redundant.

Stemming

Stemming is the process of reducing inflected or sometimes derived words to their stem, base, or root form, for example, a stemming algorithm might reduce Riding and Rides, to just Ride. Stemming is done to improve search result recall, but at the expense of some precision. If you are processing general text, you will improve your search results with stemming. However, if you have text that is mostly proper nouns, such as an artist's name in MusicBrainz, then anything more than light stemming will hurt the results. If you want to improve the precision of search results but retain the recall benefits, you should consider indexing the data in two fields, one stemmed and the other not stemmed. The DisMax query parser, described in Chapter 5, Searching, and Chapter 6, Search Relevancy, can then be configured to search the stemmed field and boost by the unstemmed one via its bq or pf options.

Many stemmers will generate stemmed tokens that are not correctly spelled words, such as Bunnies becoming Bunni instead of Bunny or stemming Quote to Quot; you'll see this in Solr's Analysis screen. This is harmless since stemming is applied at both index and search times; however, it does mean that a field that is stemmed like this cannot also be used for query spellcheck, wildcard searches, or search term autocomplete—features described in later chapters. These features directly use the indexed terms.

Tip

A stemming algorithm is very language specific compared to other text analysis components; remember to visit https://cwiki.apache.org/confluence/display/solr/Language+Analysis as advised earlier for non-English text. It includes information on a Solr token filter that performs decompounding, which is useful for certain languages (not English).

Here are stemmers suitable for the English language:

  • SnowballPorterFilterFactory: This one lets you choose among many stemmers that were generated by the so-called Snowball program, hence the name. It has a language attribute in which you make the implementation choice from a list. Specifying English uses the Porter2 algorithm—regarded as a slight improvement over the original. Specifying Lovins uses the Lovins algorithm for English—regarded as an improvement on Porter but too slow in its current form.
  • PorterStemFilterFactory: This is the original English Porter algorithm. It is said to be twice as fast as using Snowball English.
  • KStemFilterFactory: This English stemmer is less aggressive than Porter's algorithm. This means it will not stem in as many cases as Porter will in an effort to reduce false-positives at the expense of missing stemming opportunities. We recommend this as the default English stemmer.
  • EnglishMinimalStemFilterFactory: This is a simple stemmer that only stems on typical pluralization patterns. Unlike most other stemmers, the stemmed tokens that are generated are correctly spelled words; they are the singular form. A benefit of this is that a single Solr field with this stemmer is usable for both general searches and for query term autocomplete simultaneously, thereby saving index size and making indexing faster.

Correcting and augmenting stemming

These stemmers are algorithmic instead of being based on a vetted Thesaurus for the target language. Languages have so many spelling idiosyncrasies that algorithmic stemmers are imperfect—they sometimes stem incorrectly or don't stem when they should.

If there are particularly troublesome words that get stemmed, you can prevent it by preceding the stemmer with a KeywordMarkerFilter with the protected attribute referring to a file of newline-separated tokens that should not be stemmed. An ignoreCase Boolean option is available too. Some stemmers have, or used to have, a protected attribute that worked similarly, but that old approach isn't advised any more.

If you need to augment the stemming algorithm so that you can tell it how to stem some specific words, precede the stemmer with StemmerOverrideFilter. It takes a dictionary attribute referring to a UTF8-encoded file in the conf directory of token pairs, one pair per line, and a tab is used to separate the input token from the output token (the desired stemmed form of the input). An ignoreCase Boolean option is available too. This filter will skip tokens already marked by KeywordMarkerFilter and it will keyword-mark all the tokens it replaces itself, so that the stemmer will skip them.

Here is a sample excerpt of an analyzer chain showing three filters in support of stemming:

<filter class="solr.KeywordMarkerFilterFactory"
  protected="protwords.txt" />
<filter class="solr.StemmerOverrideFilterFactory"
  dictionary="stemdict.txt" />
<filter class="solr.PorterStemFilterFactory" />

Processing synonyms

The purpose of synonym processing is straightforward. Someone searches using a word that wasn't in the original document but is synonymous with a word that is indexed, so you want that document to match the query. Of course, the synonym need not be strictly those identified by a Thesaurus, and they can be whatever you want, including terminology specific to your application's domain.

Tip

The most widely known free Thesaurus is WordNet (http://wordnet.princeton.edu/). From Solr 3.4, we have the ability to read WordNet's "prolog" formatted file via a format="wordnet" attribute on the synonym filter. However, don't be surprised if you lose precision in the search results—it's not a clear win, for example, "Craftsman" in context might be a proper noun referring to a brand, but WordNet would make it synonymous with "artisan". Synonym processing doesn't know about context—it's simple and dumb.

Here is a sample analyzer configuration line for synonym processing:

<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

The synonym reference is to a file in the conf directory. Set ignoreCase to true for the case-insensitive lookup of synonyms.

Before describing the expand option, let's consider an example. The synonyms file is processed line-by-line. Here is a sample line with an explicit mapping that uses the arrow =>:

i-pod, i pod =>ipod

This means that if either i-pod (one token) or i then pod (two tokens) are found in the incoming token stream to this filter, then they are replaced with ipod. There could have been multiple replacement synonyms, each of which might contain multiple tokens. Also notice that commas are what separate each synonym, which is then split by whitespace for multiple tokens. To customize the tokenization to be something more sophisticated than whitespace, there is a tokenizerFactory attribute, but it's rarely used.

Alternatively, you may have lines that look like this:

ipod, i-pod, i pod

These lines don't have => and are interpreted differently according to the expand parameter. If expand is true, the line will be translated to the following explicit mapping:

ipod, i-pod, i pod =>ipod, i-pod, i pod

If expand is false, the aforementioned line will become this explicit mapping, in which the first source synonym is the replacement synonym:

ipod, i-pod, i pod =>ipod

It's okay to have multiple lines that reference the same synonyms. If a source synonym in a new rule is already found to have replacement synonyms from another rule, then those replacements are merged.

Tip

Multiword (also known as Phrase) synonyms

For multiword synonyms to work, the analysis must be applied at index time and with expansion so that both the original words and the combined word get indexed. The next section elaborates on why this is so. Also, be aware that the tokenizer and previous filters can affect the tokens that the SynonymFilter sees. So, depending on the configuration and hyphens, other punctuations may or may not be stripped out.

Synonym expansion at index time versus query time

If you are doing synonym expansion (have any source synonyms that map to multiple replacement synonyms or tokens), do synonym processing at either index time or query time, but not both. Doing it in both places will yield correct results but will perform slower. We recommend doing it at index time because of the following problems that occur when doing it at query time:

  • A source synonym containing multiple words (for example, i pod) isn't recognized at query time because the query parser tokenizes on whitespace before the analyzer gets it.
  • The IDF component of Lucene's scoring algorithm (discussed in Chapter 6, Search Relevancy) will be much higher for documents matching a synonym appearing rarely, compared to its equivalents that are common. This reduces the scoring effectiveness.
  • Prefix, wildcard, and fuzzy queries aren't analyzed, and thus won't match synonyms.

However, any analysis at index time is less flexible, because any changes to the synonyms will require a complete re-index to take effect. Moreover, the index will get larger if you do index-time expansion—perhaps too large if you have a large set of synonyms such as with WordNet. It's plausible to imagine the aforementioned issues being rectified at some point. In spite of this, we usually recommend index time.

Alternatively, you could choose not to do synonym expansion. This means for a given synonym token, there is just one token that should replace it. This requires processing at both index time and query time to effectively normalize the synonymous tokens. However, since there is query-time processing, it suffers from the problems mentioned earlier (with the exception of poor scores, which isn't applicable). The benefit to this approach is that the index size would be smaller, because the number of indexed tokens is reduced.

You might also choose a blended approach to meet different goals, for example, if you have a huge index that you don't want to re-index often, but you need to respond rapidly to new synonyms, then you can put new synonyms into both a query-time synonym file and an index-time one. When a re-index finishes, you empty the query-time synonym file. You might also be fond of the query-time benefits, but due to the multiple word token issue, you decide to handle those particular synonyms at index time.

Working with stop words

There is a simple filter called StopFilterFactory that filters out certain so-called stop words specified in a file in the conf directory, optionally ignoring case. The example usage is as follows:

<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>

When used, it is present in both index and query analyzer chains.

For indexes with lots of text, common uninteresting words such as "the", "a", and so on, make the index large and slow down phrase queries that use them. A simple solution to this problem is to filter them out of the fields in which they often show up. Fields likely to contain more than a sentence are ideal candidates. Our MusicBrainz schema does not have content like this. The trade-off when omitting stop words from the index is that those words are no longer queryable. This is usually fine, but in some circumstances like searching for To be or not to be, it is obviously a problem.

Tip

The ideal solution to the common word problem is not to remove them. Chapter 10, Scaling Solr, discusses an approach called common-grams implemented with CommonGramsFilterFactory that can be used to improve phrase search performance, while keeping these words. It is highly recommended.

Solr comes with a decent set of stop words for the English language. You may want to supplement it or use a different list altogether, if you're indexing non-English text. In order to determine which words appear commonly in your index, access the Schema Browser menu option in Solr's admin interface. All of your fields will appear in a drop-down list on the form. In case the list does not appear at once, be patient. For large indexes, there is a considerable delay before the field list appears because Solr is analyzing the data in your index. Now, choose a field that you know contains a lot of text. In the main viewing area, you'll see a variety of statistics about the field, including the top 10 terms appearing most frequently. If you can't see the term info by default, click on the Load Term Info button and select the Autoload checkbox.

Note

You can also manage synonyms and stop words via a REST API using ManagedSynonymFilterFactory and ManagedStopFilterFactory respectively. You can read more and find sample configurations at https://cwiki.apache.org/confluence/display/solr/Managed+Resources.

Phonetic analysis

Another useful text analysis option to enable searches that sound like a queried word is phonetic translation. A filter is used at both index and query time that phonetically encodes each word into a phoneme-based word. There are many phonetic encoding algorithms to choose from: BeiderMorse, Caverphone, Cologne, DoubleMetaphone, Metaphone, RefinedSoundex, and Soundex. We suggest using DoubleMetaphone for most text, and definitely BeiderMorse for names. However, you might want to experiment in order to make your own choice.

Note

Solr has three tools for more aggressive inexact searching: phonetic, query spellchecking, and fuzzy searching. These are all employed a bit differently.

The following code shows how to configure text analysis for phonetic matching using the DoubleMetaphone encoding in the schema.xml file:

<!-- for phonetic (sounds-like) indexing -->
<fieldType name="phonetic" class="solr.TextField"
     positionIncrementGap="100">
   <analyzer>
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.DoubleMetaphoneFilterFactory"
          inject="false" maxCodeLength="8"/>
   </analyzer>
</fieldType>

The previous example uses the DoubleMetaphoneFilterFactory analysis filter, which has the following two options:

  • inject: This is a Boolean defaulting to true that will cause the original words to pass through the filter. It might interfere with other filter options, querying, and potentially scoring. Therefore, it is preferred to disable this, and use a separate field dedicated to phonetic indexing.
  • maxCodeLength: This is the maximum phoneme code (that is, phonetic character or syllable) length. It defaults to 4. Longer codes are truncated. Only DoubleMetaphone supports this option.

Note that the phonetic encoders internally handle both uppercase and lowercase, so there's no need to add a lowercase filter.

In the MusicBrainz schema that is supplied with the book, a field named a_phonetic is declared to use BeiderMorse because that encoding is best for names. The field has the artist name copied into it through a copyField directive. In Chapter 5, Searching, you will read about the DisMax query parser that can conveniently search across multiple fields with different scoring boosts. It can be configured to search not only the artist name (a_name) field, but also a_phonetic with a low boost so that regular exact matches will come above those that match phonetically.

Here is how BeiderMorse is configured:

<fieldType name="phonetic" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <!-- ... potentially others ... -->
    <filter class="solr.BeiderMorseFilterFactory" ruleType="APPROX"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.BeiderMorseFilterFactory" ruleType="EXACT"/>
  </analyzer>
</fieldType>

Notice the difference in ruleType between the query and index analyzers. In order to use most of the phonetic encoding algorithms, you must use the following filter:

<filter class="solr.PhoneticFilterFactory" encoder="RefinedSoundex" inject="false"/>

The encoder attribute must be one of those algorithms listed in the first paragraph of this section, with the exception of DoubleMetaphone and BeiderMorse, which have dedicated filter factories.

Tip

Try Solr's Analysis admin page to see how variations in text change (or don't change) the phonemes that are indexed and searched.

Substring indexing and wildcards

Usually, the text indexing technology is employed to search entire words. Occasionally, however, there arises a need for a query to match an arbitrary substring of an indexed word or across them. Solr supports wildcards on queries (for example, mus*ainz), but there is some consideration needed in the way data is indexed.

It's useful to first get a sense of how Lucene handles a wildcard query at the index level. Lucene internally scans the sorted terms list on disk starting with the nonwildcard prefix (mus in the previous example). One thing to note about this is that the query takes exponentially longer for each fewer prefix character. In fact, Solr configures Lucene to not accept a leading wildcard to ameliorate the problem. Another thing to note is that stemming, phonetic, and other trivial text analysis will interfere with these kinds of searches, for example, if running is stemmed to run, then runni* would not match.

Tip

Before employing these approaches, consider whether you really need better tokenization for special codes, for example, if you have a long string code that internally has different parts that users might search in separately, then you can use a PatternReplaceFilterFactory with some other analyzers to split them up.

ReversedWildcardFilter

Solr doesn't permit a leading wildcard in a query unless you index the text in a reverse direction in addition to the forward direction. Doing this will also improve query performance when the wildcard is very close to the front. The following example configuration should appear at the end of the index analyzer chain:

<filter class="solr.ReversedWildcardFilterFactory" />

It has several performance-tuning options you can investigate further at its Javadocs, available at http://lucene.apache.org/solr/api/org/apache/solr/analysis/ReversedWildcardFilterFactory.html, but the defaults are reasonable.

Solr does not support a query with both a leading and trailing wildcard for performance reasons. Given our explanation of the internals, we hope you understand why.

N-gram analysis

N-gram analysis slices text into many smaller substrings ranging between a minimum and maximum configured size, for example, consider the word "Tonight". An NGramFilterFactory configured with minGramSize of 2 and maxGramSize of 5 would yield all of the following indexed terms: (2-grams:) To, on, ni, ig, gh, ht, (3-grams:) Ton, oni, nig, igh, ght, (4-grams:) Toni, onig, nigh, ight, (5-grams:) Tonig, onigh, night. Note that "Tonight" fully does not pass through because it has more characters than the maxGramSize. N-gram analysis can be used as a token filter, and it can also be used as a tokenizer with NGramTokenizerFactory, which will emit n-grams spanning across the words of the entire source text.

Note

The term n-gram can be ambiguous. Outside of Lucene, it is more commonly defined as word-based substrings, not character based. Lucene calls this shingling and you'll learn how to use that in Chapter 10, Scaling Solr.

The following is a suggested analyzer configuration using n-grams to match substrings:

<fieldType name="nGram" class="solr.TextField"
       positionIncrementGap="100">
   <analyzer type="index">
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <!-- potentially word delimiter, synonym filter, stop words, NOT stemming -->
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.NGramFilterFactory" minGramSize="2"             maxGramSize="15"/>
   </analyzer>
   <analyzer type="query">
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <!-- potentially word delimiter, synonym filter, stop words, NOT stemming -->
       <filter class="solr.LowerCaseFilterFactory"/>
   </analyzer>
</fieldType>

Notice that the n-gramming only happens at index time. The range of gram sizes goes from the smallest number of characters you wish to enable substring searches on (2 in this example), to the maximum size permitted for substring searches (15 in this example).

Apply this analysis to a field created solely for the purpose of matching substrings. Another field should exist for typical searches, and configure the DisMax query parser, described in Chapter 5, Searching, for searches to use both fields using a smaller boost for this field.

Another variation is EdgeNGramTokenizerFactory and EdgeNGramFilterFactory, which emit n-grams that are adjacent to either the start or end of the input text. For the filter-factory, this input-text is a token, and for the tokenizer, it is the entire input. In addition to minGramSize and maxGramSize, these analyzers take a side argument that is either front or back. If only prefix or suffix matching is needed instead of both, then an EdgeNGram analyzer is for you.

N-gram costs

There is a high price to be paid for n-gramming. Recall that in the earlier example, Tonight was split into 15 substring terms, whereas typical analysis would probably leave only one. This translates to greater index sizes, and thus a longer time to index. Let's look at the effects of this in the MusicBrainz schema. The a_name field, which contains the artist name, is indexed in a typical fashion and is a stored field. The a_ngram field is fed by the artist name and is indexed with n-grams ranging from 2 to 15 characters in length. It is not a stored field because the artist name is already stored in a_name.

 

a_name

a_name + a_ngram

Increase

Indexing Time

46 seconds

479 seconds

> 10x

Disk Size

11.7 MB

59.7 MB

> 5x

Distinct Terms

203,431

1,288,720

> 6x

The preceding table shows a comparison of index statistics of an index with just a_name versus both a_name and a_ngram. Note the ten-fold increase in indexing time for the artist name, and a five-fold increase in disk space. Remember that this is just one field!

Tip

Given these costs, n-gramming, if used at all, is generally only done on a field or two of small size where there is a clear requirement for fast substring matches.

The costs of n-gramming are lower if minGramSize is raised and to a lesser extent if maxGramSize is lowered. Edge n-gramming costs less too. This is because it is only based on one side. It definitely costs more to use the tokenizer-based n-grammers instead of the term-based filters used in the example before, because terms are generated that include and span whitespace. However, with such indexing, it is possible to match a substring spanning words.

Sorting text

Usually, search results are sorted by relevancy via the score pseudo-field, but it is common to need to support conventional sorting by field values too. And, in addition to sorting search results, there are ramifications to this discussion in doing a range query and when showing facet results in a sorted order.

Note

Sorting limitations

A field needs to be indexed, not be multivalued, and for text, it should not have multiple tokens (either there is no text analysis or it yields just one token).

It just so happens that MusicBrainz already supplies alternative artist and label names for sorting. When different from the original name, these sortable versions move words like "The" from the beginning to the end after a comma. We've marked the sort names as indexed but not stored since we're going to sort on it but not display it—deviating from what MusicBrainz does. Remember that indexed and stored are true by default. Because of the special text analysis restrictions of fields used for sorting, text fields in your schema that need to be sortable will usually be copied to another field and analyzed differently. The copyField directive in the schema facilitates this task. The string type is a type that has no text analysis and so it's perfect for our MusicBrainz case. As we're getting a sort-specific value from MusicBrainz, we don't need to derive something ourselves.

However, note that in the MusicBrainz schema there are no sort-specific release names, so let's add sorting support. One option is to use the string type again. That's fine, but you may want to lowercase the text, remove punctuation, and collapse multiple spaces into one (if the data isn't clean). You can even use PatternReplaceFilterFactory to move words like "The" to the end. It's up to you. For the sake of variety in our example, we'll be taking the latter route; we're using a type title_sort that does these kinds of things.

By the way, Lucene sorts text by the internal Unicode code point. You probably won't notice any problem with the sort order. If you want sorting that is more accurate to the finer rules of various languages (English included), you should try CollationKeyFilterFactory. Since it isn't commonly used and it's already well documented, we'll refer you to the wiki page https://cwiki.apache.org/confluence/display/solr/Language+Analysis#LanguageAnalysis-UnicodeCollation.

Miscellaneous token filters

Solr includes many more token filters:

  • ClassicFilterFactory: (It was formerly named StandardFilter prior to Solr 3.1.) This filter works in conjunction with ClassicTokenizer. It will remove periods in between acronyms and 's at the end of terms:
    "I.B.M. cat's" => "IBM", "cat"
  • EnglishPossessiveFilterFactory: This removes the trailing 's.
  • TrimFilterFactory: This removes leading and trailing whitespace. We recommend doing this sort of thing before text analysis, same as using TrimFieldUpdateProcessorFactory (see Chapter 4, Indexing Data).
  • LowerCaseFilterFactory: This lowercases all text. Don't put this before WordDelimeterFilterFactory if you want to split on case transitions.
  • KeepWordFilterFactory: This filter omits all of the words, except those in the specified file:
    <filter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="true"/>

    If you want to ensure a certain vocabulary of words in a special field, you might enforce it with this.

  • LengthFilterFactory: This filters out the terms that do not have a length within an inclusive range. The following is an example:
    <filter class="solr.LengthFilterFactory" min="2" max="5" />
  • LimitTokenCountFilterFactory: This filter caps the number of tokens passing through to that specified in the maxTokenCount attribute. Even without any hard limits, you are effectively limited by the memory allocated to Java—reach that and Solr will throw an error.
  • RemoveDuplicatesTokenFilterFactory: This ensures that no duplicate terms appear at the same position. This can happen, for example, when synonyms stem to a common root. It's a good idea to add this to your last analysis step if you are doing a fair amount of other analysis.
  • ASCIIFoldingFilterFactory: See MappingCharFilterFactory in the earlier Character filters section for more information on this filter.
  • CapitalizationFilterFactory: This filter capitalizes each word according to the rules that you specify. For more information, see the Javadocs at http://lucene.apache.org/core/4_10_4/analyzers-common/org/apache/lucene/analysis/miscellaneous/CapitalizationFilterFactory.html.
  • PatternReplaceFilterFactory: This takes a regular expression and replaces the matches. Take a look at the following example:
    <filter class="solr.PatternReplaceFilterFactory" pattern=".*@(.*)"
            replacement="$1" replace="first" />

    This example is for processing an e-mail address field to get only the domain of the address. This replacement happens to be a reference to a regular expression group, but it might be any old string. If the replace attribute is set to first, then only the first match is replaced; if replace is all, the default, then all matches are replaced.

  • Write your own: Writing your own filter is an option if the existing ones don't suffice. Crack open the source code to Lucene for one of these to get a handle on what's involved. Before you head down this path though, you'd be surprised at what a little creativity with PatternReplaceFilterFactory and some of the others can offer you. For starters, check out the rType field type in the schema.xml that is supplied online with this book.

There are some other miscellaneous Solr filters we didn't mention for various reasons. For common-grams or shingling, see Chapter 10, Scaling Solr. See the all known implementing classes section at the top of http://lucene.apache.org/core/4_10_4/analyzers-common/org/apache/lucene/analysis/util/TokenFilterFactory.html for a complete list of token filter factories, including documentation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.251.68