Character filters, declared with the <charFilter>
element, process a stream of text prior to tokenization. There are only a few. This feature is not commonly used except for the first one described here, which is configured to strip accents:
MappingCharFilterFactory
: This maps a character (or string) to another—potentially none. In other words, it's a find-replace capability. There is a mapping
attribute in which you specify a configuration file. Solr's example configuration includes two such configuration files with useful mappings:mapping-FoldToASCII.txt
: This is a comprehensive mapping of non-ASCII characters to ASCII equivalents. For further details on the characters mapped, read the comments at the top of the file. This char filter has a token filter equivalent named ASCIIFoldingFilterFactory
that should run faster and is recommended instead.mapping-ISOLatin1Accent.txt
: This is a smaller subset covering just the ISO Latin1 accent characters (like ñ to n). Given that FoldToASCII
is more comprehensive; it's likely to be a better default than this one.HTMLStripCharFilterFactory
: This is used for HTML or XML, and it need not be well formed. Essentially, it removes all markup, leaving just the text content of elements. The text of script and style elements are removed. Entity references (for example, &
) are resolved.PatternReplaceCharFilterFactory
: This performs a search based on a regular expression given as the pattern
attribute, replacing it with the replacement
attribute. Only use this char filter if the replacement should affect tokenization, such as by introducing a space.The regular expression specification supported by Solr is the one that Java uses: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html.
52.14.130.13