Character filters

Character filters, declared with the <charFilter> element, process a stream of text prior to tokenization. There are only a few. This feature is not commonly used except for the first one described here, which is configured to strip accents:

  • MappingCharFilterFactory: This maps a character (or string) to another—potentially none. In other words, it's a find-replace capability. There is a mapping attribute in which you specify a configuration file. Solr's example configuration includes two such configuration files with useful mappings:
    • mapping-FoldToASCII.txt: This is a comprehensive mapping of non-ASCII characters to ASCII equivalents. For further details on the characters mapped, read the comments at the top of the file. This char filter has a token filter equivalent named ASCIIFoldingFilterFactory that should run faster and is recommended instead.
    • mapping-ISOLatin1Accent.txt: This is a smaller subset covering just the ISO Latin1 accent characters (like ñ to n). Given that FoldToASCII is more comprehensive; it's likely to be a better default than this one.

      Tip

      This analysis component and quite a few others have an attribute in which you can specify a configuration file. Usually, you can specify more than one file, separated by a comma but some components don't support that. They are always in the conf directory and UTF-8 encoded.

  • HTMLStripCharFilterFactory: This is used for HTML or XML, and it need not be well formed. Essentially, it removes all markup, leaving just the text content of elements. The text of script and style elements are removed. Entity references (for example, &amp;) are resolved.

    Tip

    Instead of stripping markup at the analysis stage, which is very late, consider if this should be done at an earlier point with UpdateRequestProcessor, or even before Solr gets it. If you need to retain the markup in Solr's stored value, then you will indeed need to perform this step here.

  • PatternReplaceCharFilterFactory: This performs a search based on a regular expression given as the pattern attribute, replacing it with the replacement attribute. Only use this char filter if the replacement should affect tokenization, such as by introducing a space.

    Note

    The regular expression specification supported by Solr is the one that Java uses: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.130.13