Beider-Morse Phonetic Matching (BMPM) helps you search for personal names or surnames. It is a very intelligent algorithm compared to soundex, metaphone, caverphone, and so on. Its purpose is to match names that are phonetically equivalent to the expected name. BMPM does not split spellings and does not generate false hits. It extracts names that are phonetically equivalent.
It executes these steps to extract names that are phonetically equivalent:
- Determines the language from the spelling of the name
- Applies phonetic rules to identify the language and translates the name into phonetic alphabets
- In the case of a language not identified from the name, it applies generic phonetics
- Finally, it applies language-independent rules regarding things such as voiced and unvoiced consonants and vowels to further ensure the reliability of the matches
BMPM supports the following languages: English, French, German, Greek, Hebrew written in Hebrew script, Hungarian, Italian, Polish, Romanian, Russian written in Cyrillic script, Russian transliterated into Latin script, Spanish, and Turkish.
Factory class: solr.BeiderMorseFilterFactory
Arguments:
- nameType: Types of names. Valid values are GENERIC, ASHKENAZI, or SEPHARDIC. If you are not processing Ashkenazi or Sephardic names, use GENERIC.
- ruleType: The types of rules to apply. Valid values are APPROX or EXACT.
- concat: Defines whether multiple possible matches should be combined with a pipe (|).
- languageSet: The language set to use. The value auto will allow the filter to identify the language, or a comma-separated list can be provided.
Example:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto" />
</analyzer>
</fieldType>
Input: sandeep
Tokenizer to filter: sandeep
Output: sYndip, sandDp, sandi, sandip, sondDp, sondi, sondip, zYndip, zandip, zondip
From the generated tokens, token sandip is similar to our expectations.
Similar to BMPB, Solr provides many more algorithms with unique behavior for implementing phonetic matching. Following is the list of those algorithms:
- Daitch-Mokotoff soundex
- Double metaphone
- Metaphone
- Soundex
- Refined soundex
- Caverphone
- Kölner Phonetik also known as Cologne Phonetic
- NYSIIS
Explaining each algorithm is not possible, but we can understand their behavior through the Solr Admin console by configuring them in the managed-schema.xml file.