Understanding Beider-Morse phonetic matching

Beider-Morse Phonetic Matching (BMPM) helps you search for personal names or surnames. It is a very intelligent algorithm compared to soundex, metaphone, caverphone, and so on. Its purpose is to match names that are phonetically equivalent to the expected name. BMPM does not split spellings and does not generate false hits. It extracts names that are phonetically equivalent.

It executes these steps to extract names that are phonetically equivalent:

  • Determines the language from the spelling of the name
  • Applies phonetic rules to identify the language and translates the name into phonetic alphabets
  • In the case of a language not identified from the name, it applies generic phonetics
  • Finally, it applies language-independent rules regarding things such as voiced and unvoiced consonants and vowels to further ensure the reliability of the matches

BMPM supports the following languages: English, French, German, Greek, Hebrew written in Hebrew script, Hungarian, Italian, Polish, Romanian, Russian written in Cyrillic script, Russian transliterated into Latin script, Spanish, and Turkish.

Factory class: solr.BeiderMorseFilterFactory

Arguments:

  • nameType: Types of names. Valid values are GENERIC, ASHKENAZI, or SEPHARDIC. If you are not processing Ashkenazi or Sephardic names, use GENERIC.
  • ruleType: The types of rules to apply. Valid values are APPROX or EXACT.
  • concat: Defines whether multiple possible matches should be combined with a pipe (|).
  • languageSet: The language set to use. The value auto will allow the filter to identify the language, or a comma-separated list can be provided.

Example:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto" />
</analyzer>
</fieldType>

Input: sandeep

Tokenizer to filter: sandeep

Output: sYndipsandDpsandisandipsondDpsondisondipzYndipzandip, zondip

From the generated tokens, token sandip is similar to our expectations.

Similar to BMPB, Solr provides many more algorithms with unique behavior for implementing phonetic matching. Following is the list of those algorithms:

  • Daitch-Mokotoff soundex
  • Double metaphone
  • Metaphone
  • Soundex
  • Refined soundex
  • Caverphone
  • Kölner Phonetik also known as Cologne Phonetic
  • NYSIIS

Explaining each algorithm is not possible, but we can understand their behavior through the Solr Admin console by configuring them in the managed-schema.xml file.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.58.155