2.17. Calculating Soundex

Problem

You need the Soundex code of a word or a name.

Solution

Use Jakarta Commons Codec’s Soundex. Supply a surname or a word, and Soundex will produce a phonetic encoding:

// Required import declaration
import org.apache.commons.codec.language.Soundex;

// Code body
Soundex soundex = new Soundex( );
String obrienSoundex = soundex.soundex( "O'Brien" );
String obrianSoundex = soundex.soundex( "O'Brian" );
String obryanSoundex = soundex.soundex( "O'Bryan" );

System.out.println( "O'Brien soundex: " + obrienSoundex );
System.out.println( "O'Brian soundex: " + obrianSoundex );
System.out.println( "O'Bryan soundex: " + obryanSoundex );

This will produce the following output for three similar surnames:

O'Brien soundex: O165
O'Brian soundex: O165
O'Bryan soundex: O165

Discussion

Soundex.soundex( ) takes a string, preserves the first letter as a letter code, and proceeds to calculate a code based on consonants contained in a string. So, names such as “O’Bryan,” “O’Brien,” and “O’Brian,” all being common variants of the Irish surname, are given the same encoding: “O165.” The 1 corresponds to the B, the 6 corresponds to the R, and the 5 corresponds to the N; vowels are discarded from a string before the Soundex code is generated.

The Soundex algorithm can be used in a number of situations, but Soundex is usually associated with surnames, as the United States historical census records are indexed using Soundex. In addition to the role Soundex plays in the census, Soundex is also used in the health care industry to index medical records and report statistics to the government. A system to access individual records should allow a user to search for a person by the Soundex code of a surname. If a user types in the name “Boswell” to search for a patient in a hospital, the search result should include patients named “Buswell” and “Baswol;” you can use Soundex to provide this capability if an application needs to locate individuals by the sound of a surname.

The Soundex of a word or name can also be used as a primitive method to find out if two small words rhyme. Commons Codec contains other phonetic encodings, such as RefinedSoundex, Metaphone, and DoubleMetaphone. All of these alternatives solve similar problems—capturing the phonemes or sounds contained in a word.

See Also

For more information on the Soundex encoding, take a look at the Dictionary of Algorithms and Data Structures at the National Institute of Standards and Technology (NIST), http://www.nist.gov/dads/HTML/soundex.html. There you will find links to a C implementation of the Soundex algorithm.

For more information about alternatives to Soundex encoding, read “The Double Metaphone Search Algorithm” by Lawrence Philips (http://www.cuj.com/documents/s=8038/cuj0006philips/). Or take a look at one of Lawrence Philips’s original Metaphone algorithm implementations at http://aspell.sourceforge.net/metaphone/. Both the Metaphone and Double Metaphone algorithms capture the sound of an English word; implementations of these algorithms are available in Commons Codec as Metaphone and DoubleMetaphone.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.107.229