Use Jakarta Commons Codec’s
Soundex
. Supply a surname or a word, and
Soundex
will produce a phonetic encoding:
// Required import declaration import org.apache.commons.codec.language.Soundex; // Code body Soundex soundex = new Soundex( ); String obrienSoundex = soundex.soundex( "O'Brien" ); String obrianSoundex = soundex.soundex( "O'Brian" ); String obryanSoundex = soundex.soundex( "O'Bryan" ); System.out.println( "O'Brien soundex: " + obrienSoundex ); System.out.println( "O'Brian soundex: " + obrianSoundex ); System.out.println( "O'Bryan soundex: " + obryanSoundex );
This will produce the following output for three similar surnames:
O'Brien soundex: O165 O'Brian soundex: O165 O'Bryan soundex: O165
Soundex.soundex( )
takes a string, preserves the
first letter as a letter code, and proceeds to calculate a code based
on consonants contained in a string. So, names such as
“O’Bryan,”
“O’Brien,” and
“O’Brian,” all
being common variants of the Irish surname, are given the same
encoding: “O165.” The 1 corresponds
to the B, the 6 corresponds to the R, and the 5 corresponds to the N;
vowels are discarded from a string before the
Soundex
code is generated.
The Soundex
algorithm can be used in a number of
situations, but Soundex
is usually associated with
surnames, as the United States historical census records are indexed
using Soundex
. In addition to the role
Soundex
plays in the census,
Soundex
is also used in the health care industry
to index medical records and report statistics to the government. A
system to access individual records should allow a user to search for
a person by the Soundex
code of a surname. If a
user types in the name “Boswell” to
search for a patient in a hospital, the search result should include
patients named “Buswell” and
“Baswol;” you can use
Soundex
to provide this capability if an
application needs to locate individuals by the sound of a surname.
The Soundex
of a word or name can also be used as
a primitive method to find out if two small words rhyme. Commons
Codec contains other phonetic encodings, such as
RefinedSoundex
, Metaphone
, and
DoubleMetaphone
. All of these alternatives solve
similar problems—capturing the phonemes
or sounds contained in a word.
For more information on the Soundex
encoding, take
a look at the Dictionary of Algorithms and Data Structures at the
National Institute of Standards and Technology (NIST), http://www.nist.gov/dads/HTML/soundex.html.
There you will find links to a C implementation of the
Soundex
algorithm.
For more information about alternatives to Soundex
encoding, read “The Double Metaphone Search
Algorithm” by Lawrence Philips (http://www.cuj.com/documents/s=8038/cuj0006philips/).
Or take a look at one of Lawrence Philips’s original
Metaphone algorithm implementations at http://aspell.sourceforge.net/metaphone/.
Both the Metaphone and Double Metaphone algorithms capture the sound
of an English word; implementations of these algorithms are available
in Commons Codec as Metaphone
and
DoubleMetaphone
.
18.217.107.229