I l@ve RuBoard

17.9 Looking Up Words by Sound Similarity

Credit: Greg Jorgensen, Scott David Daniels

17.9.1 Problem

You need to look up words (most often people's surnames) by sound, rather than by spelling, so that likely spelling mistakes don't spoil the search.

17.9.2 Solution

The Soundex algorithm (by Odell and Russell, made popular by Knuth) transforms each surname into a signature that is more representative of how that surname is likely to sound when pronounced than of how it's spelled:

def soundex(name, len=4):
    """ soundex module conforming to Odell-Russell algorithm """

    # digits holds the soundex values for the alphabet
    soundex_digits = '01230120022455012623010202'
    sndx = ''
    fc = ''

    # Translate letters in name to soundex digits
    for c in name.upper(  ):
        if c.isalpha(  ):
            if not fc: fc = c   # Remember first letter
            d = soundex_digits[ord(c)-ord('A')]
            # Duplicate consecutive soundex digits are skipped
            if not sndx or (d != sndx[-1]):
                sndx += d

    # Replace first digit with first letter
    sndx = fc + sndx[1:]

    # Remove all 0s from the soundex code
    sndx = sndx.replace('0', '')

    # Return soundex code truncated or 0-padded to len characters
    return (sndx + (len * '0'))[:len]

17.9.3 Discussion

The common approach to avoiding confusion when a name's spelling induces lookup errors is the Soundex algorithm, by Odell and Russell, as reported by Knuth. The algorithm is designed for English-language surnames. If you have a significant number of non-English surnames, you might want to alter the values in digits to improve your matches. For example, to accommodate a large number of Spanish surnames, you might count "J" and "L" ("L" because of how "ll" is used) as vowels, setting their positions in digits to 0.

The basic assumptions of Soundex are that the consonants are more important than the vowels, and they are placed in groups of letters that can be confused with each other. Coming up with a set of such groups for a language is not horribly tough if you know that language's typical pronunciation issues. Just remember that each group should contain all letters that can be confused with any of those in the group. For example, a slightly better code for both English and Spanish names has the digits "01230120002055012623010202".

In languages such as Italian, which has strong and very distinct vowels, the basic assumptions of Soundex break down. There, vowels should probably play a contrary role, that of anchors that cannot be confused with each other. However, Italian phonetics teaches us that this is true to a varying degree, depending in part on where the phonic accent falls in the surname�semivowels in destressed syllables are not good anchors�and these complications are somewhat difficult to handle in a simple-minded, speedy algorithm.

17.9.4 See Also

Soundex is described in Donald Knuth's The Art of Computer Programming (Addison-Wesley), which is discussed at http://www-cs-staff.stanford.edu/~knuth/taocp.html.

I l@ve RuBoard