CS21 Lab 7: Did you mean: spelling

Due 11:59pm Tuesday, March 25, 2008

A skeleton version of the program (google.py) will appear when you run update21 in a terminal window. The program handin21 will only submit files in this directory.

You may work with a partner on this assignment. If you work with a partner, put your name and the name of your partner in the list of authors at the top of your program. Only one partner should run handin21 to submit the files for the group. Both partners will receive the same grade.

Introduction

For this assignment, you will write a program that implements something similar to the algorithm Google uses for it's "Did you mean" feature. If you are unfamiliar with this feature, you can try it yourself by going to Google and entering a slightly misspelled search term, such as "speling". Google will suggest that perhaps you meant to type "spelling".

Google's algorithm is proprietary, so we can't know for sure what they do to determine when to suggest words, and which words to suggest. We can infer that Google does not use a dictionary to decide whether or not to propose alternate searches. For example, searching for "grap" (instead of "grape" or "graph") doesn't yield a "Did you mean" prompt even though "grap" isn't a word. Searching for "Wicentowsk" yields "Did you mean: Wicentowski" even though neither "Wicentowsk" or "Wicentowski" are words in any dictionary. Sometimes words that are in the dictionary are the targets of the "Did you mean" prompt. For example, even though "awkwardest" does not yield a "Did you mean" prompt, searching for "awkwarder", which is a word (albeit, an awkward one), yields "Did you mean: awkward".


Generating Alternatives

Our version of the algorithm will not return the same results that Google does, but it will be a reasonable approximation.

You will write a function which suggests alternative words. If the word is in the dictionary, your function will simply return the word. If the word is not in the dictionary, you will generate a large number of alternative words, rank them according to the most likely alternative, and then return this new alternative.

Let's begin by looking at how we generate alternatives. We'll use "speling" as the running example. This word is not in the dictionary, so we need to propose alternatives. Our alternatives will be generated using four different methods: deletion, transposition, substitution, and insertion. Each is explained below:

Deletions

We form alternatives of the word by generating all possible single-letter deletions. Using our example, "speling", we would generate: peling, seling, spling, speing, spelng, spelig, and spelin. Each alternative differs from the original word by only one letter. For a word of length n, there are n alternates we can form by deleting a single character; here, the word (speling) was 7 letters long and there are 7 alternatives we can form using deletion.

Transpositions

We form alternatives of the word by generating all possible single-letter transpositions. Using our example, "speling", we would generate: pseling, sepling, spleing, speilng, spelnig, and spelign. For a word of length n, there are n-1 alternates we can form by transposing a pair of characters; here, the word was 7 letters long and there are 6 alternatives we can form using transposition.

Substitutions

We form alternatives of the word by generating all possible single-letter substitutions. For each letter in the original word, we substitute a letter from the alphabet (abcd...xyz) to replace it. There are too many possible substitutions to list them all. Using our example, "speling", we would generate:

apelingsaelingspalingspeaingspelangspeliagspelina
bpelingsbelingspblingspebingspelbngspelibgspelinb
cpelingscelingspclingspecingspelcngspelicgspelinc
.....................
ypelingsyelingspylingspeyingspelyngspeliygspeliny
zpelingszelingspzlingspezingspelzngspelizgspelinz

For a word of length n, there are 26 * n alternates we can form by substituting characters; here, the word was 7 letters long and there are 182 alternatives we can form using substitution.

Insertions

We form alternatives of the word by generating all possible single-letter insertions. For each letter in the alphabet, we insert that letter into our word. There are too many possible insertions to list them all. Using our example, "speling", we would generate:

aspelingsapelingspaelingspealingspelaingspeliangspelinagspelinga
bspelingsbpelingspbelingspeblingspelbingspelibngspelinbgspelingb
cspelingscpelingspcelingspeclingspelcingspelicngspelincgspelingc
........................
yspelingsypelingspyelingspeylingspelyingspeliyngspelinygspelingy
zspelingszpelingspzelingspezlingspelzingspelizngspelinzgspelingz

For a word of length n, there are 26 * (n+1) alternates we can form by inserting characters; here, the word was 7 letters long and there are 208 alternatives we can form using insertion.

Ranking alternatives

In total, between deletions (n), transpositions (n-1), substitutions (26*n), and insertions (26*(n+1)), there are 54n+25 alternatives. This means that for a 7 letter word (like "speling"), there are 403 alternatives to choose from. However, when you type "speling" into Google, it doesn't suggest 403 alternatives. So, we need a way of deciding which one of the 403 alternatives are good choices to suggest.

For the "speling" example, we find that only 3 of the alternatives are legal words in our dictionary: spewing, spieling, and spelling. Even now, though, we can't suggest all three of these words, we need to pick just one.

We have provided you with a new dictionary (/usr/local/doc/wordcts.txt) which contains each of the words from /usr/share/dict/words augmented with a frequency with which this word occurred in a very large document collection. Here are the first 10 lines of that file:

a 7841087012
aardvark 93030
aardvarks 11762
abaci 5391
aback 272920
abacus 149548
abacuses 4473
abaft 19188
abalone 237386
abalones 6544
Very frequently used words such as "a" occurred very often (7,841,087,012 times). Common words such as "abacus" occurred more frequently (149,548 times) than uncommon words such as "abaci" (5,391 times). We will use the frequency with which words occurred to rank the alternatives and we will choose the most frequently occurring word as the proposed alternative.

In our running example, we want to choose the alternative of "speling" that is most frequent. Consulting wordcts.txt, we find that "spelling" is the most frequent:

spelling 5373561
spewing 262886
spieling 1239
Therefore, we would want to choose spelling as the proposed alternative.

To implement this ranking algorithm, you will first need to read the file /usr/local/doc/wordcts.txt into a list of (word, frequency) tuples:

[(a, 7841087012), (aardvark, 93030), (aardvarks, 11762), ...]
You will then need to search this list of (word, frequency) tuples for each of your proposed alternatives. This means for "speling" you will need to perform 403 searches in this list. You will want to use binary search since the list of words is in sorted order. You will then need to choose the word that has the highest frequency of all the proposed alternatives.

Note that for some misspelled words, there will be no alternatives that occur in the dictionary. For example, if you type in "telefone", there are no alternatives which occur in the word list. In cases like this, simply return the original word ("telefone").

Interface

Your program will ask the user repeatedly enter words until they type "QUIT". For each word the user enters, you will perform your algorithm above, returning either the original word or a proposed alternative. If you are proposing an alternative (such as "spelling"), you will ask the user "Did you mean: spelling". If you are returning the original word (either because it was spelled correctly or because you could not find a valid alternative), you should not print anything.

Here is a sample interaction:

Enter a word, or QUIT to exit: speling
Did you mean: spelling
Enter a word, or QUIT to exit: spelling
Enter a word, or QUIT to exit: telefone
Enter a word, or QUIT to exit: vaccuum
Did you mean: vacuum
Enter a word, or QUIT to exit: benifit
Did you mean: benefit
Enter a word, or QUIT to exit: google
Did you mean: goggle
Enter a word, or QUIT to exit: QUIT
Extensions

Once you are done implementing the algorithm as explained, you can tweak the algorithm to make it better. The only caveat is that you should save it as "google-extended.py" and leave the file "google.py" containing only the basic algorithm as described above.

There are lots of ideas you can explore. For example, it's more likely that people would mistake a "c" for a "k" than for an "l" since "c" and "k" sound the same (e.g. "elastik" vs "elastil"). People are more likely to accidentally press a neighboring key on the keyboard than they are to accidentally press a key on the other end of the keyboard (e.g. "walj" is a more likely typo of "walk" than "walb"). People often double letters when they're not supposed to (e.g. "vaccuum") and they forget to double letters when they're supposed to (e.g. "speling"). Plus, there are loads of other ideas. Implement any and all of them!

Submit

Once you are satisfied with your program, hand it in by typing handin21 in a terminal window.