CS21 Lab 9: Google's Did you mean...

Due by 11:59 pm, Tuesday, April 5

The files google.py and design.py will appear in the cs21/labs/09 directory when you run update21 in a terminal window. Please use top-down design first, then implement your program in google.py. Run handin21 to submit these files.


Introduction

For this assignment, you will write a program that implements something similar to the algorithm Google uses for its Did you mean feature. If you are unfamiliar with this feature, you can try it yourself by going to Google and entering a slightly misspelled search term, such as coputer. Google will suggest that perhaps you meant to type computer (they now show results for computer, and provide a link if you *really* wanted to search for coputer). Our version of the algorithm will not return exactly the same results that Google does, but it will be a reasonable approximation.

To accomplish this task you will need a dictionary of valid words and how often they occur. When the user enters a word to search for, you will first check if the word appears in the dictionary. If so, you should report that it was found. If not, you will generate a list of possible alternatives based on the typical kinds of typos that a user could make. You will then look up each of these alternatives in our dictionary and, if some of these alternatives are valid words, pick the one that occurs most frequently and ask the user if they actually meant it instead. Somtimes the search word will be so badly misspelled that we won't be able to find any viable alternatives. In this case, you should just report that it was not found.

Here is a sample interaction:

Enter a word or QUIT to exit: speling
Did you mean: spelling
Enter a word or QUIT to exit: spelling
Found
Enter a word or QUIT to exit: benifit
Did you mean: benefit
Enter a word or QUIT to exit: telefone
Not found
Enter a word or QUIT to exit: vaccuum
Did you mean: vacuum
Enter a word or QUIT to exit: google
Did you mean: goggle
Enter a word or QUIT to exit: QUIT

1. Top-down design

Read through this entire lab carefully, then spend at least an hour sketching out the main program and the functions you'll need to complete this program. Write your top-down design in the file design.py. Remember to focus on the interfaces. What parameters will each function need? What will each function return? How will you call the functions from main?

Once you have a good understanding of the structure of your program, begin your bottom-up implementation in the file google.py. You can copy your design to google.py like this:

       cp design.py google.py


2. Create a word frequency dictionary

We have provided you with a file (/usr/local/doc/wordcts.txt) that contains each word from /usr/share/dict/words and the frequency with which this word occurred in a very large document collection. Here are the first 10 lines of that file:

a 7841087012
aardvark 93030
aardvarks 11762
abaci 5391
aback 272920
abacus 149548
abacuses 4473
abaft 19188
abalone 237386
abalones 6544
Very frequently used words such as a occurred very often (7,841,087,012 times). Common words such as abacus occurred more frequently (149,548 times) than uncommon words such as abaci (5,391 times). We will use the word frequencies to rank the alternatives, choosing the most frequent word as the proposed alternative.

To implement the ranking algorithm, you should first read the file /usr/local/doc/wordcts.txt into a python dictionary. The keys for this dictionary will be the words and the values will be their frequencies. Remember that the order of the keys in the dictionary will not match the order in which you entered them. For example, the start of the dictionary might look like this:

{'abilities': 6726666, 'abdications': 1464, 'abbreviation': 1110315, ...}
Write a function that creates this word-frequency dictionary, and then unit test it.



3. Generating Alternatives

Write a function that suggests alternative words. You will call this function only when the word entered by the user is not in the dictionary. The function will generate a large number of alternative words, rank them according to the most likely alternative, and then return this new alternative.

Let's begin by looking at how we generate alternative words. We'll use speling as the running example. This word is not in the dictionary, so we need to propose alternatives. Our alternatives will be generated using four different methods: deletion, transposition, substitution, and insertion. Each is explained below:

Deletions

We form alternatives of the word by generating all possible single-letter deletions. Using our example, speling, we would generate: peling, seling, spling, speing, spelng, spelig, and spelin. Each alternative differs from the original word by only one letter. For a word of length n, there are n alternatives we can form by deleting a single character; here, the word (speling) was 7 letters long and there are 7 alternatives we can form using deletion.

Transpositions

We form alternatives of the word by generating all possible single-letter transpositions. Using our example, speling, we would generate: pseling, sepling, spleing, speilng, spelnig, and spelign. For a word of length n, there are n-1 alternatives we can form by transposing a pair of characters; here, the word was 7 letters long and there are 6 alternatives we can form using transposition.

Substitutions

We form alternatives of the word by generating all possible single-letter substitutions. For each letter in the original word, we substitute a letter from the alphabet (abcd...xyz) to replace it. There are too many possible substitutions to list them all. Using our example, speling, we would generate:

apelingsaelingspalingspeaingspelangspeliagspelina
bpelingsbelingspblingspebingspelbngspelibgspelinb
cpelingscelingspclingspecingspelcngspelicgspelinc
.....................
ypelingsyelingspylingspeyingspelyngspeliygspeliny
zpelingszelingspzlingspezingspelzngspelizgspelinz

For a word of length n, there are 26 * n alternatives we can form by substituting characters; here, the word was 7 letters long and there are 182 alternatives we can form using substitution.

Insertions

We form alternatives of the word by generating all possible single-letter insertions. For each letter in the alphabet, we insert that letter into our word. There are too many possible insertions to list them all. Using our example, speling, we would generate:

aspelingsapelingspaelingspealingspelaingspeliangspelinagspelinga
bspelingsbpelingspbelingspeblingspelbingspelibngspelinbgspelingb
cspelingscpelingspcelingspeclingspelcingspelicngspelincgspelingc
........................
yspelingsypelingspyelingspeylingspelyingspeliyngspelinygspelingy
zspelingszpelingspzelingspezlingspelzingspelizngspelinzgspelingz

For a word of length n, there are 26 * (n+1) alternatives we can form by inserting characters; here, the word was 7 letters long and there are 208 alternatives we can form using insertion.

4. Ranking alternatives

In total, from deletions (n), transpositions (n-1), substitutions (26*n), and insertions (26*(n+1)), there are 54n+25 alternatives. This means that for a 7 letter word (like speling), there are 403 alternatives to choose from. However, when you type speling into Google, it doesn't suggest 403 alternatives. So, we need a way of deciding which one of the 403 alternatives is the best to suggest.

For the speling example, only 3 of the alternatives are legal words in our dictionary: spewing, spieling, and spelling. From these legal words, we suggest the alternative that is most frequent. If you consult the word frequency dictionary, you will find that spelling is the most frequent:

spelling 5373561
spewing 262886
spieling 1239
Therefore, we would choose spelling as the proposed alternative.

Note that, for some misspelled words, none of our generated alternatives will occur in the dictionary. For example, if you type in telefone, there are no alternatives which occur in the word list. In cases like this, simply return the original word (telefone).


Optional Extensions

You should only attempt these if you have completed the entire program as described above.

Once you are done implementing the algorithm as described above, you may tweak the algorithm to make it better. Our only caveat is that you should save your tweaked program as "google-extended.py" and leave the file "google.py" containing only the basic algorithm as described above.

There are lots of ideas you can explore. For example, it's more likely that people would mistake a "c" for a "k" than for an "l" since "c" and "k" sound the same (e.g. "elastik" vs "elastil"). People are more likely to accidentally press a neighboring key on the keyboard than they are to accidentally press a key on the other end of the keyboard (e.g. "walj" is a more likely typo of "walk" than "walb"). People often double letters when they're not supposed to (e.g. "vaccuum") and they forget to double letters when they're supposed to (e.g. "speling"). Plus, there are loads of other ideas. Implement any and all of them!

Submit

Once you are satisfied with your program, hand it in by typing handin21 in a terminal window.