The files google.py and design.py will appear in the cs21/labs/09 directory when you run update21 in a terminal window. Please use top-down design first, then implement your program in google.py. Run handin21 to submit these files.
For this assignment, you will write a program that implements something similar to the algorithm Google uses for its Did you mean feature. If you are unfamiliar with this feature, you can try it yourself by going to Google and entering a slightly misspelled search term, such as coputer. Google will suggest that perhaps you meant to type computer (they now show results for computer, and provide a link if you *really* wanted to search for coputer). Our version of the algorithm will not return exactly the same results that Google does, but it will be a reasonable approximation.
To accomplish this task you will need a dictionary of valid words and how often they occur. When the user enters a word to search for, you will first check if the word appears in the dictionary. If so, you should report that it was found. If not, you will generate a list of possible alternatives based on the typical kinds of typos that a user could make. You will then look up each of these alternatives in our dictionary and, if some of these alternatives are valid words, pick the one that occurs most frequently and ask the user if they actually meant it instead. Somtimes the search word will be so badly misspelled that we won't be able to find any viable alternatives. In this case, you should just report that it was not found.
Here is a sample interaction:
Enter a word or QUIT to exit: speling Did you mean: spelling Enter a word or QUIT to exit: spelling Found Enter a word or QUIT to exit: benifit Did you mean: benefit Enter a word or QUIT to exit: telefone Not found Enter a word or QUIT to exit: vaccuum Did you mean: vacuum Enter a word or QUIT to exit: google Did you mean: goggle Enter a word or QUIT to exit: QUIT
Read through this entire lab carefully, then spend at least an hour sketching out the main program and the functions you'll need to complete this program. Write your top-down design in the file design.py. Remember to focus on the interfaces. What parameters will each function need? What will each function return? How will you call the functions from main?
Once you have a good understanding of the structure of
your program, begin your bottom-up implementation in the file
google.py. You can copy your design to google.py like
this:
cp design.py google.py
We have provided you with a file (/usr/local/doc/wordcts.txt) that contains each word from /usr/share/dict/words and the frequency with which this word occurred in a very large document collection. Here are the first 10 lines of that file:
a 7841087012 aardvark 93030 aardvarks 11762 abaci 5391 aback 272920 abacus 149548 abacuses 4473 abaft 19188 abalone 237386 abalones 6544Very frequently used words such as a occurred very often (7,841,087,012 times). Common words such as abacus occurred more frequently (149,548 times) than uncommon words such as abaci (5,391 times). We will use the word frequencies to rank the alternatives, choosing the most frequent word as the proposed alternative.
To implement the ranking algorithm, you should first read the file /usr/local/doc/wordcts.txt into a python dictionary. The keys for this dictionary will be the words and the values will be their frequencies. Remember that the order of the keys in the dictionary will not match the order in which you entered them. For example, the start of the dictionary might look like this:
{'abilities': 6726666, 'abdications': 1464, 'abbreviation': 1110315, ...}Write a function that creates this word-frequency dictionary, and then unit test it.
Write a function that suggests alternative words. You will call this function only when the word entered by the user is not in the dictionary. The function will generate a large number of alternative words, rank them according to the most likely alternative, and then return this new alternative.
Let's begin by looking at how we generate alternative words. We'll use speling as the running example. This word is not in the dictionary, so we need to propose alternatives. Our alternatives will be generated using four different methods: deletion, transposition, substitution, and insertion. Each is explained below:
We form alternatives of the word by generating all possible single-letter substitutions. For each letter in the original word, we substitute a letter from the alphabet (abcd...xyz) to replace it. There are too many possible substitutions to list them all. Using our example, speling, we would generate:
apeling | saeling | spaling | speaing | spelang | speliag | spelina |
bpeling | sbeling | spbling | spebing | spelbng | spelibg | spelinb |
cpeling | sceling | spcling | specing | spelcng | spelicg | spelinc |
... | ... | ... | ... | ... | ... | ... |
ypeling | syeling | spyling | speying | spelyng | speliyg | speliny |
zpeling | szeling | spzling | spezing | spelzng | spelizg | spelinz |
For a word of length n, there are 26 * n alternatives we can form by substituting characters; here, the word was 7 letters long and there are 182 alternatives we can form using substitution.
We form alternatives of the word by generating all possible single-letter insertions. For each letter in the alphabet, we insert that letter into our word. There are too many possible insertions to list them all. Using our example, speling, we would generate:
aspeling | sapeling | spaeling | spealing | spelaing | speliang | spelinag | spelinga |
bspeling | sbpeling | spbeling | spebling | spelbing | spelibng | spelinbg | spelingb |
cspeling | scpeling | spceling | specling | spelcing | spelicng | spelincg | spelingc |
... | ... | ... | ... | ... | ... | ... | ... |
yspeling | sypeling | spyeling | speyling | spelying | speliyng | spelinyg | spelingy |
zspeling | szpeling | spzeling | spezling | spelzing | spelizng | spelinzg | spelingz |
For a word of length n, there are 26 *
(n+1) alternatives we can form by inserting characters; here,
the word was 7 letters long and there are 208 alternatives we can form
using insertion.
In total, from deletions (n), transpositions (n-1), substitutions (26*n), and insertions (26*(n+1)), there are 54n+25 alternatives. This means that for a 7 letter word (like speling), there are 403 alternatives to choose from. However, when you type speling into Google, it doesn't suggest 403 alternatives. So, we need a way of deciding which one of the 403 alternatives is the best to suggest.
For the speling example, only 3 of the alternatives are legal words in our dictionary: spewing, spieling, and spelling. From these legal words, we suggest the alternative that is most frequent. If you consult the word frequency dictionary, you will find that spelling is the most frequent:
spelling 5373561 spewing 262886 spieling 1239Therefore, we would choose spelling as the proposed alternative.
Note that, for some misspelled words, none of our generated alternatives will occur in the dictionary. For example, if you type in telefone, there are no alternatives which occur in the word list. In cases like this, simply return the original word (telefone).
You should only attempt these if you have completed the entire program as described above.
Once you are done implementing the algorithm as described above, you may tweak the algorithm to make it better. Our only caveat is that you should save your tweaked program as "google-extended.py" and leave the file "google.py" containing only the basic algorithm as described above.
There are lots of ideas you can explore. For example, it's more
likely that people would mistake a "c" for a "k" than for an "l" since
"c" and "k" sound the same (e.g. "elastik" vs "elastil"). People are
more likely to accidentally press a neighboring key on the keyboard
than they are to accidentally press a key on the other end of the
keyboard (e.g. "walj" is a more likely typo of "walk" than "walb").
People often double letters when they're not supposed to
(e.g. "vaccuum") and they forget to double letters when they're
supposed to (e.g. "speling"). Plus, there are loads of other ideas.
Implement any and all of them!
Once you are satisfied with your program, hand it in by typing handin21 in a terminal window.