A skeleton version of the program (google.py) will appear when you run update21 in a terminal window. The program handin21 will only submit files in this directory.
You may work with a partner on this assignment. If you work with a
partner, put your name and the name of your partner in the list of
authors at the top of your program. Only one partner should run
handin21 to submit the files for the group. Both partners
will receive the same grade.
For this assignment, you will write a program that implements something similar to the algorithm Google uses for it's "Did you mean" feature. If you are unfamiliar with this feature, you can try it yourself by going to Google and entering a slightly misspelled search term, such as "speling". Google will suggest that perhaps you meant to type "spelling".
Google's algorithm is proprietary, so we can't know for sure what they do to determine when to suggest words, and which words to suggest. We can infer that Google does not use a dictionary to decide whether or not to propose alternate searches. For example, searching for "grap" (instead of "grape" or "graph") doesn't yield a "Did you mean" prompt even though "grap" isn't a word. Searching for "Wicentowsk" yields "Did you mean: Wicentowski" even though neither "Wicentowsk" or "Wicentowski" are words in any dictionary. Sometimes words that are in the dictionary are the targets of the "Did you mean" prompt. For example, even though "awkwardest" does not yield a "Did you mean" prompt, searching for "awkwarder", which is a word (albeit, an awkward one), yields "Did you mean: awkward".
Our version of the algorithm will not return the same results that Google does, but it will be a reasonable approximation.
You will write a function which suggests alternative words. If the word is in the dictionary, your function will simply return the word. If the word is not in the dictionary, you will generate a large number of alternative words, rank them according to the most likely alternative, and then return this new alternative.
Let's begin by looking at how we generate alternatives. We'll use "speling" as the running example. This word is not in the dictionary, so we need to propose alternatives. Our alternatives will be generated using four different methods: deletion, transposition, substitution, and insertion. Each is explained below:
We form alternatives of the word by generating all possible single-letter substitutions. For each letter in the original word, we substitute a letter from the alphabet (abcd...xyz) to replace it. There are too many possible substitutions to list them all. Using our example, "speling", we would generate:
For a word of length n, there are 26 * n alternates we can form by substituting characters; here, the word was 7 letters long and there are 182 alternatives we can form using substitution.
We form alternatives of the word by generating all possible single-letter insertions. For each letter in the alphabet, we insert that letter into our word. There are too many possible insertions to list them all. Using our example, "speling", we would generate:
For a word of length n, there are 26 * (n+1) alternates we can form by inserting characters; here, the word was 7 letters long and there are 208 alternatives we can form using insertion.
In total, between deletions (n), transpositions (n-1), substitutions (26*n), and insertions (26*(n+1)), there are 54n+25 alternatives. This means that for a 7 letter word (like "speling"), there are 403 alternatives to choose from. However, when you type "speling" into Google, it doesn't suggest 403 alternatives. So, we need a way of deciding which one of the 403 alternatives are good choices to suggest.
For the "speling" example, we find that only 3 of the alternatives are legal words in our dictionary: spewing, spieling, and spelling. Even now, though, we can't suggest all three of these words, we need to pick just one.
We have provided you with a new dictionary (/usr/local/doc/wordcts.txt) which contains each of the words from /usr/share/dict/words augmented with a frequency with which this word occurred in a very large document collection. Here are the first 10 lines of that file:
a 7841087012 aardvark 93030 aardvarks 11762 abaci 5391 aback 272920 abacus 149548 abacuses 4473 abaft 19188 abalone 237386 abalones 6544Very frequently used words such as "a" occurred very often (7,841,087,012 times). Common words such as "abacus" occurred more frequently (149,548 times) than uncommon words such as "abaci" (5,391 times). We will use the frequency with which words occurred to rank the alternatives and we will choose the most frequently occurring word as the proposed alternative.
In our running example, we want to choose the alternative of "speling" that is most frequent. Consulting wordcts.txt, we find that "spelling" is the most frequent:
spelling 5373561 spewing 262886 spieling 1239Therefore, we would want to choose spelling as the proposed alternative.
To implement this ranking algorithm, you will first need to read the file /usr/local/doc/wordcts.txt into a list of (word, frequency) tuples:
[(a, 7841087012), (aardvark, 93030), (aardvarks, 11762), ...]You will then need to search this list of (word, frequency) tuples for each of your proposed alternatives. This means for "speling" you will need to perform 403 searches in this list. You will want to use binary search since the list of words is in sorted order. You will then need to choose the word that has the highest frequency of all the proposed alternatives.
Note that for some misspelled words, there will be no alternatives that occur in the dictionary. For example, if you type in "telefone", there are no alternatives which occur in the word list. In cases like this, simply return the original word ("telefone").
Your program will ask the user repeatedly enter words until they type "QUIT". For each word the user enters, you will perform your algorithm above, returning either the original word or a proposed alternative. If you are proposing an alternative (such as "spelling"), you will print out "Did you mean: spelling". If you are returning the original word (either because it was spelled correctly or because you could not find a valid alternative), you should not print anything. Your program should match this format exactly.
Here is a sample interaction:
Enter a word, or QUIT to exit: speling Did you mean: spelling Enter a word, or QUIT to exit: spelling Enter a word, or QUIT to exit: telefone Enter a word, or QUIT to exit: vaccuum Did you mean: vacuum Enter a word, or QUIT to exit: benifit Did you mean: benefit Enter a word, or QUIT to exit: google Did you mean: goggle Enter a word, or QUIT to exit: QUIT
Once you are done implementing the algorithm as explained, you can tweak the algorithm to make it better. The only caveat is that you should save it as "google-extended.py" and leave the file "google.py" containing only the basic algorithm as described above.
There are lots of ideas you can explore. For example, it's more likely that people would mistake a "c" for a "k" than for an "l" since "c" and "k" sound the same (e.g. "elastik" vs "elastil"). The same is true for "f" and "ph" as shown above in the "telefone" example. People are more likely to accidentally press a neighboring key on the keyboard than they are to accidentally press a key on the other end of the keyboard (e.g. "walj" is a more likely typo of "walk" than "wall"). People often double letters when they're not supposed to (e.g. "vaccuum") and they forget to double letters when they're supposed to (e.g. "speling"). Plus, there are loads of other ideas. Implement any and all of them!
Once you are satisfied with your program, hand it in by typing handin21 in a terminal window.