Note: As part of this week's lab, also complete these worksheet problems, due Tuesday in class.
For Lab 10, you will be completing the rest of the features for our approximation of Google's "Did you mean?" algorithm. You will be using your solution from last week, google.py, as your starting point for this lab. Be sure to reread Lab 9 for background and a general overview. While we implemented some functionality last week, we were nowhere near challenging Google's search dominance. For this lab, you will implement three other functions for generating alternative spellings (see Generating Alternatives, below). In addition, you will use sorting to more intelligently suggest alternate spellings to the user (see Suggesting Alternatives, blow). Briefly, you will answer the question "Did you mean?" by giving priority to commonly occuring words in the English language (i.e., words with high frequency). The complete program requirements and a sample run of the program are found towards the bottom of this page.
IMPORTANT: To begin your lab, you must do the following:
$ cd $ update21 $ cd cs21/labs/10 $ cp ../09/google.py .Your Lab 10 solution should be completed only in your lab/10 directory. handin21 will only submit files from this directory.
There are four different methods you need to use to create alternative spellings: deletion, transposition, substitution, and insertion. Each is explained below. We'll use the word speling as our running example. This word is not in the word-frequency list, so we need to propose alternatives.
Deletions was implemented for Lab 9. For this lab, you should implement the other three functions: insertions, substitutions, and transpositions.
Form alternatives of the word by generating all possible single-letter substitutions. For each character in the word, replace the character by substituting a letter from the alphabet (abcd...xyz). For speling, first try every possible replacement for the first character, "s"; e.g., ateling, bteling, cteling, etc. Then try all substitutions for the second character of the original word; e.g., saeling, sbeling, sceling. There are too many possible substitutions to list them all, but this table should demonstrate the general pattern for the word speling:
For a word of length n, there are 26 * n alternatives we can form by substituting characters; here, the word was 7 letters long and there are 182 alternatives we can form using substitution.
Form alternatives of the word by inserting single-letters in the word. For each letter in the alphabet, we insert that letter into our word in all possible locations. For example, in our example, try inserting the letter "a" in each position: aspeling, sapeling, spaeling, spealing, spelaing, speliang, spelinag, and spelinga. We then do the same for the letter "b", and so on. Notice that there are 8 possible locations for the inserted letter "a" in the the 7-letter word speling. There are too many possible insertions to list them all. Using our example, speling, we would generate:
For a word of length n, there are 26 * (n+1) alternatives we can form by inserting characters; here, the word was 7 letters long and there are 208 alternatives we can form using insertion.
Hint: all of these methods will involve loops, some nested. In python, slicing and indexing can be very useful. And don't forget to use the python shell for testing!
Hint: while you are generating lots of different alternative spellings, you
only need to suggest actual words. That is, you only need to keep track of
alternatives that are in the wordcts.txt file.
In Lab 9, once all alternative spellings of a word are considered, your program simply printed to the screen all valid alternative spellings and their term frequencies values. For Lab 10, you will overwrite this technique to, instead, suggest alternate spellings that are most likely to be what the user intended to type. Google's algorithm uses something akin to the frequency of use of a word to rank possible alternatives. That is, a word that is seen widely in the English language (e.g., the) is more likely to be what the user meant than a rare word (e.g., thy). In essence, "Did you mean?" is a popularity contest.
For your program, your job is to take the list of possible alternate spellings for the original search word and suggest the word with the highest term frequency. Next, you should ask the user if the suggestion is correct. If the user types no, you should output the possible alternatives with the second highest frequency. You should then repeat the question, continuing to output the alternate spelling with the next highest frequency until a) there are no more alternate suggestions or b) the user types yes to the suggested alternative. You must implement your own function for sorting the alternate spellings. You cannot use built-in python methods to perform the sort such as list.sort().
For the speling example, only 3 of the alternatives are valid words in the English language: spewing, spieling, and spelling. From these words, we suggest the alternative that is most frequent. If you consult your word frequency list of lists, you should find that spelling is the most frequent:
[... ['spelling', 5373561],... ['spewing', 262886],... ['spieling', 1239], ...]Therefore, we would choose spelling as the proposed alternative. If that is not correct, we should suggest spewing. Here is what this looks like:
Enter a word or QUIT to exit: speling Did you mean spelling (frequency 5373561)? no Did you mean spewing (frequency 262886)? no Did you mean spieling (frequency 1239)? no Not found.
The general flow of your algorithm will be very similar to Lab 9. The requirements below build off of Lab 9 requirements, with a tag of New Requirement for requirements unique to Lab 10. You are responsible for all items on this list, however, including making corrections for any errors in Lab 9.
Here is what a sample run of your program should like like:
Welcome to "Did You Mean?". This program will search a list of known words to see if your search query is a valid word. If not, it will suggest alternate spellings that are valid words. Enter a word or QUIT to exit: speling Did you mean: spelling (frequency: 5373561)? no Did you mean: spewing (frequency: 262886)? no Did you mean: spieling (frequency: 1239)? no Not found Enter a word or QUIT to exit: abase Found Enter a word or QUIT to exit: ebash Did you mean: bash (frequency: 2781144)? yes Enter a word or QUIT to exit: quit Found Enter a word or QUIT to exit: hed Did you mean: he (frequency: 595282264)? no Did you mean: had (frequency: 472458757)? no Did you mean: her (frequency: 360656669)? no Did you mean: head (frequency: 71472344)? good Please enter yes or no: yes Enter a word or QUIT to exit: asdf No entries found Enter a word or QUIT to exit: poole Did you mean: pool (frequency: 28654230)? no Did you mean: pole (frequency: 5209605)? yes Enter a word or QUIT to exit: QUIT
Remember to do unit testing of your functions. You can either do unit testing in the python interpreter or by adding some test calls to your functions in main() and running it normally. Also, you can add debug print statements to see what your program is doing. For example, to see if your word-frequency list looks right, you could add some print statements to list a few values in the list. Just make sure to remove all debug prints before submitting your solution.
Lists and strings will probably be needed in this program. See our documentation about using strings and lists as objects for guidance on advanced operations for strings and lists.
Be sure to re-examine in class examples for programs that use lists of lists as well as file input/output.
Also, your "generate alternatives" solution may need to use string operations that we learned early in the semester, including concatenation, substring (or slicing), and repetitions. Here are some examples:
s1 = "hello there" s2 = "goodbye" newstr = s1 + s2 # newstr is "hello theregoodbye" newstr = s1[3:10] # newtr is "l0 ther" newstr = s1*2 # newsr is "goodbyegoodbye"
Lastly, strings can be compared using all relational operators learned in class (i.e., <, <=, >, >=, ==, !=). Strings are ordered lexographically, exactly like in the dictonary or phonebook. For example:
name1 = "Ameet" name2 = "Tia" if name1 < name2: print name1 else: print name2outputs the value Ameet since A comes before T in the alphabet. If the first letter of both strings is the same, we compare the second letter in each just as we would in the dictionary. For example, Ameet is greater than Alex. Python continues evaluating position-by-position until one string ends. In that case, the shorter word is less than the longer; for example, Amy is less than Amylia.
Are you looking to try out some extra challenges? Remember: this is not for credit, and should only be attempted after completing the rest of the assignment. First, copy of your solution to a new file:
cp google.py google_ext.py
Next, add some new functions for generating alternatives. One possibility is to search for prefix substrings. That is, words that start with the same letters as the search term, but with suffixes on them. For example, a search for spel should return spelling since it starts with the entire word spel.
Another alternate is to play the fun game of anagrams. An anagram is a word that results from the rearranging of letters of another word. For a good example (and a nerdy joke), Google the word anagram. Check out the Did you mean suggestion. In your program, a search for mary would return army (same letters, different order).
As a last suggestion, use wildcard searches. For example, if the user types sp*ing, you should return all words that start with sp and end in ing. Any combination of letters can occur in between (such as with spelling or spending). This is similar to the prefix-substring suggestion, but more difficult. As another example, if the user types *ful, you should return all words that end in the letters ful. That's a lot of possibilities!
Since none of these are really "Did you mean" algorithms, you may want to print the list differently. Rather than asking one at a time, sort the list of alternative suggestions and output 5 at a time until the user says they found the word or you have no more suggestions.
Once you are satisfied with your program, hand it in by typing handin21 in a terminal window. Don't forget to hand in your worksheet problems in class.