CS21 Lab 10

Google's Did you mean:Part 2

Due by 11:59 pm, Tuesday, Nov 15

Note: As part of this week's lab, also complete these worksheet problems, due Tuesday in class.

For Lab 10, you will be completing the rest of the features for our approximation of Google's "Did you mean?" algorithm. You will be using your solution from last week, google.py, as your starting point for this lab. Be sure to reread Lab 9 for background and a general overview. While we implemented some functionality last week, we were nowhere near challenging Google's search dominance. For this lab, you will implement three other functions for generating alternative spellings (see Generating Alternatives, below). In addition, you will use sorting to more intelligently suggest alternate spellings to the user (see Suggesting Alternatives, blow). Briefly, you will answer the question "Did you mean?" by giving priority to commonly occuring words in the English language (i.e., words with high frequency). The complete program requirements and a sample run of the program are found towards the bottom of this page.

IMPORTANT: To begin your lab, you must do the following:

$ cd
$ update21
$ cd cs21/labs/10
$ cp ../09/google.py .
Your Lab 10 solution should be completed only in your lab/10 directory. handin21 will only submit files from this directory.


Generating Alternatives

There are four different methods you need to use to create alternative spellings: deletion, transposition, substitution, and insertion. Each is explained below. We'll use the word speling as our running example. This word is not in the word-frequency list, so we need to propose alternatives.

Deletions was implemented for Lab 9. For this lab, you should implement the other three functions: insertions, substitutions, and transpositions.

Deletions

Already implemented for Lab 9. See Lab 9 for a detailed description.

Transpositions

Form alternatives of the word by using transpositions (swapping of adjacent letters within the word). Using our example, speling, we would generate: pseling (transposed the first 2 letters), sepling (transposed the second and third letters), spleing, speilng, spelnig, and spelign. For a word of length n, there are n-1 alternatives we can form by transposing a pair of neighboring characters; here, the word was 7 letters long and there are 6 alternatives we can form using transposition.

Substitutions

Form alternatives of the word by generating all possible single-letter substitutions. For each character in the word, replace the character by substituting a letter from the alphabet (abcd...xyz). For speling, first try every possible replacement for the first character, "s"; e.g., ateling, bteling, cteling, etc. Then try all substitutions for the second character of the original word; e.g., saeling, sbeling, sceling. There are too many possible substitutions to list them all, but this table should demonstrate the general pattern for the word speling:

apelingsaelingspalingspeaingspelangspeliagspelina
bpelingsbelingspblingspebingspelbngspelibgspelinb
cpelingscelingspclingspecingspelcngspelicgspelinc
.....................
ypelingsyelingspylingspeyingspelyngspeliygspeliny
zpelingszelingspzlingspezingspelzngspelizgspelinz

For a word of length n, there are 26 * n alternatives we can form by substituting characters; here, the word was 7 letters long and there are 182 alternatives we can form using substitution.

Insertions

Form alternatives of the word by inserting single-letters in the word. For each letter in the alphabet, we insert that letter into our word in all possible locations. For example, in our example, try inserting the letter "a" in each position: aspeling, sapeling, spaeling, spealing, spelaing, speliang, spelinag, and spelinga. We then do the same for the letter "b", and so on. Notice that there are 8 possible locations for the inserted letter "a" in the the 7-letter word speling. There are too many possible insertions to list them all. Using our example, speling, we would generate:

aspelingsapelingspaelingspealingspelaingspeliangspelinagspelinga
bspelingsbpelingspbelingspeblingspelbingspelibngspelinbgspelingb
cspelingscpelingspcelingspeclingspelcingspelicngspelincgspelingc
........................
yspelingsypelingspyelingspeylingspelyingspeliyngspelinygspelingy
zspelingszpelingspzelingspezlingspelzingspelizngspelinzgspelingz

For a word of length n, there are 26 * (n+1) alternatives we can form by inserting characters; here, the word was 7 letters long and there are 208 alternatives we can form using insertion.

Hint: all of these methods will involve loops, some nested. In python, slicing and indexing can be very useful. And don't forget to use the python shell for testing!

Hint: while you are generating lots of different alternative spellings, you only need to suggest actual words. That is, you only need to keep track of alternatives that are in the wordcts.txt file.


Suggesting Alternatives

In Lab 9, once all alternative spellings of a word are considered, your program simply printed to the screen all valid alternative spellings and their term frequencies values. For Lab 10, you will overwrite this technique to, instead, suggest alternate spellings that are most likely to be what the user intended to type. Google's algorithm uses something akin to the frequency of use of a word to rank possible alternatives. That is, a word that is seen widely in the English language (e.g., the) is more likely to be what the user meant than a rare word (e.g., thy). In essence, "Did you mean?" is a popularity contest.

For your program, your job is to take the list of possible alternate spellings for the original search word and suggest the word with the highest term frequency. Next, you should ask the user if the suggestion is correct. If the user types no, you should output the possible alternatives with the second highest frequency. You should then repeat the question, continuing to output the alternate spelling with the next highest frequency until a) there are no more alternate suggestions or b) the user types yes to the suggested alternative. You must implement your own function for sorting the alternate spellings. You cannot use built-in python methods to perform the sort such as list.sort().

For the speling example, only 3 of the alternatives are valid words in the English language: spewing, spieling, and spelling. From these words, we suggest the alternative that is most frequent. If you consult your word frequency list of lists, you should find that spelling is the most frequent:

[...
['spelling', 5373561],...
['spewing', 262886],...
['spieling', 1239],
...]
Therefore, we would choose spelling as the proposed alternative. If that is not correct, we should suggest spewing.

Here is what this looks like:
Enter a word or QUIT to exit: speling
Did you mean spelling (frequency 5373561)? no
Did you mean spewing (frequency 262886)? no
Did you mean spieling (frequency 1239)? no
Not found.

Lab 10 Requirements

The general flow of your algorithm will be very similar to Lab 9. The requirements below build off of Lab 9 requirements, with a tag of New Requirement for requirements unique to Lab 10. You are responsible for all items on this list, however, including making corrections for any errors in Lab 9.

  1. Your program should use a list of lists to store the word-frequency list read in from the file wordcts.txt.
  2. Your program should run until the user enters the word QUIT
  3. Use binary search any time you need to look for a word in your word-frequency list (i.e., your lists of lists).
  4. If the entered search word is in the the word-frequency list, print Found to the screen and do not suggest alternatives.
  5. New Requirement: For misspelt words, generate alternate spellings by follwing the instructions in the Generating Alternatives: Lab 10 section above. Specifically, last week you generated alternatives based on deletions. This week, you will generate more alternative spellings using insertions, transpositions, and substitutions.
  6. Only consider suggesting generated alternate spellings that are actually words.
  7. New Requirement: When your program suggests word alternatives to the user, it should not list a particular word more than one time (i.e., no duplicates allowed in alternate spelling list).
  8. New Requirement: Using the instructions in the Suggesting Alternatives: Lab 10 section above, you are going to present the alternatives to the user in a different way than lab 9. Specifically:
    • You will sort the possible suggestions based on their frequency value. To accomplish this, you must implement your own function to sort the data using selection sort or one of the other sorting algorithms dicussed in class. You may not use Python's list method for sorting (i.e. list.sort())
    • You should suggest the alternate spelling with the highest frequency value first
    • If the user is satisfied with the suggestion (i.e., answers yes), the search is done and you should ask for the next search word.
    • If the user does not agree with the suggested word (i.e., answers no ), you should print the word with the next highest frequency in your alternate list of words
    • If there are no alternate spellings for a word, or none that satisfy the user, print Not Found as you did in Lab 9.
  9. New Requirement: As always, any user input should be validated before being used. When asking the user "Did you mean" you should only except a yes or no response. All other responses should repeat the question.

Sample Run of google.py

Here is what a sample run of your program should like like:

Welcome to "Did You Mean?". This program will search a list of known
words to see if your search query is a valid word.  If not, it will
suggest alternate spellings that are valid words.

Enter a word or QUIT to exit: speling
Did you mean: spelling (frequency: 5373561)? no
Did you mean: spewing (frequency: 262886)? no
Did you mean: spieling (frequency: 1239)? no
Not found

Enter a word or QUIT to exit: abase
Found

Enter a word or QUIT to exit: ebash
Did you mean: bash (frequency: 2781144)? yes

Enter a word or QUIT to exit: quit
Found

Enter a word or QUIT to exit: hed
Did you mean: he (frequency: 595282264)? no
Did you mean: had (frequency: 472458757)? no
Did you mean: her (frequency: 360656669)? no
Did you mean: head (frequency: 71472344)? good
Please enter yes or no: yes

Enter a word or QUIT to exit: asdf
No entries found

Enter a word or QUIT to exit: poole
Did you mean: pool (frequency: 28654230)? no
Did you mean: pole (frequency: 5209605)? yes

Enter a word or QUIT to exit: QUIT

Useful Tips

Remember to do unit testing of your functions. You can either do unit testing in the python interpreter or by adding some test calls to your functions in main() and running it normally. Also, you can add debug print statements to see what your program is doing. For example, to see if your word-frequency list looks right, you could add some print statements to list a few values in the list. Just make sure to remove all debug prints before submitting your solution.

Lists and strings will probably be needed in this program. See our documentation about using strings and lists as objects for guidance on advanced operations for strings and lists.

Be sure to re-examine in class examples for programs that use lists of lists as well as file input/output.

Also, your "generate alternatives" solution may need to use string operations that we learned early in the semester, including concatenation, substring (or slicing), and repetitions. Here are some examples:

s1 = "hello there"
s2 = "goodbye"
newstr = s1 + s2   # newstr is "hello theregoodbye"
newstr = s1[3:10]  # newtr is "l0 ther"
newstr = s1*2      # newsr is "goodbyegoodbye"

Lastly, strings can be compared using all relational operators learned in class (i.e., <, <=, >, >=, ==, !=). Strings are ordered lexographically, exactly like in the dictonary or phonebook. For example:

name1 = "Ameet"
name2 = "Tia"

if name1 < name2:
  print name1
else:
  print name2
outputs the value Ameet since A comes before T in the alphabet. If the first letter of both strings is the same, we compare the second letter in each just as we would in the dictionary. For example, Ameet is greater than Alex. Python continues evaluating position-by-position until one string ends. In that case, the shorter word is less than the longer; for example, Amy is less than Amylia.


Extra Challenges

Are you looking to try out some extra challenges? Remember: this is not for credit, and should only be attempted after completing the rest of the assignment. First, copy of your solution to a new file:

cp google.py google_ext.py

Next, add some new functions for generating alternatives. One possibility is to search for prefix substrings. That is, words that start with the same letters as the search term, but with suffixes on them. For example, a search for spel should return spelling since it starts with the entire word spel.

Another alternate is to play the fun game of anagrams. An anagram is a word that results from the rearranging of letters of another word. For a good example (and a nerdy joke), Google the word anagram. Check out the Did you mean suggestion. In your program, a search for mary would return army (same letters, different order).

As a last suggestion, use wildcard searches. For example, if the user types sp*ing, you should return all words that start with sp and end in ing. Any combination of letters can occur in between (such as with spelling or spending). This is similar to the prefix-substring suggestion, but more difficult. As another example, if the user types *ful, you should return all words that end in the letters ful. That's a lot of possibilities!

Since none of these are really "Did you mean" algorithms, you may want to print the list differently. Rather than asking one at a time, sort the list of alternative suggestions and output 5 at a time until the user says they found the word or you have no more suggestions.

Submit

Once you are satisfied with your program, hand it in by typing handin21 in a terminal window. Don't forget to hand in your worksheet problems in class.