Lab 05

Due 11:59am Monday October 22, 2018

Note about a bug fix

IMPORTANT NOTE: The output in interaction.py is no longer valid. Please see the bottom of the webpage for (what I hope) is the correct output!

Introduction

In this week’s lab, you will build on last week’s edit distance finding code to implement a spell-checker that a) generates suggested spelling corrections and b) automatically fixes spelling errors.

Answers to written questions should be added to a file called Writeup.md in your repository.

EditDistanceFinder

This week’s starter code includes an EditDistance.py file that is the same as the one you wrote last week, but with a couple of additions:

Questions

  1. In Writeup.md, explain how Laplace smoothing works in general and how it is implemented in the EditDistance.py file. Why is Laplace smoothing needed in order to make the prob method work? In other words, the prob method wouldn’t work properly without smoothing – why?
  2. Describe the command-line interface for EditDistance.py. What command should you run to generate a model from /data/spelling/wikipedia_misspellings.txt and save it to ed.pkl?

LanguageModel

This lab’s starter code also includes a file called LanguageModel.py that defines an n-gram language model. Read through the code for the LanguageModel class, then answer the following questions:

  1. What n-gram orders are supported by the given LanguageModel class?
  2. How does the given LanguageModel class deal with the problem of 0-counts?
  3. What behavior does the “__contains__()” method of the LanguageModel class provide?
  4. Spacy uses a lot of memory if it tries to load a very large document. To avoid that problem, LanguageModel limits the amount of text that’s processed at once with the get_chunks method. Explain how that method works.
  5. Describe the command-line interface for LanguageModel.py. What command should you run to generate a model from /data/gutenberg/*.txt and save it to lm.pkl if you want an alpha value of 0.1 and a vocabulary size of 40000?

The language model takes a bit of time to train – on the order of 20 minutes or so depending on what machine you use. You may want to start training the LanguageModel in another window before you continue reading the lab writeup.

Required Part (Everyone Does the Same Thing)

Your job for this week will be to write a SpellChecker that uses the EditDistanceFinder class as the error (channel) model and the provided LanguageModel as the language model to implement spelling correction.

You will be using spacy again in this lab, making use of the built-in part-of-speech tagger and parser. To initialize spacy for the lab, use the line below. You will probably want the nlp variable to be an instance variable in your class.

nlp = spacy.load("en", pipeline=["tagger", "parser"])

Your class should have the following member functions:

Hints and additional information about some of these functions follow:

Sample Interaction

The file interaction.py gives a sample interaction with the SpellChecker class. If you call interaction.py from the command line with language and edit distance models created above, it should use them to check (and optionally autocorrect) sentences.

Evaluation

In /data/spelling/ there are two files:

For a variety of reasons, labeled corpora of spelling errors are hard to come by. You can perform a noisy evaluation of your system by comparing it to the ispell output.

The file autocorrect.py will use your spell checker, language model, and edit distance class to auto-correct every sentence in every line that is passed to it. Use your SpellChecker to autocorrect the reddit_comments.txt file, then use the diff tool to compare the output. Based on a hand analysis of a reasonable subset of differences, answer the following questions:

  1. How often did your spell checker do a better job of correcting than ispell? Conversely, how often did ispell do a better job than your spell checker?
  2. Can you characterize the type of errors your spell checker tended to best at, and the type of errors ispell tended to do best at?
  3. Comment on anything else you notice that is interesting about spell checking – either for your model or for ispell.

Optional Part (Pick One or More)

Once you have your spell checker working to correct non-words, you should add one of the following:

Phonetic Suggestions

Expand your generate_candidates to also suggest words whose pronunciation is within an edit distance of self.max_distance of each error word. Your solution should use the metaphone code that is included with the lab. In Writeup.md, you should:

  1. Describe your approach
  2. Give examples of how your approach works, including specific sentences where your new model gives a different (hopefully better!) result than the baseline model.
  3. Discuss any challenges you ran into, design decisions you made, etc.

Real-Word Correction

Add a new member function to your SpellChecker class called check_words() that generates suggested corrections for real word spelling errors. Your check_spelling() function should call check_words after check_sentence_words, so functions like autocorrect_sentence and suggest_sentence should work off of the combination of the two.

You should feel free to use the simplifying assumtion of at most one real-word spelling error in a sentence if it makes your task easier.

In Writeup.md, you should:

  1. Describe your approach
  2. Give examples of how your approach works, including specific sentences where your new model gives a different (hopefully better!) result than the baseline model.
  3. Discuss any challenges you ran into, design decisions you made, etc.

Transpositions

Extend your model to handle character transpositions, where two characters are “swapped,” resulting in spelling errors like “teh.”

In Writeup.md, you should:

  1. Describe your approach
  2. Give examples of how your approach works, including specific sentences where your new model gives a different (hopefully better!) result than the baseline model.
  3. Discuss any challenges you ran into, design decisions you made, etc.

Other Extensions

With instructor approval, you are encouraged to come up with other ways to expand your spell checker. Some ideas:

Some good places to start looking for relevant research:

In Writeup.md, you should:

  1. Describe your approach
  2. Give examples of how your approach works, including specific sentences where your new model gives a different (hopefully better!) result than the baseline model.
  3. Discuss any challenges you ran into, design decisions you made, etc.

Bug fix

The original interaction.py file contained incorrect sample output. Below is the correct sample output.

>>> print(s.channel_model.prob("hello", "hello"))
-0.6520393913851943
>>> print(s.channel_model.prob("hellp", "hello"))
-10.655417526736118
>>> print(s.channel_model.prob("hllp", "hello"))
-12.889127454847866
>>> print(s.check_text("they did not yb any menas"))
[[['they'], ['did'], ['not'], ['be', 'by', 'my', 'you', 'i', 'in', 'ye', 'b', 'y', 'rib', 'yet', 
'ay', 'if', 'job', 'ob', 'yo', 'jib', 'of', 'ly', 'on', 'ab', 'o', 'rob', 'orb', 'jub', 'it', 
'ty', 'bo', 'is', 'a', 'yea', 'mob', 'cab', 'web', 'sob', 'to', 'up', 'yon', 'yew', 'yes', 'cob', 
'an', 'obi', 'ebb', 'nob', 'do', 'iv', 'alb', 'bab', 'eye', 'tob', 'yaw', 'v', 'abi', 'mab', 'at',
'he', 'go', 'as', 'x', 'rub', 'gob', 'lye', 'sub', 'or', 'ix', 'aye', 'd', 'lbs', 'cub', 'pub', 
'tub', 'z', 'so', 'dab', 'bob', 'we', 'l', 'dye', 'k', 'pmb', 'n', 'xv', 'ho', 'hye', 'il', 'yer',
'wo', 'yee', 'ex', 'bye', 'yis', 'vp', 'ox', 'rye', 'oh', 'w', 'io', 'en', 'm', 'ed', 'h', 'me',
'am', 'xx', 'el', 'us', 'no', 'fye', 'eh', 't', 'qu', 'ii', 'r', 'e', 'c', 'ah', 'ha', 's', 'lo',
'al', 'uz', 'em', 'ad', 'ao', 'ow', 'og', 'vs', 'er', 'ir', 'et', 'mr', 'un', 'hm', 'th', 'ji',
'ai', 'xi', 'je', 'hi', 'ze', 'co', 'wm', 'ee', 'au', 'ou', 'ar', 'ca', 'um', 'ro', 'vi', 'de',
'dr', 'fa', 'va', 'sh', 'la', 'nt', 'tm', 'ma', 'gr', 'ur', 'di', 're', 'st', 'tu', 'da', 'ms', 
'le', 'pi', 'si', 'se'], ['any'], ['men', 'means', 'mens', 'meals', 'mes', 'mans', 'meanes', 
'meats', 'meat', 'menials', 'omens', 'mean', 'mene', 'mines', 'enos', 'menace', 'mend', 'meads', 
'zenas', 'kenaz', 'menan', 'seas', 'ment', 'jonas', 'mess', 'mead', 'medes', 'medals', 'enan', 
'monks', 'minus', 'ends', 'mews', 'fens', 'minds', 'dens', 'meal', 'midas', 'eras', 'amends', 
'pens', 'hena', 'hens', 'tens', 'vedas', 'meres', 'mental', 'lens', 'peas', 'lena', 'meah', 
'medad', 'venus', 'arenas', 'aeneas', 'metals', 'enam', 'medan', 'demas', 'teas', 'zenan', 
'kenan', 'meets', 'sends', 'merab', 'texas', 'tents', 'bends', 'melts', 'metal', 'tends', 'penal', 
'dents', 'lends', 'cents', 'rents', 'annas']]]
>>> print(s.autocorrect_line("they did not yb any menas"))
[['they'], ['did'], ['not'], ['be'], ['any'], ['men']]
>>> print(s.suggest_text("they did not yb any menas", max_suggestions=2))
[['they'], ['did'], ['not'], ['be', 'by'], ['any'], ['men', 'means']]

In addition, you may find this to be helpful:

>>> text = """This should take a list of words as input and return a list of lists. 
	Each sublist in the return value corresponds to a single word in the input 
	sentence. Words in the sentence that are in the language model will be represented
	as a sublist containing just that word. Words in the sentence that are not in the
	language model will be represented as a sublist of possible corrections. This sublist
	of possible corrections should be, for each word in the sentence not in the language
	model, the result of calling generate_candidates with each of the candidates in the
	list and then sorting these candidates by the combination of LanguageModel score and
	EditDistance score. If no candidates are found and fallback is True, then non-words
	should be represented by a sublist with just the original word (the same 
	representation as correctly-spelled words).""".lower()
>>> result = sp.autocorrect_line(text)
>>> print(' '.join([x[0] for x in result]))
this should take a list of words as put and return a list of lists . each subtlest in the 
return value corresponds to a single word in the put sentence . words in the sentence that 
are in the language model will be represented as a subtlest containing just that word . 
words in the sentence that are not in the language model will be represented as a subtlest 
of possible corrections . this subtlest of possible corrections should be , for each word 
in the sentence not in the language model , the result of calling generate_candidates with 
each of the candidates in the list and then sorting these candidates by the combination 
of languagemodel score and editdistance score . if no candidates are found and fallacy is 
true , then non - words should be represented by a subtlest with just the original word 
( the same representation as correctly - spilled words ) .