Lab 02

Due 11:59am Monday September 17, 2018

Overview

This lab has three starter files:

Answers to written questions should be included in your repository in a file called Writeup.md. You will also create (and should add to your repository) a file called ngrams.py during this assignment.

Docker and lxml

To process XML files, we’ll work with the lxml library this week. We’ve updated the Docker image to have everything you need, but you’ll need to run docker pull jmedero/nlp:fall2018 to get the most recent version on your own machine.

Spacy

This week, we’ll get to use the spacy python library for the first time. Spacy is designed to make it easy to use pre-trained models to analyze very large sets of data. At the beginning of the semester, we’ll be using spacy as a skeleton for building our own NLP algorithms. Later in the semester, you’ll get a chance to use more of the built in functionality to build larger systems.

Sentence Segmentation

In the first part of the lab, you will write a simple sentence segmenter.

The /data/brown directory includes three text files taken from the Brown Corpus:

The files do not indicate where one sentence ends and the next begins. In the data set you are working with, sentences can only end with one of 5 characters: period, colon, semi-colon, exclamation point and question mark.

However, there is a catch: not every period represents the end of a sentence since there are many abbreviations (U.S.A., Dr., Mon., etc., etc.) that can appear in the middle of a sentence where the periods do not indicate the end of a sentence. The text also has many examples where colon is not the end of the sentence. The other three punctuation marks are all nearly unambiguously the ends of a sentence. Yes, even semi-colons.

For each of the above files, I have also provided a file containing the line number (counting from 0) containing the actual locations of the ends of sentences:

Your job is to write a sentence segmenter, and to add that segmenter to spacy’s processing pipeline.

Part 1a

The given segmenter.py has some starter code, but it can’t be run from the command line. We want it to be executable, though, and when it’s called from the command line, it should take one required argument and one optional argument:

$ python3 ./segmenter.py --help
usage: segmenter.py [-h] --textfile FILE [--hypothesis_file FILE]

Predict sentence segmentation for a file.

optional arguments:
  -h, --help            show this help message and exit
  --textfile FILE, -t FILE
                        Path to the unlabeled text file.
  --hypothesis_file FILE, -y FILE
                        Write hypothesized boundaries to FILE (default stdout)

Write the command-line interface for segmenter.py using the argparse module in python for processing command-line arguments. In addition to the module documentation, you may also find the argparse tutorial useful.

Both arguments to segmenter.py should be FileType arguments.

As in Lab 01, all print statements should be in your main() function, which should only be called if segmenter.py is run from the command line.

Part 1b

This week’s starter code includes a file called tokenizer.py defines two things:

In the segmenter.py file, there’s one function called create_doc that takes a readable file pointer (the type of the textfile argument to segmenter.py) and returns a spacy document.

Stop now and make sure that the MyTokenizer class makes sense to you. They both use python and/or spacy components that are likely new to you; some places you can read to better understand how they work include:

Questions

  1. Explain the MyTokenizer class in your own words.
  2. Explain the create_doc function in your own words.
  3. How many tokens are in the file /data/brown/editorial.txt? (Hint: You can get the number of tokens in a spacy Doc by calling len() on the Doc object.)

Part 1c

Next, write a function called baseline_segmenter that takes a spacy Doc as its only argument. We’ll add this function to our spacy pipeline after tokenization, so you can assume that the Doc you get is word tokenized.

Your function should iterate through all of the tokens in the Doc (for token in doc:) and predict which ones are the ends of sentences. Instead of keeping track of the ends of sentences, though, spacy keeps track of the beginning of sentences. In particular, the first token in each sentence in a spacy Doc has its is_sent_start attribute set to True.

For every token that you predict corresponds to the end of a sentence, you should set the is_sent_start attribute to True for the next token.

Remember that every sentence in our data set ends with one of the five tokens ['.', ':', ';', '!', '?']. Since it’s a baseline approach, baseline_segmenter should predict that every instance of one of these characters is the end of a sentence. You can access the text content of a Token in spacy through its .text attribute:

>>> my_token = doc[0]
>>> type(my_token)
<class 'spacy.tokens.token.Token'>
>>> my_token.text
'The'
>>> type(my_token.text)
<class 'str'>

Next, add your baseline_segmenter function to the pipeline of tools that will be called on every Doc that is created with your create_doc function. To do that, you’ll want to look at the add_pipe function.

Finally, update your main() function to write out the token numbers corresponding to the predicted line breaks to hypothesis_file. Be sure to write out the last token number as a sentence boundary. Since spacy is keeping track of the starts of new sentences, the final sentence is never explicitly marked. You can access a list of the sentences in a spacy Doc with its sents attribute:

>>> doc = nlp("The cat in the hat came back, wrecked a lot of havoc on the way.")
>>> print(len(list(doc.sents)))
1

Confirm that when you run your baseline_segmenter on the file /data/brown/editorial.txt, it predicts 3278 sentence boundaries.

Part 1d

To evaluate your system, I am providing you a program called evaluate.py that compares your hypothesized sentence boundaries with the ground truth boundaries. This program will report to you the true positives, true negatives, false positives and false negatives (as well as precision, recall and F-measure, which we haven’t talked about in class just yet). You can run evaluate.py with the -h option to see all of the command-line options that it supports.

A sample run with the output of your baseline segmenter from above stored in editorial.hyp would be:

python3 evaluate.py -d /data/brown/ -c editorial -y editorial.hyp

Run the evaluation on your baseline system’s output for the editorial category, and confirm that you get the following before moving on:

TP:    2719	FN:       0
FP:     559	TN:   60055

PRECISION: 82.95%	RECALL: 100.00%	F: 90.68%

Part 1e

Now it’s time to improve the baseline sentence segmenter. We don’t have any false negatives (since we’re predicting that every instance of the possibly-end-of-sentence punctuation marks is, in fact, the end of a sentence), but we have quite a few false positives.

Make a copy of your baseline_segmenter function called my_best_segmenter. Change your create_doc function to call your new segmenter instead of the baseline one.

You can see the type of tokens that your system is mis-characterizing by setting the verbosity of evaluate.py to something greater than 0. Setting it to 1 will print out all of the false positives and false negatives so you can work to improve your my_best_segmenter function.

To test your segmenter, I will run it on a hidden text you haven’t seen yet. It may not make sense for you to spend a lot of time trying to fix obscure cases that show up in the three texts I am providing you because these cases may never show up in the hidden text that you haven’t seen. But it is important to handle cases that occur multiple times and even some cases that appear only once if you suspect they could appear in the hidden text. You want to write your code to be as general as possible so that it works well on the hidden text without leading to too many false positives.

NGrams

In class, we talked about the problem of 0’s in language modeling. If you were to train a unigram language model on the editorial category of the Brown corpus and then try to calculate the probability of generating the adventure category, you’d end with a 0 probability because there are words that occur in the adventure category that don’t appear in the editorial category (e.g. “badge”).

In this part of the assignment, we’ll explore that problem in more depth. To do that, we’ll start working with the dataset that you’ll use for your final project.

You should put your code for this part in a file called ngrams.py.

Part 2a: Extracting data from XML files

A sample of the dataset we’ll use for your final project is in /data/semeval. There’s a single XML file that contains all of the articles we’ll look at this week. Each article has a label that gives its bias: “left”, “right”, or “least.”

The data file you’ll use for your final project is big (the most recent release is around 3.6G) so it’s best not to store the whole thing in memory at once if we can help it. Fortunately, the lxml library gives us a way to iteratively parse through an xml file, dealing with one node at a time. Here’s sample code that opens a file called myfile.xml and call a function called my_func on every article node:

from lxml import etree

fp = open("myfile.xml", "br")
for event, element in etree.iterparse(fp, events=("end",)):
    if element.tag == "article":
        my_func(element)
    	element.clear()

Take a minute now to look at the contents of some of the articles in /data/semeval and skim the documentation for lxml. Then, write a function called get_xml_contents() that takes a file object and a spacy language object and returns a dictionary. The dictionary’s keys should be labels (that is, “left,” “right,” and “least”) and its elements should be lists of spacy Documents, each containing the plain text contents of a single article. Hint: You might want to take a look at the itertext() method of etree Elements.

You should use your MyTokenizer from Part 1. We’ll ignore sentence boundaries for this part, so you don’t need to add your my_best_segmenter.

Questions

  1. When you looked at the content of the articles, you probably noticed a lot of question marks. What’s your hypothesis about how those ended up in the file? In other words, what did the Task organizers not do correctly in preparing this data release? (The data release includes a note about the problem, and it should be fixed by the time we get to our final project. Alas, we will have to accept it the way it is for this week.)
  2. What percentage of the tokens that appear in the ‘left’ articles don’t appear in the ‘least’ articles?
  3. What percentage of tokens that appear in the ‘right’ articles don’t appear in the ‘least’ articles?
  4. What if you look at types instead of tokens?
  5. Are you surprised by these results? Why or why not?

Part 2c: Bigram analysis

What happens when you move to higher order n-gram models like bigrams and trigrams?

Write a function that takes as input a spacy Document, and returns a list of all of the bigrams in the document.

Write a function that takes as input a spacy Document, and returns a list of all of the trigrams in the document.

Questions

  1. What percentage of the bigrams (tokens, not types) that appear in the ‘left’ articles don’t appear in the ‘least’ articles? What percentage of the bigrams that appear in the ‘right’ articles don’t appear in the ‘least’ articles? Are you surprised by these results? Why or why not?

  2. What percentage of the trigrams (tokens, not types) that appear in the ‘left’ articles don’t appear in the ‘least’ articles? What percentage of the trigrams that appear in the ‘right’ articles don’t appear in the ‘least’ articles? Are you surprised by these results? Why or why not?

Part 2d: Collecting statistics on pairs of categories

Instead of collecting statistics on one set of articles (“training”) and seeing how closely these statistics match another set of articles (“testing”), what if you trained your language model on two of the sets (e.g. ‘left’ and ‘least’) and tested it on the third category (e.g. ‘right’)?

Questions

  1. Does that change your results? Why? Try for each of the three combinations. Which worked best? Why? Your Writeup.md file should include a table of your results, similar to the table below. Report percentages, not raw counts.
Train Test Missing Tokens Missing Types Missing Bigrams Missing Trigrams
Left, Right Least        
Least, Right Left        
Left, Least Right        

Part 2e: Train on equal-sized “chunks”

Instead of training on one set of articles and testing on another, suppose we break each of the sets into 4 equally-sized chunks and then combine chunks across all three sets. For simplicity, let’s call the first chunk of each category “chunk A”, the second “chunk B”, etc. That means that “chunk A” will contain the first 25% of ‘left’ articles and the first 25% of ‘right’ articles and the first 25% of ‘least’ articles. “Chunk B” contains the next 25% of each, etc.:

all text=left+least+right
chunk A=1st 25% of left+ 1st 25% of least+1st 25% of right
chunk B=2nd 25% of left+ 2nd 25% of least+2nd 25% of right
chunk C=3rd 25% of left+ 3rd 25% of least+3rd 25% of right
chunk D=4th 25% of left+ 4th 25% of least+4th 25% of right

Now, combine chunks A, B and C and use that as your training data. Use chunk D as your test data.

Questions

  1. Try training on A,B,C and testing on D; then train on A, B, D and test on C; then train on A, C, D and test on B; and finally train on B, C, D and test on A. How are your results different from the previous question? Why? Your Writeup.md file should include a table of your results, similar to the table below. Report percentages, not raw counts. Evaluating your system in this way is called cross-validation. In this case, since you are breaking your data into 4 distinct test sets, you are performing 4-fold cross-validation.
Train Test Missing Tokens Missing Types Missing Bigrams Missing Trigrams