CS21 Lab 8: author detection

Due 11:59pm Fri, Apr 5

Run update21, if you haven't already, to create the cs21/labs/08 directory (and some additional files). Then cd into your cs21/labs/08 directory and create the python programs for lab 8 in this directory.

$ update21
$ cd cs21/labs/08
NOTE: thanks to Michelle Craig at U. Toronto for this lab idea!
Background Info...

For this lab we will read in a book and calculate various linguistic features, such as average word length used or average number of words per sentence. Using enough features, we might be able to see the difference between known authors. Furthermore, given a book from an unknown author and comparing features with known authors, we might be able to figure out who wrote the book. In addition to authorship detection, programs like this could be used in plagiarism detection or email filtering.

The five features we will calculate for each text are:

  1. average word length
  2. Type-Token Ratio: number of different words used/total number of words
  3. Hapax Legomana Ratio: number of words occurring exactly once/total number of words
  4. average number of words per sentence
  5. average number of phrases per sentence

Your job will be to read in a book (or something smaller) from a text file and write functions to calculate each feature. Here is a quick example:

$ cat roses.txt
Roses are red.
Violets are blue.
Sugar is sweet,
and so are you!

$ python author.py

text file: roses.txt

     average word length =  3.69230769231
        type token ratio =  0.846153846154
    hapax legomana ratio =  0.769230769231
  average words/sentence =  4.33333333333
average phrases/sentence =  1.33333333333

To do this, we want you to practice incremental development: write a function and then TEST IT to make sure it works before moving on. The following sections will guide you through this process and provide a few hints and tips.

1. average word length, reading in the text

Start with the first feature: average word length. To do this, you will need a list of all words in the file. If you had a long string like this: "Roses are red. Violets are blue. Sugar is sweet, and so are you!", then a python list of words can almost be formed by using split() on that string. The one problem, if you want to eventually get the true length of each word, is punctuation. Once you split() the string into a list of words, the above example will contain words like "red.", "blue.", and "sweet,". These will need to be cleaned up by stripping punctuation. Luckily, python's strip() function takes multiple characters:

>>> "red.".strip(".,!?")
'red'
>>> "you!".strip(".,!?")
'you'

Experiment with the two test files in your 08 directory: roses.txt and gb.txt. You may need to add other punctuation marks to the strip() function. Write some functions to:

Results you should get are:

$ python author.py 
text file: roses.txt
average word length =  3.69230769231

$ python author.py 
text file: gb.txt
average word length =  4.25563909774

Note: some words are hyphenated, like "battle-field". Don't worry about removing hyphens from hyphenated words.

2. word-count data list-of-lists

For the next two features (Type-Token Ratio and Hapax Legomana Ratio), you will need a count of how many times each word appears in the text. We would like you to create a word-count list-of-lists, with each inner list being [word,count] data for each word in the text.

For example, using the roses.txt file, you would create a list like this:

[['and', 1], ['are', 3], ['blue', 1], ['is', 1], ['red', 1], ['roses', 1], ['so', 1], ['sugar', 1], ['sweet', 1], ['violets', 1], ['you', 1]]

And here's a portion of the list for gb.txt:

[['a', 6], ['above', 1], ['add', 1], ['ago', 1], ['all', 1], ... ['work', 1], ['world', 1], ['years', 1]]

Notice that the lists are in alphabetic order, and the words are all lowercase and stripped of any punctuation.

To create this word-count list-of-lists, you should use your list of words from the previous section and a modified binary search function (maybe call it modifiedBS()?). Your modified binary search function should look through your list-of-lists (using the binary search algorithm!) for a given word. If it finds the word in the list, add one to the count. If it doesn't find the word, insert the new [word,1] list into the list-of-lists.

Here's a specific example. Suppose your current list-of-lists is:

LOL= [['and', 1], ['are', 3], ['blue', 1], ['red', 1], ['roses', 1], ['violets', 1]]

If the next word to process is "red", your modifiedBS() function should find an entry for "red" already in LOL, so it should increment the counter to 2:

LOL= [['and', 1], ['are', 3], ['blue', 1], ['red', 2], ['roses', 1], ['violets', 1]]

If the next word to process is "so", your modifiedBS() function will not find "so" in LOL, and should add ['so', 1] into the list in the correct position, like this:

LOL= [['and', 1], ['are', 3], ['blue', 1], ['red', 2], ['roses', 1], ['so', 1], ['violets', 1]]
3. Type-Token and Hapax Legomana Ratios

Once you have the word-count list, write functions to calculate these two features:

$ python author.py 
text file: roses.txt
     average word length =  3.69230769231
        type token ratio =  0.846153846154
    hapax legomana ratio =  0.769230769231

$ python author.py 
text file: gb.txt
     average word length =  4.25563909774
        type token ratio =  0.503759398496
    hapax legomana ratio =  0.327067669173
4. sentences and phrases

The next two features are a little bit harder to calculate. Parsing the text into individual sentences and phrases is something you could do, but would require some extra work and testing. To keep this lab from being too long, we are giving you code snippets to help with these features:

To get a python list of sentences, given a long string of text, try this:

>>> from nlp_tools import *
>>> text = "Roses are red. Violets are blue. Sugar is sweet, and so are you!"
>>> sentences = tokenize(text)
>>> print sentences
['Roses are red.', 'Violets are blue.', 'Sugar is sweet, and so are you!']

We are providing the nlp_tools.py file in your 08 directory, and you are welcome to look at what's in it. For this section, it is importing the nltk library. nltk is a natural language toolkit that already knows how to split up text into individual sentences. It also takes care of special cases like abbreviations and words like "Mr." and "Mrs." in the middle of a sentence.

Given that list of sentences, write a function to calculate the average number of words per sentence.

To divide a sentence into individual phrases, try this:

>>> for s in sentences:
...   phrases = re_split(s, ',:;')
...   print phrases
... 
['Roses are red.']
['Violets are blue.']
['Sugar is sweet', ' and so are you!']

This code, also using a function in nlp_tools.py, uses regular expressions (re) to split strings on commas, colons, and semi-colons.

Given the regular expressions code to split a sentence into a list of phrases, write a function to calculate the average number of phrases per sentence.

Test out your functions on the roses.txt and gb.txt files:

$ python author.py 
text file: roses.txt
     average word length =  3.69230769231
        type token ratio =  0.846153846154
    hapax legomana ratio =  0.769230769231
  average words/sentence =  4.33333333333
average phrases/sentence =  1.33333333333

$ python author.py 
text file: gb.txt
     average word length =  4.25563909774
        type token ratio =  0.503759398496
    hapax legomana ratio =  0.327067669173
  average words/sentence =  26.6
average phrases/sentence =  3.2
5. putting it all together...

Once you have these 5 features working, write one more function to compare these numbers with various other authors' numbers. We will compare them using a sum of the weighted differences of each feature. For example, given the 5 features above (1: ave word length, 2: type-token ratio, 3: hapax legomana, 4: ave words/sentence, 5: ave phrases/sentence), for two different texts A and B, the SUM we want is:

SUMAB = abs(F1A - F1B)*W1 +
        abs(F2A - F2B)*W2 +
        abs(F3A - F3B)*W3 +
        abs(F4A - F4B)*W4 +
        abs(F5A - F5B)*W5

where F1A is feature 1 (ave word length) for text A, F1B is feature 1 for text B, W1 is the weight for feature 1, and abs() is the absolute value function.

For this lab, use the following weights: weights = [11, 33, 50, 0.4, 4]

Using the above calculation, SUMAB will be smaller for similar texts (and should be zero for identical texts).

In your 08 directory are various known author stats files:

$ ls *.stats
agatha.christie.stats
alexandre.dumas.stats
charles.dickens.stats
james.joyce.stats
jane.austen.stats
lewis.caroll.stats
william.shakespeare.stats

$ cat lewis.caroll.stats 
lewis carroll
4.22709528497
0.111591342227
0.0537026953444
16.2728740581
2.86275565124

Your compare() function should read in each stats file and calculate SUMAB to compare the known author stats (text A) versus the current text stats (text B).

Here's a python way to make a list of all files in your current directory that end with .stats:

>>> import glob
>>> files = glob.glob('*.stats')
>>> print files
['lewis.caroll.stats', 'alexandre.dumas.stats', 'william.shakespeare.stats', 'agatha.christie.stats', 'james.joyce.stats', 'charles.dickens.stats', 'jane.austen.stats']

And here's an example of comparing texts:

$ python author.py

text file: roses.txt

     average word length =  3.69230769231
        type token ratio =  0.846153846154
    hapax legomana ratio =  0.769230769231
  average words/sentence =  4.33333333333
average phrases/sentence =  1.33333333333

----------------------------------------
comparing to  james.joyce.stats
72.6586795995
----------------------------------------
comparing to  lewis.caroll.stats
76.7931354047
----------------------------------------
comparing to  william.shakespeare.stats
70.8484366319
----------------------------------------
comparing to  charles.dickens.stats
79.9357082887
----------------------------------------
comparing to  alexandre.dumas.stats
80.7503628899
----------------------------------------
comparing to  agatha.christie.stats
72.6887354247
----------------------------------------
comparing to  jane.austen.stats
81.246491732
----------------------------------------

Note: these SUMAB numbers are all large, indicating the roses.txt file was probably not written by any of these authors. :)

Test your code on some other text files we have in /usr/local/doc. Here's a list of some books we have downloaded from Project Gutenberg:

Below are 3 examples. Your numbers for these larger examples should be similar, but may not be exactly the same.


Submit

Once you are satisfied with your program, hand it in by typing handin21 in a terminal window.