Run update21, if you haven't already, to create the cs21/labs/08 directory (and some additional files). Then cd into your cs21/labs/08 directory and create the python programs for lab 8 in this directory.
$ update21 $ cd cs21/labs/08NOTE: thanks to Michelle Craig at U. Toronto for this lab idea!
For this lab we will read in a book and calculate various linguistic features, such as average word length used or average number of words per sentence. Using enough features, we might be able to see the difference between known authors. Furthermore, given a book from an unknown author and comparing features with known authors, we might be able to figure out who wrote the book. In addition to authorship detection, programs like this could be used in plagiarism detection or email filtering.
The five features we will calculate for each text are:
Your job will be to read in a book (or something smaller) from a text file and write functions to calculate each feature. Here is a quick example:
$ cat roses.txt Roses are red. Violets are blue. Sugar is sweet, and so are you! $ python author.py text file: roses.txt average word length = 3.69230769231 type token ratio = 0.846153846154 hapax legomana ratio = 0.769230769231 average words/sentence = 4.33333333333 average phrases/sentence = 1.33333333333
To do this, we want you to practice incremental development: write a function and then TEST IT to make sure it works before moving on. The following sections will guide you through this process and provide a few hints and tips.
Start with the first feature: average word length. To do this, you will need a list of all words in the file. If you had a long string like this: "Roses are red. Violets are blue. Sugar is sweet, and so are you!", then a python list of words can almost be formed by using split() on that string. The one problem, if you want to eventually get the true length of each word, is punctuation. Once you split() the string into a list of words, the above example will contain words like "red.", "blue.", and "sweet,". These will need to be cleaned up by stripping punctuation. Luckily, python's strip() function takes multiple characters:
>>> "red.".strip(".,!?") 'red' >>> "you!".strip(".,!?") 'you'
Experiment with the two test files in your 08 directory: roses.txt and gb.txt. You may need to add other punctuation marks to the strip() function. Write some functions to:
Results you should get are:
$ python author.py text file: roses.txt average word length = 3.69230769231 $ python author.py text file: gb.txt average word length = 4.25563909774
Note: some words are hyphenated, like "battle-field". Don't worry about removing hyphens from hyphenated words.
For the next two features (Type-Token Ratio and Hapax Legomana Ratio), you will need a count of how many times each word appears in the text. We would like you to create a word-count list-of-lists, with each inner list being [word,count] data for each word in the text.
For example, using the roses.txt file, you would create a list like this:
[['and', 1], ['are', 3], ['blue', 1], ['is', 1], ['red', 1], ['roses', 1], ['so', 1], ['sugar', 1], ['sweet', 1], ['violets', 1], ['you', 1]]
And here's a portion of the list for gb.txt:
[['a', 6], ['above', 1], ['add', 1], ['ago', 1], ['all', 1], ... ['work', 1], ['world', 1], ['years', 1]]
Notice that the lists are in alphabetic order, and the words are all lowercase and stripped of any punctuation.
To create this word-count list-of-lists, you should use your list of words from the previous section and a modified binary search function (maybe call it modifiedBS()?). Your modified binary search function should look through your list-of-lists (using the binary search algorithm!) for a given word. If it finds the word in the list, add one to the count. If it doesn't find the word, insert the new [word,1] list into the list-of-lists.
Here's a specific example. Suppose your current list-of-lists is:
LOL= [['and', 1], ['are', 3], ['blue', 1], ['red', 1], ['roses', 1], ['violets', 1]]
If the next word to process is "red", your modifiedBS() function should find an entry for "red" already in LOL, so it should increment the counter to 2:
LOL= [['and', 1], ['are', 3], ['blue', 1], ['red', 2], ['roses', 1], ['violets', 1]]
If the next word to process is "so", your modifiedBS() function will not find "so" in LOL, and should add ['so', 1] into the list in the correct position, like this:
LOL= [['and', 1], ['are', 3], ['blue', 1], ['red', 2], ['roses', 1], ['so', 1], ['violets', 1]]
Once you have the word-count list, write functions to calculate these two features:
$ python author.py text file: roses.txt average word length = 3.69230769231 type token ratio = 0.846153846154 hapax legomana ratio = 0.769230769231 $ python author.py text file: gb.txt average word length = 4.25563909774 type token ratio = 0.503759398496 hapax legomana ratio = 0.327067669173
The next two features are a little bit harder to calculate. Parsing the text into individual sentences and phrases is something you could do, but would require some extra work and testing. To keep this lab from being too long, we are giving you code snippets to help with these features:
To get a python list of sentences, given a long string of text, try this:
>>> from nlp_tools import * >>> text = "Roses are red. Violets are blue. Sugar is sweet, and so are you!" >>> sentences = tokenize(text) >>> print sentences ['Roses are red.', 'Violets are blue.', 'Sugar is sweet, and so are you!']
We are providing the nlp_tools.py file in your 08 directory, and you are welcome to look at what's in it. For this section, it is importing the nltk library. nltk is a natural language toolkit that already knows how to split up text into individual sentences. It also takes care of special cases like abbreviations and words like "Mr." and "Mrs." in the middle of a sentence.
Given that list of sentences, write a function to calculate the average number of words per sentence.
To divide a sentence into individual phrases, try this:
>>> for s in sentences: ... phrases = re_split(s, ',:;') ... print phrases ... ['Roses are red.'] ['Violets are blue.'] ['Sugar is sweet', ' and so are you!']
This code, also using a function in nlp_tools.py, uses regular expressions (re) to split strings on commas, colons, and semi-colons.
Given the regular expressions code to split a sentence into a list of phrases, write a function to calculate the average number of phrases per sentence.
Test out your functions on the roses.txt and gb.txt files:
$ python author.py text file: roses.txt average word length = 3.69230769231 type token ratio = 0.846153846154 hapax legomana ratio = 0.769230769231 average words/sentence = 4.33333333333 average phrases/sentence = 1.33333333333 $ python author.py text file: gb.txt average word length = 4.25563909774 type token ratio = 0.503759398496 hapax legomana ratio = 0.327067669173 average words/sentence = 26.6 average phrases/sentence = 3.2
Once you have these 5 features working, write one more function to compare these numbers with various other authors' numbers. We will compare them using a sum of the weighted differences of each feature. For example, given the 5 features above (1: ave word length, 2: type-token ratio, 3: hapax legomana, 4: ave words/sentence, 5: ave phrases/sentence), for two different texts A and B, the SUM we want is:
SUMAB = abs(F1A - F1B)*W1 + abs(F2A - F2B)*W2 + abs(F3A - F3B)*W3 + abs(F4A - F4B)*W4 + abs(F5A - F5B)*W5
where F1A is feature 1 (ave word length) for text A, F1B is feature 1 for text B, W1 is the weight for feature 1, and abs() is the absolute value function.
For this lab, use the following weights: weights = [11, 33, 50, 0.4, 4]
Using the above calculation, SUMAB will be smaller for similar texts (and should be zero for identical texts).
In your 08 directory are various known author stats files:
$ ls *.stats agatha.christie.stats alexandre.dumas.stats charles.dickens.stats james.joyce.stats jane.austen.stats lewis.caroll.stats william.shakespeare.stats $ cat lewis.caroll.stats lewis carroll 4.22709528497 0.111591342227 0.0537026953444 16.2728740581 2.86275565124
Your compare() function should read in each stats file and calculate SUMAB to compare the known author stats (text A) versus the current text stats (text B).
Here's a python way to make a list of all files in your current directory that end with .stats:
>>> import glob >>> files = glob.glob('*.stats') >>> print files ['lewis.caroll.stats', 'alexandre.dumas.stats', 'william.shakespeare.stats', 'agatha.christie.stats', 'james.joyce.stats', 'charles.dickens.stats', 'jane.austen.stats']
And here's an example of comparing texts:
$ python author.py text file: roses.txt average word length = 3.69230769231 type token ratio = 0.846153846154 hapax legomana ratio = 0.769230769231 average words/sentence = 4.33333333333 average phrases/sentence = 1.33333333333 ---------------------------------------- comparing to james.joyce.stats 72.6586795995 ---------------------------------------- comparing to lewis.caroll.stats 76.7931354047 ---------------------------------------- comparing to william.shakespeare.stats 70.8484366319 ---------------------------------------- comparing to charles.dickens.stats 79.9357082887 ---------------------------------------- comparing to alexandre.dumas.stats 80.7503628899 ---------------------------------------- comparing to agatha.christie.stats 72.6887354247 ---------------------------------------- comparing to jane.austen.stats 81.246491732 ----------------------------------------
Note: these SUMAB numbers are all large, indicating the roses.txt file was probably not written by any of these authors. :)
Test your code on some other text files we have in /usr/local/doc. Here's a list of some books we have downloaded from Project Gutenberg:
Below are 3 examples. Your numbers for these larger examples should be similar, but may not be exactly the same.
Once you are satisfied with your program, hand it in by typing handin21 in a terminal window. include("../style/footer.php"); ?>