CS21 Lab 9: Digital Humanities Lab

Part 2: Sorting

Due Monday, November 27, before midnight

Because of the Thanksgiving break, this lab will be due on Monday night after you return from break.

Please read through the entire lab before starting! More than most other labs you've done so far this semester, you will benefit from reading this lab through completely before beginning.

As always, run update21 to create your cs21/labs/09 directory and create your programs for lab 9 in this directory.

Note that this is part two of a two-part lab, but each part will be turned in and graded separately.

The goal of the two-part lab is to learn how to do some basic comparasions between literary texts. We will start in Part 1 by working with a single text, learning how to read in the text file, count the words in the file, and show the context in which specific words occur. In Part 2, we will work with two texts, examining the most common words in both texts and identifying the words that are much more "standout" or "distinctive" in one text than the other. If we were literary scholars, we might be able to use the output of this program to make speculative literary arguments about these texts. We could then examine the texts more closely by seeing how each of these more-"distinctive" words are used in the context of each text.

Although we may not be literary scholars, this lab is being done in conjuction with English 35: Rise of the Novel, taught by Professor Rachel Buurma. In ENGL 35, Professor Buurma's students will be using the solutions you write to examine texts that they have been reading in their class and compare them to a larger body of texts that they haven't read. They will interpret the similarities and differences they notice in order to construct an argument about these characteristics in the context of theoretical and historical approaches to the eighteenth- and nineteenth-century novels. They'll also gain some insight into the computational processes though which these textual features were extracted and what the assumptions, limitations, and further possibilities of such computationally-created representations of literary texts might be.

Prerequisites for Part 2

If your implementation of Part 1 (Lab 8) is complete, copy your code from the lab 8 directory to the lab 9 directory before beginning:

$ update21
$ cd ~/cs21/labs/09/
$ cp ../08/textinfo.py .

Once you copy the file to the lab 9 directory, you should be careful not to go back and edit the lab 8 file as this will just confuse you as to which version you're editing.

If your implementation for Part 1 was incomplete, you can still get full credit on Part 2 by importing a solution to Part 1. You can then build the remainder of Part 2 without worrying about problems you had in Part 1.

$ python3
>>> import dh
>>> filename = "/data/cs21/novels/shelley_frankenstein.txt"
>>> words = dh.read_file(filename)
>>> counts = dh.count_words(words)
>>> dh.get_word_count(counts,'yourself')
27
>>> dh.get_word_count(counts,'pickles')
0

Overview of Part 2

In part 2 of the lab, you will read in two texts and identify words that are more prevalent in one text relative to the other. For example, you might not be surprised that the word "laboratory" is more prevalent in Mary Shelley's "Frankenstein" than other novels published in the eighteenth century and early nineteenth century. Perhaps there are some surprising things you can find out about the tendencies of one author relative to other authors. For example, does Jane Austen use the word "very" way more than you'd expect from an author of her time period? What might that tell you if she did?

We will provide you with some starting code that you can use in your program. This starting code will present a menu from which the user can choose which texts they'd like to compare. Once the user chooses their two texts, you will read in the texts and count the words in each text using the functions you wrote in Part 1.

You will use the word counts from each text to compute the "most distinctive" words in each text. For both texts, you will display the 10 most frequent words in the text and also the 10 "most distinctive" words in the text. Then, you will use the code you wrote in Part 1 to display the Key Words In Context for words that the user might be interested in.

Here is an example run:

Examples, Requirements, and Tips

You have some freedom in how you want program to look. Here are our requirements:

the program should print out adequate instructions throughout (see examples)
the program should present the user with a menu where they can choose two texts
the program should print out the top 10 most frequent words in each text
the program should print out the top 10 most "distinctive" words in each text
the program should allow the user to see key words in context until they enter a blank word

Top-Down Design

Although you will not be submitting your top-down design (TDD) this week, we strongly encourage you to write your full design first. Design your main() function and any helper functions. Your main() function should be relatively short and easy to read. Implement and test the program one function at a time. A good program design and good use of functions is part of the grade.

We have provided you with a program called filemenu.py. If you look at the main() function you should be able to figure out how to use this in your own program. If you aren't sure, try running the filemenu.py program and see what happens. If you still can't figure it out, add some print statements to main to see what list_files and menu_selection return.

Once you've figured out how to use the menu, add import filemenu to the top of your existing program. Since you have imported filemenu.py, you can use filemenu.list_files and filemenu.menu_selection in your existing program.

Modify your code to use the filemenu so that the user doesn't have to type in the name of the text anymore.

Showing the most frequent words

In class we talked about sorting algorithms. In this lab, you need to implement one of the sorting algorithms we talked about. It won't matter which sorting algorithm you use in this lab, and you're welcome to start with any solutions that we gave you in class.

Recall that the word counts that we created in Part 1 look like this:

>>> counts
[['a', 1], ['above', 1], ['and', 2], ['are', 2], ['as', 1], ['blazing', 1], ... ]

You'd like to be able to print out the 10 most frequent words in the text. We could do this if we sorted this counts list by word frequency. Notice that this list is already sorted, but not by word frequency; instead, it is sorted alphabetically by the word. That is, because this counts list is a list-of-lists, we could sort either by the words or by the counts. The sorting algorithms we talked about in class didn't explicitly sort list-of-lists, but you'll find that if you try to use them on the counts list that they will sort the list by words, not by counts.

You need to modify one of the sort functions we gave you so that it sorts this list by count instead of by word. Once you've figured out how to sort the counts list, print out the top 10 words in the file. Here are some examples:

Note that for the longer texts, the sort function can take 5-10 seconds to run. This is fine! But that means that while you are writing and testing your program you should test it on the shorter texts before moving on to the longer texts once you're sure it's working properly.

Most Distinctive Words

In order to determine the most distinctive words, we will count how often each word occurs relative to the number of words in each text. Let's compare the usage of the word 'everything' in Jane Austen novels versus Maria Brunton novels. For our two source texts, we will use the file AUSTEN.txt (a text collection containing all seven of the individual Jane Austen novels you have access to) and the file BRUNTON.txt (containing both Maria Brunton texts). If you count the words in both texts, you will find that 'everything' occurs 2 out of 340,411 times (about 0.0006% of all words) in Brunton but 310 out of 752,331 times (about 0.04% of all words) in Austen. The ratio (310/752331)/(2/340411) = 70.1, which tells us that 'everything' is about 70 times more likely to appear in Austen than in Brunton. If we calculate this ratio for every word in both documents, we can find the set of words for each text (or group of texts) that are, in a sense, the most distinctive words in that text.

One problem with this "distinctiveness" score is that the word must appear in both texts in order to be useful. For example, the word 'anyone' appears 40 times in Austen texts but doesn't appear at all in Brunton texts. If we tried to calculate (40/752331)/(0/340411) we'd get a ZeroDivisionError. Effectively, our score is calculating that 'anyone' is infinitely more likely to appear in Austen than Brunton. For our purposes, we are going to ignore the cases where the word appears in one text but not the other and simply report on the most distinctive words from amongst the words that appear in both texts.

The algorithm to compute "distinctiveness" is as follows:

set n1 to the total number of words in the first text
set n2 to the total number of words in the second text
for each word w in one text, if w also appears in the second text, compute the fraction (count1/n1)/(count2/n2) where count1 is the frequency of w in the first text and count2 is the frequency of w in the second text.

Create a list of lists, much like you did for the counts, to store each word and its distinctiveness score. For example, in this list of lists you will have an element that represents the distinctivess of the word 'everything': ['everything', 70.13363134046051]

Once you have made this list of distinctiveness scores, sort it by the score (using the same sorting function you used to sort the word counts by frequency). When you do this for the Austen and Brunton texts, here are is the list of scores:

distinctiveness scores: Austen vs Brunton

Notice that the words that are most distinctive in Austen are way down at the end of the list. If you scroll all the way to the bottom you'll find 'everything'.

How can we find the words that are most distinctive in Brunton? You already have! If we invert the distinctiveness scores we already created (that is, compute 1/score for each score), we'll have the scores for the most distinctive words in Brunton. Since the list is already sorted, we can now just start at the beginning of the list. The word with the smallest score, shown in the list as ['laura', 0.0019226984741056523], is representing the fact that the word 'laura' (actually, the name Laura) is 1/0.0019226984741056523 = 520 times more likely to appear in Brunton than Austen. The first four words are all names, but the fifth word is forehead, which is about 70 times more likely to appear in Brunton than Austen!

Have your program print out the ten most distinctive words in the first text (those are the last ten values in your sorted distinctiveness score list) and the ten most distinctive words in the second text (those are the first ten words in your sorted distinctiveness score list). Show each word along with the score. Note that for very short texts (e.g. Beer on the Wall, Twinkle Twinkle), there may not be 10 words in the list, so show up to 10 words.

Compare how the scores show in the example distinctiveness scores above are represented in the final output which was shown at the start of the lab in the "Jane Austen vs. Maria Brunton" sample run.

A note on sorting

When building the list of distinctiveness scores, you will want to iterate over the word counts list for one of the texts. For each word in this text, you should get the word counts using your function from lab 08 (or dh.get_word_count(counts, word)) to find out how many times it appears in the other text. This function uses binary search so it expects counts to be sorted alphabetically, but your counts list may at this point be sorted by frequency. You can avoid this predicament in a few ways: (1) make a copy of the counts list before you sort it by frequency, (2) sort the list back into alphabetical order before computing the distinctiveness scores, or (3) compute the distinctiveness scores before computing the most frequent words.

If you'd like to make a copy of the counts list, here's how:

>>> counts_copy = counts[:]
>>> counts_copy == counts
True

Key Words in Context

Once your program has read in both texts, counted all the words, shown the 10 most frequent words in both texts, and shown the 10 most distinctive words in both texts, the user will want to explore the texts further by seeing how these distinctive words are used in context. Use the key words in context function that you wrote in Part 1 to ask the user for words they'd like to see in context. Show how this word is used in both texts. Then, ask the user for another word they'd like to see. Continue to ask the user for more words to show until they press Enter without inputting a value. At that point, the program ends.

And you are done with the lab!

Answer the Questionnaire

After you have turned in the full program, please edit the QUESTIONS-09.txt file in your cs21/labs/09 directory and answer the questions in that file.

Turning in your labs....

Don't forget to run handin21 to turn in your lab files! You may run handin21 as many times as you want. Each time it will turn in any new work. We recommend running handin21 after you complete each program or after you complete significant work on any one program.