CS21 Lab 8: Digital Humanities Lab

Part 1: Searching

Due Saturday, November 16, before midnight


Please read through the entire lab before starting! More than most other labs you've done so far this semester, you will benefit from reading this lab through completely before beginning.

As always, run update21 to create your cs21/labs/08 directory and create your program for lab 8 in this directory. Your code will be stored in a file called textinfo.py.

Note that this is part one of a two-part lab, but each part will be turned in and graded separately.

The goal of the two-part lab is to learn how to do some basic comparasions between literary texts. We will start in Part 1 by working with a single text, learning how to read in the text file, count the words in the file, and show the context in which specific words occur. In Part 2, we will work with two texts, examining the most common words in both texts and identifying the words that are much more "standout" or "distinctive" in one text than the other. If we were literary scholars, we might be able to use the output of this program to make speculative literary arguments about these texts. We could then examine the texts more closely by seeing how each of these more-"distinctive" words are used in the context of each text.

Although we may not be literary scholars, this lab is being done in conjuction with English 35: Rise of the Novel, taught by Professor Rachel Buurma. In ENGL 35, Professor Buurma's students will be using the solutions you write to examine texts that they have been reading in their class and compare them to a larger body of texts that they haven't read. They will interpret the similarities and differences they notice in order to construct an argument about these characteristics in the context of theoretical and historical approaches to the eighteenth- and nineteenth-century novels. They'll also gain some insight into the computational processes though which these textual features were extracted and what the assumptions, limitations, and further possibilities of such computationally-created representations of literary texts might be.

Overview of Part 1

In part 1 of the lab, you will use the searching techniques that we have learned about in class to quickly count how often every word in the corpus occurs. Once you've counted all words in the corpus, you will report the most frequent word in the corpus and its count. Then, you will ask the user to enter a word that they'd like more information about. Your program will tell the user how often that word occurred and then show them each of the occurrences of that word in the context that it appeared in the text. You will continue to ask the user for more words to show until they press Enter without inputting a value. At that point, the program ends. Here are some examples:

Examples, Requirements, and Tips

You have some freedom in how you want program to look. Here are our requirements:

Top-Down Design

Although you will not be submitting your top-down design (TDD) this week, we strongly encourage you to write your full design first. Design your main() function and any helper functions. Your main() function should be relatively short and easy to read. Implement and test the program one function at a time. A good program design and good use of functions is part of the grade.

Location of the texts

In /data/cs21/novels, you will find a lot of text files. (To list these files, type ls /data/cs21/novels.) Most of the files in this directory are novels. For example, shelley_frankenstein.txt is Mary Shelley's 1818 novel "Frankenstein". There are also two small example texts: taylor_twinkle.txt, which is Jane Taylor's 1806 poem "Twinkle, Twinkle, Little Star", and anonymous_beer.txt, the folk song "99 Bottles of Beer". You can open any of these files in atom to read through them and get a sense of the texts we'll be working with in this lab. (To open Frankenstein you can type atom /data/cs21/novels/shelley_frankenstein.txt.)

Processing the texts

Open shelley_frankenstein.txt in atom. Take a look at the contents of the file and notice that there are lowercase and capitalized letters; there is a lot of punctuation and newlines; and there are numbers. Let's say we wanted to count how many times the word morning occurred in the text. In atom you can search for the word morning. The first time you find the word it appears in the phrase "...lay to until the morning, fearing to encounter..." The next time it appears: "...In the morning, however...". Searching more, you'll find "...at morning's dawn...", "...one in the morning; the rain..." and "Morning, dismal and wet". When we count how many times "morning" appears in the text, we're going to want to count each of those occurrences.

To simplify the problem, we will read in the text, lowercase it, then keep all the alphabetic characters and turn everything else into a space character. This will lead to a "cleaner" version of the text where it will be easier to count occurrences of each word. As an example, here is the list of words obtained after cleaning the taylor_twinkle.txt text and then splitting it into words.

To get this cleaned version of the text, you might want to write a function that takes a single line of text as a parameter and returns the cleaned version of that line. In that function, you can use an accumulator to build a new cleaned string from the original line: for every character in the original line, if the character is alphabetic, you'll add it to the cleaned string; otherwise, you'll add a space character (' ') to the cleaned string.

Recall that you can split a string into a list of words by using the split method. Here is an example:

>>> s='twinkle twinkle little star'
>>> s.split()
['twinkle', 'twinkle', 'little', 'star']

Counting the words

Now that you've read in all the words in the file and created a clean version of the text, you can count each of the words. Working with the Twinkle text, we could easily use len to find out that there are 118 words in total. However, what you want to be able to find out is that the word the occurs 11 times, the word twinkle occurs 7 times, and the word diamond appears 1 just one time.

The data structure we are going to use to store the counts will look like this:

>>> counts
[['a', 1], ['above', 1], ['and', 2], ['are', 2], ['as', 1], ['blazing', 1], ... ]

Each element of the list will contain two items: a word and the frequency with which the word occurs. We will keep this list sorted alphabetically by word. One reason to do that is so that if we wanted to look up the number of times that blazing occurs in the text, we could use binary search to quickly find out that it occurred once. Another reason for storing the words alphabetically is that it will allow us to build the list of counts one word at a time.

Let's walk through how to build up this list using a shorter text, anonymous_kid.txt. This is just a silly sentence that we wrote for the purposes of testing your lab. Here's the sentence: "The kid in the suit is in the car by the house." When we clean the text and read it into a list, we'll get a list like the one shown below (which we'll call text so we can refer to it later).

>>> text
['the', 'kid', 'in', 'the', 'suit', 'is', 'in', 'the', 'car', 'by', 'the', 'house']

We'll also initialize an empty list called counts where we'll store the word counts. Recall that this counts list will be a list of lists like the one shown above. Let's assume that we've already counted the first 3 words in the text and our counts look like this:

>>> counts
[['in', 1], ['kid', 1], ['the', 1] ]

We now need to count the fourth word in the text which is the word the. Using binary search, we can efficiently find that the word the is already in the counts list. Since we now see another the, we'll change the count from 1 to 2 so our counts now looks like this:

>>> counts
[['in', 1], ['kid', 1], ['the', 2]]

The next word in the text is the word, suit. When we do a binary search in the counts list for the word suit, we fail to find it. This means that we've now seen suit for the first time, and that means that we want [ 'suit', 1] to appear in our counts. In order for binary search to work, we'll need to make sure that we put suit into the counts list in sorted order. Fortunately, when the binary search algorithm ends without finding the item, the value low (from the implementation of binary search) tells us where suit should go in the list. In this case, when binary search fails, low is 2. (Trace through binary search if you aren't sure that this is so.) Now we can use the list.insert(index, object) method to insert [ 'suit', 1] at index 2:

>>> low
2
>>> word
'suit'
>>> counts.insert(low, [word, 1])
>>> counts
[['in', 1], ['kid', 1], ['suit', 1], ['the', 2]]

We will continue with this algorithm, using binary search to locate each word from the text in our counts list and either incrementing the count by 1 (as we did for the) or inserting a new list with [word, 1] (as we did with suit). When we are all done, we'll have the count for every word in the text. Here's the output we get for the anonymous_kid.txt file:

[['by', 1], ['car', 1], ['house', 1], ['in', 2], ['is', 1], ['kid', 1], ['suit', 1], ['the', 4]]

Try to do the anonymous_kid.txt example by hand to make sure you understand how binary search and insert are working to create the counts list. You might want to convince yourself that the algorithm works when starting from an empty list and building up the whole list.

Here are the counts from two other files for you to compare against:

Note that in Lab 08 the only thing we will be using the counts for is determining the most frequent word. Don't worry: we will use these counts extensively in Lab 09!

Key Words In Context

The second component of the lab is to allow the user to see how words are used in the context of the text. For example, a student investigating Shelley's "Frankenstein" might be interested in seeing how the word "monster" is used in the text. You will ask the user to enter a word and then show them (in a nicely formatted way), the 4 words before and 4 words after every occurrence of the word the user asked to see. (See the sample runs above for details.)

The basic idea of this algorithm is to start at the beginning of the text list you started out with and do a linear search for the word the user is interested in. When you find the word, accumulate a string of the 4 words before the word and accumulate another string of the 4 words after the word you searched for. Use string formatting to make it all look nicely aligned.

The only tricky part of this is that sometimes the user might want to see a word that is close enough to the beginning or end that there aren't actually 4 words to show. You'll have to figure out how to make sure the program shows the right context without crashing!

Hint: Remember that you can use string formatting to left or right justify text by using a minus sign before the number in the format string. For example:

>>> s = "hello there"
>>> t = "how are you"
>>> print("%20s | %20s" % (s, t))
         hello there |          how are you
>>> print("%-20s | %-20s" % (s, t))
hello there          | how are you         
>>> print("%-20s | %20s" % (s, t))
hello there          |          how are you
>>> print("%20s | %-20s" % (s, t))
         hello there | how are you          

Answer the Questionnaire

After you have turned in the full program, please edit the QUESTIONS-08.txt file in your cs21/labs/08 directory and answer the questions in that file.


Turning in your labs....

Don't forget to run handin21 to turn in your lab files! You may run handin21 as many times as you want. Each time it will turn in any new work. We recommend running handin21 after you complete each program or after you complete significant work on any one program.